Osteoporosis Risk Prediction with Machine Learning

The dataset offers comprehensive information on health factors influencing osteoporosis development, including demographic details, lifestyle choices, medical history, and bone health indicators. It aims to facilitate research in osteoporosis prediction, enabling machine learning models to identify individuals at risk. Analyzing factors like age, gender, hormonal changes, and lifestyle habits can help improve osteoporosis management and prevention strategies.

This dataset was created by Amit KulKarni and was last updated in March, 2024.

The dataset is from Kaggle.com. If you would like to view or work with the dataset, please click on the download button below. The dataset is in CSV format.

Download Osteoporosis.csv

Some parts of this project are not displayed. To view the entire code using R programming for this project, you can visit my GitHub Repository to see the work:

Min Chang's GitHub

Loading, Seed, & Data Cleaning

The provided R script is designed to prepare an environment for building and analyzing a machine learning model focused on osteoporosis data. The first steps were to install necessary packages, load them into the R environment, set a seed for reproducibility, and load and clean the dataset.

After installing and loading all the needed packages into the session to utilize their functions in subsequent steps, I set a random seed (1234) to ensure that any random operations, such as data splitting in machine learning, are reproducible across different runs.

Next steps were to load the data and perform data cleaning by removing all rows with null values using the na.omit function to ensure the dataset is clean and ready for analysis. This step is crucial for maintaining the integrity of the model's input data

We can see a quick view of the table using the Glimpse and Dim function to retrieve or set the dimensions of an object

Logistics Regression

The given R code snippet outlines the process of creating a logistic regression model to analyze the relationship between various predictors and the likelihood of osteoporosis. This model is built using the glm function, which stands for Generalized Linear Model, suitable for fitting generalized linear models including logistic regression.

This formula specifies that Osteoporosis (the dependent variable) is predicted by a combination of explanatory variables: Age, Gender, Family History, Race/Ethnicity, Vitamin D Intake, Smoking, Medical Conditions, Medications, Prior Fractures, and Hormonal Changes.

The data for fitting the model is taken from a dataset named train. The Osteoporosis dataset likely contains observations (rows) and the specified variables (columns) that are necessary for the model.

The family parameter specifies the type of model to be fitted. Here, binomial indicates that a logistic regression is being performed. Logistic regression is used when the dependent variable is binary (in this case, the presence or absence of osteoporosis).

Random Forest Model and Feature Importance Analysis

I then fitted a Random Forest model to predict osteoporosis and assess the importance of various features contributing to the predictions. The Random Forest model is a popular machine learning algorithm known for its robustness and accuracy, especially useful in classification tasks like predicting disease occurrence based on multiple predictors.

The conclusion drawn from the Random Forest model analysis indicates the ranking and significance of each feature in predicting osteoporosis. Understanding these key predictors allows for a targeted approach in healthcare settings, focusing on significant risk factors for better management and prevention strategies. This approach is valuable in medical research and practice, where identifying and prioritizing risk factors can lead to more effective interventions and improved patient outcomes.