Predicting Potential Diabetic Female Patients with Diagnostic Measurements

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. In this project, I developed Classification and KNN models to predict whether a female patient has diabetes.

Download Diabetes.csv

To see the full Python work for this project, check out my Github Repository!

Min Chang's GitHub

The Data

The dataset contains 768 entries and 9 columns. Here's a brief overview of the columns:

Pregnancies: Number of times pregnant.
Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test.
BloodPressure: Diastolic blood pressure (mm Hg).
SkinThickness: Triceps skin fold thickness (mm).
Insulin: 2-hour serum insulin (mu U/ml).
BMI: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.
Age: Age (years).
Outcome: Class variable (0 or 1) indicating if the patient has diabetes.

Using Python, I employed a method called K-Nearest Neighbors (KNN), which predicts whether a patient has diabetes by looking at other similar patients in the dataset. The question I asked myself here was "What happened to patients who had similar measurements?" Well looking at the KNN Plot graph below, the visualization shows the training and testing accuracy for the K-Nearest Neighbors (KNN) model as the number of neighbors varies from 1 to 15. Here I have found some observations and insights from the plot and the preprocessing steps:

Observations from the Plot:

Training Accuracy: It is highest when the number of neighbors is lowest (k=1), which is typical as the model just memorizes the training data. However, this often leads to overfitting.
Testing Accuracy: It tends to increase as the number of neighbors increases, reaches a peak, and then either stabilizes or decreases slightly. This behavior suggests that a middle ground in the value of k helps the model generalize better compared to very low or very high values.

Insights:

I discovered that using about 9 to 11 similar patients (neighbors) to make a prediction is optimal. If we use too few, our predictions are too tailored to individual cases and can be misleading. If I use too many, then my predictions might become too general and ignore important differences between patients.
The accuracy of our predictions got better when I adjusted the number of neighbors used for comparisons. This helped me find a sweet spot where the model neither underestimates nor overestimates the likelihood of diabetes.

Next, I created a heatmap which is a visual tool that helps to understand the relationships between different variables in a dataset. In the context of the diabetes dataset. The heatmap above displays the correlation between various features in the diabetes dataset. Correlation measures the degree to which two variables move in relation to each other. A positive value means they tend to increase together, while a negative value means one increases as the other decreases.

Key Insights from the Correlation Matrix:

Glucose and Outcome: There is a significant positive correlation (0.47) between glucose levels and the diabetes outcome. This indicates that higher glucose levels are associated with a higher likelihood of diabetes, which aligns well with medical understanding.
BMI and Outcome: BMI (Body Mass Index) also shows a moderate positive correlation (0.29) with diabetes. This suggests that higher BMI could be a risk factor for diabetes.
Age and Outcome: Age shows a positive correlation (0.24) with the diabetes outcome, indicating that the risk of diabetes increases with age.
Insulin and Glucose: There is a moderate positive correlation (0.33) between insulin and glucose levels. As expected, higher glucose levels often require higher insulin levels to be managed.
Skin Thickness and BMI: Skin thickness is moderately correlated (0.39) with BMI, possibly reflecting the fact that body fat can influence skinfold thickness.

Recommendations:

Targeted Screening: Given the strong correlation between glucose levels and diabetes, targeted screening using glucose tests could be especially effective for early detection of diabetes.
Lifestyle Interventions: Programs aimed at controlling BMI through diet and exercise might be effective in reducing the risk of developing diabetes, particularly in those who are older or at higher risk.
Further Research: More detailed analysis could be conducted on the interplay between insulin and glucose levels, which could inform better management strategies for patients with higher glucose readings.

This visualization helps in understanding the relationships between various health indicators and diabetes, providing a basis for targeted interventions and further investigation.

Side Goal:

By creating a reliable tool for predicting diabetes using patient data and carefully adjusting how missing data is handled with similar patients for comparison, I've been able to enhance the tool's accuracy. Further refinements and considering additional evaluation measures could provide even more reliable predictions, helping healthcare professionals make informed decisions about diabetes risk.