Paper Title
Predictive Modeling of Diabetes Using Machine Learning
Abstract
Diabetes has become a global epidemic, necessitating innovative approaches for early detection and management.
This research employs machine learning techniques to predict diabetes by leveraging two distinct datasets—precisely, a
focused dataset of Pima Indian heritage females and a comprehensive dataset encompassing diverse medical and
demographic information. Our methodology involves meticulous data cleaning, balancing outcomes with SMOTE, and
extensive data visualization. Eight machine learning models are assessed, namely, Logistic Regression, SVC, Ada Boost
Classifier, K Neighbors Classifier, Gaussian Naïve Bayes and few others. Model selection is done by Cross Validation and
metrics such as accuracy, precision, recall and F1 score are calculated for each of the models. Random Forest and Gradient
Boosting emerged as the most effective models in predicting diabetes in the focused dataset of Pima Indian heritage females
and the comprehensive dataset respectively. Confusion matrices were plotted to measure the performance of these
classification models. In the Pima dataset, surprising insights challenge conventional age-diabetes correlations, while the
second dataset reinforces established patterns. The study emphasizes population-specific considerations in diabetes
prediction models and advocates for tailored approaches. Combining diverse datasets enhances the robustness of our models,
paving the way for accurate and personalized diabetes prediction.
Keywords - Diabetes, Machine Learning, Confusion Matrix, Accuracy