2Yildiz Technical University, Department of Electronics and Communication Engineering, Istanbul, Turkey
Abstract
Regardless of origin, ethnicity, or color, heart disease is one of the top causes of mortality worldwide. Timely diagnosis and precise prediction of heart disease can significantly reduce its impact and save countless lives. The heart disease prediction model in this study was developed using a heart disease dataset collected from US citizens by the Centers for Disease Control and Prevention. Four supervised machine learning classification algorithms, logistic regression, decision tree, random forest, and k-nearest neighbors, were used for the model development and optimized by leveraging grid search, an exhaustive hyperparameter optimization technique with five-fold cross-validation. The models were evaluated and compared based on various performance metrics, including accuracy, ROC-AUC, PR-AUC, and confidence intervals for the accuracy estimates to assess model reliability. The accuracy of logistic regression, decision tree, random forest, and k-nearest neighbors classifiers was 76.25%, 93.80%, 93.95%, and 90.87%, respectively. Overall, the random forest classifier shows superiority with achieved ROC-AUC, PR-AUC, accuracy, precision, recall/sensitivity, specificity, and F1-score performance of 98.55%, 98.42%, 93.95%, 89.99%, 98.90%, 88.99%, and 94.24%, respectively. The analysis also shows that the random forest model outperforms many existing classifiers for heart disease prediction. This novel model, leveraging a large dataset and comprehensive metrics, supports early detection in clinical settings. It also informs public health strategies by identifying highly correlated factors to heart disease, such as age category, general health, difficulty walking, and diabetes. Furthermore, for model interpretability, Shapley Additive exPlanations analysis was performed, providing explainable AI-based insights into how individual features influence predictions.