← Back to Projects

Heart Disease Prediction ML

A machine learning system for early heart disease risk prediction combining multiple algorithms including Random Forest, Logistic Regression, and KNN, achieving 88% accuracy on a comprehensive healthcare dataset.

This research project addresses cardiovascular diseases (CVDs), which account for 17.9 million deaths annually (31% of global mortality). Focusing on early detection through machine learning, we developed a predictive system using data from five major heart disease datasets across different hospitals, combining 918 patient records with 11 key health indicators.

Random Forest Confusion Matrix Results
Feature Correlation Heatmap
ROC Curve for Logistic Regression
ROC Curve for KNN
ROC Curve for Decision Trees
ROC Curve for Random Forest
Final Results

Data Analysis & Preprocessing

  • Combined and cleaned datasets from five hospitals: Stalog Heart (270), Long Beach VA (200), Hungarian (294), Cleveland (303), and Switzerland (123)
  • Implemented comprehensive data preprocessing including duplicate removal and categorical variable encoding
  • Conducted extensive Exploratory Data Analysis (EDA) revealing key correlations between health indicators
  • Applied feature scaling and normalization techniques for optimal model performance
  • Utilized cross-validation and grid search for robust model evaluation

Technical Implementation

  • Developed multiple machine learning models: Logistic Regression, Decision Trees, Random Forest, and K-Nearest Neighbors
  • Implemented GridSearchCV for automated hyperparameter optimization
  • Created visualization tools for model performance analysis using confusion matrices and ROC curves
  • Achieved 88% accuracy with Random Forest, outperforming other algorithms
  • Built robust evaluation metrics including precision, recall, and F1-score calculations

Key Findings

  • Random Forest emerged as the most effective model with 88% accuracy
  • Logistic Regression achieved 85% accuracy with 91.42% cross-validation score
  • K-Nearest Neighbors demonstrated 87% accuracy with strong ROC-AUC metrics
  • Identified critical correlations between exercise angina, old peak, and heart disease outcomes
  • Developed insights into demographic patterns showing higher prevalence in male patients

The system demonstrates the effectiveness of machine learning in early heart disease detection, with Random Forest showing the most promising results. The project contributes to advancing predictive healthcare technologies while addressing critical ethical and professional considerations in medical AI applications.

Technologies Used

PythonScikit-learnPandasNumPyMatplotlibSeabornGridSearchCVJupyter NotebooksStatistical Analysis

Key Features

  • Multi-model comparison (RF, LR, KNN)
  • Comprehensive data preprocessing pipeline
  • Advanced feature engineering
  • Automated hyperparameter optimization
  • ROC-AUC and confusion matrix analysis
  • Cross-validation implementation
  • Interactive visualization dashboards
  • Statistical significance testing

Challenges Overcome

  • Handling imbalanced medical datasets
  • Ensuring model reliability for clinical use
  • Optimizing hyperparameters across multiple models
  • Managing data quality from multiple sources
  • Addressing ethical considerations in medical AI
Aditya Parmar