A machine learning system for early heart disease risk prediction combining multiple algorithms including Random Forest, Logistic Regression, and KNN, achieving 88% accuracy on a comprehensive healthcare dataset.
This research project addresses cardiovascular diseases (CVDs), which account for 17.9 million deaths annually (31% of global mortality). Focusing on early detection through machine learning, we developed a predictive system using data from five major heart disease datasets across different hospitals, combining 918 patient records with 11 key health indicators.







Data Analysis & Preprocessing
- •Combined and cleaned datasets from five hospitals: Stalog Heart (270), Long Beach VA (200), Hungarian (294), Cleveland (303), and Switzerland (123)
- •Implemented comprehensive data preprocessing including duplicate removal and categorical variable encoding
- •Conducted extensive Exploratory Data Analysis (EDA) revealing key correlations between health indicators
- •Applied feature scaling and normalization techniques for optimal model performance
- •Utilized cross-validation and grid search for robust model evaluation
Technical Implementation
- •Developed multiple machine learning models: Logistic Regression, Decision Trees, Random Forest, and K-Nearest Neighbors
- •Implemented GridSearchCV for automated hyperparameter optimization
- •Created visualization tools for model performance analysis using confusion matrices and ROC curves
- •Achieved 88% accuracy with Random Forest, outperforming other algorithms
- •Built robust evaluation metrics including precision, recall, and F1-score calculations
Key Findings
- •Random Forest emerged as the most effective model with 88% accuracy
- •Logistic Regression achieved 85% accuracy with 91.42% cross-validation score
- •K-Nearest Neighbors demonstrated 87% accuracy with strong ROC-AUC metrics
- •Identified critical correlations between exercise angina, old peak, and heart disease outcomes
- •Developed insights into demographic patterns showing higher prevalence in male patients
The system demonstrates the effectiveness of machine learning in early heart disease detection, with Random Forest showing the most promising results. The project contributes to advancing predictive healthcare technologies while addressing critical ethical and professional considerations in medical AI applications.
Technologies Used
Key Features
- •Multi-model comparison (RF, LR, KNN)
- •Comprehensive data preprocessing pipeline
- •Advanced feature engineering
- •Automated hyperparameter optimization
- •ROC-AUC and confusion matrix analysis
- •Cross-validation implementation
- •Interactive visualization dashboards
- •Statistical significance testing
Challenges Overcome
- •Handling imbalanced medical datasets
- •Ensuring model reliability for clinical use
- •Optimizing hyperparameters across multiple models
- •Managing data quality from multiple sources
- •Addressing ethical considerations in medical AI