Predicting Obesity Risk from Behavioral & Demographic Factors
A comparative analysis of tree-based machine learning models to predict obesity risk using behavioral patterns, lifestyle factors, and demographic characteristics.
Project Overview
Obesity affects over 1 billion people globally with an economic impact of $4.32 trillion projected by 2035. This project applies tree-based machine learning methods to predict obesity risk categories based on behavioral and demographic factors.
Methodology & Approach
Data Preparation
Filtered synthetic data to work only with 531 real survey responses. Continuous variables binned into quantiles, and the target simplified from 7 to 3 obesity classes.
- Real vs. Synthetic Separation
- Variable Transformations
- Train/Test Split (80/20)
Exploratory Analysis
Analyzed distributions and relationships between predictors and obesity classes to understand key patterns.
- Target Distribution Analysis
- Predictor Correlations
- Class Imbalance Assessment
Model Implementation
Implemented and compared four tree-based classification models with cross-validation and hyperparameter tuning.
- Decision Tree (Deviance)
- Decision Tree (Gini)
- Bagging
- Random Forest
Model Performance
Comprehensive comparison of all four models using error rate and AUC metrics on the test set.
| Model | Error Rate | AUC Score | Normal Class Error | Overweight Error | Obese Error |
|---|---|---|---|---|---|
| Random Forest BEST | 31.8% | 0.7503 | 14.5% | 80.8% | 7.1% |
| Tree (Deviance) | 33.6% | 0.6489 | 30.3% | 76.9% | 0% |
| Tree (Gini) | 33.6% | 0.7377 | 13.6% | 69.2% | 57.1% |
| Bagging | 34.6% | 0.7207 | 16.7% | 76.9% | 14.3% |
Key Findings
Random Forest Wins
Achieved lowest error rate (31.8%) and highest AUC (0.75) by decorrelating trees through random feature selection at each split.
Age is #1 Predictor
Age emerged as the most influential variable (importance: 24.7), followed by Height (16.4) and Water Intake (16.7).
Genetics Matter
Family history of overweight showed strong predictive power (importance: 15.0), with higher proportion of obese individuals reporting family history.
Overweight Challenge
All models struggled with the "Overweight" class (69-81% error), as it shares characteristics with both Normal and Obese groups.
Diet Type Less Important
High-calorie food consumption was ubiquitous (~90%) across all classes, making "what you eat" less predictive than "how much you eat."
Activity Correlation
Physical activity levels were generally lower in Overweight and Obese groups, suggesting negative correlation with obesity risk.
Project Resources
Access the complete analysis including R code, visualizations, and detailed methodology.