Skip to main content
Machine Learning Project

Predicting Obesity Risk from Behavioral & Demographic Factors

A comparative analysis of tree-based machine learning models to predict obesity risk using behavioral patterns, lifestyle factors, and demographic characteristics.

December 2025
Université Laval
MQT-7015
Prof. Cremona, Severino
R Programming Random Forest Decision Trees Classification Healthcare Analytics

Project Overview

Obesity affects over 1 billion people globally with an economic impact of $4.32 trillion projected by 2035. This project applies tree-based machine learning methods to predict obesity risk categories based on behavioral and demographic factors.

531
Real Observations
16
Predictor Variables
4
Models Compared
68.2%
Best Accuracy

Methodology & Approach

Data Preparation

Filtered synthetic data to work only with 531 real survey responses. Continuous variables binned into quantiles, and the target simplified from 7 to 3 obesity classes.

  • Real vs. Synthetic Separation
  • Variable Transformations
  • Train/Test Split (80/20)

Exploratory Analysis

Analyzed distributions and relationships between predictors and obesity classes to understand key patterns.

  • Target Distribution Analysis
  • Predictor Correlations
  • Class Imbalance Assessment

Model Implementation

Implemented and compared four tree-based classification models with cross-validation and hyperparameter tuning.

  • Decision Tree (Deviance)
  • Decision Tree (Gini)
  • Bagging
  • Random Forest

Model Performance

Comprehensive comparison of all four models using error rate and AUC metrics on the test set.

Model performance comparison — error rate and AUC scores
Model Error Rate AUC Score Normal Class Error Overweight Error Obese Error
Random Forest BEST 31.8% 0.7503 14.5% 80.8% 7.1%
Tree (Deviance) 33.6% 0.6489 30.3% 76.9% 0%
Tree (Gini) 33.6% 0.7377 13.6% 69.2% 57.1%
Bagging 34.6% 0.7207 16.7% 76.9% 14.3%

Key Findings

Random Forest Wins

Achieved lowest error rate (31.8%) and highest AUC (0.75) by decorrelating trees through random feature selection at each split.

Age is #1 Predictor

Age emerged as the most influential variable (importance: 24.7), followed by Height (16.4) and Water Intake (16.7).

Genetics Matter

Family history of overweight showed strong predictive power (importance: 15.0), with higher proportion of obese individuals reporting family history.

Overweight Challenge

All models struggled with the "Overweight" class (69-81% error), as it shares characteristics with both Normal and Obese groups.

Diet Type Less Important

High-calorie food consumption was ubiquitous (~90%) across all classes, making "what you eat" less predictive than "how much you eat."

Activity Correlation

Physical activity levels were generally lower in Overweight and Obese groups, suggesting negative correlation with obesity risk.

Project Resources

Access the complete analysis including R code, visualizations, and detailed methodology.