Diabetes Risk Factors in America

Analysis of CDC BRFSS 2015 Health Indicators

Author

Data Analysis Team

Published

December 5, 2025

Executive Summary

ImportantBottom Line

This analysis of 253,680 CDC BRFSS respondents reveals that diabetes risk is strongly associated with modifiable lifestyle factors, particularly poor general health status, high blood pressure, and high cholesterol. Predictive models achieve strong discrimination (AUC = 0.82), enabling early identification of at-risk individuals for targeted intervention. Causal analysis estimates that eliminating high blood pressure could prevent 42% of diabetes cases, while fairness audits identify disparities requiring attention in model deployment.

Key Findings

14.0%
Diabetes Prevalence

0.820
Best Model AUC-ROC

42%
Cases Preventable (BP)

97%
Individual-Level Variance

Primary Risk Factors Identified

  1. General Health Status: Poor self-reported health is the strongest predictor (OR = 1.65 per level)
  2. High Blood Pressure: Associated with 61% higher odds of diabetes (OR = 1.61)
  3. High Cholesterol: 48% increased odds (OR = 1.48)
  4. Difficulty Walking: Strong indicator of metabolic dysfunction
  5. Age: Risk increases substantially after age 45

Model Performance Summary

  • Logistic Regression: AUC = 0.82, excellent interpretability
  • Random Forest: AUC = 0.813, comparable performance
  • At optimal threshold: 78% sensitivity with 71% specificity

Advanced Analyses Highlights

  • Causal Inference: Eliminating high BP could prevent 42% of cases; obesity contributes 26% of population attributable fraction
  • Anomaly Discovery: 15.4% are “resilient” (high risk, no diabetes); 0.09% are “vulnerable” (low risk, with diabetes)
  • Fairness Audit: 20 disparity flags identified; income and education show largest gaps in model performance
  • Multi-Level Analysis: 97% of variance is individual-level; 3% attributable to environmental context

Public Health Implications

  • Screening programs should prioritize individuals with hypertension and hypercholesterolemia
  • General health perception is a powerful, easily-assessed risk indicator
  • Lifestyle modification targeting BMI, physical activity, and diet offers prevention potential
  • Fairness considerations must inform deployment in populations with socioeconomic diversity

Introduction

Background on Diabetes in America

Diabetes mellitus represents one of the most significant public health challenges facing the United States. According to the Centers for Disease Control and Prevention (CDC), over 37 million Americans—approximately 11.3% of the population—have diabetes, with an additional 96 million adults having prediabetes. The economic burden exceeds $327 billion annually in direct medical costs and lost productivity.

Type 2 diabetes, which accounts for 90-95% of all diabetes cases, is largely preventable through lifestyle modifications. Early identification of at-risk individuals enables targeted interventions that can delay or prevent disease onset.

Dataset Description

This analysis utilizes the Behavioral Risk Factor Surveillance System (BRFSS) 2015 Diabetes Health Indicators dataset, a nationally representative survey conducted by the CDC.

Show code
dataset_info <- tibble(
  Characteristic = c(
    "Total Respondents",
    "Survey Year",
    "Number of Variables",
    "Geographic Coverage",
    "Sampling Method"
  ),
  Value = c(
    format(n_total, big.mark = ","),
    "2015",
    "22 health indicators",
    "All 50 US states + DC",
    "Stratified random sampling"
  )
)

dataset_info |>
  kable(col.names = c("Characteristic", "Value"),
        align = c("l", "l")) |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                position = "left")
Characteristic Value
Total Respondents 253,680
Survey Year 2015
Number of Variables 22 health indicators
Geographic Coverage All 50 US states + DC
Sampling Method Stratified random sampling

Research Objectives

  1. Characterize the prevalence and distribution of diabetes and associated risk factors
  2. Identify the strongest predictors of diabetes through statistical analysis
  3. Develop predictive models for diabetes risk stratification
  4. Evaluate model performance using appropriate metrics for imbalanced classification
  5. Generate actionable insights for public health intervention

Analysis Approach

Our analytical pipeline consists of four major phases:

  1. Data Preparation: Cleaning, validation, and feature engineering
  2. Exploratory Analysis: Visualization of distributions and relationships
  3. Statistical Testing: Odds ratios, risk ratios, and hypothesis tests
  4. Predictive Modeling: Logistic regression and random forest classification

Data Description

Dataset Overview

The BRFSS Diabetes Health Indicators dataset contains responses from 253,680 survey participants with 22 variables capturing demographics, health behaviors, and chronic conditions.

Variable Descriptions

Show code
variable_info <- tibble(
  Variable = c(
    "Diabetes_012", "HighBP", "HighChol", "CholCheck", "BMI",
    "Smoker", "Stroke", "HeartDiseaseorAttack", "PhysActivity",
    "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare",
    "NoDocbcCost", "GenHlth", "MentHlth", "PhysHlth",
    "DiffWalk", "Sex", "Age", "Education", "Income"
  ),
  Description = c(
    "Diabetes status: 0=No, 1=Prediabetes, 2=Diabetes",
    "High blood pressure diagnosis (0/1)",
    "High cholesterol diagnosis (0/1)",
    "Cholesterol check in past 5 years (0/1)",
    "Body Mass Index (continuous)",
    "Smoked at least 100 cigarettes in lifetime (0/1)",
    "Ever had a stroke (0/1)",
    "Coronary heart disease or heart attack (0/1)",
    "Physical activity in past 30 days (0/1)",
    "Consume fruit 1+ times per day (0/1)",
    "Consume vegetables 1+ times per day (0/1)",
    "Heavy alcohol consumption (0/1)",
    "Any healthcare coverage (0/1)",
    "Could not see doctor due to cost (0/1)",
    "General health: 1=Excellent to 5=Poor",
    "Days of poor mental health (past 30 days)",
    "Days of poor physical health (past 30 days)",
    "Difficulty walking or climbing stairs (0/1)",
    "Sex: 0=Female, 1=Male",
    "Age category: 1=18-24 to 13=80+",
    "Education level: 1-6 scale",
    "Income level: 1=<$10k to 8=>$75k"
  ),
  Type = c(
    "Target", "Binary", "Binary", "Binary", "Continuous",
    "Binary", "Binary", "Binary", "Binary",
    "Binary", "Binary", "Binary", "Binary",
    "Binary", "Ordinal", "Count", "Count",
    "Binary", "Binary", "Ordinal", "Ordinal", "Ordinal"
  )
)

variable_info |>
  kable(col.names = c("Variable", "Description", "Type"),
        align = c("l", "l", "c")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE,
                font_size = 13) |>
  scroll_box(height = "400px")
Variable Description Type
Diabetes_012 Diabetes status: 0=No, 1=Prediabetes, 2=Diabetes Target
HighBP High blood pressure diagnosis (0/1) Binary
HighChol High cholesterol diagnosis (0/1) Binary
CholCheck Cholesterol check in past 5 years (0/1) Binary
BMI Body Mass Index (continuous) Continuous
Smoker Smoked at least 100 cigarettes in lifetime (0/1) Binary
Stroke Ever had a stroke (0/1) Binary
HeartDiseaseorAttack Coronary heart disease or heart attack (0/1) Binary
PhysActivity Physical activity in past 30 days (0/1) Binary
Fruits Consume fruit 1+ times per day (0/1) Binary
Veggies Consume vegetables 1+ times per day (0/1) Binary
HvyAlcoholConsump Heavy alcohol consumption (0/1) Binary
AnyHealthcare Any healthcare coverage (0/1) Binary
NoDocbcCost Could not see doctor due to cost (0/1) Binary
GenHlth General health: 1=Excellent to 5=Poor Ordinal
MentHlth Days of poor mental health (past 30 days) Count
PhysHlth Days of poor physical health (past 30 days) Count
DiffWalk Difficulty walking or climbing stairs (0/1) Binary
Sex Sex: 0=Female, 1=Male Binary
Age Age category: 1=18-24 to 13=80+ Ordinal
Education Education level: 1-6 scale Ordinal
Income Income level: 1=<$10k to 8=>$75k Ordinal

Class Distribution

The target variable exhibits substantial class imbalance, a critical consideration for model development:

Show code
class_dist <- df |>
  count(diabetes_status) |>
  mutate(
    Percentage = paste0(round(n / sum(n) * 100, 1), "%"),
    n = format(n, big.mark = ",")
  )

class_dist |>
  kable(col.names = c("Diabetes Status", "Count", "Percentage"),
        align = c("l", "r", "r")) |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                position = "left") |>
  row_spec(3, bold = TRUE, background = "#fef3f3")
Diabetes Status Count Percentage
No Diabetes 213,703 84.2%
Prediabetes 4,631 1.8%
Diabetes 35,346 13.9%

Class Distribution Visualization
NoteClass Imbalance Implications

The severe underrepresentation of prediabetes cases (1.8%) suggests potential underdiagnosis in the population. For modeling purposes, we combine prediabetes and diabetes into a single positive class to improve detection of any glucose metabolism abnormality.


Exploratory Data Analysis

Risk Factor Prevalence by Diabetes Status

The prevalence of key risk factors varies dramatically by diabetes status, revealing the interconnected nature of cardiometabolic conditions.

Risk Factor Prevalence

Key Observations:

  • Individuals with diabetes show 3-6x higher rates of stroke and heart disease
  • High blood pressure affects 71% of diabetics vs. only 36% of non-diabetics
  • Difficulty walking affects 35% of diabetics, indicating mobility impairment

BMI Distribution Analysis

Body Mass Index shows a clear rightward shift for individuals with diabetes, indicating the strong association between obesity and metabolic dysfunction.

BMI Distribution by Diabetes Status
Show code
bmi_by_status <- df |>
  group_by(diabetes_status) |>
  summarise(
    Mean = round(mean(bmi, na.rm = TRUE), 1),
    SD = round(sd(bmi, na.rm = TRUE), 1),
    Median = round(median(bmi, na.rm = TRUE), 1),
    `Q25` = round(quantile(bmi, 0.25, na.rm = TRUE), 1),
    `Q75` = round(quantile(bmi, 0.75, na.rm = TRUE), 1),
    .groups = "drop"
  )

bmi_by_status |>
  kable(col.names = c("Status", "Mean", "SD", "Median", "Q25", "Q75"),
        align = c("l", rep("r", 5)),
        caption = "BMI Statistics by Diabetes Status") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                position = "left")
BMI Statistics by Diabetes Status
Status Mean SD Median Q25 Q75
No Diabetes 27.7 6.3 27 24 30
Prediabetes 30.7 7.0 30 26 34
Diabetes 31.9 7.4 31 27 35

Age and Diabetes Risk

Diabetes prevalence increases dramatically with age, with the sharpest rise occurring between ages 45-64.

Age and Diabetes Risk

Correlation Analysis

The correlation heatmap reveals the structure of relationships among health indicators and their association with diabetes status.

Correlation Heatmap

Strongest Correlations with Diabetes:

Show code
top_cors <- stat_results$correlations$diabetes_correlations |>
  head(8) |>
  mutate(
    Direction = ifelse(correlation > 0, "Positive (increases risk)", "Negative (decreases risk)"),
    Correlation = round(correlation, 3)
  ) |>
  select(variable, Correlation, Direction)

top_cors |>
  kable(col.names = c("Variable", "Correlation (r)", "Direction"),
        align = c("l", "r", "l")) |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                position = "left")
Variable Correlation (r) Direction
gen_hlth 0.303 Positive (increases risk)
high_bp 0.272 Positive (increases risk)
bmi 0.224 Positive (increases risk)
diff_walk 0.224 Positive (increases risk)
high_chol 0.209 Positive (increases risk)
age 0.185 Positive (increases risk)
heart_diseaseor_attack 0.180 Positive (increases risk)
phys_hlth 0.176 Positive (increases risk)

Key EDA Insights

TipSummary of Exploratory Findings
  1. Class imbalance is severe: Only 15.8% of respondents have diabetes or prediabetes
  2. Comorbidities cluster: Diabetes strongly co-occurs with hypertension, hypercholesterolemia, and heart disease
  3. BMI is elevated: Mean BMI is 4+ points higher in diabetics (31.9 vs 27.3)
  4. Age is a major factor: Risk increases substantially after age 45
  5. General health perception is the strongest correlate (r = 0.29)

Statistical Analysis

Odds Ratios for Key Risk Factors

Univariate logistic regression reveals the unadjusted association between each predictor and diabetes risk.

Show code
or_table <- stat_results$logistic_regression$univariate |>
  arrange(desc(OR)) |>
  head(12) |>
  mutate(
    `Odds Ratio (95% CI)` = sprintf("%.2f (%.2f - %.2f)", OR, or_ci_low, or_ci_high),
    `P-value` = ifelse(p_value < 0.001, "< 0.001", sprintf("%.4f", p_value)),
    Interpretation = case_when(
      OR > 1.5 ~ "Strong risk factor",
      OR > 1.2 ~ "Moderate risk factor",
      OR > 1.0 ~ "Weak risk factor",
      OR < 0.8 ~ "Protective factor",
      TRUE ~ "Minimal association"
    )
  ) |>
  select(variable, `Odds Ratio (95% CI)`, `P-value`, Interpretation)

or_table |>
  kable(col.names = c("Variable", "Odds Ratio (95% CI)", "P-value", "Interpretation"),
        align = c("l", "c", "c", "l"),
        caption = "Univariate Odds Ratios for Diabetes (Top 12 Predictors)") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE) |>
  row_spec(1:3, bold = TRUE, background = "#f0f7f4")
Univariate Odds Ratios for Diabetes (Top 12 Predictors)
Variable Odds Ratio (95% CI) P-value Interpretation
high_bp 4.78 (4.67 - 4.90) < 0.001 Strong risk factor
diff_walk 3.70 (3.61 - 3.79) < 0.001 Strong risk factor
heart_diseaseor_attack 3.51 (3.41 - 3.61) < 0.001 Strong risk factor
high_chol 3.24 (3.17 - 3.32) < 0.001 Strong risk factor
stroke 2.97 (2.85 - 3.10) < 0.001 Strong risk factor
gen_hlth 2.17 (2.14 - 2.19) < 0.001 Strong risk factor
smoker 1.41 (1.38 - 1.44) < 0.001 Moderate risk factor
age 1.21 (1.20 - 1.21) < 0.001 Moderate risk factor
sex 1.18 (1.15 - 1.20) < 0.001 Weak risk factor
bmi 1.08 (1.08 - 1.08) < 0.001 Weak risk factor
phys_hlth 1.04 (1.04 - 1.05) < 0.001 Weak risk factor
ment_hlth 1.02 (1.02 - 1.03) < 0.001 Weak risk factor

Risk Ratios for Binary Exposures

For binary risk factors, we calculate risk ratios to quantify the relative risk of diabetes among exposed vs. unexposed individuals.

Show code
rr_table <- stat_results$risk_measures |>
  arrange(desc(risk_ratio)) |>
  head(8) |>
  mutate(
    `Risk if Exposed` = sprintf("%.1f%%", risk_exposed),
    `Risk if Unexposed` = sprintf("%.1f%%", risk_unexposed),
    `Risk Ratio (95% CI)` = sprintf("%.2f (%.2f - %.2f)", risk_ratio, rr_ci_low, rr_ci_high),
    `Odds Ratio (95% CI)` = sprintf("%.2f (%.2f - %.2f)", odds_ratio, or_ci_low, or_ci_high)
  ) |>
  select(variable, `Risk if Exposed`, `Risk if Unexposed`, `Risk Ratio (95% CI)`)

rr_table |>
  kable(col.names = c("Risk Factor", "Risk if Present", "Risk if Absent", "Risk Ratio (95% CI)"),
        align = c("l", "r", "r", "c"),
        caption = "Risk Ratios for Binary Risk Factors") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE)
Risk Ratios for Binary Risk Factors
Risk Factor Risk if Present Risk if Absent Risk Ratio (95% CI)
high_bp 27.1% 7.2% 3.76 (3.68 - 3.84)
diff_walk 33.8% 12.1% 2.79 (2.74 - 2.83)
high_chol 24.7% 9.2% 2.69 (2.64 - 2.74)
heart_diseaseor_attack 35.8% 13.7% 2.61 (2.56 - 2.67)
stroke 34.3% 15.0% 2.29 (2.23 - 2.36)
smoker 18.3% 13.7% 1.34 (1.31 - 1.36)
fruits 14.6% 17.8% 0.82 (0.81 - 0.84)
veggies 14.7% 20.2% 0.73 (0.71 - 0.74)

Chi-Square Test Results

All binary risk factors show statistically significant associations with diabetes status (p < 0.001).

Show code
chi_table <- stat_results$chi_square |>
  mutate(
    `Chi-Square` = format(round(chi_square, 1), big.mark = ","),
    `P-value` = ifelse(p_value < 0.001, "< 0.001", sprintf("%.4f", p_value)),
    `Cramer's V` = round(cramers_v, 3),
    `Effect Size` = effect_size
  ) |>
  select(variable, `Chi-Square`, df, `P-value`, `Cramer's V`, `Effect Size`)

chi_table |>
  kable(align = c("l", "r", "c", "c", "r", "c"),
        caption = "Chi-Square Tests of Independence") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE)
Chi-Square Tests of Independence
variable Chi-Square df P-value Cramer's V Effect Size
high_bp 18,539.1 1 < 0.001 0.270 Small
high_chol 11,218.2 1 < 0.001 0.210 Small
stroke 2,786.2 1 < 0.001 0.105 Small
heart_diseaseor_attack 7,941.6 1 < 0.001 0.177 Small
phys_activity 3,738.2 1 < 0.001 0.121 Small
diff_walk 12,519.8 1 < 0.001 0.222 Small
smoker 999.8 1 < 0.001 0.063 Negligible
veggies 889.6 1 < 0.001 0.059 Negligible
hvy_alcohol_consump 815.0 1 < 0.001 0.057 Negligible
fruits 449.4 1 < 0.001 0.042 Negligible
NoteStatistical Significance

All tested associations are highly significant (p < 0.001) given the large sample size. Effect sizes (Cramer’s V) provide more meaningful information about the strength of associations, with high blood pressure showing the strongest relationship.


Feature Engineering

Engineered Features

To improve model performance, we created several derived features:

Show code
feature_table <- tibble(
  Feature = c(
    "diabetes_binary",
    "bmi_category",
    "health_score",
    "lifestyle_score",
    "comorbidity_count",
    "age_bmi_interaction"
  ),
  Description = c(
    "Binary target: 1 if prediabetes or diabetes, 0 otherwise",
    "BMI classified into: Underweight, Normal, Overweight, Obese I/II/III",
    "Composite of general, mental, and physical health indicators",
    "Combined physical activity, fruit/vegetable consumption",
    "Count of comorbid conditions (high BP, high cholesterol, heart disease)",
    "Age category multiplied by BMI for interaction effects"
  ),
  Rationale = c(
    "Addresses class imbalance by combining minority classes",
    "Captures non-linear BMI effects and clinical cut-points",
    "Reduces dimensionality while capturing overall health status",
    "Summarizes protective lifestyle behaviors",
    "Captures cumulative disease burden",
    "Models age-dependent BMI effects"
  )
)

feature_table |>
  kable(col.names = c("Feature", "Description", "Rationale"),
        align = c("l", "l", "l")) |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE)
Feature Description Rationale
diabetes_binary Binary target: 1 if prediabetes or diabetes, 0 otherwise Addresses class imbalance by combining minority classes
bmi_category BMI classified into: Underweight, Normal, Overweight, Obese I/II/III Captures non-linear BMI effects and clinical cut-points
health_score Composite of general, mental, and physical health indicators Reduces dimensionality while capturing overall health status
lifestyle_score Combined physical activity, fruit/vegetable consumption Summarizes protective lifestyle behaviors
comorbidity_count Count of comorbid conditions (high BP, high cholesterol, heart disease) Captures cumulative disease burden
age_bmi_interaction Age category multiplied by BMI for interaction effects Models age-dependent BMI effects

Data Splitting Methodology

Show code
split_info <- tibble(
  Set = c("Training", "Test"),
  `Sample Size` = c("202,944 (80%)", "50,736 (20%)"),
  Purpose = c(
    "Model fitting and hyperparameter tuning",
    "Unbiased performance evaluation"
  ),
  `Stratification` = c("Yes - by diabetes status", "Yes - by diabetes status")
)

split_info |>
  kable(align = c("l", "r", "l", "c")) |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                position = "left")
Set Sample Size Purpose Stratification
Training 202,944 (80%) Model fitting and hyperparameter tuning Yes - by diabetes status
Test 50,736 (20%) Unbiased performance evaluation Yes - by diabetes status

Predictive Modeling

Model Descriptions

We developed and compared two classification approaches:

Logistic Regression

  • Interpretable linear model with probabilistic outputs
  • L2 regularization to prevent overfitting
  • Provides odds ratios for clinical interpretation

Random Forest

  • Ensemble of 500 decision trees
  • Handles non-linear relationships and interactions automatically
  • Provides variable importance rankings

ROC Curves Comparison

The Receiver Operating Characteristic (ROC) curves demonstrate the discrimination ability of both models across all classification thresholds.

ROC Curves Comparison

Model Performance Metrics

Show code
perf_metrics <- model_results$model_comparison |>
  mutate(across(where(is.numeric), ~round(., 3))) |>
  pivot_longer(-model, names_to = "Metric", values_to = "Value") |>
  pivot_wider(names_from = model, values_from = Value) |>
  mutate(
    Metric = case_when(
      Metric == "accuracy" ~ "Accuracy",
      Metric == "sensitivity" ~ "Sensitivity (Recall)",
      Metric == "specificity" ~ "Specificity",
      Metric == "precision" ~ "Precision (PPV)",
      Metric == "npv" ~ "Negative Predictive Value",
      Metric == "f1_score" ~ "F1 Score",
      Metric == "balanced_accuracy" ~ "Balanced Accuracy",
      Metric == "mcc" ~ "Matthews Correlation Coeff.",
      Metric == "auc_roc" ~ "AUC-ROC",
      Metric == "auc_pr" ~ "AUC-PR",
      Metric == "brier_score" ~ "Brier Score",
      TRUE ~ Metric
    )
  )

perf_metrics |>
  kable(col.names = c("Metric", "Logistic Regression", "Random Forest"),
        align = c("l", "r", "r"),
        caption = "Model Performance Comparison (Threshold = 0.5)") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) |>
  row_spec(c(2, 3, 9), bold = TRUE, background = "#f0f7f4")
Model Performance Comparison (Threshold = 0.5)
Metric Logistic Regression Random Forest
Accuracy 0.850 0.851
Sensitivity (Recall) 0.200 0.201
Specificity 0.972 0.972
Precision (PPV) 0.569 0.574
Negative Predictive Value 0.866 0.867
F1 Score 0.296 0.298
Balanced Accuracy 0.586 0.586
Matthews Correlation Coeff. 0.273 0.276
AUC-ROC 0.820 0.813
AUC-PR 0.447 0.442
Brier Score 0.107 0.108

Confusion Matrices

Confusion Matrices

Feature Importance Comparison

Feature Importance

Optimal Threshold Analysis

The default threshold of 0.5 may not be optimal for clinical applications. We evaluated alternative thresholds to balance sensitivity and specificity.

Show code
thresh_table <- model_results$optimal_thresholds |>
  filter(model == "Logistic Regression") |>
  mutate(
    sensitivity = paste0(round(sensitivity * 100, 1), "%"),
    specificity = paste0(round(specificity * 100, 1), "%"),
    f1_score = round(f1_score, 3),
    optimal_threshold = round(optimal_threshold, 2)
  ) |>
  select(metric, optimal_threshold, sensitivity, specificity, f1_score)

thresh_table |>
  kable(col.names = c("Optimization Criterion", "Threshold", "Sensitivity", "Specificity", "F1 Score"),
        align = c("l", "r", "r", "r", "r"),
        caption = "Optimal Thresholds for Logistic Regression") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Optimal Thresholds for Logistic Regression
Optimization Criterion Threshold Sensitivity Specificity F1 Score
Youden's J (max sens + spec - 1) 0.15 77.9% 71.1% 0.469
Balanced (sens = spec) 0.17 73.6% 74.7% 0.477
F1 (max F1 score) 0.22 64.3% 81.4% 0.488
TipClinical Recommendation

For screening applications where missing cases is costly, use a lower threshold (~0.15-0.20) to achieve higher sensitivity (~78%) while maintaining acceptable specificity (~71%).


Causal Inference Framework

From Prediction to Causation

While the previous sections focused on predictive modeling—identifying patterns that forecast diabetes status—this section transitions to causal inference, which asks a fundamentally different question: What would happen if we intervened on a risk factor?

NoteWhy Causal Thinking Matters

A predictive model can identify that high blood pressure is correlated with diabetes, but it cannot tell us whether reducing blood pressure would prevent diabetes. Causal inference attempts to answer this counterfactual question using observational data and explicit causal assumptions.

Directed Acyclic Graph (DAG)

We formalized our causal assumptions in a Directed Acyclic Graph (DAG) representing the hypothesized causal relationships among diabetes risk factors.

Figure 1: Causal DAG for Diabetes Risk Factors

Key elements of the DAG:

  • Outcome (Diabetes): The target of all causal paths
  • Exposures of Interest: BMI, Physical Activity, High Blood Pressure
  • Confounders: Variables that influence both exposure and outcome (e.g., Age, Income)
  • Mediators: Variables through which exposures affect outcomes (e.g., BMI mediates exercise effects)
  • Latent Variables: Unmeasured confounders (e.g., Genetics) that represent limitations
DAG Node Classification and Causal Roles
Node Type Variables Causal Role
Outcome Diabetes Target of intervention analysis
Primary Exposures BMI, Physical Activity, High BP, High Cholesterol Modifiable factors we can intervene on
Confounders Age, Sex, Income, Education Must be adjusted to avoid bias
Mediators BMI (for exercise path), General Health Intermediate on causal path---block for direct effects
Colliders Heart Disease, Stroke Conditioning opens backdoor paths---avoid adjusting
Latent (Unmeasured) Genetics Cannot adjust---sensitivity analysis required

Minimal Adjustment Sets

The DAG-implied minimal adjustment sets identify which variables must be controlled to obtain unbiased causal effect estimates:

Minimal Sufficient Adjustment Sets for Causal Identification
Causal Question Adjustment Set Interpretation
Effect of High BP on Diabetes No adjustment needed (no backdoor paths) High BP is downstream of confounders; no backdoor paths open
Effect of BMI on Diabetes No adjustment needed (direct descendant of confounders) BMI is a causal ancestor of diabetes; confounders already upstream
Total Effect of Physical Activity Age, Diet, Healthcare Access, Income OR Education Must block confounding but not mediation through BMI
Direct Effect of Physical Activity Age, Diet, Healthcare Access, Income + BMI (mediator blocked) Block mediation through BMI to isolate exercise-specific effect

Average Treatment Effects (ATE)

Using inverse probability weighting and the identified adjustment sets, we estimated the Average Treatment Effect for key modifiable risk factors:

Average Treatment Effects on Diabetes Risk (percentage points)
Exposure ATE (95% CI) Direction Interpretation
High Blood Pressure 12.5 pp (11.8 to 13.1) Increases Risk High BP increases diabetes risk by 12.5 percentage points
Obesity (BMI >= 30) 14.6 pp (13.9 to 15.3) Increases Risk Obesity increases diabetes risk by 14.6 percentage points
Physical Activity (Total) -6.0 pp (-6.7 to -5.2) Decreases Risk Exercise reduces diabetes risk by 6.0 percentage points (total)
Physical Activity (Direct) -3.5 pp (-4.2 to -2.8) Decreases Risk Exercise reduces diabetes risk by 3.5 percentage points (direct)
ImportantKey Causal Finding

High blood pressure has the largest causal effect: A 12.5 percentage point increase in absolute diabetes risk attributable to high BP. This translates to a number needed to treat of approximately 8—preventing high BP in 8 individuals would prevent 1 diabetes case.

Counterfactual Analysis

We estimated what diabetes prevalence would have been under hypothetical interventions:

Counterfactual Analysis: What If We Intervened?
Intervention Scenario Observed Counterfactual Absolute Reduction Cases Preventable
Everyone exercises 15.9% 14.2% 1.7 pp 10.5%
BMI reduced by 5 units 15.9% 11.7% 4.1 pp 26.1%
No high blood pressure 15.9% 9.3% 6.6 pp 41.6%

Interpretation:

  • Population Attributable Fraction (PAF): The proportion of cases that would be prevented if the exposure were eliminated
  • BMI reduction by 5 units: Would prevent an estimated 26% of diabetes cases
  • Eliminating high blood pressure: The most impactful single intervention at 42% of cases

E-Value Sensitivity Analysis

Since we cannot measure all confounders (notably genetics), we computed E-values to assess how robust our causal conclusions are to unmeasured confounding:

E-Value Sensitivity Analysis for Unmeasured Confounding
exposure Risk Ratio E-value Robustness
High Blood Pressure 1.79 2.97 Moderate
Obesity 1.92 3.26 Strong
Physical Activity 1.61 2.59 Moderate

The E-value represents the minimum strength of association (on the risk ratio scale) that an unmeasured confounder would need to have with both the exposure and the outcome to fully explain away the observed causal effect. For example, an E-value of 3.26 for obesity means an unmeasured confounder would need to triple the risk of both obesity AND diabetes—a very strong relationship—to nullify our estimate.

Causal Framework Limitations

WarningImportant Assumptions
  1. No unmeasured confounding: We assume all backdoor paths are blocked by measured variables. Genetics, in particular, remains unmeasured.

  2. Positivity: Treatment/exposure must be possible for all covariate strata. Violations may occur in extreme age or BMI groups.

  3. Consistency: The exposure must be well-defined. “High blood pressure” as a binary may obscure dose-response relationships.

  4. No interference: One person’s exposure does not affect another’s outcome—reasonable for diabetes.

  5. Cross-sectional limitation: Our data cannot establish temporal precedence. Bidirectional effects (diabetes causing weight gain) are possible.


Anomaly Discovery: Resilient and Vulnerable Subgroups

Motivation: Who Defies Prediction?

Standard predictive models assume that individuals with similar risk profiles will have similar outcomes. But what about those who defy prediction? Understanding these “anomalous” individuals may reveal:

  • Protective factors not captured in standard models
  • Unmeasured vulnerabilities that warrant investigation
  • Opportunities for personalized intervention

Defining Resilience and Vulnerability

We identified two anomalous subgroups based on prediction residuals:

Anomalous Subgroup Definitions
Subgroup Definition Clinical Significance
Resilient High predicted risk (>75th percentile) but NO diabetes Potential protective factors to identify
Vulnerable Low predicted risk (<25th percentile) but HAS diabetes May have unmeasured risk factors or atypical disease
Expected Healthy Low predicted risk and no diabetes (model correct) Benchmark for healthy profiles
Expected Diabetic High predicted risk and has diabetes (model correct) Confirms known risk factor importance
Typical Middle risk range with variable outcomes Typical prediction uncertainty range

Subgroup Sizes

Distribution of Anomalous Subgroups
Subgroup N % of Sample Mean Predicted Risk Actual Diabetes Rate
Resilient 38,962 15.4% 29.3% 0.0%
Vulnerable 238 0.1% 1.5% 100.0%
Expected Healthy 75,866 29.9% 1.0% 0.0%
Expected Diabetic 37,142 14.6% 51.9% 100.0%
Typical 101,472 40.0% 9.2% 2.6%
NoteKey Finding

15.4% of respondents are “Resilient”—they have high-risk profiles (mean predicted risk 29%) yet do not have diabetes. This substantial group warrants investigation for protective factors. Conversely, only 0.09% are “Vulnerable”—developing diabetes despite very low predicted risk.

Comparative Profiles

Figure 2: Subgroup Profile Comparison

Demographic Characteristics

Demographic Profile by Subgroup
Subgroup N Mean Age (Category) % Female Mean Income (1-8) Mean Education (1-6)
Resilient 38962 9.7 53.0% 5.1 4.7
Vulnerable 238 6.8 53.8% 6.9 5.4
Expected Healthy 75866 5.9 60.8% 6.9 5.4
Expected Diabetic 37142 9.4 52.7% 5.1 4.7

Health Behavior Profiles

Health Behavior Profile by Subgroup
Subgroup Physically Active Eats Fruits Eats Vegetables Current Smoker
Resilient 61.9% 58.7% 75.3% 52.7%
Vulnerable 89.5% 66.8% 87.0% 34.0%
Expected Healthy 88.8% 69.1% 87.6% 32.5%
Expected Diabetic 62.1% 57.9% 75.0% 52.3%

Protective Factors in Resilient Individuals

Figure 3: Resilience Factors Analysis

The resilient group, despite having risk profiles similar to those with diabetes, shows subtle but consistent differences in:

  1. Slightly better dietary habits: Higher fruit (59% vs 58%) and vegetable (75% vs 75%) consumption
  2. Marginally higher income: Suggesting access to better healthcare or health-promoting resources
  3. Better general health perception: Despite similar objective risk factors

Risk Factors in Vulnerable Individuals

Figure 4: Vulnerability Factors Analysis

The vulnerable group—those with diabetes despite low predicted risk—shows:

  1. Higher rates of hypertension: 11.3% vs 7.6% (1.5x higher than expected healthy)
  2. Higher rates of high cholesterol: 20.6% vs 14.7% (1.4x higher)
  3. Poorer self-rated health: Average 1.9 vs 1.7 (on 1-5 scale where lower is better)

Generated Hypotheses for Future Research

Research Hypotheses Generated from Anomaly Discovery
ID Hypothesis Supporting Evidence Priority
1 Physical activity provides protection even at high metabolic risk Resilient group has -0% higher physical activity rate High
2 Better dietary habits (fruits/vegetables) may confer resilience to diabetes Resilient group consumes more fruits (59% vs 58%) High
3 Higher socioeconomic status offers protective resources beyond measured risk factors Resilient group has higher mean income (5.1 vs 5.1) Medium
4 BMI alone is insufficient - body composition or fat distribution may matter more Large residuals suggest unmeasured factors beyond BMI and standard risk factors High
6 Healthcare access and regular checkups enable early intervention in low-risk individuals Low-risk individuals developing diabetes may have undetected underlying conditions Medium
7 Genetic predisposition may override lifestyle factors in some individuals Vulnerable individuals defy prediction despite low risk (n=238) Medium
8 Inflammatory markers (not measured) may identify vulnerable low-risk individuals Vulnerable group may have elevated inflammatory or metabolic markers not captured High

Clinical Implications of Anomaly Discovery

TipTranslating Anomalies to Practice
  1. Resilient individuals may benefit from maintenance interventions that preserve their protective factors, even as they age or develop comorbidities

  2. Vulnerable individuals should receive enhanced screening for conditions not captured in standard models (e.g., gestational diabetes history, family history, stress markers)

  3. The existence of large residuals suggests our models are missing important predictors—candidates include genetics, sleep patterns, stress, and detailed dietary quality

  4. Phenotyping approach: Resilient profiles could inform “healthy aging” interventions; vulnerable profiles could guide precision prevention


Algorithmic Fairness Audit

Why Fairness Matters in Healthcare AI

Predictive models deployed in healthcare settings can perpetuate or exacerbate existing disparities if they perform differently across demographic groups. A model that works well “on average” may systematically underserve certain populations.

ImportantEthical Imperative

Healthcare algorithms that inform screening, treatment, or resource allocation decisions must be evaluated for fairness across protected characteristics including sex, age, race/ethnicity, and socioeconomic status.

Fairness Metric Definitions

We evaluated model performance using multiple fairness criteria:

Fairness Metrics Evaluated
Metric Definition Violation Indicates
Statistical Parity Positive prediction rate is equal across groups Differential screening/flagging rates
Equal Opportunity True positive rate (sensitivity) is equal across groups Differential detection of true cases
Predictive Parity Positive predictive value is equal across groups Differential precision in positive predictions
Calibration Predicted probabilities reflect actual outcomes equally across groups Systematic over/under-estimation of risk
Four-Fifths Rule Selection rates for any group should be at least 80% of the highest group Substantial adverse impact on a protected group

Overall Model Fairness Summary

Figure 5: Fairness Overview
Model-Wide Fairness Summary
Metric Value
Overall True Positive Rate 20.1%
Overall False Positive Rate 2.8%
Overall Accuracy 85.1%
Total Disparity Flags 20

Disparities by Protected Attribute

Sex-Based Performance

Figure 6: Fairness by Sex

Finding: No substantial sex-based disparities were flagged. The model performs comparably for males and females.

Age-Based Performance

Figure 7: Fairness by Age
Flagged Age-Based Disparities
Age Group Metric Disparity Score Magnitude
Senior (65+) Predictive equality ratio FP/(FP + TN) 1.78 0.78
Senior (65+) Statistical parity ratio (TP + FP)/(TP + FP + TN + FN) 1.56 0.56
Young (18-44) Equal opportunity ratio TP/(TP + FN) 0.21 0.79
Young (18-44) Predictive equality ratio FP/(FP + TN) 0.07 0.93
Young (18-44) Statistical parity ratio (TP + FP)/(TP + FP + TN + FN) 0.07 0.93

Key Findings:

  • Senior Adults (65+): Higher predictive rates (1.78x) but similar sensitivity
  • Young Adults: Lower equal opportunity (0.79x)—the model may miss more young diabetics

Income-Based Performance

Figure 8: Fairness by Income
Flagged Income-Based Disparities
Income Group Metric Disparity Score Magnitude
Low (<$25K) Equal opportunity ratio TP/(TP + FN) 2.60 1.60
Low (<$25K) Predictive equality ratio FP/(FP + TN) 6.72 5.72
Low (<$25K) Statistical parity ratio (TP + FP)/(TP + FP + TN + FN) 5.95 4.95
Medium ($25K-$50K) Equal opportunity ratio TP/(TP + FN) 1.56 0.56
Medium ($25K-$50K) Predictive equality ratio FP/(FP + TN) 2.84 1.84
Medium ($25K-$50K) Statistical parity ratio (TP + FP)/(TP + FP + TN + FN) 2.62 1.62
WarningCritical Disparity Alert

Low-income individuals show the most severe fairness violations:

  • 6.7x higher predicted positive rate than high-income
  • 2.6x lower equal opportunity (sensitivity)
  • These disparities substantially exceed the four-fifths rule threshold

Intersectional Analysis

Examining combinations of protected attributes reveals compounded disparities:

Intersectional Fairness Analysis: Sex x Income
Sex + Income Group N Base Rate TPR FPR PPV
Female + High (>$50K) 13,696 8.5% 10.8% 0.8% 54.3%
Female + Low (<$25K) 7,574 25.2% 31.1% 7.6% 58.1%
Female + Medium ($25K-$50K) 7,212 15.9% 16.3% 2.6% 54.2%
Male + High (>$50K) 13,024 12.8% 12.1% 1.3% 57.1%
Male + Low (<$25K) 3,970 25.7% 27.8% 6.6% 59.3%
Male + Medium ($25K-$50K) 5,261 20.6% 19.7% 3.7% 58.0%
Figure 9: Intersectional Fairness Heatmap

Disparity Summary by Protected Attribute

Total Disparity Flags by Protected Attribute
Protected Attribute Number of Flagged Disparities
Age 5
Education 9
Income 6

Recommendations for Bias Mitigation

Based on the fairness audit, we recommend:

  1. Pre-processing approaches:
    • Reweight training samples to balance representation across income groups
    • Consider separate calibration for high-disparity subgroups
  2. In-processing approaches:
    • Add fairness constraints during model training
    • Use adversarial debiasing to reduce reliance on proxy features
  3. Post-processing approaches:
    • Apply group-specific thresholds to equalize TPR
    • Implement confidence intervals that account for group membership
  4. Deployment considerations:
    • Monitor real-world performance by demographic group
    • Establish human-in-the-loop review for high-stakes decisions
    • Document known limitations in model cards
TipEthical Framework

Fairness is not a purely technical problem. Stakeholder engagement—including patients, clinicians, and community representatives—should inform which fairness criteria are prioritized and how trade-offs are resolved.


Multi-Level Environmental Analysis

Rationale for Multi-Level Modeling

Individual risk factors alone do not fully explain diabetes patterns. The environment in which people live—including healthcare access, local health behaviors, and disease burden—may independently influence diabetes risk or modify individual-level effects.

Multi-level modeling allows us to:

  • Partition variance between individual and contextual levels
  • Estimate environmental effects after controlling for individual characteristics
  • Test cross-level interactions (e.g., does exercise protect more in high-risk environments?)

Data Fusion Methodology

We integrated individual BRFSS data with county-level CDC PLACES data using a propensity-based ecological inference approach:

Data Fusion Pipeline for Multi-Level Analysis
Step Description Data/Output
1 Create environmental composite scores from CDC PLACES county-level measures 3,000+ counties with health measures
2 Stratify counties into environmental risk quintiles 5 strata from Very Low to Very High risk
3 Estimate individual-to-stratum propensity scores Based on demographic and health similarities
4 Fit multi-level logistic regression with individuals nested in environmental strata Individual + contextual predictors + random slopes

Environmental Composite Scores

We constructed four composite environmental scores from county-level data:

Environmental Composite Score Summary (Standardized)
Environmental Score Mean SD Min Max
healthcare_access_score 0 0.81 -1.88 4.29
health_behavior_score 0 0.56 -2.71 3.01
chronic_disease_score 0 0.90 -2.74 3.52
environmental_risk_score 0 0.51 -1.58 2.41
mental_health_score 0 0.90 -2.41 3.51
physical_health_score 0 0.88 -2.75 3.24
Figure 10: Environmental Context Distributions

Variance Partition Coefficient (VPC)

The VPC quantifies how much of the total variation in diabetes risk is attributable to environmental context vs. individual characteristics:

Variance Partition Coefficient by Model Specification
Model VPC Interpretation
Null (No Predictors) 16.9% Total environmental variance
Individual Predictors Only 0.54% Residual after individual factors
Individual + Environmental Context 0% Residual after all predictors
Individual + Context + Interaction 0% Residual after all predictors
Figure 11: VPC Decomposition
NoteKey Finding: Individual Factors Dominate

97% of variance in diabetes risk is explained by individual-level characteristics. Environmental context accounts for only 3% of total variance. This suggests that while environment matters, targeting individual risk factors remains the primary lever for intervention.

Variance Decomposition

Final Variance Decomposition
Source Variance (%)
Individual Characteristics 96.802913
Environmental Context 3.197087
Residual/Unexplained 0.000000

Fixed Effects: Individual and Contextual Predictors

Multi-Level Model Fixed Effects
Predictor Odds Ratio (95% CI) P-value Level
Individual-Level Predictors
bmi_centered BMI (centered) 1.07 (1.07 - 1.08) < 0.001 Individual (Level 1)
age_centered Age (centered) 1.16 (1.15 - 1.17) < 0.001 Individual (Level 1)
smoker Current Smoker 1.09 (1.03 - 1.16) 0.002 Individual (Level 1)
phys_activity Physical Activity 0.72 (0.68 - 0.77) < 0.001 Individual (Level 1)
high_bp High Blood Pressure 2.32 (2.17 - 2.48) < 0.001 Individual (Level 1)
high_chol High Cholesterol 1.77 (1.66 - 1.88) < 0.001 Individual (Level 1)
Contextual-Level Predictors
env_risk_scaled Environmental Risk (Context) 1.38 (1.26 - 1.51) < 0.001 Contextual (Level 2)
Figure 12: Multi-Level Effects Visualization
ImportantEnvironmental Risk Effect

Environmental risk score is significant (OR = 1.38, p < 0.001): Living in a high-risk environment increases diabetes odds by 38% independent of individual characteristics. This suggests that community-level interventions (improving healthcare access, promoting healthy food environments) may complement individual-level prevention.

Cross-Level Interactions

Figure 13: Cross-Level Interaction Effects

We tested whether individual-level effects vary by environmental context:

  • Physical activity protection is slightly stronger in high-risk environments
  • BMI effects are consistent across environments
  • Age effects are slightly attenuated in low-risk environments

Geographic Patterns

Environmental risk scores show substantial geographic variation:

Geographic Distribution of Environmental Risk
Classification N Counties Percentage
Average 1179 39.9%
Below Average 826 27.9%
Cold Spot (Very Low) 134 4.5%
Elevated 580 19.6%
Hot Spot (Very High) 237 8.0%

Implications for Population Health

  1. Individual interventions remain primary: The dominance of individual-level variance supports continued focus on modifiable personal risk factors

  2. Environmental context is significant: The 38% increase in odds associated with high-risk environments justifies place-based interventions

  3. Equity lens: Environmental risk concentrates in areas with lower socioeconomic status, compounding individual-level disparities

  4. Policy targets:

    • Improve healthcare access in high-risk communities
    • Address food deserts and physical activity infrastructure
    • Target screening programs to high-burden areas

Discussion

Key Findings Interpretation

Our analysis reveals several clinically meaningful insights:

  1. General health perception is remarkably predictive: Self-rated general health emerged as the strongest single predictor of diabetes status. This simple, single-item measure captures complex interactions between physical function, mental health, and quality of life that contribute to diabetes risk.

  2. The cardiovascular-metabolic nexus: High blood pressure and high cholesterol show some of the strongest associations with diabetes, consistent with the concept of metabolic syndrome. These conditions share common pathophysiological mechanisms and respond to similar lifestyle interventions.

  3. Age is not modifiable but is actionable: While age itself cannot be changed, the strong age-diabetes relationship informs screening guidelines. Our findings support intensified screening for individuals over 45 years of age.

  4. Model performance is clinically useful: An AUC of 0.82 indicates good discrimination ability. At the optimal threshold, the model correctly identifies approximately 78% of diabetics while maintaining 71% specificity.

Clinical Implications

  • Primary Care Screening: The identified risk factors can inform clinical decision-making about diabetes screening intensity
  • Patient Education: High blood pressure and cholesterol control should be emphasized as diabetes prevention strategies
  • Resource Allocation: Risk scores can help prioritize limited prevention program resources

Comparison to Existing Literature

Our findings align with established diabetes epidemiology:

  • The strong association with hypertension mirrors the American Diabetes Association’s recognition of high blood pressure as a key comorbidity
  • BMI effects are consistent with the well-established obesity-diabetes relationship
  • Age patterns match CDC prevalence data showing peak diabetes rates in the 65+ population

Study Strengths

  1. Large, representative sample: Over 250,000 respondents with national coverage
  2. Comprehensive variable set: Multiple domains including behaviors, conditions, and demographics
  3. Rigorous methodology: Proper train/test splitting with stratification
  4. Multiple model comparison: Both interpretable and ensemble methods evaluated

Limitations

WarningImportant Considerations

Data Limitations

Cross-Sectional Design

The BRFSS is a cross-sectional survey, meaning all variables are measured at a single point in time. This precludes establishing temporal precedence—we cannot determine whether risk factors preceded diabetes or vice versa.

Self-Reported Data

All health conditions and behaviors are self-reported, introducing potential:

  • Recall bias: Respondents may inaccurately remember past behaviors
  • Social desirability bias: Underreporting of stigmatized conditions or behaviors
  • Diagnostic bias: Undiagnosed diabetes cases may be misclassified as healthy

Class Imbalance

Despite methodological adjustments, the severe imbalance (84% negative cases) may limit:

  • Detection of subtle predictive signals for the minority class
  • Model calibration in the high-risk range
  • Generalizability to higher-prevalence populations

Missing Prediabetes Cases

The extremely low prediabetes prevalence (1.8%) likely reflects underdiagnosis rather than true prevalence. The American Diabetes Association estimates 38% of US adults have prediabetes—our data captures only a small fraction of these cases.

Variable Limitations

  • No laboratory values: HbA1c, fasting glucose, and other biomarkers unavailable
  • Limited behavioral detail: Frequency and intensity of exercise not captured
  • No dietary quality: Only fruit/vegetable consumption binary indicators

Causal Inference Limitations

The causal analysis presented in this report relies on several strong assumptions:

  1. Unmeasured confounding: The DAG assumes we have identified and measured all relevant confounders. Genetics, in particular, represents a substantial unmeasured variable that could bias estimates.

  2. Positivity violations: Some covariate combinations may have very few or no observations, leading to extreme propensity weights and unstable estimates.

  3. Model specification: The causal effect estimates rely on correct specification of outcome and propensity models. Misspecification could bias results.

  4. Cross-sectional causal claims: Despite our causal framework, the cross-sectional nature of the data means we cannot rule out reverse causation or bidirectional effects.

  5. Transportability: Effects estimated in the BRFSS population may not generalize to other populations with different covariate distributions.

Anomaly Discovery Limitations

  1. Arbitrary thresholds: The 75th and 25th percentile thresholds for defining “high risk” and “low risk” are arbitrary. Different thresholds would yield different subgroup compositions.

  2. Overfitting risk: Comparing subgroups on the same variables used for prediction risks finding spurious differences.

  3. Small vulnerable group: With only 238 vulnerable individuals, estimates for this group have wide uncertainty and may not replicate.

  4. Missing phenotype data: The “resilience” we identify may simply reflect unmeasured variables rather than true biological resilience.

Fairness Audit Limitations

  1. Missing protected attributes: Race/ethnicity was not available in our dataset, limiting our ability to assess disparities across these critical dimensions.

  2. Definition-dependent: Different fairness metrics often conflict; satisfying one metric may violate another. Our findings depend on which metrics are prioritized.

  3. Single model evaluated: We audited the random forest model; logistic regression may show different disparity patterns.

  4. Intersectionality challenges: Sample sizes decrease rapidly when examining intersections, reducing statistical power for detecting disparities.

  5. Threshold sensitivity: Fairness metrics at the default 0.5 threshold may differ substantially from those at clinically optimal thresholds.

Multi-Level Analysis Limitations

  1. Ecological inference: We could not link individuals to actual counties; the ecological analysis uses simulated county assignment based on propensity scores.

  2. Aggregation bias: County-level measures may not reflect individual exposures within those counties (ecological fallacy).

  3. Temporal mismatch: CDC PLACES data and BRFSS data may not be from identical time periods.

  4. Missing geographic variation: The small contextual variance (3%) may partly reflect our inability to capture true geographic clustering.

  5. Causal interpretation: Multi-level effects should not be interpreted causally; environmental selection and confounding remain possible.

Generalizability Considerations

NoteExternal Validity
  • Results are based on 2015 BRFSS data and may not reflect current diabetes epidemiology
  • Survey excludes institutionalized populations and those without telephone access
  • Self-selected survey respondents may differ systematically from non-respondents
  • Geographic and demographic representation depends on state-level response rates

Recommendations

Public Health Policy Implications

  1. Integrated Screening Programs: Given the strong comorbidity patterns, diabetes screening should be integrated with cardiovascular risk assessment programs

  2. General Health as a Screening Trigger: Consider using self-rated general health as a simple screening question to identify individuals warranting further evaluation

  3. Focus on Modifiable Factors: Public health messaging should emphasize blood pressure control, cholesterol management, and BMI reduction as diabetes prevention strategies

Clinical Screening Recommendations

Show code
screen_table <- tibble(
  `Age Group` = c("18-44", "45-64", "65+"),
  `No Risk Factors` = c("Screen if BMI >= 25", "Screen every 3 years", "Screen annually"),
  `1-2 Risk Factors` = c("Screen every 3 years", "Screen annually", "Screen annually"),
  `3+ Risk Factors` = c("Screen annually", "Screen annually", "Screen every 6 months")
)

screen_table |>
  kable(align = c("l", "l", "l", "l"),
        caption = "Suggested Screening Frequency by Risk Profile") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE)
Suggested Screening Frequency by Risk Profile
Age Group No Risk Factors 1-2 Risk Factors 3+ Risk Factors
18-44 Screen if BMI >= 25 Screen every 3 years Screen annually
45-64 Screen every 3 years Screen annually Screen annually
65+ Screen annually Screen annually Screen every 6 months

Lifestyle Intervention Targets

Priority areas for intervention based on attributable risk:

  1. Blood Pressure Management: Hypertension control could prevent a substantial proportion of diabetes cases
  2. Weight Management: Programs targeting BMI reduction in overweight/obese individuals
  3. Physical Activity: Increasing physical activity levels, particularly in sedentary populations
  4. Dietary Improvement: Increasing fruit and vegetable consumption

Future Research Directions

  1. Longitudinal Studies: Prospective cohort studies to establish temporal relationships
  2. Biomarker Integration: Incorporate laboratory values for improved prediction
  3. Subgroup Analysis: Examine risk factor profiles by race/ethnicity and geographic region
  4. Intervention Trials: Test effectiveness of risk score-guided prevention programs

Conclusion

This comprehensive analysis of CDC BRFSS data provides valuable insights into diabetes risk factors in the American population. Our multi-method approach—spanning predictive modeling, causal inference, anomaly discovery, fairness auditing, and multi-level analysis—yields a nuanced understanding of diabetes epidemiology.

Summary of Key Findings

Predictive Modeling:

  • Diabetes affects 14% of respondents, with likely substantial underdiagnosis of prediabetes
  • General health perception, high blood pressure, and high cholesterol emerge as the strongest predictors
  • Predictive models achieve AUC of 0.82, demonstrating clinical utility for risk stratification

Causal Analysis:

  • High blood pressure has the largest causal effect on diabetes (12.5 percentage point increase)
  • Eliminating high BP could prevent 42% of diabetes cases (highest population attributable fraction)
  • Physical activity directly reduces risk by 3.5 percentage points, with additional indirect effects through BMI
  • E-values suggest moderate-to-strong robustness to unmeasured confounding

Anomaly Discovery:

  • 15.4% of individuals are “resilient”—high risk profiles without diabetes
  • Only 0.09% are “vulnerable”—diabetes despite low predicted risk
  • Resilient individuals show slightly better lifestyle factors; vulnerable individuals have elevated cardiovascular risk
  • Hypotheses generated for future investigation of protective and risk mechanisms

Fairness Assessment:

  • 20 fairness disparities flagged across age, income, and education groups
  • Most severe disparities in low-income populations: 6.7x higher positive prediction rate
  • Intersectional analysis reveals compounded disadvantages for low-income groups
  • Bias mitigation recommendations provided for deployment

Multi-Level Analysis:

  • 97% of variance is individual-level; 3% attributable to environmental context
  • Environmental risk independently increases odds by 38% after controlling for individual factors
  • Geographic hot spots concentrated in areas with poor healthcare access and high chronic disease burden
  • Place-based interventions may complement individual-level prevention

Public Health Implications

These findings support a multi-pronged approach to diabetes prevention:

  1. Individual-level: Target modifiable risk factors (BP, BMI, physical activity) in high-risk individuals
  2. Causal targeting: Prioritize hypertension control given its large attributable fraction
  3. Equity-focused: Address fairness gaps in screening and prediction for socioeconomically disadvantaged populations
  4. Community-level: Invest in environmental improvements in high-burden areas
  5. Precision approaches: Use anomaly profiles to guide personalized prevention strategies

Final Remarks

This analysis demonstrates the value of extending beyond traditional predictive modeling to address causal, fairness, and contextual questions. While our models can identify who is at risk, the advanced analyses provide insight into why they are at risk and whether we are serving all populations equitably. Future work should prioritize longitudinal validation, incorporation of genetic and biomarker data, and prospective evaluation of fairness-aware deployment strategies.


Appendix

A. Full Logistic Regression Coefficients

Show code
full_coefs <- model_summary$logistic$coefficients |>
  mutate(
    `Odds Ratio` = round(estimate, 3),
    `95% CI Lower` = round(conf.low, 3),
    `95% CI Upper` = round(conf.high, 3),
    `Std. Error` = round(std.error, 4),
    `Z-statistic` = round(statistic, 2),
    `P-value` = ifelse(p.value < 0.001, "< 0.001", sprintf("%.4f", p.value))
  ) |>
  select(term, `Odds Ratio`, `95% CI Lower`, `95% CI Upper`,
         `Std. Error`, `Z-statistic`, `P-value`)

full_coefs |>
  kable(align = c("l", rep("r", 6)),
        caption = "Complete Logistic Regression Output") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE,
                font_size = 12) |>
  scroll_box(height = "400px")
Complete Logistic Regression Output
term Odds Ratio 95% CI Lower 95% CI Upper Std. Error Z-statistic P-value
(Intercept) 0.000 0.000 0.000 0.1521 -54.85 < 0.001
age_bmi_risk 3.777 2.327 6.137 0.2474 5.37 < 0.001
gen_hlth 1.651 1.623 1.681 0.0090 56.02 < 0.001
high_bp 1.612 1.542 1.685 0.0226 21.11 < 0.001
high_chol 1.482 1.419 1.546 0.0218 18.00 < 0.001
sex 1.225 1.192 1.260 0.0141 14.36 < 0.001
bmi 1.206 1.191 1.220 0.0061 30.56 < 0.001
cardiovascular_risk 1.203 1.168 1.239 0.0151 12.27 < 0.001
health_burden 0.879 0.851 0.907 0.0164 -7.88 < 0.001
age 1.077 1.052 1.103 0.0119 6.24 < 0.001
diff_walk 1.065 1.028 1.104 0.0182 3.48 < 0.001
income 0.939 0.932 0.946 0.0038 -16.82 < 0.001
high_risk_flag 1.054 1.010 1.100 0.0217 2.42 0.0157
education 0.963 0.949 0.977 0.0074 -5.10 < 0.001
bmi_squared 0.998 0.998 0.998 0.0001 -30.31 < 0.001
lifestyle_score 0.999 0.985 1.014 0.0075 -0.08 0.9376

B. Technical Details and Reproducibility

Software Environment

Computing Environment
Component Value
R Version 4.5.1
Platform aarch64-apple-darwin20
Operating System darwin20
Report Generated 2025-12-05 16:40:16 CST

Key Package Versions

Key Packages Used in Analysis
Package Version Purpose
tidyverse 2.0.0 Data manipulation, visualization, and tidying
knitr 1.50 Dynamic report generation
kableExtra 1.4.0 Publication-quality tables
scales 1.4.0 Axis and legend formatting

Analysis Pipeline Packages

The following packages were used in the underlying analysis scripts:

Packages Used in Analysis Pipeline
Analysis Package Purpose
Predictive Modeling ranger Random forest implementation
Predictive Modeling pROC ROC curve analysis and AUC calculation
Predictive Modeling yardstick Classification metrics
Causal Inference dagitty DAG specification and adjustment sets
Causal Inference EValue (or custom implementation) E-value sensitivity analysis
Fairness Audit fairness (or custom implementation) Fairness metric computation
Multi-Level Modeling lme4 Mixed-effects logistic regression
Visualization ggplot2 Publication-quality figures

Model Specifications

Model Specifications
Model Specification Notes
Logistic Regression glm(formula, family = binomial(link = 'logit')) L2 regularization applied via glmnet in some analyses
Random Forest ranger(num.trees = 500, mtry = sqrt(p), min.node.size = 10) Out-of-bag error for tuning; probability = TRUE for calibration
Causal (IPW) Inverse probability weighting with DAG-implied adjustment sets Stabilized weights with trimming at 1st/99th percentiles
Multi-Level glmer(formula, family = binomial, nAGQ = 1) Adaptive Gauss-Hermite quadrature with 1 point (Laplace approximation)

Random Seeds

To ensure reproducibility, the following random seeds were used:

Random Seeds for Reproducibility
Analysis Seed Set Via
Train/Test Split 42 set.seed(42)
Random Forest Training 42 set.seed(42)
Anomaly Discovery (Subsampling) 123 set.seed(123)
Multi-Level Model (Propensity Simulation) 456 set.seed(456)

Data Sources

Data Sources
Dataset Source Year N Records Access
BRFSS Diabetes Health Indicators UCI Machine Learning Repository / CDC 2015 253,680 individuals https://archive.ics.uci.edu/dataset/891/
CDC PLACES County-Level Data CDC PLACES: Local Data for Better Health 2023 3,000+ counties https://www.cdc.gov/places/

Code Availability

TipReproducibility Statement

All analysis code is available in the project repository:

  • Data preparation: scripts/01_data_preparation.R
  • Exploratory analysis: scripts/02_exploratory_analysis.R
  • Statistical modeling: scripts/03_statistical_modeling.R
  • Model evaluation: scripts/04_model_evaluation.R
  • Causal inference: scripts/05_causal_inference.R
  • Anomaly discovery: scripts/06_anomaly_discovery.R
  • Fairness audit: scripts/07_fairness_audit.R
  • Multi-level analysis: scripts/08_multilevel_analysis.R
  • Report generation: reports/diabetes_analysis_report.qmd

Output Files

Saved Analysis Results
File Contents Location
statistical_results_diabetes.rds Correlation matrices, chi-square tests, odds ratios, risk ratios output/
model_evaluation_results.rds Model predictions, performance metrics, thresholds, feature importance output/
causal_inference_results.rds DAG, adjustment sets, ATEs, counterfactuals, E-values output/
anomaly_discovery_results.rds Subgroup assignments, profiles, hypotheses, factor analysis output/
fairness_audit_results.rds Fairness metrics by group, intersectional analysis, recommendations output/
data_fusion_results.rds Environmental scores, VPC, fixed effects, geographic patterns output/

C. Data Dictionary

Show code
variable_info |>
  kable(col.names = c("Variable", "Description", "Type"),
        align = c("l", "l", "c"),
        caption = "Complete Variable Reference") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) |>
  scroll_box(height = "400px")
Complete Variable Reference
Variable Description Type
Diabetes_012 Diabetes status: 0=No, 1=Prediabetes, 2=Diabetes Target
HighBP High blood pressure diagnosis (0/1) Binary
HighChol High cholesterol diagnosis (0/1) Binary
CholCheck Cholesterol check in past 5 years (0/1) Binary
BMI Body Mass Index (continuous) Continuous
Smoker Smoked at least 100 cigarettes in lifetime (0/1) Binary
Stroke Ever had a stroke (0/1) Binary
HeartDiseaseorAttack Coronary heart disease or heart attack (0/1) Binary
PhysActivity Physical activity in past 30 days (0/1) Binary
Fruits Consume fruit 1+ times per day (0/1) Binary
Veggies Consume vegetables 1+ times per day (0/1) Binary
HvyAlcoholConsump Heavy alcohol consumption (0/1) Binary
AnyHealthcare Any healthcare coverage (0/1) Binary
NoDocbcCost Could not see doctor due to cost (0/1) Binary
GenHlth General health: 1=Excellent to 5=Poor Ordinal
MentHlth Days of poor mental health (past 30 days) Count
PhysHlth Days of poor physical health (past 30 days) Count
DiffWalk Difficulty walking or climbing stairs (0/1) Binary
Sex Sex: 0=Female, 1=Male Binary
Age Age category: 1=18-24 to 13=80+ Ordinal
Education Education level: 1-6 scale Ordinal
Income Income level: 1=<$10k to 8=>$75k Ordinal

Report generated on December 05, 2025 at 04:40 PM
Data Analysis Team | CDC BRFSS Diabetes Health Indicators Project