Modelling Diabetes Risk: Social and Behavioural Factors in BRFSS Data

Multivariable Logistic Regression Analysis · n = 287,804

This notebook estimates adjusted associations between diabetes status and a set of demographic, socioeconomic, and behavioural predictors using multivariable logistic regression.

The data come from the 2024 Behavioral Risk Factor Surveillance System (BRFSS), a nationally representative telephone survey of U.S. adults conducted annually by the CDC. The analysis examines whether these factors remain associated with diabetes status after adjustment for confounding.

1 Setup

Load packages and helpers
library(here)
library(tidyverse)
library(arrow)
library(gtsummary)
library(janitor)
library(broom)            # tidy() — extract model coefficients as a data frame
library(car)              # vif() — variance inflation factors
library(performance)      # check_model() — visual diagnostic plots
library(see)              # plotting backend for performance
library(patchwork)        # plot composition and annotation
library(pROC)             # roc(), auc(), ggroc() — ROC curve and AUC
library(ResourceSelection) # hoslem.test() — Hosmer-Lemeshow goodness-of-fit

source(here("R", "utils.R"))
source(here("R", "plot_theme.R"))

car checks multicollinearity via variance inflation factors, and performance provides diagnostic plots to assess overall model behaviour. pROC evaluates discrimination through the ROC curve and AUC, and ResourceSelection provides the Hosmer-Lemeshow test as a basic check of calibration across risk strata.

Model estimates are extracted with broom and presented using gtsummary for consistent reporting.


2 Data Import

Code
df_raw <- read_parquet(here("data", "processed", "analysis_dataset.parquet"))
cat("Rows:", nrow(df_raw), "| Columns:", ncol(df_raw), "\n")
## Rows: 453241 | Columns: 15

The analytical dataset is produced by the SQL pipeline from the raw 2024 BRFSS XPT file. Respondents with a missing or ambiguous diabetes response are excluded at the SQL stage, so all 453,241 rows carry a definitive binary outcome. No further row-level exclusions are applied until the complete-case step in Data Preparation.


3 Data Preparation

3.1 Factor levels and reference categories

Reference categories determine the comparison group for each odds ratio. We set these to substantively meaningful baselines — typically the lowest-risk or most socioeconomically advantaged group — so that estimated effects reflect higher risk relative to a favourable exposure level, making the direction of associations intuitive.

Set factor levels and reference categories
df <- df_raw |>
  mutate(
    diabetes = factor(diabetes, levels = c(0, 1), labels = c("No", "Yes")),

    age_group = factor(age_group,
                       levels = c("18-34", "35-44", "45-54", "55-64", "65-74", "75+")),

    sex = factor(sex, levels = c("Male", "Female")),

    race_ethnicity = factor(race_ethnicity,
                            levels = c("White", "Black",
                                       "American Indian/Alaskan Native",
                                       "Asian", "Pacific Islander",
                                       "Other/Multiracial",
                                       "Hispanic", "Other")),

    education = factor(education,
                       levels = c("College graduate", "Some college",
                                  "High school graduate", "Did not graduate high school")),

    income_group = factor(income_group,
                          levels = c(">$100k", "$50k-$100k", "$35k-$50k",
                                     "$25k-$35k", "$15k-$25k", "<$15k")),

    bmi_category = factor(bmi_category,
                          levels = c("Normal", "Underweight", "Overweight", "Obese")),

    physically_active = factor(physically_active, levels = c(1, 0),
                               labels = c("Active", "Inactive")),

    smoking_status = factor(smoking_status,
                            levels = c("Never", "Former",
                                       "Current (some days)", "Current (daily)")),

    heavy_drinker = factor(heavy_drinker, levels = c(0, 1),
                           labels = c("No", "Yes")),

    has_provider = factor(has_provider, levels = c(1, 0),
                          labels = c("Has provider", "No provider")),

    mental_health_days = factor(mental_health_days,
                                levels = c("0 days", "1-13 days", "14+ days"))
  )

The first level of each factor becomes the reference category under R’s default treatment coding. This choice affects how contrasts are expressed, but does not change overall model fit.

3.2 Complete-case exclusions

glm() requires no missing values. Rows with a missing value on any predictor are dropped here, and the retained dataset df_cc is used for all subsequent modelling.

Code
predictor_cols <- c(
  "age_group", "sex", "race_ethnicity", "education", "income_group",
  "bmi_category", "physically_active", "smoking_status",
  "heavy_drinker", "has_provider", "mental_health_days"
)

n_before <- nrow(df)
df_cc    <- df |> drop_na(all_of(predictor_cols))
n_after  <- nrow(df_cc)

cat("Before:", n_before, "| After:", n_after, "| Excluded:", n_before - n_after,
    "(", round((n_before - n_after) / n_before * 100, 1), "%)\n")
## Before: 453241 | After: 287804 | Excluded: 165437 ( 36.5 %)

36.5% of respondents are excluded, largely due to the 25.5% income-group missingness identified in the EDA. The reason for this missingness is unknown; if lower-income respondents are more likely to skip the question, the estimated income gradient may be attenuated under complete-case analysis, while the opposite pattern would bias estimates in the other direction. This is the main limitation of the complete-case specification in this model.


4 Model Specification

The outcome — diabetes diagnosis — is binary, so logistic regression is used to model the log-odds of the outcome as a function of the predictors. Coefficients are exponentiated and reported as odds ratios (ORs), the standard effect measure for cross-sectional analyses of this type.

All 11 predictors are entered simultaneously rather than selected stepwise. Stepwise procedures can improve apparent fit in-sample but tend to produce unstable estimates and inflated Type I error. The aim here is to estimate the adjusted association of each predictor with diabetes status, so a theory-driven full model is preferred.

The model includes three demographic predictors (age, sex, race/ethnicity), two socioeconomic predictors (education and income), one access-to-care predictor (having a personal doctor), and five behavioural and clinical predictors (BMI, physical activity, smoking, alcohol use, and mental health).

Code
model_formula <- formula(
  diabetes ~
    age_group + sex + race_ethnicity +
    education + income_group + has_provider +
    bmi_category + physically_active + smoking_status +
    heavy_drinker + mental_health_days
)
model_formula
diabetes ~ age_group + sex + race_ethnicity + education + income_group + 
    has_provider + bmi_category + physically_active + smoking_status + 
    heavy_drinker + mental_health_days

We are using an associational model fit to cross-sectional data, so it estimates adjusted odds rather than causal effects. Diabetes status and the predictors are measured at the same point in time, so the direction of causality cannot be established from this data alone.


5 Model Fitting

The specified logistic regression model is fit to the complete-case dataset.

Code
mod <- glm(model_formula, data = df_cc, family = binomial(link = "logit"))

saveRDS(mod,   here("data", "processed", "mod.rds"))
saveRDS(df_cc, here("data", "processed", "df_cc.rds"))

cat("Converged:", mod$converged, "\n")
## Converged: TRUE
glance(mod)

The model converged successfully. The reduction in deviance relative to the null model indicates that the predictors collectively improve model fit. Adjusted associations are interpreted in the Results section, after diagnostic checks.


6 Assumption Checks

Model fit alone does not guarantee reliable estimates, so key assumptions are checked before interpreting coefficients.

6.1 Multicollinearity — VIF

Highly correlated predictors inflate standard errors and make individual coefficients unstable, even when overall model fit is acceptable. Variance inflation factors (VIFs) quantify this inflation for each predictor.

Code
vif(mod)
##                        GVIF Df GVIF^(1/(2*Df))
## age_group          1.314941  5        1.027757
## sex                1.074021  1        1.036350
## race_ethnicity     1.223534  7        1.014515
## education          1.329057  3        1.048554
## income_group       1.402839  5        1.034429
## has_provider       1.047195  1        1.023325
## bmi_category       1.082000  3        1.013222
## physically_active  1.093264  1        1.045593
## smoking_status     1.167625  3        1.026165
## heavy_drinker      1.012101  1        1.006032
## mental_health_days 1.153756  2        1.036403

Because several predictors are multi-level categorical variables, car::vif() reports generalized VIFs (GVIF). The adjusted values, GVIF^(1/(2·Df)), are all below 1.05 — well under the conventional threshold of 2 — indicating no meaningful collinearity.

Income and education show slightly higher values than the other predictors, consistent with their expected correlation in the population, but neither approaches a level that would inflate standard errors or destabilise coefficient estimates.

6.2 Model diagnostics

Three diagnostic panels assess model fit, influential observations, and collinearity before interpreting coefficients.

Code
cm <- check_model(mod, check = c("binned_residuals", "outliers", "vif"),
                  residual_type = "normal")

diag_plot <- plot(cm) &
  theme(
    plot.title    = element_text(size = 12, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, hjust = 0.5),
    axis.title    = element_text(size = 8),
    plot.margin   = margin(t = 8, r = 8, b = 8, l = 8)
  )

diag_plot[[3]] <- diag_plot[[3]] +
  theme(
    plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(b = 10, l = 60)),
    axis.text.x   = element_text(angle = 45, hjust = 1, vjust = 1, size = 7.5),
    plot.margin   = margin(t = 8, r = 8, b = 18, l = 8)
  )

diag_plot +
  patchwork::plot_annotation(
    title = "Model Diagnostic Checks",
    theme = theme(
      plot.title  = element_text(size = 16, face = "bold", hjust = 0.5),
      plot.margin = margin(t = 14, b = 8)
    )
  )

The binned residuals plot shows systematic deviation outside the error bounds, with residuals consistently negative across the predicted probability range. The model overpredicts diabetes probability across most of the risk distribution, indicating that the linear functional form does not fully capture the relationship between the predictors and the log-odds of the outcome.

This pattern is consistent with non-linear associations in key predictors — particularly age, BMI, and mental health days — and the absence of interaction terms in the current specification. No influential observations are identified in the Cook’s distance panel, as expected given the sample size.

This misfit does not invalidate the model, but it suggests that effect estimates should be interpreted with caution.


7 Results

7.1 Adjusted odds ratios

The primary result of this analysis is the set of adjusted ORs from the full multivariable model — the estimated odds of diabetes for each predictor level, holding all other predictors constant.

Code
tbl_regression(
  mod,
  exponentiate = TRUE,
  estimate_fun = \(x) style_sigfig(x, digits = 3),
  label = list(
    age_group          ~ "Age group",
    sex                ~ "Sex",
    race_ethnicity     ~ "Race/ethnicity",
    education          ~ "Education",
    income_group       ~ "Household income",
    bmi_category       ~ "BMI category",
    physically_active  ~ "Physically active",
    smoking_status     ~ "Smoking status",
    heavy_drinker      ~ "Heavy drinker",
    has_provider       ~ "Has personal doctor",
    mental_health_days ~ "Poor mental health days"
  )
) |>
  bold_p(t = 0.05) |>
  bold_labels() |>
  modify_caption("**Table 1.** Adjusted odds ratios - full social determinants model")
Table 1. Adjusted odds ratios - full social determinants model
Characteristic OR 95% CI p-value
Age group


    18-34
    35-44 2.21 1.99, 2.46 <0.001
    45-54 5.57 5.06, 6.13 <0.001
    55-64 10.9 9.91, 11.9 <0.001
    65-74 14.4 13.2, 15.8 <0.001
    75+ 17.4 15.9, 19.1 <0.001
Sex


    Male
    Female 0.703 0.687, 0.719 <0.001
Race/ethnicity


    White
    Black 1.63 1.57, 1.70 <0.001
    American Indian/Alaskan Native 1.83 1.69, 1.99 <0.001
    Asian 1.94 1.79, 2.10 <0.001
    Pacific Islander 1.80 1.55, 2.07 <0.001
    Other/Multiracial 1.16 1.02, 1.32 0.018
    Hispanic 1.27 1.18, 1.37 <0.001
    Other 1.50 1.44, 1.56 <0.001
Education


    College graduate
    Some college 1.19 1.15, 1.22 <0.001
    High school graduate 1.16 1.13, 1.20 <0.001
    Did not graduate high school 1.33 1.27, 1.40 <0.001
Household income


    >$100k
    $50k-$100k 1.21 1.17, 1.25 <0.001
    $35k-$50k 1.43 1.37, 1.49 <0.001
    $25k-$35k 1.59 1.52, 1.66 <0.001
    $15k-$25k 1.82 1.74, 1.90 <0.001
    <$15k 1.86 1.77, 1.96 <0.001
Has personal doctor


    Has provider
    No provider 0.337 0.318, 0.357 <0.001
BMI category


    Normal
    Underweight 0.675 0.593, 0.766 <0.001
    Overweight 1.68 1.63, 1.74 <0.001
    Obese 3.21 3.11, 3.31 <0.001
Physically active


    Active
    Inactive 1.58 1.54, 1.62 <0.001
Smoking status


    Never
    Former 1.09 1.06, 1.12 <0.001
    Current (some days) 0.895 0.838, 0.955 <0.001
    Current (daily) 1.00 0.963, 1.05 0.9
Heavy drinker


    No
    Yes 0.484 0.456, 0.514 <0.001
Poor mental health days


    0 days
    1-13 days 1.13 1.10, 1.16 <0.001
    14+ days 1.43 1.38, 1.47 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

The adjusted model shows the strongest associations for BMI and age, with obesity and older age groups carrying substantially higher odds of diabetes than their reference categories. Lower income groups also retain elevated odds after adjustment, indicating that the socioeconomic gradient is not fully explained by the behavioural and clinical predictors in the model. Several smaller effects remain directionally consistent with the EDA, though their magnitude and precision vary once the predictors are considered jointly.

7.2 Forest plot

A visual complement to Table 1. Points show adjusted odds ratios for each predictor level, with horizontal bars representing 95% confidence intervals. The dashed line at OR = 1 marks the null (no association), and the x-axis is on a log scale, so equal distances represent equal multiplicative changes in odds. Predictors are split across two figures by substantive domain.

Code
predictor_names <- c(
  "age_group", "sex", "race_ethnicity", "education", "income_group",
  "bmi_category", "physically_active", "smoking_status",
  "heavy_drinker", "mental_health_days", "has_provider"
)

var_labels <- c(
  age_group          = "Age group",
  sex                = "Sex",
  race_ethnicity     = "Race/ethnicity",
  education          = "Education",
  income_group       = "Household income",
  bmi_category       = "BMI category",
  physically_active  = "Physically active",
  smoking_status     = "Smoking status",
  heavy_drinker      = "Heavy drinker",
  mental_health_days = "Poor mental health days",
  has_provider       = "Has personal doctor"
)

demo_vars <- c("age_group", "sex", "race_ethnicity", "education", "income_group")

forest_data <- tidy(mod, conf.int = TRUE, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    variable = str_extract(term, str_c("^(", str_c(predictor_names, collapse = "|"), ")")),
    label    = str_remove(term, variable),
    variable = factor(variable, levels = predictor_names)
  ) |>
  arrange(variable)

forest_demo <- filter(forest_data, variable %in% demo_vars)

ggplot(forest_demo,
       aes(x = estimate,
           y = fct_rev(factor(term, levels = unique(term))),
           xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(color = palette_epi[["blue"]], size = 0.4) +
  facet_wrap(~variable, scales = "free_y", ncol = 2,
             labeller = as_labeller(var_labels)) +
  scale_x_log10() +
  scale_y_discrete(labels = \(x) forest_data$label[match(x, forest_data$term)]) +
  labs(x = "Odds Ratio (log scale)", y = NULL,
       title = "Demographic and socioeconomic predictors")

Adjusted ORs with 95% CIs — demographic and socioeconomic predictors

The demographic and socioeconomic panel shows a clear gradient in diabetes risk. Age is the dominant predictor, with odds increasing sharply and monotonically across older age groups. Income and education show consistent inverse gradients, with lower socioeconomic status associated with higher odds of diabetes.

Race and ethnicity show consistently elevated adjusted odds across most groups, with Asian, AI/AN, Black, and Pacific Islander respondents all showing ORs above 1.6 relative to White respondents. Intervals are wider for smaller subgroups, consistent with the uneven sample composition observed in the EDA. Sex shows a precisely estimated association in the expected direction: female respondents have approximately 30% lower adjusted odds of diagnosis, smaller in magnitude than age or BMI but not negligible.

Code
behav_vars <- c("bmi_category", "physically_active", "smoking_status",
                "heavy_drinker", "mental_health_days", "has_provider")

forest_behav <- filter(forest_data, variable %in% behav_vars)

ggplot(forest_behav,
       aes(x = estimate,
           y = fct_rev(factor(term, levels = unique(term))),
           xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(color = palette_epi[["blue"]], size = 0.4) +
  facet_wrap(~variable, scales = "free_y", ncol = 2,
             labeller = as_labeller(var_labels)) +
  scale_x_log10() +
  scale_y_discrete(labels = \(x) forest_data$label[match(x, forest_data$term)]) +
  labs(x = "Odds Ratio (log scale)", y = NULL,
       title = "Behavioural and clinical predictors")

Adjusted ORs with 95% CIs — behavioural and clinical predictors

The behavioural and clinical panel is dominated by BMI, which shows a strong gradient in risk. Odds of diabetes increase substantially from overweight to obese, while underweight is associated with lower odds, making BMI the largest modifiable predictor in the model.

Physical inactivity and poor mental health are both associated with higher odds of diabetes, with mental health showing a graded pattern across categories. These effects are smaller than BMI but remain consistent in direction.

Smoking shows a non-monotonic pattern, with former smokers at higher odds than current smokers. This is consistent with reverse causation, where individuals may quit smoking following diagnosis. Similar patterns appear for heavy drinking and having a personal doctor, both of which show associations that are unlikely to reflect causal effects and are more plausibly explained by behavioural change or healthcare access following diagnosis.

7.3 Unadjusted vs adjusted comparison

Comparing unadjusted and adjusted odds ratios highlights confounding between predictors. Changes in estimates after adjustment indicate that part of the crude association was explained by other variables in the model.

Code
tbl_uv <- df_cc |>
  select(diabetes, all_of(predictor_cols)) |>
  tbl_uvregression(
    method       = glm,
    y            = diabetes,
    method.args  = list(family = binomial),
    exponentiate = TRUE,
    estimate_fun = \(x) style_sigfig(x, digits = 3),
    label = list(
      age_group          ~ "Age group",
      sex                ~ "Sex",
      race_ethnicity     ~ "Race/ethnicity",
      education          ~ "Education",
      income_group       ~ "Household income",
      bmi_category       ~ "BMI category",
      physically_active  ~ "Physically active",
      smoking_status     ~ "Smoking status",
      heavy_drinker      ~ "Heavy drinker",
      has_provider       ~ "Has personal doctor",
      mental_health_days ~ "Poor mental health days"
    )
  ) |>
  bold_p(t = 0.05)

tbl_mv <- tbl_regression(
  mod,
  exponentiate = TRUE,
  estimate_fun = \(x) style_sigfig(x, digits = 3),
  label = list(
    age_group          ~ "Age group",
    sex                ~ "Sex",
    race_ethnicity     ~ "Race/ethnicity",
    education          ~ "Education",
    income_group       ~ "Household income",
    bmi_category       ~ "BMI category",
    physically_active  ~ "Physically active",
    smoking_status     ~ "Smoking status",
    heavy_drinker      ~ "Heavy drinker",
    has_provider       ~ "Has personal doctor",
    mental_health_days ~ "Poor mental health days"
  )
) |>
  bold_p(t = 0.05)

tbl_merge(
  list(tbl_uv, tbl_mv),
  tab_spanner = c("**Unadjusted OR**", "**Adjusted OR**")
) |>
  bold_labels() |>
  modify_caption("**Table 2.** Unadjusted and adjusted odds ratios")
Table 2. Unadjusted and adjusted odds ratios
Characteristic
Unadjusted OR
Adjusted OR
N OR 95% CI p-value OR 95% CI p-value
Age group 287,804





    18-34


    35-44
2.52 2.27, 2.79 <0.001 2.21 1.99, 2.46 <0.001
    45-54
6.74 6.14, 7.41 <0.001 5.57 5.06, 6.13 <0.001
    55-64
13.5 12.4, 14.8 <0.001 10.9 9.91, 11.9 <0.001
    65-74
16.9 15.5, 18.5 <0.001 14.4 13.2, 15.8 <0.001
    75+
18.7 17.1, 20.5 <0.001 17.4 15.9, 19.1 <0.001
Sex 287,804





    Male


    Female
0.875 0.858, 0.893 <0.001 0.703 0.687, 0.719 <0.001
Race/ethnicity 287,804





    White


    Black
1.55 1.50, 1.60 <0.001 1.63 1.57, 1.70 <0.001
    American Indian/Alaskan Native
1.68 1.56, 1.81 <0.001 1.83 1.69, 1.99 <0.001
    Asian
0.787 0.730, 0.847 <0.001 1.94 1.79, 2.10 <0.001
    Pacific Islander
1.43 1.25, 1.63 <0.001 1.80 1.55, 2.07 <0.001
    Other/Multiracial
1.16 1.03, 1.30 0.016 1.16 1.02, 1.32 0.018
    Hispanic
0.950 0.887, 1.02 0.15 1.27 1.18, 1.37 <0.001
    Other
0.941 0.908, 0.974 <0.001 1.50 1.44, 1.56 <0.001
Education 287,804





    College graduate


    Some college
1.48 1.45, 1.52 <0.001 1.19 1.15, 1.22 <0.001
    High school graduate
1.52 1.48, 1.56 <0.001 1.16 1.13, 1.20 <0.001
    Did not graduate high school
2.04 1.95, 2.13 <0.001 1.33 1.27, 1.40 <0.001
Household income 287,804





    >$100k


    $50k-$100k
1.44 1.40, 1.48 <0.001 1.21 1.17, 1.25 <0.001
    $35k-$50k
1.85 1.78, 1.91 <0.001 1.43 1.37, 1.49 <0.001
    $25k-$35k
2.15 2.08, 2.23 <0.001 1.59 1.52, 1.66 <0.001
    $15k-$25k
2.70 2.60, 2.80 <0.001 1.82 1.74, 1.90 <0.001
    <$15k
2.54 2.43, 2.66 <0.001 1.86 1.77, 1.96 <0.001
BMI category 287,804





    Normal


    Underweight
0.761 0.671, 0.860 <0.001 0.675 0.593, 0.766 <0.001
    Overweight
1.88 1.82, 1.94 <0.001 1.68 1.63, 1.74 <0.001
    Obese
3.52 3.42, 3.63 <0.001 3.21 3.11, 3.31 <0.001
Physically active 287,804





    Active


    Inactive
2.31 2.26, 2.36 <0.001 1.58 1.54, 1.62 <0.001
Smoking status 287,804





    Never


    Former
1.46 1.43, 1.49 <0.001 1.09 1.06, 1.12 <0.001
    Current (some days)
0.959 0.902, 1.02 0.2 0.895 0.838, 0.955 <0.001
    Current (daily)
1.23 1.19, 1.28 <0.001 1.00 0.963, 1.05 0.9
Heavy drinker 287,804





    No


    Yes
0.410 0.387, 0.435 <0.001 0.484 0.456, 0.514 <0.001
Has personal doctor 287,804





    Has provider


    No provider
0.214 0.202, 0.226 <0.001 0.337 0.318, 0.357 <0.001
Poor mental health days 287,804





    0 days


    1-13 days
0.764 0.745, 0.783 <0.001 1.13 1.10, 1.16 <0.001
    14+ days
1.13 1.10, 1.16 <0.001 1.43 1.38, 1.47 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

The crude-to-adjusted comparison shows how much of each association is explained by other predictors in the model. Several socioeconomic and behavioural effects attenuate after adjustment, particularly income, education, physical activity, and smoking, indicating that part of their crude association was shared with age and BMI.

In contrast, age and BMI remain strong with only modest attenuation, suggesting these are more robust associations in the data. The age gradient persists after adjustment, with the 75+ group retaining an OR of 17.4. The obesity effect is similarly stable (3.52 → 3.21), remaining the largest modifiable predictor in the model.

Some predictors change direction or strengthen after adjustment, highlighting more complex confounding. Asian ethnicity shifts from a protective crude association to a higher adjusted odds (0.79 → 1.94). This reflects confounding by socioeconomic status: in the crude analysis, higher average income and education levels suppress the underlying association, which becomes visible after adjustment. The 1–13 mental health day group shows a weaker reversal (0.76 → 1.13), likely reflecting adjustment for correlated behavioural and social factors. These reversals illustrate why comparing crude and adjusted estimates is necessary in observational analyses, rather than a formality.


8 Model Performance

Model performance is assessed in terms of discrimination and calibration before interpreting predicted probabilities.

8.1 Discrimination — ROC curve and AUC

The ROC curve and AUC assess whether the model assigns higher predicted probabilities to respondents with diabetes, indicating how well it ranks risk across the sample.

Code
roc_obj <- roc(
  response  = as.integer(df_cc$diabetes) - 1L,
  predictor = fitted(mod),
  quiet     = TRUE
)

ggroc(roc_obj, color = palette_epi[["vermillion"]], linewidth = 1) +
  geom_abline(slope = -1, intercept = 1, linetype = "dashed", color = "grey60") +
  annotate("text", x = 0.35, y = 0.1,
           label = paste0("AUC = ", round(auc(roc_obj), 3)),
           size = 4.5) +
  labs(title = "ROC Curve — Full Model", x = "Specificity", y = "Sensitivity")

ROC curve — full social determinants model

The model achieves an AUC of 0.771: if a respondent with diabetes and a respondent without are selected at random, the model assigns a higher predicted probability to the case 77.1% of the time. The predictors are demographic, socioeconomic, and self-reported behavioural measures, none of which directly measure metabolic function — clinical markers such as fasting glucose or HbA1c are not available in BRFSS. The next section quantifies how much the socioeconomic and behavioural predictors contribute beyond age, sex, and race/ethnicity alone.

8.2 Incremental value of SDoH — base model vs full model

A base model including age, sex, and race/ethnicity provides a reference point for model performance. Comparing its AUC to the full model shows how much additional discrimination is contributed by the socioeconomic and behavioural predictors.

Code
mod_base <- glm(
  diabetes ~ age_group + sex + race_ethnicity,
  data   = df_cc,
  family = binomial
)

roc_base <- roc(
  response  = as.integer(df_cc$diabetes) - 1L,
  predictor = fitted(mod_base),
  quiet     = TRUE
)

cat("Base model AUC (age + sex + race/ethnicity):", round(auc(roc_base), 3), "\n")
## Base model AUC (age + sex + race/ethnicity): 0.696
cat("Full model AUC (all predictors):            ", round(auc(roc_obj),  3), "\n")
## Full model AUC (all predictors):             0.771
cat("Delta AUC:                                  ", round(auc(roc_obj) - auc(roc_base), 3), "\n\n")
## Delta AUC:                                   0.075

roc.test(roc_obj, roc_base)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc_obj and roc_base
## Z = 69.353, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.07306843 0.07731845
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7712646   0.6960712

The base model (age, sex, race/ethnicity) achieves an AUC of 0.696, compared with 0.771 for the full model (ΔAUC = 0.075). This increase reflects the additional discrimination provided by the socioeconomic and behavioural predictors beyond demographic structure alone.

The DeLong test confirms that this difference is not due to sampling variation (p < 0.001), but the magnitude is more informative than the p-value here: the added predictors improve risk ranking, but do not fundamentally change the model’s ability to separate cases from non-cases.

8.3 Calibration

The Hosmer–Lemeshow test compares observed and expected event counts across deciles of predicted risk. A significant p-value indicates a difference between predicted and observed rates within these groups.

At this sample size (287,804), the test has sufficient power to detect very small deviations, so a near-zero p-value is expected even when calibration is visually close. The p-value alone is insufficient — the calibration plot provides the more complete picture.

Code
cal_data <- tibble(
  predicted = fitted(mod),
  observed  = as.integer(df_cc$diabetes) - 1L
) |>
  mutate(decile = ntile(predicted, 10)) |>
  summarise(
    mean_pred = mean(predicted),
    mean_obs  = mean(observed),
    n         = n(),
    .by       = decile
  )

ggplot(cal_data, aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
  geom_point(aes(size = n), color = palette_epi[["blue"]], show.legend = FALSE) +
  geom_line(color = palette_epi[["blue"]], linewidth = 0.5) +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "Mean predicted probability", y = "Observed proportion",
       title = "Calibration plot — predicted vs observed\n(by decile of predicted risk)") +
  theme(plot.title = element_text(hjust = 0.5))

Calibration plot — mean predicted probability vs observed proportion by decile

The calibration plot compares predicted probabilities with observed event rates across deciles of risk. Points close to the 45° line indicate good agreement; systematic deviation above the line indicates underprediction, and below indicates overprediction. Point size reflects the number of respondents in each decile.

Predicted probabilities are slightly elevated in the lower-risk deciles and modestly underestimated at the upper end, but remain close to the reference line across the full range. These deviations are small, and the model is well calibrated for population-level interpretation.

Code
hl <- hoslem.test(as.integer(df_cc$diabetes) - 1L, fitted(mod), g = 10)
print(hl)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.integer(df_cc$diabetes) - 1L, fitted(mod)
## X-squared = 124.21, df = 8, p-value < 2.2e-16

The test returns X² = 124.21 (df = 8, p < 0.001). As expected at this sample size, the result is significant and reflects small deviations from perfect calibration. The calibration plot provides the more substantive assessment of fit.

Taken together, these checks address two questions: whether the model separates higher- from lower-risk respondents, and whether predicted probabilities align with observed rates. Here, they matter less for classification and more as evidence that the model is behaving consistently enough to support interpretation of the adjusted odds ratios.


9 Conclusions

9.1 Key findings

Obesity carries the highest adjusted odds in the model, with respondents in the obese BMI category having 3.21 times the odds of a diabetes diagnosis compared to those at normal weight. Age shows the steepest gradient: adjusted odds rise through each successive band, reaching OR 17.4 for the 75+ group relative to 18–34 year olds.

A clear income gradient persists after adjustment, with odds increasing from OR 1.21 in the $50k–$100k group to OR 1.86 in the lowest income bracket, compared to respondents earning over $100k. Racial and ethnic disparities are similarly robust, with Asian, American Indian or Alaska Native, and Black respondents all showing elevated adjusted odds relative to White respondents (OR 1.94, 1.83, and 1.63, respectively), indicating these associations are not fully explained by the socioeconomic and behavioural variables included in the model.

Physical inactivity and poor mental health are independently associated with higher odds (OR 1.58 and 1.43 for 14+ poor mental health days, respectively). In contrast, two predictors show inverse associations: having no personal doctor (OR 0.337) and heavy drinking (OR 0.484). These are unlikely to reflect protective effects. Respondents without a provider are less likely to have received a diagnosis, while the heavy drinking association likely reflects detection bias or behavioural change following diagnosis. Former smokers show modestly elevated odds (OR 1.09), consistent with a similar pattern of cessation after diagnosis.

9.2 Model performance

The full model achieves an AUC of 0.771, compared to 0.696 for the base demographic model, confirming that socioeconomic and behavioural predictors add meaningful discrimination beyond age, sex, and race/ethnicity alone. The calibration plot shows predicted probabilities are well-aligned with observed rates across the risk distribution. The Hosmer-Lemeshow p-value is significant, as expected at this sample size, and does not indicate poor fit.

9.3 Limitations

Cross-sectional design. Diabetes status and all predictors are measured at the same point in time. The direction of causality cannot be established. For example, diabetes diagnosis may prompt changes in physical activity or diet rather than the reverse.

Unweighted analysis. BRFSS uses a complex survey design with stratification, clustering, and post-stratification weights to produce nationally representative estimates. This analysis uses unweighted logistic regression, which produces valid associations within the sample but does not account for the survey design. Population-level prevalence estimates and fully representative ORs would require svyglm() from the survey package with the provided psu, strata, and llcp_weight variables.

Self-reported outcome. Diabetes status is based on self-report of a prior diagnosis. Undiagnosed diabetes, estimated to affect roughly 1 in 4 people with the condition, is not captured, which likely biases associations involving healthcare access toward the null.

Complete-case analysis. Respondents with missing predictor values were excluded (36.5%, n = 165,437), driven primarily by income-group non-response. If lower-income respondents are more likely to skip the question, the income gradient may be attenuated; the opposite pattern would inflate it. The missingness mechanism cannot be verified from these data.


10 Session Info

R session info
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.0.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Australia/Sydney
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] ResourceSelection_0.3-6 pROC_1.19.0.1           patchwork_1.3.2        
##  [4] see_0.13.0              performance_0.16.0      car_3.1-5              
##  [7] carData_3.0-6           broom_1.0.12            janitor_2.2.1          
## [10] gtsummary_2.5.0         arrow_23.0.1.2          lubridate_1.9.5        
## [13] forcats_1.0.1           stringr_1.6.0           dplyr_1.2.1            
## [16] purrr_1.2.1             readr_2.2.0             tidyr_1.3.2            
## [19] tibble_3.3.1            ggplot2_4.0.2           tidyverse_2.0.0        
## [22] here_1.0.2             
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     farver_2.1.2         S7_0.2.1            
##  [4] fastmap_1.2.0        bayestestR_0.17.0    broom.helpers_1.22.0
##  [7] labelled_2.16.0      digest_0.6.39        timechange_0.4.0    
## [10] lifecycle_1.0.5      magrittr_2.0.5       compiler_4.5.0      
## [13] rlang_1.2.0          sass_0.4.10          tools_4.5.0         
## [16] yaml_2.3.12          gt_1.3.0             knitr_1.51          
## [19] labeling_0.4.3       htmlwidgets_1.6.4    bit_4.6.0           
## [22] xml2_1.5.2           RColorBrewer_1.1-3   abind_1.4-8         
## [25] withr_3.0.2          grid_4.5.0           datawizard_1.3.0    
## [28] scales_1.4.0         insight_1.4.6        cli_3.6.5           
## [31] rmarkdown_2.31       generics_0.1.4       rstudioapi_0.18.0   
## [34] tzdb_0.5.0           commonmark_2.0.0     parameters_0.28.3   
## [37] splines_4.5.0        assertthat_0.2.1     base64enc_0.1-6     
## [40] vctrs_0.7.2          Matrix_1.7-5         jsonlite_2.0.0      
## [43] litedown_0.9         hms_1.1.4            bit64_4.6.0-1       
## [46] Formula_1.2-5        glue_1.8.0           stringi_1.8.7       
## [49] gtable_0.3.6         pillar_1.11.1        htmltools_0.5.9     
## [52] R6_2.6.1             rprojroot_2.1.1      evaluate_1.0.5      
## [55] lattice_0.22-9       markdown_2.0         haven_2.5.5         
## [58] backports_1.5.1      cards_0.7.1          snakecase_0.11.1    
## [61] renv_1.2.0           cardx_0.3.2          Rcpp_1.1.1          
## [64] nlme_3.1-169         mgcv_1.9-4           xfun_0.57           
## [67] fs_2.0.1             pkgconfig_2.0.3