Modelling Diabetes Risk: Social and Behavioural Factors in BRFSS Data

Multivariable Logistic Regression Analysis · n = 287,804

This notebook estimates adjusted associations between diabetes status and demographic, socioeconomic, and behavioural predictors using multivariable logistic regression.

The data come from the 2024 Behavioral Risk Factor Surveillance System (BRFSS), a nationally representative telephone survey of U.S. adults conducted annually by the CDC. The analysis examines whether these factors remain associated with diabetes status after adjustment for confounding.

1 Setup

Load packages and helpers

library(here)
library(tidyverse)
library(arrow)
library(gtsummary)
library(janitor)
library(broom)            # tidy(): extract model coefficients as a data frame
library(car)              # vif(): variance inflation factors
library(performance)      # check_model(): visual diagnostic plots
library(see)              # plotting backend for performance
library(patchwork)        # plot composition and annotation
library(pROC)             # roc(), auc(), ggroc(): ROC curve and AUC
library(ResourceSelection) # hoslem.test(): Hosmer-Lemeshow goodness-of-fit

source(here("R", "utils.R"))
source(here("R", "plot_theme.R"))

car checks multicollinearity via variance inflation factors, and performance provides diagnostic plots to assess overall model behaviour. pROC evaluates discrimination through the ROC curve and AUC, and ResourceSelection provides the Hosmer-Lemeshow test as a basic check of calibration across risk strata.

Model estimates are extracted with broom and presented using gtsummary for consistent reporting.

2 Data Import

Code

df_raw <- read_parquet(here("data", "processed", "analysis_dataset.parquet"))
cat("Rows:", nrow(df_raw), "| Columns:", ncol(df_raw), "\n")
## Rows: 453241 | Columns: 15

The analytical dataset is produced by the SQL pipeline from the raw 2024 BRFSS XPT file. Respondents with a missing or ambiguous diabetes response are excluded at the SQL stage, so all 453,241 rows carry a definitive binary outcome. No further row-level exclusions are applied until the complete-case step in Data Preparation.

3 Data Preparation

3.1 Factor levels and reference categories

Reference categories determine the comparison group for each odds ratio. We set these to meaningful baselines (typically the lowest-risk or most advantaged group) so that estimated effects read as higher risk relative to a favourable exposure level, keeping the direction of associations intuitive.

Set factor levels and reference categories

df <- df_raw |>
  mutate(
    diabetes = factor(diabetes, levels = c(0, 1), labels = c("No", "Yes")),

    age_group = factor(age_group,
                       levels = c("18-34", "35-44", "45-54", "55-64", "65-74", "75+")),

    sex = factor(sex, levels = c("Male", "Female")),

    race_ethnicity = factor(race_ethnicity,
                            levels = c("White", "Black",
                                       "American Indian/Alaskan Native",
                                       "Asian", "Pacific Islander",
                                       "Other/Multiracial",
                                       "Hispanic", "Other")),

    education = factor(education,
                       levels = c("College graduate", "Some college",
                                  "High school graduate", "Did not graduate high school")),

    income_group = factor(income_group,
                          levels = c(">$100k", "$50k-$100k", "$35k-$50k",
                                     "$25k-$35k", "$15k-$25k", "<$15k")),

    bmi_category = factor(bmi_category,
                          levels = c("Normal", "Underweight", "Overweight", "Obese")),

    physically_active = factor(physically_active, levels = c(1, 0),
                               labels = c("Active", "Inactive")),

    smoking_status = factor(smoking_status,
                            levels = c("Never", "Former",
                                       "Current (some days)", "Current (daily)")),

    heavy_drinker = factor(heavy_drinker, levels = c(0, 1),
                           labels = c("No", "Yes")),

    has_provider = factor(has_provider, levels = c(1, 0),
                          labels = c("Has provider", "No provider")),

    mental_health_days = factor(mental_health_days,
                                levels = c("0 days", "1-13 days", "14+ days"))
  )

The first level of each factor becomes the reference category under R’s default treatment coding. This choice affects how contrasts are expressed, but does not change overall model fit.

3.2 Complete-case exclusions

glm() cannot use rows with missing values, so any row missing a predictor is dropped here. The retained dataset df_cc is used for all subsequent modelling.

Code

predictor_cols <- c(
  "age_group", "sex", "race_ethnicity", "education", "income_group",
  "bmi_category", "physically_active", "smoking_status",
  "heavy_drinker", "has_provider", "mental_health_days"
)

n_before <- nrow(df)
df_cc    <- df |> drop_na(all_of(predictor_cols))
n_after  <- nrow(df_cc)

cat("Before:", n_before, "| After:", n_after, "| Excluded:", n_before - n_after,
    "(", round((n_before - n_after) / n_before * 100, 1), "%)\n")
## Before: 453241 | After: 287804 | Excluded: 165437 ( 36.5 %)

36.5% of respondents are excluded, largely due to the 25.5% income-group missingness identified in the EDA. The reason for this missingness is unknown; if lower-income respondents are more likely to skip the question, the estimated income gradient may be attenuated, and the reverse would inflate it. This is the main limitation of the complete-case approach.

4 Model Specification

The outcome, diabetes diagnosis, is binary, so logistic regression models the log-odds of diagnosis as a function of the predictors. Coefficients are exponentiated and reported as odds ratios (ORs), the standard effect measure for this kind of cross-sectional analysis.

All 11 predictors are entered simultaneously rather than selected stepwise. Stepwise procedures can improve apparent fit in-sample but tend to produce unstable estimates and inflated Type I error. The aim here is to estimate the adjusted association of each predictor with diabetes status, so a theory-driven full model is preferred.

The model includes three demographic predictors (age, sex, race/ethnicity), two socioeconomic predictors (education and income), one access-to-care predictor (having a personal doctor), and five behavioural and clinical predictors (BMI, physical activity, smoking, alcohol use, and mental health).

Code

model_formula <- formula(
  diabetes ~
    age_group + sex + race_ethnicity +
    education + income_group + has_provider +
    bmi_category + physically_active + smoking_status +
    heavy_drinker + mental_health_days
)
model_formula

diabetes ~ age_group + sex + race_ethnicity + education + income_group + 
    has_provider + bmi_category + physically_active + smoking_status + 
    heavy_drinker + mental_health_days

This is an associational model fit to cross-sectional data: it estimates adjusted odds, not causal effects. Diabetes status and the predictors are measured at the same point in time, so the direction of causality cannot be established from these data alone.

5 Model Fitting

The specified logistic regression model is fit to the complete-case dataset.

Code

mod <- glm(model_formula, data = df_cc, family = binomial(link = "logit"))

saveRDS(mod,   here("data", "processed", "mod.rds"))
saveRDS(df_cc, here("data", "processed", "df_cc.rds"))

cat("Converged:", mod$converged, "\n")
## Converged: TRUE
glance(mod)

The model converged successfully. The reduction in deviance relative to the null model indicates that the predictors collectively improve model fit. Adjusted associations are interpreted in the Results section, after diagnostic checks.

6 Assumption Checks

Model fit alone does not guarantee reliable estimates, so key assumptions are checked before interpreting coefficients.

6.1 Multicollinearity: VIF

Highly correlated predictors inflate standard errors and make individual coefficients unstable, even when overall model fit is acceptable. Variance inflation factors (VIFs) quantify this inflation for each predictor.

Code

vif(mod)
##                        GVIF Df GVIF^(1/(2*Df))
## age_group          1.314941  5        1.027757
## sex                1.074021  1        1.036350
## race_ethnicity     1.223534  7        1.014515
## education          1.329057  3        1.048554
## income_group       1.402839  5        1.034429
## has_provider       1.047195  1        1.023325
## bmi_category       1.082000  3        1.013222
## physically_active  1.093264  1        1.045593
## smoking_status     1.167625  3        1.026165
## heavy_drinker      1.012101  1        1.006032
## mental_health_days 1.153756  2        1.036403

Because several predictors are multi-level categorical variables, car::vif() reports generalised VIFs (GVIF). The adjusted values, GVIF^(1/(2·Df)), are all below 1.05, well under the conventional threshold of 2, indicating no meaningful collinearity.

Income and education show slightly higher values than the other predictors, consistent with their expected correlation in the population, but neither approaches a level that would inflate standard errors or destabilise coefficient estimates.

6.2 Model diagnostics

Three diagnostic panels assess model fit, influential observations, and collinearity before interpreting coefficients.

Code

cm <- check_model(mod, check = c("binned_residuals", "outliers", "vif"),
                  residual_type = "normal")

diag_plot <- plot(cm) &
  theme(
    plot.title    = element_text(size = 12, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, hjust = 0.5),
    axis.title    = element_text(size = 8),
    plot.margin   = margin(t = 8, r = 8, b = 8, l = 8)
  )

diag_plot[[3]] <- diag_plot[[3]] +
  theme(
    plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(b = 10, l = 60)),
    axis.text.x   = element_text(angle = 45, hjust = 1, vjust = 1, size = 7.5),
    plot.margin   = margin(t = 8, r = 8, b = 18, l = 8)
  )

diag_plot +
  patchwork::plot_annotation(
    title = "Model Diagnostic Checks",
    theme = theme(
      plot.title  = element_text(size = 16, face = "bold", hjust = 0.5),
      plot.margin = margin(t = 14, b = 8)
    )
  )

The binned residuals plot shows systematic deviation outside the error bounds, with residuals consistently negative across the predicted probability range. The model overpredicts diabetes probability across most of the risk distribution: the linear functional form does not fully capture the relationship between the predictors and the log-odds of the outcome.

This pattern is consistent with non-linear associations in key predictors (particularly age, BMI, and mental health days) and the absence of interaction terms in the current specification. No influential observations are identified in the Cook’s distance panel, as expected given the sample size.

This misfit does not invalidate the model, but it suggests that effect estimates should be interpreted with caution.

7 Results

7.1 Adjusted odds ratios

The primary result of this analysis is the set of adjusted ORs from the full multivariable model: the estimated odds of diabetes for each predictor level, holding all other predictors constant.

Code

tbl_regression(
  mod,
  exponentiate = TRUE,
  estimate_fun = \(x) style_sigfig(x, digits = 3),
  label = list(
    age_group          ~ "Age group",
    sex                ~ "Sex",
    race_ethnicity     ~ "Race/ethnicity",
    education          ~ "Education",
    income_group       ~ "Household income",
    bmi_category       ~ "BMI category",
    physically_active  ~ "Physically active",
    smoking_status     ~ "Smoking status",
    heavy_drinker      ~ "Heavy drinker",
    has_provider       ~ "Has personal doctor",
    mental_health_days ~ "Poor mental health days"
  )
) |>
  bold_p(t = 0.05) |>
  bold_labels() |>
  modify_caption("**Table 1.** Adjusted odds ratios from the full social determinants model")

**Table 1.** Adjusted odds ratios from the full social determinants model
Characteristic	OR	95% CI	p-value
Age group
18-34	—	—
35-44	2.21	1.99, 2.46	<0.001
45-54	5.57	5.06, 6.13	<0.001
55-64	10.9	9.91, 11.9	<0.001
65-74	14.4	13.2, 15.8	<0.001
75+	17.4	15.9, 19.1	<0.001
Sex
Male	—	—
Female	0.703	0.687, 0.719	<0.001
Race/ethnicity
White	—	—
Black	1.63	1.57, 1.70	<0.001
American Indian/Alaskan Native	1.83	1.69, 1.99	<0.001
Asian	1.94	1.79, 2.10	<0.001
Pacific Islander	1.80	1.55, 2.07	<0.001
Other/Multiracial	1.16	1.02, 1.32	0.018
Hispanic	1.27	1.18, 1.37	<0.001
Other	1.50	1.44, 1.56	<0.001
Education
College graduate	—	—
Some college	1.19	1.15, 1.22	<0.001
High school graduate	1.16	1.13, 1.20	<0.001
Did not graduate high school	1.33	1.27, 1.40	<0.001
Household income
>$100k	—	—
$50k-$100k	1.21	1.17, 1.25	<0.001
$35k-$50k	1.43	1.37, 1.49	<0.001
$25k-$35k	1.59	1.52, 1.66	<0.001
$15k-$25k	1.82	1.74, 1.90	<0.001
<$15k	1.86	1.77, 1.96	<0.001
Has personal doctor
Has provider	—	—
No provider	0.337	0.318, 0.357	<0.001
BMI category
Normal	—	—
Underweight	0.675	0.593, 0.766	<0.001
Overweight	1.68	1.63, 1.74	<0.001
Obese	3.21	3.11, 3.31	<0.001
Physically active
Active	—	—
Inactive	1.58	1.54, 1.62	<0.001
Smoking status
Never	—	—
Former	1.09	1.06, 1.12	<0.001
Current (some days)	0.895	0.838, 0.955	<0.001
Current (daily)	1.00	0.963, 1.05	0.9
Heavy drinker
No	—	—
Yes	0.484	0.456, 0.514	<0.001
Poor mental health days
0 days	—	—
1-13 days	1.13	1.10, 1.16	<0.001
14+ days	1.43	1.38, 1.47	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

The adjusted model shows the strongest associations for BMI and age, with obesity and older age groups carrying substantially higher odds of diabetes than their reference categories. Lower income groups also retain elevated odds after adjustment, indicating that the socioeconomic gradient is not fully explained by the behavioural and clinical predictors in the model. Several smaller effects remain directionally consistent with the EDA, though their magnitude and precision vary once the predictors are considered jointly.

7.2 Forest plot

A visual complement to Table 1. Points show adjusted odds ratios for each predictor level, with horizontal bars representing 95% confidence intervals. The dashed line at OR = 1 marks the null (no association), and the x-axis is on a log scale, so equal distances represent equal multiplicative changes in odds. Predictors are split across two figures by domain.

Code

predictor_names <- c(
  "age_group", "sex", "race_ethnicity", "education", "income_group",
  "bmi_category", "physically_active", "smoking_status",
  "heavy_drinker", "mental_health_days", "has_provider"
)

var_labels <- c(
  age_group          = "Age group",
  sex                = "Sex",
  race_ethnicity     = "Race/ethnicity",
  education          = "Education",
  income_group       = "Household income",
  bmi_category       = "BMI category",
  physically_active  = "Physically active",
  smoking_status     = "Smoking status",
  heavy_drinker      = "Heavy drinker",
  mental_health_days = "Poor mental health days",
  has_provider       = "Has personal doctor"
)

demo_vars <- c("age_group", "sex", "race_ethnicity", "education", "income_group")

forest_data <- tidy(mod, conf.int = TRUE, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    variable = str_extract(term, str_c("^(", str_c(predictor_names, collapse = "|"), ")")),
    label    = str_remove(term, variable),
    variable = factor(variable, levels = predictor_names)
  ) |>
  arrange(variable)

forest_demo <- filter(forest_data, variable %in% demo_vars)

ggplot(forest_demo,
       aes(x = estimate,
           y = fct_rev(factor(term, levels = unique(term))),
           xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(color = palette_epi[["blue"]], size = 0.4) +
  facet_wrap(~variable, scales = "free_y", ncol = 2,
             labeller = as_labeller(var_labels)) +
  scale_x_log10() +
  scale_y_discrete(labels = \(x) forest_data$label[match(x, forest_data$term)]) +
  labs(x = "Odds Ratio (log scale)", y = NULL,
       title = "Demographic and socioeconomic predictors")

Adjusted ORs with 95% CIs: demographic and socioeconomic predictors

The demographic and socioeconomic panel shows a clear gradient in diabetes risk. Age is the dominant predictor, with odds increasing sharply and monotonically across older age groups. Income and education show consistent inverse gradients, with lower socioeconomic status associated with higher odds of diabetes.

Race and ethnicity show elevated adjusted odds across most groups, with Asian, AI/AN, Black, and Pacific Islander respondents all above OR 1.6 relative to White respondents. Intervals are wider for smaller subgroups, consistent with the uneven sample composition seen in the EDA. Sex shows a precisely estimated association in the expected direction: female respondents have roughly 30% lower adjusted odds, smaller than the age or BMI effects but not negligible.

Code

behav_vars <- c("bmi_category", "physically_active", "smoking_status",
                "heavy_drinker", "mental_health_days", "has_provider")

forest_behav <- filter(forest_data, variable %in% behav_vars)

ggplot(forest_behav,
       aes(x = estimate,
           y = fct_rev(factor(term, levels = unique(term))),
           xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(color = palette_epi[["blue"]], size = 0.4) +
  facet_wrap(~variable, scales = "free_y", ncol = 2,
             labeller = as_labeller(var_labels)) +
  scale_x_log10() +
  scale_y_discrete(labels = \(x) forest_data$label[match(x, forest_data$term)]) +
  labs(x = "Odds Ratio (log scale)", y = NULL,
       title = "Behavioural and clinical predictors")

Adjusted ORs with 95% CIs: behavioural and clinical predictors

The behavioural and clinical panel is dominated by BMI, which shows a strong gradient in risk. Odds of diabetes increase substantially from overweight to obese, while underweight is associated with lower odds, making BMI the largest modifiable predictor in the model.

Physical inactivity and poor mental health are both associated with higher odds of diabetes, with mental health showing a graded pattern across categories. These effects are smaller than BMI but remain consistent in direction.

Smoking shows a non-monotonic pattern, with former smokers at higher odds than current smokers. This is consistent with reverse causation: people may quit smoking after diagnosis. Similar patterns appear for heavy drinking and having a personal doctor; neither is plausibly causal, and both are better explained by behaviour change or healthcare access following diagnosis.

7.3 Unadjusted vs adjusted comparison

Comparing unadjusted and adjusted odds ratios highlights confounding: where an estimate changes after adjustment, part of the crude association was explained by other variables in the model.

Code

tbl_uv <- df_cc |>
  select(diabetes, all_of(predictor_cols)) |>
  tbl_uvregression(
    method       = glm,
    y            = diabetes,
    method.args  = list(family = binomial),
    exponentiate = TRUE,
    estimate_fun = \(x) style_sigfig(x, digits = 3),
    label = list(
      age_group          ~ "Age group",
      sex                ~ "Sex",
      race_ethnicity     ~ "Race/ethnicity",
      education          ~ "Education",
      income_group       ~ "Household income",
      bmi_category       ~ "BMI category",
      physically_active  ~ "Physically active",
      smoking_status     ~ "Smoking status",
      heavy_drinker      ~ "Heavy drinker",
      has_provider       ~ "Has personal doctor",
      mental_health_days ~ "Poor mental health days"
    )
  ) |>
  bold_p(t = 0.05)

tbl_mv <- tbl_regression(
  mod,
  exponentiate = TRUE,
  estimate_fun = \(x) style_sigfig(x, digits = 3),
  label = list(
    age_group          ~ "Age group",
    sex                ~ "Sex",
    race_ethnicity     ~ "Race/ethnicity",
    education          ~ "Education",
    income_group       ~ "Household income",
    bmi_category       ~ "BMI category",
    physically_active  ~ "Physically active",
    smoking_status     ~ "Smoking status",
    heavy_drinker      ~ "Heavy drinker",
    has_provider       ~ "Has personal doctor",
    mental_health_days ~ "Poor mental health days"
  )
) |>
  bold_p(t = 0.05)

tbl_merge(
  list(tbl_uv, tbl_mv),
  tab_spanner = c("**Unadjusted OR**", "**Adjusted OR**")
) |>
  bold_labels() |>
  modify_caption("**Table 2.** Unadjusted and adjusted odds ratios")

**Table 2.** Unadjusted and adjusted odds ratios
Characteristic	Unadjusted OR				Adjusted OR
Characteristic	N	OR	95% CI	p-value	OR	95% CI	p-value
Age group	287,804
18-34		—	—		—	—
35-44		2.52	2.27, 2.79	<0.001	2.21	1.99, 2.46	<0.001
45-54		6.74	6.14, 7.41	<0.001	5.57	5.06, 6.13	<0.001
55-64		13.5	12.4, 14.8	<0.001	10.9	9.91, 11.9	<0.001
65-74		16.9	15.5, 18.5	<0.001	14.4	13.2, 15.8	<0.001
75+		18.7	17.1, 20.5	<0.001	17.4	15.9, 19.1	<0.001
Sex	287,804
Male		—	—		—	—
Female		0.875	0.858, 0.893	<0.001	0.703	0.687, 0.719	<0.001
Race/ethnicity	287,804
White		—	—		—	—
Black		1.55	1.50, 1.60	<0.001	1.63	1.57, 1.70	<0.001
American Indian/Alaskan Native		1.68	1.56, 1.81	<0.001	1.83	1.69, 1.99	<0.001
Asian		0.787	0.730, 0.847	<0.001	1.94	1.79, 2.10	<0.001
Pacific Islander		1.43	1.25, 1.63	<0.001	1.80	1.55, 2.07	<0.001
Other/Multiracial		1.16	1.03, 1.30	0.016	1.16	1.02, 1.32	0.018
Hispanic		0.950	0.887, 1.02	0.15	1.27	1.18, 1.37	<0.001
Other		0.941	0.908, 0.974	<0.001	1.50	1.44, 1.56	<0.001
Education	287,804
College graduate		—	—		—	—
Some college		1.48	1.45, 1.52	<0.001	1.19	1.15, 1.22	<0.001
High school graduate		1.52	1.48, 1.56	<0.001	1.16	1.13, 1.20	<0.001
Did not graduate high school		2.04	1.95, 2.13	<0.001	1.33	1.27, 1.40	<0.001
Household income	287,804
>$100k		—	—		—	—
$50k-$100k		1.44	1.40, 1.48	<0.001	1.21	1.17, 1.25	<0.001
$35k-$50k		1.85	1.78, 1.91	<0.001	1.43	1.37, 1.49	<0.001
$25k-$35k		2.15	2.08, 2.23	<0.001	1.59	1.52, 1.66	<0.001
$15k-$25k		2.70	2.60, 2.80	<0.001	1.82	1.74, 1.90	<0.001
<$15k		2.54	2.43, 2.66	<0.001	1.86	1.77, 1.96	<0.001
BMI category	287,804
Normal		—	—		—	—
Underweight		0.761	0.671, 0.860	<0.001	0.675	0.593, 0.766	<0.001
Overweight		1.88	1.82, 1.94	<0.001	1.68	1.63, 1.74	<0.001
Obese		3.52	3.42, 3.63	<0.001	3.21	3.11, 3.31	<0.001
Physically active	287,804
Active		—	—		—	—
Inactive		2.31	2.26, 2.36	<0.001	1.58	1.54, 1.62	<0.001
Smoking status	287,804
Never		—	—		—	—
Former		1.46	1.43, 1.49	<0.001	1.09	1.06, 1.12	<0.001
Current (some days)		0.959	0.902, 1.02	0.2	0.895	0.838, 0.955	<0.001
Current (daily)		1.23	1.19, 1.28	<0.001	1.00	0.963, 1.05	0.9
Heavy drinker	287,804
No		—	—		—	—
Yes		0.410	0.387, 0.435	<0.001	0.484	0.456, 0.514	<0.001
Has personal doctor	287,804
Has provider		—	—		—	—
No provider		0.214	0.202, 0.226	<0.001	0.337	0.318, 0.357	<0.001
Poor mental health days	287,804
0 days		—	—		—	—
1-13 days		0.764	0.745, 0.783	<0.001	1.13	1.10, 1.16	<0.001
14+ days		1.13	1.10, 1.16	<0.001	1.43	1.38, 1.47	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Several socioeconomic and behavioural effects attenuate after adjustment, particularly income, education, physical activity, and smoking, indicating that part of their crude association was shared with age and BMI.

In contrast, age and BMI remain strong with only modest attenuation, suggesting these are more robust associations in the data. The age gradient persists after adjustment, with the 75+ group retaining an OR of 17.4. The obesity effect is similarly stable (3.52 → 3.21), remaining the largest modifiable predictor in the model.

Some predictors change direction or strengthen after adjustment, highlighting more complex confounding. Asian ethnicity shifts from a protective crude association to elevated adjusted odds (0.79 → 1.94). This reflects confounding by socioeconomic status: in the crude analysis, higher average income and education levels suppress the underlying association, which becomes visible after adjustment. The 1-13 mental health day group shows a weaker reversal (0.76 → 1.13), likely reflecting adjustment for correlated behavioural and social factors. Reversals like these are why the crude-to-adjusted comparison is more than a formality in observational analyses.

8 Model Performance

Model performance is assessed in terms of discrimination and calibration before interpreting predicted probabilities.

8.1 Discrimination: ROC curve and AUC

The ROC curve and AUC assess whether the model assigns higher predicted probabilities to respondents with diabetes, indicating how well it ranks risk across the sample.

Code

roc_obj <- roc(
  response  = as.integer(df_cc$diabetes) - 1L,
  predictor = fitted(mod),
  quiet     = TRUE
)

ggroc(roc_obj, color = palette_epi[["vermillion"]], linewidth = 1) +
  geom_abline(slope = -1, intercept = 1, linetype = "dashed", color = "grey60") +
  annotate("text", x = 0.35, y = 0.1,
           label = paste0("AUC = ", round(auc(roc_obj), 3)),
           size = 4.5) +
  labs(title = "ROC Curve: Full Model", x = "Specificity", y = "Sensitivity")

ROC curve: full social determinants model

The model achieves an AUC of 0.771: if a respondent with diabetes and a respondent without are selected at random, the model assigns a higher predicted probability to the case 77.1% of the time. The predictors are demographic, socioeconomic, and self-reported behavioural measures, none of which directly measure metabolic function; clinical markers such as fasting glucose or HbA1c are not available in BRFSS. The next section quantifies how much the socioeconomic and behavioural predictors contribute beyond age, sex, and race/ethnicity alone.

8.2 Incremental value of the social and behavioural predictors

A base model including age, sex, and race/ethnicity provides a reference point. Comparing its AUC with the full model’s shows how much discrimination the socioeconomic and behavioural predictors add.

Code

mod_base <- glm(
  diabetes ~ age_group + sex + race_ethnicity,
  data   = df_cc,
  family = binomial
)

roc_base <- roc(
  response  = as.integer(df_cc$diabetes) - 1L,
  predictor = fitted(mod_base),
  quiet     = TRUE
)

cat("Base model AUC (age + sex + race/ethnicity):", round(auc(roc_base), 3), "\n")
## Base model AUC (age + sex + race/ethnicity): 0.696
cat("Full model AUC (all predictors):            ", round(auc(roc_obj),  3), "\n")
## Full model AUC (all predictors):             0.771
cat("Delta AUC:                                  ", round(auc(roc_obj) - auc(roc_base), 3), "\n\n")
## Delta AUC:                                   0.075

roc.test(roc_obj, roc_base)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc_obj and roc_base
## Z = 69.353, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.07306843 0.07731845
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7712646   0.6960712

The base model (age, sex, race/ethnicity) achieves an AUC of 0.696, compared with 0.771 for the full model (ΔAUC = 0.075).

The DeLong test confirms that this difference is not due to sampling variation (p < 0.001), but the magnitude is more informative than the p-value here: the added predictors improve risk ranking, but do not fundamentally change the model’s ability to separate cases from non-cases.

8.3 Calibration

The Hosmer-Lemeshow test compares observed and expected event counts across deciles of predicted risk. A significant p-value indicates a difference between predicted and observed rates within these groups.

At this sample size (287,804), the test has sufficient power to detect very small deviations, so a near-zero p-value is expected even when calibration is visually close. The p-value alone is insufficient; the calibration plot provides the more complete picture.

Code

cal_data <- tibble(
  predicted = fitted(mod),
  observed  = as.integer(df_cc$diabetes) - 1L
) |>
  mutate(decile = ntile(predicted, 10)) |>
  summarise(
    mean_pred = mean(predicted),
    mean_obs  = mean(observed),
    n         = n(),
    .by       = decile
  )

ggplot(cal_data, aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
  geom_point(aes(size = n), color = palette_epi[["blue"]], show.legend = FALSE) +
  geom_line(color = palette_epi[["blue"]], linewidth = 0.5) +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "Mean predicted probability", y = "Observed proportion",
       title = "Calibration plot: predicted vs observed\n(by decile of predicted risk)") +
  theme(plot.title = element_text(hjust = 0.5))

Calibration plot: mean predicted probability vs observed proportion by decile

The calibration plot compares predicted probabilities with observed event rates across deciles of risk. Points close to the 45° line indicate good agreement; systematic deviation above the line indicates underprediction, and below indicates overprediction. Point size reflects the number of respondents in each decile.

Predicted probabilities are slightly elevated in the lower-risk deciles and modestly underestimated at the upper end, but remain close to the reference line across the full range. These deviations are small, and the model is well calibrated for population-level interpretation.

Code

hl <- hoslem.test(as.integer(df_cc$diabetes) - 1L, fitted(mod), g = 10)
print(hl)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.integer(df_cc$diabetes) - 1L, fitted(mod)
## X-squared = 124.21, df = 8, p-value < 2.2e-16

The test returns X² = 124.21 (df = 8, p < 0.001). As expected at this sample size, the result is significant and reflects small deviations from perfect calibration.

Taken together, these checks address two questions: whether the model separates higher- from lower-risk respondents, and whether predicted probabilities align with observed rates. Here, they matter less for classification and more as evidence that the model is behaving consistently enough to support interpretation of the adjusted odds ratios.

9 Conclusions

9.1 Key findings

Obesity carries the highest adjusted odds in the model, with respondents in the obese BMI category having 3.21 times the odds of a diabetes diagnosis compared to those at normal weight. Age shows the steepest gradient: adjusted odds rise through each successive band, reaching OR 17.4 for the 75+ group relative to 18-34 year olds.

A clear income gradient persists after adjustment, with odds increasing from OR 1.21 in the $50k-$100k group to OR 1.86 in the lowest income bracket, compared to respondents earning over $100k. Racial and ethnic disparities are similarly robust, with Asian, American Indian or Alaska Native, and Black respondents all showing elevated adjusted odds relative to White respondents (OR 1.94, 1.83, and 1.63, respectively), so these associations are not fully explained by the socioeconomic and behavioural variables in the model.

Physical inactivity and poor mental health are independently associated with higher odds (OR 1.58 and 1.43 for 14+ poor mental health days, respectively). In contrast, two predictors show inverse associations: having no personal doctor (OR 0.337) and heavy drinking (OR 0.484). These are unlikely to reflect protective effects. Respondents without a provider are less likely to have received a diagnosis, while the heavy drinking association likely reflects detection bias or behavioural change following diagnosis. Former smokers show modestly elevated odds (OR 1.09), consistent with a similar pattern of cessation after diagnosis.

9.2 Model performance

The full model achieves an AUC of 0.771, compared to 0.696 for the base demographic model, confirming that socioeconomic and behavioural predictors add meaningful discrimination beyond age, sex, and race/ethnicity alone. The calibration plot shows predicted probabilities are well aligned with observed rates across the risk distribution. The Hosmer-Lemeshow p-value is significant, as expected at this sample size, and does not indicate poor fit.

9.3 Limitations

Cross-sectional design. Diabetes status and all predictors are measured at the same point in time. The direction of causality cannot be established. For example, diabetes diagnosis may prompt changes in physical activity or diet rather than the reverse.

Unweighted analysis. BRFSS uses a complex survey design with stratification, clustering, and post-stratification weights to produce nationally representative estimates. This analysis uses unweighted logistic regression, which produces valid associations within the sample but does not account for the survey design. Population-level prevalence estimates and fully representative ORs would require svyglm() from the survey package with the provided psu, strata, and llcp_weight variables.

Self-reported outcome. Diabetes status is based on self-report of a prior diagnosis. Undiagnosed diabetes, estimated to affect roughly 1 in 4 people with the condition, is not captured, which likely biases associations involving healthcare access towards the null.

Complete-case analysis. Respondents with missing predictor values were excluded (36.5%, n = 165,437), driven primarily by income-group nonresponse. If lower-income respondents are more likely to skip the question, the income gradient may be attenuated; the opposite pattern would inflate it. The missingness mechanism cannot be verified from these data.

10 Session Info

R session info

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.0.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Australia/Sydney
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] ResourceSelection_0.3-6 pROC_1.19.0.1           patchwork_1.3.2        
##  [4] see_0.13.0              performance_0.16.0      car_3.1-5              
##  [7] carData_3.0-6           broom_1.0.12            janitor_2.2.1          
## [10] gtsummary_2.5.0         arrow_23.0.1.2          lubridate_1.9.5        
## [13] forcats_1.0.1           stringr_1.6.0           dplyr_1.2.1            
## [16] purrr_1.2.1             readr_2.2.0             tidyr_1.3.2            
## [19] tibble_3.3.1            ggplot2_4.0.2           tidyverse_2.0.0        
## [22] here_1.0.2             
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     farver_2.1.2         S7_0.2.1            
##  [4] fastmap_1.2.0        bayestestR_0.17.0    broom.helpers_1.22.0
##  [7] labelled_2.16.0      digest_0.6.39        timechange_0.4.0    
## [10] lifecycle_1.0.5      magrittr_2.0.5       compiler_4.5.0      
## [13] rlang_1.2.0          sass_0.4.10          tools_4.5.0         
## [16] yaml_2.3.12          gt_1.3.0             knitr_1.51          
## [19] labeling_0.4.3       htmlwidgets_1.6.4    bit_4.6.0           
## [22] xml2_1.5.2           RColorBrewer_1.1-3   abind_1.4-8         
## [25] withr_3.0.2          grid_4.5.0           datawizard_1.3.0    
## [28] scales_1.4.0         insight_1.4.6        cli_3.6.5           
## [31] rmarkdown_2.31       generics_0.1.4       rstudioapi_0.18.0   
## [34] tzdb_0.5.0           commonmark_2.0.0     parameters_0.28.3   
## [37] splines_4.5.0        assertthat_0.2.1     base64enc_0.1-6     
## [40] vctrs_0.7.2          Matrix_1.7-5         jsonlite_2.0.0      
## [43] litedown_0.9         hms_1.1.4            bit64_4.6.0-1       
## [46] Formula_1.2-5        glue_1.8.0           stringi_1.8.7       
## [49] gtable_0.3.6         pillar_1.11.1        htmltools_0.5.9     
## [52] R6_2.6.1             rprojroot_2.1.1      evaluate_1.0.5      
## [55] lattice_0.22-9       markdown_2.0         haven_2.5.5         
## [58] backports_1.5.1      cards_0.7.1          snakecase_0.11.1    
## [61] renv_1.2.0           cardx_0.3.2          Rcpp_1.1.1          
## [64] nlme_3.1-169         mgcv_1.9-4           xfun_0.57           
## [67] fs_2.0.1             pkgconfig_2.0.3