Multiple Linear Regression 2

Lecture 15

Author
Affiliation

Katie Solarz

Duke University
STA 199 Summer 2026: Session I

Published

June 8, 2026

Recap: linear regression

Population model

Multiple linear regression captures the relationship between a numerical outcome \(Y\) and many numerical predictors \(X_1\), \(X_2\), …, \(X_p\):

\[\Large{Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...+ \beta_p X_p+\epsilon}\]

  • \(\beta_0\): True intercept of the relationship between the \(X_j\) and \(Y\)
  • \(\beta_j\): True slope of the relationship between \(X_j\) and \(Y\)
  • \(\epsilon\): Error (residual)

. . .

The model with the greek letters and the error term is the “true,” idealized, population relationship that we could access if we had infinite amounts of perfect data. But we don’t, so we have to settle for…

Estimated model

\[\Large{\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + ... + b_pX_p}\]

  • \(b_j\): Estimated slope of the relationship between \(X_j\) and \(Y\);
  • \(b_0\): your prediction for \(Y\) if all the predictors equal zero;
  • No error term!

. . .

This is your best guess at the true regression function based on the noisy, meager, imperfect data you actually have access to. We still compute the \(b_j\) using the principle of least squares: pick the estimates that make the sum of squared residuals as small as possible.

Example: one numerical predictor

Example: one numerical predictor

bm_fl_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins)

tidy(bm_fl_fit)
# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        -5781.     306.       -18.9 5.59e- 55
2 flipper_length_mm     49.7      1.52      32.7 4.37e-107

\[ \widehat{\text{body mass (g)}}=-5780.83+49.68\times \text{flipper length (mm)} \]

Example: one numerical predictor

new_penguin <- tibble(
  flipper_length_mm = 205
  )

predict(bm_fl_fit, 
        new_penguin)
# A tibble: 1 × 1
  .pred
  <dbl>
1 4405.

Example: “one categorical predictor”

Example: “one categorical predictor”

This actually gets treated as \(k\) binary (0/1) variables, where k is equal to the number of levels in the categorical predictor, \(k-1\) of which are included as predictors in the model:

bm_island_fit <- linear_reg() |>
  fit(body_mass_g ~ island, data = penguins)

tidy(bm_island_fit)
# A tibble: 3 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21

\[ \begin{aligned} \widehat{\text{body mass (g)}} = 4716 &- 1003 \times \text{islandDream} \\ &- 1010 \times \text{islandTorgersen} \end{aligned} \]

Example: “one categorical predictor”

It’s just a concise way of summarizing the within-group means of the response variable:

penguins |>
  group_by(island) |>
  summarize(
    avg_body_mass = mean(body_mass_g, na.rm = TRUE)
  )
# A tibble: 3 × 2
  island    avg_body_mass
  <fct>             <dbl>
1 Biscoe            4716.
2 Dream             3713.
3 Torgersen         3706.
new_penguins = tibble(
  island = c("Biscoe", "Dream", "Torgersen")
)

predict(bm_island_fit, 
        new_penguins)
# A tibble: 3 × 1
  .pred
  <dbl>
1 4716.
2 3713.
3 3706.

Example: numerical and categorical predictors

Example: numerical and categorical predictors

The additive model:

bm_fl_island_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + island, data = penguins)

tidy(bm_fl_island_fit)
# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -4625.     392.      -11.8  4.29e-27
2 flipper_length_mm     44.5      1.87     23.9  1.65e-74
3 islandDream         -262.      55.0      -4.77 2.75e- 6
4 islandTorgersen     -185.      70.3      -2.63 8.84e- 3

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -4625 &+ 44.5 \times \text{flipper length (mm)} \\ &- 262 \times \text{Dream} \\ &- 185 \times \text{Torgersen} \end{aligned} \]

Example: numerical and categorical predictors

Change one character in the code, and you get the interaction model:

bm_fl_island_int_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm * island, data = penguins)

tidy(bm_fl_island_int_fit) |> select(term, estimate)
# A tibble: 6 × 2
  term                              estimate
  <chr>                                <dbl>
1 (Intercept)                        -5464. 
2 flipper_length_mm                     48.5
3 islandDream                         3551. 
4 islandTorgersen                     3218. 
5 flipper_length_mm:islandDream        -19.4
6 flipper_length_mm:islandTorgersen    -17.4

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -5464 &+ 48.5 \times \text{flipper length (mm)} \\ &+ 3551 \times \text{Dream} \\ &+ 3218 \times \text{Torgersen} \\ &- 19.4 \times \text{flipper length*Dream} \\ &- 17.4 \times \text{flipper length*Torgersen} \end{aligned} \]

Example: numerical and categorical predictors

Same code with each of an additive and interaction model fit:

new_penguin <- tibble(
  flipper_length_mm = 205,
  island = "Biscoe"
  )

predict(bm_fl_island_fit, 
        new_data = new_penguin)
# A tibble: 1 × 1
  .pred
  <dbl>
1 4506.
new_penguin <- tibble(
  flipper_length_mm = 205,
  island = "Biscoe"
  )

predict(bm_fl_island_int_fit, 
        new_data = new_penguin)
# A tibble: 1 × 1
  .pred
  <dbl>
1 4488.

Example: many numerical predictors

2 predictors + 1 response = 3 dimensions. Instead of a line of best fit, it’s a plane of best fit:

Example: many numerical predictors

bm_fl_bl_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + bill_length_mm, data = penguins)

tidy(bm_fl_bl_fit)
# A tibble: 3 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       -5737.      308.      -18.6  7.80e-54
2 flipper_length_mm    48.1       2.01     23.9  7.56e-75
3 bill_length_mm        6.05      5.18      1.17 2.44e- 1

\[ \small\widehat{\text{body mass (g)}}=-5736+48.1\times \text{flipper length (mm)}+6\times \text{bill length (mm)} \]

Example: many numerical predictors

new_penguin <- tibble(
  flipper_length_mm = 200,
  bill_length_mm = 45
)

predict(bm_fl_bl_fit, new_data = new_penguin)
# A tibble: 1 × 1
  .pred
  <dbl>
1 4164.

Welcome to statistics!

The code was basically the same every time. In this second half of the course, the code itself is not the primary complication. The code is merely a labor-saving device. Our focus shifts to understanding the conceptual and technical underpinnings of the code and interpreting the output correctly. This is subtle, and it tends to trip people up.

FAQ

Does it matter what the base level is?

Regarding the leveling of a categorical predictor:

  • It doesn’t matter technically speaking. The math doesn’t change;
  • As long as you know what the baseline is, you are consistent throughout your analysis, and you interpret everything correctly, you’re probably good;
  • Sometimes the application suggests a natural choice of baseline. Interpretation and visualization may be easier if you respect this natural choice
    • For example, if the categorical variable is ordinal, it might make sense to use the “lowest ranked” variable as the baseline, so-to-speak (e.g., predict on grade level, where “Freshman” is the baseline level)

Is interaction the same as separate regressions?

Sort of. You get the same estimates of the slopes and intercepts, but the uncertainty quantification (standard errors) is different.

This matters, but we suppress the nuance in this class.

The interaction model

linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm * island, data = penguins) |>
  tidy() |>
  select(term, estimate, std.error)
# A tibble: 6 × 3
  term                              estimate std.error
  <chr>                                <dbl>     <dbl>
1 (Intercept)                        -5464.     431.  
2 flipper_length_mm                     48.5      2.05
3 islandDream                         3551.     969.  
4 islandTorgersen                     3218.    1680.  
5 flipper_length_mm:islandDream        -19.4      4.94
6 flipper_length_mm:islandTorgersen    -17.4      8.73

So on Biscoe island (where Dream = Torgersen = 0), the line is

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -5464 &+ 48.5 \times \text{flipper length (mm)} \end{aligned} \]

Simple regression within a group

If we fit a simple model with the Biscoe subset of the data:

penguins_biscoe <- penguins |>
  filter(island == "Biscoe")

linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins_biscoe) |>
  tidy()
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -5464.     435.       -12.6 8.48e-26
2 flipper_length_mm     48.5      2.07      23.4 2.20e-54

Same estimates, but different standard errors.

The default picture

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = island)) + 
  geom_point() + 
  geom_smooth(method = "lm")

These are separate simple regressions within each group. The slopes and the intercepts are the same as the interaction model, but the uncertainty bands are different.

A second look at \(R^2\)

Recap: our first look

\(R^2\) measures “goodness-of-fit:”

  • \(R^2\) is “a Rotten Tomatoes score” for your model;
  • It ranges from 0 (horrific) to 1 (perfect) and measures how well the linear model fits the data;
  • In simple linear regression with one numerical predictor, \(R^2\) is literally equal to the square of the correlation coefficient between the predictor \(X\) and the numerical response \(Y\);
  • When you move from one predictor to many, this relationship breaks down, but \(R^2\) remains meaningful.

. . .

So what is it, actually?

Reminder: residuals

Every data point has one:

Some are big, some are small. Some are above the line (positive), and some are below the line (negative).

Reminder: residuals

The residuals are what’s left over after the model attempts to “explain” the response:

\[ e_i = \text{observed} - \text{predicted} = y_i - \widehat{y_i}. \]

Every data point has one:

augment(bm_fl_fit, penguins) |>
  select(flipper_length_mm, body_mass_g, .pred, .resid)
# A tibble: 344 × 4
   flipper_length_mm body_mass_g .pred  .resid
               <int>       <int> <dbl>   <dbl>
 1               181        3750 3212.  538.  
 2               186        3800 3461.  339.  
 3               195        3250 3908. -658.  
 4                NA          NA   NA    NA   
 5               193        3450 3808. -358.  
 6               190        3650 3659.   -9.43
 7               181        3625 3212.  413.  
 8               195        4675 3908.  767.  
 9               193        3475 3808. -333.  
10               190        4250 3659.  591.  
# ℹ 334 more rows

How it started

The original distribution of the response. This is the mess we’re trying to clean up with a model:

# A tibble: 1 × 3
  type        variance   std
  <chr>          <dbl> <dbl>
1 body_mass_g  643131.  802.

How it’s going

The distribution of the residuals (the leftover mess) after the model tries to explain (clean up):

# A tibble: 2 × 3
  type        variance   std
  <chr>          <dbl> <dbl>
1 .resid       154999.  394.
2 body_mass_g  643131.  802.

The main idea

The model is trying to “explain” the variation in the response:

  • Variation in the response (blue distribution), is the mess we’re trying to clean up;
  • Variation in the residuals (red distribution), is the leftover after the model gives it a shot;
  • The more leftover there is, the worse the model did;
  • You can quantify this with the variance. The smaller the variance of the residuals relative to the original response variance, the better.

. . .

That’s what \(R^2\) actually is: the proportion of the variance in the response explained by the model.

\(R^2\approx 0\)

Horrifically awful fit. The model didn’t explain (clean up) anything:

# A tibble: 1 × 3
  var_y var_resid r.squared
  <dbl>     <dbl>     <dbl>
1  1.02      1.02  0.000450

\(R^2\approx 0.28\)

# A tibble: 1 × 3
  var_y var_resid r.squared
  <dbl>     <dbl>     <dbl>
1 0.805     0.574     0.287

\(R^2\approx 0.63\)

# A tibble: 1 × 3
  var_y var_resid r.squared
  <dbl>     <dbl>     <dbl>
1 0.701     0.255     0.636

\(R^2\approx 0.97\)

# A tibble: 1 × 3
  var_y var_resid r.squared
  <dbl>     <dbl>     <dbl>
1 0.762    0.0230     0.970

\(R^2\approx 1\)

Perfect fit. The model explained (cleaned up) everything:

# A tibble: 1 × 3
  var_y var_resid r.squared
  <dbl>     <dbl>     <dbl>
1 0.847  0.000408     1.000

The nitty gritty

\[ \begin{aligned} R^2\, = \begin{matrix} \text{proportion of}\\ \text{variation explained} \end{matrix} &= 1- \begin{matrix} \text{proportion of}\\ \text{variation unexplained} \end{matrix}\\[15pt] &= 1-\frac{\text{unexplained variation}}{\text{total variation}} \\[15pt] &= 1-\frac{\text{spread of residuals}}{\text{spread of response}} \\[15pt] &= 1-\frac{\text{var}(e_i)}{\text{var}(y_i)} . \end{aligned} \]

Literally

augment(bm_fl_fit, penguins) |>
  summarize(
    var_y = var(body_mass_g, na.rm = TRUE),
    var_e = var(.resid, na.rm = TRUE),
    R2 = 1 - var_e / var_y
  )
# A tibble: 1 × 3
    var_y   var_e    R2
    <dbl>   <dbl> <dbl>
1 643131. 154999. 0.759

. . .

linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, penguins) |>
  glance() |>
  select(r.squared)
# A tibble: 1 × 1
  r.squared
      <dbl>
1     0.759

Summary

  • \(R^2\) is a number between zero and one, and it represents the proportion of the variance in the response \(Y\) that is explained by the model fit;

  • This is the interpretation, regardless of how many predictors you have;

  • In simple linear regression with one numerical predictor, \(R^2\) is also the square of the correlation coefficient, but this doesn’t generalize to the multivariate setting.

Model selection

Model selection

Question: Given many competing models for the same response, which model is “best”? What does that even mean?

. . .

Answer: Assign a “quality score” to each model, and pick the model with the best score:

Model Score Verdict
Simple 0.1
Additive 0.5
Interaction 0.48

. . .

Variable selection: all of our models were linear, and they differed only by which predictor variables are included. By ranking models by a quality score and picking the “best” one, we can determine which predictors are the “most important” to include.

So… what “quality score” to choose?

This is a big area of statistical research. There are loads of “quality scores” you could consider, and they prioritize different things: predictive accuracy, goodness-of-fit, simplicity, fairness, etc.

  • \(R^2\)
  • adjusted \(R^2\)
  • Akaike Information criterion (AIC)
  • Bayesian information criterion (BIC)
  • … so many more

We won’t go into detail, but you will if you keep studying statistics.

tidymodels reports many of these for free

bm_fl_fit |>
  glance() |>
  select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
  r.squared adj.r.squared   AIC   BIC
      <dbl>         <dbl> <dbl> <dbl>
1     0.759         0.758 5063. 5074.

\(R^2\)’s dirty little secret

\(R^2\) gives participation trophies. It always goes up when you add any new variable to the model, even if that variable is worthless.

Adding a phony variable

The original model fit:

linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, penguins) |>
  glance() |>
  pull(r.squared)
[1] 0.7589925

. . .

Add a predictor that is literally just a bunch of random numbers:

noisy_penguins <- penguins |>
  mutate(
    bullshit = rnorm(nrow(penguins))
  )

linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + bullshit, noisy_penguins) |>
  glance() |>
  pull(r.squared)
[1] 0.7600283

Adjusted \(R^2\)

Because vanilla \(R^2\) blithely rewards the addition of any variable, no matter how meaningless, you shouldn’t use it for model selection. Instead, use adjusted \(R^2\). The formula is in the textbook if you’re dying to know, but here’s the tea:

  • adjusted \(R^2\) penalizes the inclusion of unnecessary, unhelpful predictors;
  • It prioritizes a parsimonious model:
    • a model that fits/predicts well, and
    • a model that’s not too complicated (so our pear brains can still understand it);
  • Occam’s razor!

For the penguins

bm_fl_fit |> glance() |>
  pull(adj.r.squared)
[1] 0.7582837
bm_island_fit |> glance() |>
  pull(adj.r.squared)
[1] 0.3899995
bm_fl_island_fit |> glance() |>
  pull(adj.r.squared)
[1] 0.7722296
bm_fl_island_int_fit |> glance() |>
  pull(adj.r.squared)
[1] 0.7825604
bm_fl_bl_fit |> glance() |>
  pull(adj.r.squared)
[1] 0.7585415

Balancing fit and complexity

  • More complex models (i.e., models with more predictors) tend to fit the data at hand better, but may not generalize well to new data.

  • Model selection criteria, like adjusted \(R^2\), help balance model fit and complexity to avoid overfitting by penalizing models with more predictors.

Overfitting

  • Overfitting occurs when a model captures not only the underlying relationship between predictors and outcome but also the random noise in the data.

  • Overfitted models tend to perform well on the observed data but poorly on new, unseen data.

    • Good news: We have techniques to detect and prevent overfitting.
    • Bad news: This is a topic for another day.