Multiple Linear Regression 1

Lecture 14

Katie Solarz

Duke University
STA 199 Summer 2026: Session I

June 5, 2026

Recap: simple linear regression

From last time…penguins

A researcher wants to see how body mass varies with flipper length.

outcome: body mass (g) (numerical)
predictor: flipper length (mm) (numerical)

Flipper length is easier to measure, so more plausible you would predict body mass based on that and not the other way around.

How do we concisely summarize the association between two variables?

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()

Line of best fit!

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point() + 
  geom_smooth(method = "lm", color = "cornflowerblue")

Concise numerical summary: correlation

penguins |>
  summarize(
    r = cor(flipper_length_mm, body_mass_g, use = "complete.obs")
  )

# A tibble: 1 × 1
      r
  <dbl>
1 0.871

Measures the strength and direction of the linear association between two numerical variables. Strong and positive in this case.

Model fit

bm_fl_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins)

tidy(bm_fl_fit)

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        -5781.     306.       -18.9 5.59e- 55
2 flipper_length_mm     49.7      1.52      32.7 4.37e-107

\[ \widehat{\text{body mass (g)}}=-5780.83+49.68\times \text{flipper length (mm)} \]

\(R^2\) (coefficient of determination)

The fraction of the variation in the response explained by the model. A number between 0 (bad) and 1 (good) that measures goodness-of-fit:

bm_fl_fit |>
  glance() |>
  select(r.squared)

# A tibble: 1 × 1
  r.squared
      <dbl>
1     0.759

penguins |>
  summarize(
    r = cor(flipper_length_mm, body_mass_g, use = "complete.obs"),
    R2 = r^2
  )

# A tibble: 1 × 2
      r    R2
  <dbl> <dbl>
1 0.871 0.759

Interpretation

\[ \widehat{\text{body mass (g)}}=-5780.83+49.68\times \text{flipper length (mm)} \]

Intercept: Penguins with 0 mm flipper length are expected, on average, to weigh -5,781 grams (makes no sense!);
Slope: For each additional mm of a penguin’s flipper length, the weight of the penguin is expected to be higher, on average, by 49.7 g.

Prediction

Prediction

new_penguin <- tibble(flipper_length_mm = 205)

predict(bm_fl_fit, new_data = new_penguin)

# A tibble: 1 × 1
  .pred
  <dbl>
1 4405.

\[ \widehat{\text{body mass (g)}}=-5780.83+49.68\times 205\approx 4404.71\text{g} \]

We predict that a penguin whose flipper is 205 mm long will weigh 4,404.71 g, on average.

Linear regression with one categorical predictor

New question

A different researcher wants to look at body weight of penguins based on the island they were recorded on. How are the variables involved in this analysis different?

outcome: body mass in grams (numerical)
predictor: island (categorical, with three levels)

levels(penguins$island)

[1] "Biscoe"    "Dream"     "Torgersen"

Scatterplot: possible, but not so good

Visualize the relationship between body weight and island of penguins. Also calculate the average body weight per island.

ggplot(penguins, aes(x = island, y = body_mass_g)) + 
  geom_point()

Boxplot

Visualize the relationship between body weight and island of penguins. Also, calculate the average body weight per island.

ggplot(
  penguins,
  aes(x = island, y = body_mass_g)
) +
  geom_boxplot() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body mass by island"
  )

penguins |>
  group_by(island) |>
  summarize(
    bm_mean = mean(body_mass_g, na.rm = TRUE)
  )

# A tibble: 3 × 2
  island    bm_mean
  <fct>       <dbl>
1 Biscoe      4716.
2 Dream       3713.
3 Torgersen   3706.

Density plot

ggplot(
  penguins,
  aes(x = body_mass_g, color = island, fill = island)
) +
  geom_density(alpha = 0.5) +
  scale_color_colorblind() +
  scale_fill_colorblind() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body Mass by Island",
    color = "Island",
    fill = "Island"
  )

Violin plot

ggplot(
  penguins,
  aes(x = island, y = body_mass_g)
) +
  geom_violin() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body Mass by Island"
  )

Multiple geoms

ggplot(
  penguins,
  aes(x = island, y = body_mass_g, color = island)
) +
  geom_boxplot() +
  geom_beeswarm(size = 2, alpha = 0.5) +
  scale_color_colorblind() +
  labs(
    x = "Island",
    y = "Body mass (g)",
    title = "Body Mass by Island",
    color = "Island"
  ) +
  theme(legend.position = "none")

Model - fit

Fit a linear regression model predicting body weight from island and display the results. Why is Biscoe not on the output?

bm_island_fit <- linear_reg() |>
  fit(body_mass_g ~ island, data = penguins)

tidy(bm_island_fit)

# A tibble: 3 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21

Huh?

Dummy variables

dummy_penguins <- penguins |>
  select(body_mass_g, island) |>
  arrange(body_mass_g) |>
  mutate(
    islandDream = if_else(island == "Dream", 1, 0),
    islandTorgersen = if_else(island == "Torgersen", 1, 0),
  )

dummy_penguins

# A tibble: 344 × 4
   body_mass_g island    islandDream islandTorgersen
         <int> <fct>           <dbl>           <dbl>
 1        2700 Dream               1               0
 2        2850 Biscoe              0               0
 3        2850 Biscoe              0               0
 4        2900 Biscoe              0               0
 5        2900 Dream               1               0
 6        2900 Torgersen           0               1
 7        2900 Dream               1               0
 8        2925 Biscoe              0               0
 9        2975 Dream               1               0
10        3000 Dream               1               0
# ℹ 334 more rows

Dummy variables - fit

linear_reg() |>
  fit(
    body_mass_g ~ islandDream + islandTorgersen, 
    dummy_penguins
  ) |>
  tidy()

# A tibble: 3 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21

Exactly the same as before!

Model - interpret

\[ \widehat{\text{body mass(g)}} = 4716 - 1003 \times \text{islandDream} - 1010 \times \text{islandTorgersen} \]

Intercept: Penguins from Biscoe island are expected to weigh, on average, 4,716 g.
Slope - islandDream: Penguins from Dream island are expected to weigh, on average, 1,003 g less than those from Biscoe island.
Slope - islandTorgersen: Penguins from Torgersen island are expected to weigh, on average, 1,010 g less than those from Biscoe island.

Model - predict

What is the predicted body weight of a penguin on Biscoe island? What are the estimated body weights of penguins on Dream and Torgersen islands? Where have we seen these values before?

new_penguins = tibble(
  island = c("Biscoe", "Dream", "Torgersen")
)

predict(bm_island_fit, new_data = new_penguins)

# A tibble: 3 × 1
  .pred
  <dbl>
1 4716.
2 3713.
3 3706.

Model - predict

Calculate the predicted body weights of penguins on Biscoe, Dream, and Torgersen islands by hand.

\[ \widehat{body~mass} = 4716 - 1003 \times islandDream - 1010 \times islandTorgersen \]

Biscoe: \(\widehat{body~mass} = 4716 - 1003 \times 0 - 1010 \times 0 = 4716\)

Dream: \(\widehat{body~mass} = 4716 - 1003 \times 1 - 1010 \times 0 = 3713\)

Torgersen: \(\widehat{body~mass} = 4716 - 1003 \times 0 - 1010 \times 1 = 3706\)

Look familiar?

penguins |>
  group_by(island) |>
  summarize(
    bm_mean = mean(body_mass_g, na.rm = TRUE)
  )

# A tibble: 3 × 2
  island    bm_mean
  <fct>       <dbl>
1 Biscoe      4716.
2 Dream       3713.
3 Torgersen   3706.

Models with categorical predictors

When the categorical predictor has many levels, they’re encoded as dummy variables.
The first level of the categorical variable is the “baseline” level. In a model with one categorical predictor, the intercept is the predicted value of the outcome for the baseline level (x = 0).
Each slope coefficient describes the difference between the predicted value of the outcome for that level of the categorical variable compared to the baseline level.

In general

Predicting continuous outcome \(Y\) using one categorical predictor \(X\) with multiple levels 1, 2, …, \(k\). Create dummy variables for every level except the base level:

\[ cat_k=\begin{cases} 1 & X=k\\ 0 & \text{else} \end{cases} \]

Then fit a regression with multiple dummy predictors:

\[ \widehat{Y} = b_0 + b_1 \times cat_1 + b_2 \times cat_2 \ldots + b_{k-1} \times cat_{k-1} \]

\(b_0\) : the model prediction for a member of the base level;
\(b_1\): how does the prediction change when we move from the base level to level 1?
\(b_2\): how does the prediction change when we move from the base level to level 2?
etc…

We’re not animals. We have technology!

The computer handles all of this for you, but you need to understand the details so you code and interpret it correctly.

Changing the baseline level

By default, R uses the first level of a categorical variable as the baseline level. this is often the first alphabetically, but make sure you check!
We can change the baseline level by reordering the levels of the categorical variable… do you remember how to do this??

Linear regression with one numerical and one categorical predictor

Three variables on one 2D plot

ggplot(penguins, 
  aes(x = flipper_length_mm, y = body_mass_g, color = island)
  ) +
  geom_point()

How do we predict using more than one predictor?

Both of these models use flipper_length_mm and island to predict body_mass_g:

The additive model: parallel lines, one for each island

The additive model: parallel lines, one for each island

bm_fl_island_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + island, data = penguins)

tidy(bm_fl_island_fit)

# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -4625.     392.      -11.8  4.29e-27
2 flipper_length_mm     44.5      1.87     23.9  1.65e-74
3 islandDream         -262.      55.0      -4.77 2.75e- 6
4 islandTorgersen     -185.      70.3      -2.63 8.84e- 3

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -4625 &+ 44.5 \times {\text{flipper length (mm)}} \\ &- 262 \times \text{Dream} \\ &- 185 \times \text{Torgersen} \end{aligned} \]

Where do the three lines come from?

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -4625 &+ 44.5 \times \text{flipper length (mm)} \\ &- 262 \times \text{Dream} \\ &- 185 \times \text{Torgersen} \end{aligned} \]

If penguin is from Biscoe, Dream = 0 and Torgersen = 0:

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -4625 &+ 44.5 \times \text{flipper length (mm)} \end{aligned} \]

If penguin is from Dream, Dream = 1 and Torgersen = 0:

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -4887 &+ 44.5 \times \text{flipper length (mm)} \end{aligned} \]

If penguin is from Torgersen, Dream = 0 and Torgersen = 1:

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -4810 &+ 44.5 \times \text{flipper length (mm)} \end{aligned} \]

Either way, same slope, so the lines are parallel.

The interaction model: different lines for each island

Code

ggplot(
  penguins, 
  aes(x = flipper_length_mm, y = body_mass_g, color = island)
  ) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Interaction model") +
  theme(legend.position = "bottom")

The interaction model: different lines for each island

bm_fl_island_int_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm * island, data = penguins)

tidy(bm_fl_island_int_fit) |> select(term, estimate)

# A tibble: 6 × 2
  term                              estimate
  <chr>                                <dbl>
1 (Intercept)                        -5464. 
2 flipper_length_mm                     48.5
3 islandDream                         3551. 
4 islandTorgersen                     3218. 
5 flipper_length_mm:islandDream        -19.4
6 flipper_length_mm:islandTorgersen    -17.4

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -5464 &+ 48.5 \times \text{flipper length (mm)} \\ &+ 3551 \times \text{Dream} \\ &+ 3218 \times \text{Torgersen} \\ &- 19.4 \times \text{flipper length (mm)}*\text{Dream} \\ &- 17.4 \times \text{flipper length (mm)}*\text{Torgersen} \end{aligned} \]

Where do the three lines come from?

\[ \begin{aligned} \small\widehat{\text{body mass (g)}} = -5464 &+ 48.5 \times \text{flipper length (mm)} \\ &+ 3551 \times \text{Dream} \\ &+ 3218 \times \text{Torgersen} \\ &- 19.4 \times \text{flipper length (mm)} * \text{Dream} \\ &- 17.4 \times \text{flipper length (mm)} * \text{Torgersen} \end{aligned} \]

If penguin is from Biscoe, Dream = 0 and Torgersen = 0:

\[ \begin{aligned} \widehat{\text{body mass (g)}} = -5464 &+ 48.5 \times \text{flipper length (mm)} \end{aligned} \]

If penguin is from Dream, Dream = 1 and Torgersen = 0:

\[ \begin{aligned} \widehat{\text{body mass (g)}} &= (-5464 + 3551) + (48.5-19.4) \times \text{flipper length (mm)}\\ &=-1913+29.1\times \text{flipper length (mm)}. \end{aligned} \]

Prediction on Biscoe island

Prediction on Biscoe island

new_penguin <- tibble(
  flipper_length_mm = 205,
  island = "Biscoe"
)

predict(bm_fl_island_int_fit, new_data = new_penguin)

# A tibble: 1 × 1
  .pred
  <dbl>
1 4488.

\[ \widehat{\text{body mass (g)}} = -5464 + 48.5 \times 205 \]

Prediction on Dream island

Prediction on Dream island

new_penguin <- tibble(
  flipper_length_mm = 205,
  island = "Dream"
)

predict(bm_fl_island_int_fit, new_data = new_penguin)

# A tibble: 1 × 1
  .pred
  <dbl>
1 4060.

\[ \widehat{\text{body mass (g)}} = (-5464 + 3551) + (48.5 - 19.4) \times 205 \]

Prediction on Torgersen island

Prediction on Torgersen island

new_penguin <- tibble(
  flipper_length_mm = 205,
  island = "Torgersen"
)

predict(bm_fl_island_int_fit, new_data = new_penguin)

# A tibble: 1 × 1
  .pred
  <dbl>
1 4136.

\[ \widehat{\text{body mass (g)}} = (-5464 + 3218) + (48.5 - 17.4) \times 205 \]

Linear regression with many numerical predictors

Multiple numerical predictors

bm_fl_bl_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + bill_length_mm, data = penguins)

tidy(bm_fl_bl_fit)

# A tibble: 3 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       -5737.      308.      -18.6  7.80e-54
2 flipper_length_mm    48.1       2.01     23.9  7.56e-75
3 bill_length_mm        6.05      5.18      1.17 2.44e- 1

\[ \small\widehat{\text{body mass (g)}}=-5736+48.1\times \text{flipper length (mm)}+6\times \text{bill length (mm)} \]

Interpretation

\[ \small\widehat{\text{body mass (g)}}=-5736+48.1\times \text{flipper length (mm)}+6\times \text{bill length (mm)} \]

Interpretations:

We predict that the body mass of a penguin with zero flipper length and zero bill length will be -5,736 g, on average (makes no sense);
Holding all other variables constant, for every additional mm in flipper length, we expect the body mass of penguins to be higher, on average, by 48.1 g;
Holding all other variables constant, for every additional mm in bill length, we expect the body mass of penguins to be higher, on average, by 6 g.

Prediction

new_penguin <- tibble(
  flipper_length_mm = 200,
  bill_length_mm = 45
)

predict(bm_fl_bl_fit, new_data = new_penguin)

# A tibble: 1 × 1
  .pred
  <dbl>
1 4164.

\[ \widehat{\text{body mass (g)}}=-5736+48.1\times 200+6\times 45 \]

Picture? It’s not pretty…

2 predictors + 1 response = 3 dimensions. Ick!

Picture? It’s not pretty…

Instead of a line of best fit, it’s a plane of best fit. Double ick!

Multiple linear regression in general

Population model

Multiple linear regression captures the relationship between a numerical outcome \(Y\) and many numerical predictors \(X_1\), \(X_2\), …, \(X_p\):

\[\Large{Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...+ \beta_p X_p+\epsilon}\]

\(\beta_0\): True intercept of the relationship between the \(X_j\) and \(Y\)
\(\beta_j\): True slope of the relationship between \(X_j\) and \(Y\)
\(\epsilon\): Error (residual)

The model with the greek letters and the error term is the “true,” idealized, population relationship that we could access if we had infinite amounts of perfect data. But we don’t, so we have to settle for…

Estimated / fitted model

\[\Large{\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + ... + b_pX_p}\]

\(b_j\): Estimated slope of the relationship between \(X_j\) and \(Y\);
\(b_0\): your prediction for \(Y\) if all the predictors equal zero;
No error term!

This is your best guess at the true regression function based on the noisy, meager, imperfect data you actually have access to. We still compute the \(b_j\) using the principle of least squares: pick the estimates that make the sum of squared residuals as small as possible.

Next time

Today we saw multiple models that are all attempting to do the same thing: predict body mass.

Which one is “best”?
What does “best” even mean?