Lecture 13
Duke University
STA 199 Summer 2026: Session I
June 4, 2026


critics and audience
movie_scores

How do we know which variable “should” be the response and which should be the predictor. This will depend on the domain and the research question, but in some cases there is a natural choice. In this example, the critic score for a film is typically available before the audience score. Critics can often screen the film in advance, and their reviews are published on opening day. By contrast, the audience score trickles in over the subsequent weeks. So it’s more likely that we would already have the critics score and use it to anticipate the audience score, instead of the other way around.

# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 32.3 2.34 13.8 4.03e-28
2 critics 0.519 0.0345 15.0 2.70e-31

# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 32.3 2.34 13.8 4.03e-28
2 critics 0.519 0.0345 15.0 2.70e-31
# A tibble: 1 × 1
r
<dbl>
1 0.781
`geom_smooth()` using formula = 'y ~ x'

A regression model is a function that describes the relationship between the outcome, \(Y\), and the predictor, \(X\).
\[ \begin{aligned} Y &= \color{black}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{black}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{black}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned} \]
\(\mu_{Y \mid X}\) is the expected value of \(Y\), given (or, conditional on) a particular value of \(X\)
\[ \begin{aligned} Y &= \color{#6495ED}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{#6495ED}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{#6495ED}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned} \]

Use simple linear regression to model the relationship between a quantitative outcome (\(Y\)) and a single quantitative predictor (\(X\))
The “idealized” linear regression model, revealed only with infinite data:
\[\Large{Y = \beta_0 + \beta_1 X + \epsilon}\]
\[\Large{\widehat{Y} = b_0 + b_1 X}\]
\(b_1\): Estimated slope of the relationship between \(X\) and \(Y\); you may also see \(\widehat{\beta_0}\)
\(b_0\): Estimated intercept of the relationship between \(X\) and \(Y\); you may also see \(\widehat{\beta_1}\)
No error term!
Ideally, as \(n \to \infty\), \(b_0 \to \beta_0\) and \(b_1 \to \beta_1\)
You’re already familiar with \(y=mx+b\), so why did I switch it up on you? Why the subscripts? Why the Greek letters?


\[\text{residual} = \text{observed} - \text{predicted} = y - \widehat{y}\]
We have \(n\) observations (generally, the number of rows in a df)
\(i^{th}\) observation (\(i\) from \(1\) to \(N\)):
\(y_i\) : \(i^{th}\) outcome
\(x_i\) : \(i^{th}\) explanatory variable
\(\widehat{y_i}\) : \(i^{th}\) predicted outcome
\(e_i\) : \(i^{th}\) residual
\[e_i = \text{observed} - \text{predicted} = y_i - \widehat{y}_i\]
\[e^2_1 + e^2_2 + \dots + e^2_n\]
| data | residuals | |
|---|---|---|
| \(x_1 \quad y_1\) | \(\rightarrow\) | \(e_1 = y_1 - \hat{y}_1 = y_1 - (b_0 + b_1 \times x)\) |
| \(x_2 \quad y_2\) | \(\rightarrow\) | \(e_2 = y_2 - \hat{y}_2 = y_2 - (b_0 + b_1 \times x)\) |
| \(x_3 \quad y_3\) | \(\rightarrow\) | \(e_3 = y_3 - \hat{y}_3 = y_3 - (b_0 + b_1 \times x)\) |
| … | \(\rightarrow\) | … |
| \(x_n \quad y_n\) | \(\rightarrow\) | \(e_n = y_n - \hat{y}_n = y_n - (b_0 + b_1 \times x)\) |
\[ \downarrow \]
\[ e_1^2 + e_2^2 + e_3^2 + \ldots + e_n^2 \]
We pick \(b_0\) and \(b_1\) so that
\[ \sum_{i=1}^{n} e_i^2 \]
is as small as possible (“best fit”).
Why do we minimize
\[ \sum_{i=1}^{n} e_i^2, \]
and not
\[ \sum_{i=1}^{n} e_i, \]
or
\[ \sum_{i=1}^{n} |e_i| \; ? \]
Suppose our residuals are
\[ -4,\; -2,\; 1,\; 2,\; 3 \]
Then
\[ \sum_{i=1}^n e_i = (-4)+(-2)+1+2+3 = 0. \]
But the predictions are clearly not perfect!
Problem: Positive and negative residuals cancel each other out.
\[ -4 + 4 = 0 \]
even though both residuals represent prediction errors.
Using the same residuals,
\[ -4,\; -2,\; 1,\; 2,\; 3 \]
the sum of squared residuals is
\[ (-4)^2 + (-2)^2 + 1^2 + 2^2 + 3^2 = 34. \]
Now all prediction errors contribute positively to the summation.
Absolute error solves the cancellation problem:
\[ -4,\; -2,\; 1,\; 2,\; 3 \]
\[ |{-4}| + |{-2}| + |1| + |2| + |3| = 12 \]
So every prediction error contributes positively. However, squared error has an important advantage:
\[ e^2 \]
is a smooth curve, while
\[ |e| \]
has a sharp point at \(e=0\).
As a result:
Therefore, we usually minimize
\[ \sum_{i=1}^n e_i^2. \]

Squared error gives increasingly more weight to data points that are far away from the others; absolute error is more chill.
fit syntaxIf you recall ggplot, it takes two arguments: a data frame and an aesthetic mapping that specifies what columns to use and how to use them. fit is similar. It takes two arguments: a data frame and a formula that species what variables to include in the model and how.
The statement y ~ x is called a formula in R. The variable name that appears to the left of the tilde (~) is treated as the response variable, and the variable(s!) to the right of the tilde are treated as explanatory.
A new movie with a critics’ score of \(x = 20\) is released, and our model predicts that the audience score will be \(\widehat{y}\approx 42.69\), on average:
\[\widehat{\text{audience}} = 32.3 + 0.519 \times \text{critics}\]
The “we expect” and “on average” are a bit redundant, but let’s go belt and suspenders in this class.
When interpreting coefficient estimates in a regression:
x makes y go up by 0.519”x = 0, then y will be 32.3”x increases by one unit, we expect / predict that y will be higher by 0.519, on average.”In general, our models give imperfect predictions about average behavior. The predictions are not guarantees, and the relationship may or may not be causal. Establishing that is an entire class in and of itself (causal inference).
✅ The intercept is meaningful in context of the data if
🛑 Otherwise, it might not be meaningful!
For example…
The regression line goes through the center of mass point (the coordinates corresponding to average \(X\) and average \(Y\) i.e., (\(\bar{x}, \bar{y}\)))
Why? Under LSR (least-squares regression), \(b_0 = \bar{y} - b_1~\bar{x}\); plugging in \(x =\bar{x}\) to the fitted eq., \(\widehat{y} = b_0 + b_1~\bar{x} = (\bar{y} - b_1~\bar{x}) + b_1~\bar{x} = \bar{y}\)
\(\Rightarrow \widehat{y} = \bar{y} \text{ when } x = \bar{x}\)
Slope has the same sign as the correlation coefficient: \(b_1 = r \frac{s_Y}{s_X}\)
Sum of the residuals is zero: \(\sum_{i = 1}^n \epsilon_i = 0\)
Residuals and \(X\) values are uncorrelated
Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-13-modeling-penguins.qmd.
Work through the application exercise in class, and render, commit, and push your edits.