
Lecture 17
Duke University
STA 199 Summer 2026: Session I
June 10, 2026
We have been studying regression:
What combinations of data types have we seen?
What did the picture look like?
Numerical response and one numerical predictor:

Numerical response and one categorical predictor (two levels):

Numerical response; numerical and categorical predictors:


\[ y = \begin{cases} 1 & &&\text{eg. Yes, Win, True, Heads, Success}\\ 0 & &&\text{eg. No, Lose, False, Tails, Failure}. \end{cases} \]

If we can model the relationship between predictors (\(x\)) and a binary response (\(y\)), we can use the model to do a special kind of prediction called classification.
\[ \mathbf{x}: \text{word and character counts in an e-mail.} \]

\[ y = \begin{cases} 1 & \text{it's spam}\\ 0 & \text{it's legit} \end{cases} \]
\[ \mathbf{x}: \text{features in a medical image.} \]

\[ y = \begin{cases} 1 & \text{it's cancer}\\ 0 & \text{it's healthy} \end{cases} \]
\[ \mathbf{x}: \text{financial and demographic info about a loan applicant.} \]

\[ y = \begin{cases} 1 & \text{applicant is at risk of defaulting on loan}\\ 0 & \text{applicant is safe} \end{cases} \]
\[ \mathbf{x}: \text{word counts (e.g., thou, love, heartbreak), stylistic features} \]

\[ y = \begin{cases} 1 & \text{Taylor Swift}\\ 0 & \text{William Shakespeare} \end{cases} \]


Instead of modeling \(y\) directly, we model the probability that \(y=1\):

Recall regression with a numerical response:
Similar when modeling a binary response:
It’s the logistic function:
\[ \text{Prob}(y = 1) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}. \]
If you set \(p = \text{Prob}(y = 1)\) and do some algebra, you get the simple linear model for the log-odds:
\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x. \]
This is called the logistic regression model.
\(p = \text{Prob}(y = 1)\) is a probability. A number between 0 and 1;
\(p / (1 - p)\) is the odds. A number between 0 and \(\infty\);
“The odds of this lecture going well are 10 to 1.”
The log odds \(\log(p / (1 - p))\) is a number between \(-\infty\) and \(\infty\), which is suitable for the linear model.
Why does this “transformation” work? The log function maps positive numbers \((0, \infty)\) to all real numbers \((-\infty, \infty)\).



\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x. \]
The logit function \(\log(p / (1-p))\) is an example of a link function that transforms the linear model to have an appropriate range;
This is an example of a generalized linear model
We estimate the parameters \(\beta_0,\,\beta_1\) using maximum likelihood (don’t worry about it) to get the “best fitting” S-curve;
The fitted model is
\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = b_0+b_1x. \]
Rows: 3,921
Columns: 6
$ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,…
$ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
$ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4,…
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.27 0.0553 -41.1 0
2 exclaim_mess 0.000272 0.000949 0.287 0.774
Fitted equation for the log-odds:
\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = -2.27 + 0.000272\times exclaim~mess \]
If exclaim_mess = 0, then
\[ \widehat{p}=\widehat{P(y=1)}=\frac{e^{-2.27}}{1+e^{-2.27}}\approx 0.09. \]
So, our model predicts that an email with no exclamation marks has a 9% probability of being spam.
Recall:
\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = b_0+b_1x. \]
Alternatively:
\[ \frac{\widehat{p}}{1-\widehat{p}} = e^{b_0+b_1x} = \color{blue}{e^{b_0}e^{b_1x}} . \]
If we increase \(x\) by one unit, we have:
\[ \frac{\widehat{p}}{1-\widehat{p}} = e^{b_0}e^{b_1(x+1)} = e^{b_0}e^{b_1x+b_1} = {\color{blue}{e^{b_0}e^{b_1x}}}{\color{red}{e^{b_1}}} . \]
A one unit increase in \(x\) is associated with a change in odds by a factor of \(e^{b_1}\). Gross!
\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = -2.27 + 0.000272\times exclaim~mess \]
If the email has an additional exclamation mark, we predict the odds of an email being spam to be higher by a multiplicative factor of \(e^{0.000272}\approx 1.000272\) on average.
Select a number \(0 < p^* < 1\):

Select a number \(0 < p^* < 1\):

Solve for the x-value that matches the threshold:

A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?

A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?

A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?

Two numerical predictors and one binary response:

For the log-odds, a multiple linear regression:
\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m. \] On the probability scale:
\[ \text{Prob}(y = 1) = \frac{e^{\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m}}{1+e^{\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m}}. \]
It’s linear! Consider two numerical predictors:

It’s linear! Consider two numerical predictors:

It’s linear! Consider two numerical predictors:


To balance out the two kinds of errors:

Set p* = 0
Set p* = 1
You pick a threshold in between to strike a balance. The exact number depends on context.
Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-15-spam-filter.qmd.
Work through the application exercise in class, and render, commit, and push your edits.