AE 15: Building a spam filter

Suggested answers

In this application exercise, we will

Use logistic regression to fit a model for a binary response variable
Fit a logistic regression model in R
Use a logistic regression model for classification

To illustrate logistic regression, we will build a spam filter from email data.

The data come from incoming emails in David Diez’s Gmail account for the first three months of 2012. All personally identifiable information has been removed.

library(tidyverse)
library(tidymodels)
library(openintro)

glimpse(email)

Rows: 3,921
Columns: 21
$ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple  <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ from         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, …
$ sent_email   <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ time         <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59, 2012-01-01 16:…
$ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, 0, …
$ winner       <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
$ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.091, 4.837, 7.421…
$ line_breaks  <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, 68, 25, 79, 191…
$ format       <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ re_subj      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ urgent_subj  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4, 10, 20, 0…
$ number       <fct> big, small, small, small, none, none, big, small, small, …

levels(email$spam)

[1] "0" "1"

The variables we’ll use in this analysis are

spam: 1 if the email is spam, 0 otherwise
exclaim_mess: The number of exclamation points in the email message

Goal: Use the number of exclamation points in an email to predict whether or not it is spam.

Exercises

Exercise 1

Let’s start with some exploratory analysis:

Create an density plot to investigate the relationship between spam and exclaim_mess.

email |>
  filter(exclaim_mess <= 200) |>
  ggplot(aes(x = exclaim_mess, color = spam)) +
  geom_density() +
  xlim(0, 20) + ## constrain x-axis for better viz
  labs(x = "# Exclamation Points",
       y = NULL,
       title = "Density Plot of # Exclamation Points in Email Data",
       color = "Spam Category")

Additionally, calculate the mean number of exclamation points for both spam and non-spam emails.

email |>
  group_by(spam) |>
  summarize(mean_exclaim = mean(exclaim_mess))

# A tibble: 2 × 2
  spam  mean_exclaim
  <fct>        <dbl>
1 0             6.51
2 1             7.32

Exercise 2

Visualize a linear model fit for these data:

ggplot(email, aes(x = exclaim_mess, y = as.numeric(spam))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Is the linear model a good fit for the data? Why or why not?

No; our outcome can only be either 0 or 1, but a linear model predicts nonsensical, impossible values for this outcome.

Exercise 3

Fit the logistic regression model using the number of exclamation points to predict the probability an email is spam:

log_reg_exclaim <- logistic_reg() |>
  fit(spam ~ exclaim_mess, data = email)

tidy(log_reg_exclaim)

# A tibble: 2 × 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -2.27      0.0553     -41.1     0    
2 exclaim_mess  0.000272  0.000949     0.287   0.774

Add your estimates to the fitted equation below

\[\log\Big(\frac{\hat{p}}{1-\hat{p}}\Big) = -2.272 + 0.000272 \times exclaim\_mess\]

How does the code above differ from previous code we’ve used to fit regression models?

We used logistic_reg() instead of linear_reg().

Exercise 4

What is the probability the email is spam if it contains 10 exclamation points? Answer the question using the predict() function.

The predicted probability that the new email is spam is 0.0937.

exclaim_new <- tibble(exclaim_mess = 10)
exclaim_new

# A tibble: 1 × 1
  exclaim_mess
         <dbl>
1           10

predict(log_reg_exclaim, new_data = exclaim_new, type = "prob")

# A tibble: 1 × 2
  .pred_0 .pred_1
    <dbl>   <dbl>
1   0.906  0.0937

augment(log_reg_exclaim, new_data = exclaim_new)

# A tibble: 1 × 4
  .pred_class .pred_0 .pred_1 exclaim_mess
  <fct>         <dbl>   <dbl>        <dbl>
1 0             0.906  0.0937           10

A probability is nice, but we want an actual decision. Classify the darn email.

predict(log_reg_exclaim, new_data = exclaim_new, type = "class")

# A tibble: 1 × 1
  .pred_class
  <fct>      
1 0

Exercise 5

Fit a model with all variables in the dataset as predictors.

email_full_fit <- logistic_reg() |>
  fit(spam ~ ., data = email)

tidy(email_full_fit)

# A tibble: 22 × 5
   term         estimate std.error statistic  p.value
   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)  -9.09e+1   9.80e+3  -0.00928 9.93e- 1
 2 to_multiple1 -2.68e+0   3.27e-1  -8.21    2.25e-16
 3 from1        -2.19e+1   9.80e+3  -0.00224 9.98e- 1
 4 cc            1.88e-2   2.20e-2   0.855   3.93e- 1
 5 sent_email1  -2.07e+1   3.87e+2  -0.0536  9.57e- 1
 6 time          8.48e-8   2.85e-8   2.98    2.92e- 3
 7 image        -1.78e+0   5.95e-1  -3.00    2.73e- 3
 8 attach        7.35e-1   1.44e-1   5.09    3.61e- 7
 9 dollar       -6.85e-2   2.64e-2  -2.59    9.64e- 3
10 winneryes     2.07e+0   3.65e-1   5.67    1.41e- 8
# ℹ 12 more rows

If you used this model to classify the emails in the dataset, how would it do? Use the fitted model to classify each email in the dataset, and then calculate the classification error rates (TP, TN, FP, FN).

full_aug <- augment(email_full_fit, new_data = email)
full_aug

# A tibble: 3,921 × 24
   .pred_class .pred_0  .pred_1 spam  to_multiple from     cc sent_email
   <fct>         <dbl>    <dbl> <fct> <fct>       <fct> <int> <fct>     
 1 0             0.867 1.33e- 1 0     0           1         0 0         
 2 0             0.943 5.70e- 2 0     0           1         0 0         
 3 0             0.942 5.78e- 2 0     0           1         0 0         
 4 0             0.920 7.96e- 2 0     0           1         0 0         
 5 0             0.903 9.74e- 2 0     0           1         0 0         
 6 0             0.901 9.87e- 2 0     0           1         0 0         
 7 0             1.000 7.89e-12 0     1           1         0 1         
 8 0             1.000 1.24e-12 0     1           1         1 1         
 9 0             0.862 1.38e- 1 0     0           1         0 0         
10 0             0.922 7.76e- 2 0     0           1         0 0         
# ℹ 3,911 more rows
# ℹ 16 more variables: time <dttm>, image <dbl>, attach <dbl>, dollar <dbl>,
#   winner <fct>, inherit <dbl>, viagra <dbl>, password <dbl>, num_char <dbl>,
#   line_breaks <int>, format <fct>, re_subj <fct>, exclaim_subj <dbl>,
#   urgent_subj <fct>, exclaim_mess <dbl>, number <fct>

full_aug |>
  count(spam, .pred_class) |>
  group_by(spam) |>
  mutate(prop = n / sum(n),
         decision = case_when(
           spam == 0 & .pred_class == 0 ~ "true negative",
           spam == 1 & .pred_class == 1 ~ "true positive",
           spam == 1 & .pred_class == 0 ~ "false negative",
           spam == 0 & .pred_class == 1 ~ "false positive"
         ))

# A tibble: 4 × 5
# Groups:   spam [2]
  spam  .pred_class     n    prop decision      
  <fct> <fct>       <int>   <dbl> <chr>         
1 0     0            3521 0.991   true negative 
2 0     1              33 0.00929 false positive
3 1     0             299 0.815   false negative
4 1     1              68 0.185   true positive