AE 20: Final ‘Live Coding’ Exam Practice 🎬💳

Packages

Use only the following packages for this exam:

library(tidyverse)
library(tidymodels)

Part 1 - Movies

We will work with data from the Internet Movie Database (IMDB). Specifically, the data were a random sample of movies released between 1980 and 2020.

The variables and their descriptions in the raw movies dataset are as follows:

Variable	Description
`name`	name of the movie
`score`	IMDB user rating
`runtime`	duration of the movie
`genre`	main genre of the movie.
`rating`	rating of the movie (R, PG, etc.)
`release_country`	release country
`release_date`	release date (YYYY-MM-DD)
`budget`	the budget of a movie (some movies don’t have this, so it appears as 0)
`gross`	revenue of the movie
`votes`	number of user votes
`year`	year of release
`director`	the director
`writer`	writer of the movie
`star`	main actor/actress
`country`	country of origin
`company`	the production company

Question 1: Clean and recode

The data you’ll use for this question is in the data folder of your repository, and it’s called movies-raw.csv. The goal of the question is to “clean” this raw data and save a new version of it as movies-processed.csv. The following parts walk you through what you need to do to clean the data.

a. Read the dataset called movies-raw.csv and save it as movies_raw.

movies_raw <- read_csv("data/movies-raw.csv")

Rows: 500 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): name, runtime, genre, rating, release_country, release_date, direc...
dbl  (5): score, budget, gross, votes, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

b. In a single pipeline,

remove the character string " mins" from runtime and convert runtime to numeric,
recode the levels of release_country to "United States" and "Not United States", in that order,
recode the levels of genre to "Action", "Comedy", "Drama", "Horror", and "Other", in that order, and
save the resulting data frame as movies_processed.

movies_processed <- movies_raw |>
  mutate(
    runtime = as.numeric(str_remove(runtime, " mins")),
    release_country = if_else(release_country == "United States",
                              "United States",
                              "Not United States"),
    release_country = fct_relevel(release_country, "United States", "Not United States"),
    genre = if_else(!genre %in% c("Action", "Comedy", "Drama", "Horror"),
                     "Other",
                     genre),
    genre = fct_relevel(genre, "Action", "Comedy", "Drama", "Horror", "Other")
  )

c. In a single pipeline, calculate the cutoff for the top 20th percentile of movie scores and store it as top_20_cutoff as a single numeric value.

[!T IP]

You can check if you’ve done this correctly by typing top_20_cutoff in the Console and checking the output, it should look like the following:
> top_20_cutoff
80% 
7.2

top_20_cutoff <- movies_processed |>
  summarize(top_20_cutoff = quantile(score, 0.8)) |>
  pull(top_20_cutoff)

d. In a single pipeline, update movies_processed to include a column called percentile with 2 possible values: “Top 20th”, if the movie’s score is $\geq$ top_20_cutoff, and “Bottom 80th”, if the movie’s score is < top_20_cutoff. You should ensure that percentile is a factor variable coded such that “Bottom 80th” is the baseline level.

movies_processed <- movies_processed |>
  mutate(
    percentile = if_else(score >= top_20_cutoff, "Top 20th", "Bottom 80th"),
    percentile = fct_relevel(percentile, "Bottom 80th", "Top 20th")
  )

Regardless of your solution to the above question, run the chunk below to load a pre-saved movies_processed for use in subsequent questions. You will need to change the YAML setting to #| eval: true prior to rendering your .qmd file.

movies_processed <- read_csv("data/movies-processed.csv")

Question 2: Model

a. Suppose that a movie studio makes decisions on whether to produce a movie or not based on whether they think it will score in the top 20th percentile of IMDB scores. Help them build a model to aid in their decision making.

Split the data into training (75%) and testing (25%) subsets
Fit a model predicting whether the movie is in the Top 20th percentile of scores based on its runtime, genre, and the interaction of these two predictors.
Display a tidy output of the model.
Interpret the intercept in the context of these data with respect to $\widehat{p}$.

For an action movie with a runtime of 0 minutes, we predict the probability of that movie scoring in the top 20th percentile to be $\frac{e^{-11.41}}{1 + e^{-11.41}} \approx 1.11\times10^{-05}$, on average.

set.seed(847)
movies_split <- initial_split(movies_processed)
movies_train <- training(movies_split)
movies_test <- testing(movies_split)

percentile_fit <- logistic_reg() |>
  fit(as.factor(percentile) ~ runtime * genre, data = movies_train)

tidy(percentile_fit)

# A tibble: 10 × 5
   term                 estimate std.error statistic   p.value
   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)         -11.4        2.80      -4.08  0.0000450
 2 runtime               0.0836     0.0231     3.61  0.000304 
 3 genreComedy           1.41       3.85       0.368 0.713    
 4 genreDrama            8.00       3.06       2.61  0.00899  
 5 genreHorror          13.2       13.1        1.01  0.312    
 6 genreOther            4.83       3.36       1.44  0.150    
 7 runtime:genreComedy  -0.00457    0.0338    -0.135 0.893    
 8 runtime:genreDrama   -0.0591     0.0253    -2.33  0.0196   
 9 runtime:genreHorror  -0.133      0.148     -0.900 0.368    
10 runtime:genreOther   -0.0296     0.0284    -1.04  0.297

exp(-11.410900184) / (1 + exp(-11.410900184))

[1] 1.107399e-05

b. Based on your model from part (a), calculate the predicted values of percentile for movies in your testing dataset. Then, in a single pipeline, calculate the false positive and false negative rates for this model. Explicitly state in your narrative the false positive and false negative rates.

The false positive rate is ~4.3%, and the false negative rate is ~75%.

percentile_aug <- augment(percentile_fit, new_data = movies_test)
percentile_aug |>
  count(percentile, .pred_class) |>
  group_by(percentile) |>
  mutate(p = n / sum(n))

# A tibble: 4 × 4
# Groups:   percentile [2]
  percentile  .pred_class     n      p
  <fct>       <fct>       <int>  <dbl>
1 Bottom 80th Bottom 80th    89 0.957 
2 Bottom 80th Top 20th        4 0.0430
3 Top 20th    Bottom 80th    24 0.75  
4 Top 20th    Top 20th        8 0.25

Part 2 - Credit Cards

The data for the second part of the take-home exam is on credit card balances.

The variables in this dataset and their descriptions are as follows:

Variable	Description
`balance`	Credit card balance in $
`income`	Income in $1,000
`student_status`	Whether the individual is a student (`Student`) or not (`Not student`)
`marriage_status`	Whether the individual is a married (`Married`) or not (`Not married`)
`limit`	Credit limit

Assume that these data represent a random sample of American adults.

Question 3: Model compare

a. The dataset is in the data folder of your repository, and it’s called credit.csv. First, load the data with read_csv(), and assign it to an object called credit.

credit <- read_csv("data/credit.csv")

Rows: 347 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): student_status, marriage_status
dbl (3): balance, income, limit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

b. Fit a model predicting balance from all other variables in the dataset. This should be an additive model, i.e., use only main effects, no interaction effects. Display a tidy output of the model.

balance_fit <- linear_reg() |>
  fit(balance ~ ., data = credit)

tidy(balance_fit)

# A tibble: 5 × 5
  term                       estimate std.error statistic   p.value
  <chr>                         <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)                -431.     14.2       -30.3   5.15e- 99
2 income                       -7.90    0.257     -30.7   2.29e-100
3 student_statusStudent       419.     18.5        22.7   3.21e- 70
4 marriage_statusNot married    3.66   11.7         0.313 7.54e-  1
5 limit                         0.268   0.00396    67.5   1.01e-199

c. Fit a model predicting balance from all other variables in the dataset, except for one of your choice. This should be an additive model, i.e., use only main effects, no interaction effects. Display a tidy output of the model. Then, write the fitted equation of the model using proper statistical notation.

balance_fit_2 <- linear_reg() |>
  fit(balance ~ income + student_status + limit, data = credit)

tidy(balance_fit_2)

# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)           -430.     13.3         -32.2 8.82e-106
2 income                  -7.90    0.257       -30.8 1.19e-100
3 student_statusStudent  420.     18.4          22.8 1.06e- 70
4 limit                    0.268   0.00396      67.6 2.71e-200

\[ \widehat{\text{balance}} = -429.51 -7.90 \times \text{income} + 419.68 \times \text {Student} + 0.27 \times \text{limit} \]

d. Determine which model – the one from part (a) or the one from part (b) – is the “better” model. Support your answer with an appropriate summary statistic.

Based on a comparison of adjusted $R^2$ values, we would proceed with the model from part (b), which is an additive model with income, student_status, and limit as predictors.

glance(balance_fit)$adj.r.squared

[1] 0.9482126

glance(balance_fit_2)$adj.r.squared

[1] 0.9483488

e. For the model you chose in part (d), interpret the intercept and one of the slopes in context of the data.

Intercept: For a non-student adult with a credit limit of $0 and an income of $0, we predict their credit card balance to be -$429.51, on average.

Slope (Student): All else equal, we predict the credit card balance of a student adult to be $419.68 larger than a non-student adult, on average.

Question 4: Infer

What is the average difference in income between married and not married Americans?

a. Fit a simple linear regression model that estimates the difference in mean income between these two groups. Make sure you display the tidy output. Then, write the corresponding population model using proper statistical notation.

income_diff_fit <- linear_reg() |>
  fit(income ~ marriage_status, data = credit)

tidy(income_diff_fit)

# A tibble: 2 × 5
  term                       estimate std.error statistic  p.value
  <chr>                         <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                   46.6       2.49    18.7   2.39e-54
2 marriage_statusNot married    -2.81      4.00    -0.704 4.82e- 1

\[ \text{income} = \beta_0 + \beta_1\times\text{Not married} + \epsilon \]

b. Compute a 95% bootstrap interval for the slope of the regression line for predicting income (income) from marital status (marriage_status). In your code, use 1,000 bootstrap samples when simulating your bootstrap distribution. Don’t forget to set a seed!

In your narrative, first report your point estimate; then, provide an interpretation of the 95% confidence interval you obtain for the slope in the context of these data.

The point estimate is -2.81. We are 95% confident that, for non married adult, we predict their income to be between $10.85 lower and $4.99 higher than a married adult, on average.

set.seed(847)

observed_fit <- credit |>
  specify(income ~ marriage_status) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term                       estimate
  <chr>                         <dbl>
1 intercept                     46.6 
2 marriage_statusNot married    -2.81

boot_fits <- credit |>
  specify(income ~ marriage_status) |>
  generate(reps = 500, type = "bootstrap") |>
  fit()

ci_95 <- get_confidence_interval(boot_fits,
                                 point_estimate = observed_fit,
                                 level = .95,
                                 type = "percentile")

ci_95

# A tibble: 2 × 3
  term                       lower_ci upper_ci
  <chr>                         <dbl>    <dbl>
1 intercept                      41.8    51.9 
2 marriage_statusNot married    -10.8     4.99

c. Based on your answer to part (b), what would you expect your conclusion to be for a test of the following hypotheses at the 5% discernibility level, and why?

\[ \begin{aligned} H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0 \end{aligned} \] Because the 95% confidence interval contains 0, we would expect to fail to reject the null hypothesis above at the 5% discernibility level.