AE 20: Final βLive Codingβ Exam Practice π¬π³
Suggested answers
Packages
Use only the following packages for this exam:
Part 1 - Movies
We will work with data from the Internet Movie Database (IMDB). Specifically, the data were a random sample of movies released between 1980 and 2020.
The variables and their descriptions in the raw movies dataset are as follows:
| Variable | Description |
|---|---|
name |
name of the movie |
score |
IMDB user rating |
runtime |
duration of the movie |
genre |
main genre of the movie. |
rating |
rating of the movie (R, PG, etc.) |
release_country |
release country |
release_date |
release date (YYYY-MM-DD) |
budget |
the budget of a movie (some movies donβt have this, so it appears as 0) |
gross |
revenue of the movie |
votes |
number of user votes |
year |
year of release |
director |
the director |
writer |
writer of the movie |
star |
main actor/actress |
country |
country of origin |
company |
the production company |
Question 1: Clean and recode
The data youβll use for this question is in the data folder of your repository, and itβs called movies-raw.csv. The goal of the question is to βcleanβ this raw data and save a new version of it as movies-processed.csv. The following parts walk you through what you need to do to clean the data.
a. Read the dataset called movies-raw.csv and save it as movies_raw.
movies_raw <- read_csv("data/movies-raw.csv")Rows: 500 Columns: 16
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (11): name, runtime, genre, rating, release_country, release_date, direc...
dbl (5): score, budget, gross, votes, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
b. In a single pipeline,
- remove the character string
" mins"fromruntimeand convertruntimeto numeric, - recode the levels of
release_countryto"United States"and"Not United States", in that order, - recode the levels of
genreto"Action","Comedy","Drama","Horror", and"Other", in that order, and - save the resulting data frame as
movies_processed.
movies_processed <- movies_raw |>
mutate(
runtime = as.numeric(str_remove(runtime, " mins")),
release_country = if_else(release_country == "United States",
"United States",
"Not United States"),
release_country = fct_relevel(release_country, "United States", "Not United States"),
genre = if_else(!genre %in% c("Action", "Comedy", "Drama", "Horror"),
"Other",
genre),
genre = fct_relevel(genre, "Action", "Comedy", "Drama", "Horror", "Other")
) c. In a single pipeline, calculate the cutoff for the top 20th percentile of movie scores and store it as top_20_cutoff as a single numeric value.
[!T IP]
You can check if youβve done this correctly by typing
top_20_cutoffin the Console and checking the output, it should look like the following:> top_20_cutoff 80% 7.2
d. In a single pipeline, update movies_processed to include a column called percentile with 2 possible values: βTop 20thβ, if the movieβs score is \(\geq\) top_20_cutoff, and βBottom 80thβ, if the movieβs score is < top_20_cutoff. You should ensure that percentile is a factor variable coded such that βBottom 80thβ is the baseline level.
movies_processed <- movies_processed |>
mutate(
percentile = if_else(score >= top_20_cutoff, "Top 20th", "Bottom 80th"),
percentile = fct_relevel(percentile, "Bottom 80th", "Top 20th")
)Regardless of your solution to the above question, run the chunk below to load a pre-saved movies_processed for use in subsequent questions. You will need to change the YAML setting to #| eval: true prior to rendering your .qmd file.
movies_processed <- read_csv("data/movies-processed.csv")Question 2: Model
a. Suppose that a movie studio makes decisions on whether to produce a movie or not based on whether they think it will score in the top 20th percentile of IMDB scores. Help them build a model to aid in their decision making.
Split the data into training (75%) and testing (25%) subsets
Fit a model predicting whether the movie is in the Top 20th percentile of scores based on its runtime, genre, and the interaction of these two predictors.
Display a tidy output of the model.
Interpret the intercept in the context of these data with respect to \(\widehat{p}\).
For an action movie with a runtime of 0 minutes, we predict the probability of that movie scoring in the top 20th percentile to be \(\frac{e^{-11.41}}{1 + e^{-11.41}} \approx 1.11\times10^{-05}\), on average.
set.seed(847)
movies_split <- initial_split(movies_processed)
movies_train <- training(movies_split)
movies_test <- testing(movies_split)
percentile_fit <- logistic_reg() |>
fit(as.factor(percentile) ~ runtime * genre, data = movies_train)
tidy(percentile_fit)# A tibble: 10 Γ 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -11.4 2.80 -4.08 0.0000450
2 runtime 0.0836 0.0231 3.61 0.000304
3 genreComedy 1.41 3.85 0.368 0.713
4 genreDrama 8.00 3.06 2.61 0.00899
5 genreHorror 13.2 13.1 1.01 0.312
6 genreOther 4.83 3.36 1.44 0.150
7 runtime:genreComedy -0.00457 0.0338 -0.135 0.893
8 runtime:genreDrama -0.0591 0.0253 -2.33 0.0196
9 runtime:genreHorror -0.133 0.148 -0.900 0.368
10 runtime:genreOther -0.0296 0.0284 -1.04 0.297
[1] 1.107399e-05
b. Based on your model from part (a), calculate the predicted values of percentile for movies in your testing dataset. Then, in a single pipeline, calculate the false positive and false negative rates for this model. Explicitly state in your narrative the false positive and false negative rates.
The false positive rate is ~4.3%, and the false negative rate is ~75%.
percentile_aug <- augment(percentile_fit, new_data = movies_test)
percentile_aug |>
count(percentile, .pred_class) |>
group_by(percentile) |>
mutate(p = n / sum(n))# A tibble: 4 Γ 4
# Groups: percentile [2]
percentile .pred_class n p
<fct> <fct> <int> <dbl>
1 Bottom 80th Bottom 80th 89 0.957
2 Bottom 80th Top 20th 4 0.0430
3 Top 20th Bottom 80th 24 0.75
4 Top 20th Top 20th 8 0.25
Part 2 - Credit Cards
The data for the second part of the take-home exam is on credit card balances.
The variables in this dataset and their descriptions are as follows:
| Variable | Description |
|---|---|
balance |
Credit card balance in $ |
income |
Income in $1,000 |
student_status |
Whether the individual is a student (Student) or not (Not student) |
marriage_status |
Whether the individual is a married (Married) or not (Not married) |
limit |
Credit limit |
Assume that these data represent a random sample of American adults.
Question 3: Model compare
a. The dataset is in the data folder of your repository, and itβs called credit.csv. First, load the data with read_csv(), and assign it to an object called credit.
credit <- read_csv("data/credit.csv")Rows: 347 Columns: 5
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (2): student_status, marriage_status
dbl (3): balance, income, limit
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
b. Fit a model predicting balance from all other variables in the dataset. This should be an additive model, i.e., use only main effects, no interaction effects. Display a tidy output of the model.
balance_fit <- linear_reg() |>
fit(balance ~ ., data = credit)
tidy(balance_fit)# A tibble: 5 Γ 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -431. 14.2 -30.3 5.15e- 99
2 income -7.90 0.257 -30.7 2.29e-100
3 student_statusStudent 419. 18.5 22.7 3.21e- 70
4 marriage_statusNot married 3.66 11.7 0.313 7.54e- 1
5 limit 0.268 0.00396 67.5 1.01e-199
c. Fit a model predicting balance from all other variables in the dataset, except for one of your choice. This should be an additive model, i.e., use only main effects, no interaction effects. Display a tidy output of the model. Then, write the fitted equation of the model using proper statistical notation.
balance_fit_2 <- linear_reg() |>
fit(balance ~ income + student_status + limit, data = credit)
tidy(balance_fit_2)# A tibble: 4 Γ 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -430. 13.3 -32.2 8.82e-106
2 income -7.90 0.257 -30.8 1.19e-100
3 student_statusStudent 420. 18.4 22.8 1.06e- 70
4 limit 0.268 0.00396 67.6 2.71e-200
\[ \widehat{\text{balance}} = -429.51 -7.90 \times \text{income} + 419.68 \times \text {Student} + 0.27 \times \text{limit} \]
d. Determine which model β the one from part (a) or the one from part (b) β is the βbetterβ model. Support your answer with an appropriate summary statistic.
Based on a comparison of adjusted \(R^2\) values, we would proceed with the model from part (b), which is an additive model with income, student_status, and limit as predictors.
glance(balance_fit)$adj.r.squared[1] 0.9482126
glance(balance_fit_2)$adj.r.squared[1] 0.9483488
e. For the model you chose in part (d), interpret the intercept and one of the slopes in context of the data.
Intercept: For a non-student adult with a credit limit of $0 and an income of $0, we predict their credit card balance to be -$429.51, on average.
Slope (Student): All else equal, we predict the credit card balance of a student adult to be $419.68 larger than a non-student adult, on average.
Question 4: Infer
What is the average difference in income between married and not married Americans?
a. Fit a simple linear regression model that estimates the difference in mean income between these two groups. Make sure you display the tidy output. Then, write the corresponding population model using proper statistical notation.
income_diff_fit <- linear_reg() |>
fit(income ~ marriage_status, data = credit)
tidy(income_diff_fit)# A tibble: 2 Γ 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 46.6 2.49 18.7 2.39e-54
2 marriage_statusNot married -2.81 4.00 -0.704 4.82e- 1
\[ \text{income} = \beta_0 + \beta_1\times\text{Not married} + \epsilon \]
b. Compute a 95% bootstrap interval for the slope of the regression line for predicting income (income) from marital status (marriage_status). In your code, use 1,000 bootstrap samples when simulating your bootstrap distribution. Donβt forget to set a seed!
In your narrative, first report your point estimate; then, provide an interpretation of the 95% confidence interval you obtain for the slope in the context of these data.
The point estimate is -2.81. We are 95% confident that, for non married adult, we predict their income to be between $10.85 lower and $4.99 higher than a married adult, on average.
set.seed(847)
observed_fit <- credit |>
specify(income ~ marriage_status) |>
fit()
observed_fit# A tibble: 2 Γ 2
term estimate
<chr> <dbl>
1 intercept 46.6
2 marriage_statusNot married -2.81
boot_fits <- credit |>
specify(income ~ marriage_status) |>
generate(reps = 500, type = "bootstrap") |>
fit()
ci_95 <- get_confidence_interval(boot_fits,
point_estimate = observed_fit,
level = .95,
type = "percentile")
ci_95# A tibble: 2 Γ 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 intercept 41.8 51.9
2 marriage_statusNot married -10.8 4.99
c. Based on your answer to part (b), what would you expect your conclusion to be for a test of the following hypotheses at the 5% discernibility level, and why?
\[ \begin{aligned} H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0 \end{aligned} \] Because the 95% confidence interval contains 0, we would expect to fail to reject the null hypothesis above at the 5% discernibility level.