AE 16: Forest classification 🌳🌲
Suggested answers
In this application exercise, we will
- Split our data into testing and training
- Fit logistic regression regression models to testing data to classify outcomes
- Evaluate performance of models on testing data
We will use tidyverse and tidymodels for data exploration and modeling, respectively, and the forested package for the data.
Remember from the lecture that the forested dataset contains information on whether a plot is forested (Yes) or not (No) as well as numerical and categorical features of that plot.
glimpse(forested)Rows: 7,107
Columns: 19
$ forested <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
$ year <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005,…
$ elevation <dbl> 881, 113, 164, 299, 806, 736, 636, 224, 52, 2240, 104…
$ eastness <dbl> 90, -25, -84, 93, 47, -27, -48, -65, -62, -67, 96, -4…
$ northness <dbl> 43, 96, 53, 34, -88, -96, 87, -75, 78, -74, -26, 86, …
$ roughness <dbl> 63, 30, 13, 6, 35, 53, 3, 9, 42, 99, 51, 190, 95, 212…
$ tree_no_tree <fct> Tree, Tree, Tree, No tree, Tree, Tree, No tree, Tree,…
$ dew_temp <dbl> 0.04, 6.40, 6.06, 4.43, 1.06, 1.35, 1.42, 6.39, 6.50,…
$ precip_annual <dbl> 466, 1710, 1297, 2545, 609, 539, 702, 1195, 1312, 103…
$ temp_annual_mean <dbl> 6.42, 10.64, 10.07, 9.86, 7.72, 7.89, 7.61, 10.45, 10…
$ temp_annual_min <dbl> -8.32, 1.40, 0.19, -1.20, -5.98, -6.00, -5.76, 1.11, …
$ temp_annual_max <dbl> 12.91, 15.84, 14.42, 15.78, 13.84, 14.66, 14.23, 15.3…
$ temp_january_min <dbl> -0.08, 5.44, 5.72, 3.95, 1.60, 1.12, 0.99, 5.54, 6.20…
$ vapor_min <dbl> 78, 34, 49, 67, 114, 67, 67, 31, 60, 79, 172, 162, 70…
$ vapor_max <dbl> 1194, 938, 754, 1164, 1254, 1331, 1275, 944, 892, 549…
$ canopy_cover <dbl> 50, 79, 47, 42, 59, 36, 14, 27, 82, 12, 74, 66, 83, 6…
$ lon <dbl> -118.6865, -123.0825, -122.3468, -121.9144, -117.8841…
$ lat <dbl> 48.69537, 47.07991, 48.77132, 45.80776, 48.07396, 48.…
$ land_type <fct> Tree, Tree, Tree, Tree, Tree, Tree, Non-tree vegetati…
Spending your data
Split your data into testing and training in a reproducible manner and display the split object.
set.seed(253)
forest_split <- initial_split(forested)
forest_split<Training/Testing/Total>
<5330/1777/7107>
What percent of the original forested data is allocated to training and what percent to testing? Compare your response to your neighbor’s. Are the percentages roughly consistent? What determines this in the initial_split()? How would the code need to be updated to allocate 80% of the data to training and the remaining 20% to testing?
~75% of the data is allocated to training, by default.
5330/7107[1] 0.7499648
To allocate 80% to training, we need to add the argument prop = .8 to our initial_split() function.
forested_split_2 <- initial_split(forested, prop = .8)
forested_split_2<Training/Testing/Total>
<5685/1422/7107>
Let’s stick with the default split and save our testing and training data.
forest_train <- training(forest_split)
forest_test <- testing(forest_split)Exploratory data analysis
Create a visualization that explores the relationship between the outcome, one numerical predictor, and one categorical predictor. Then, describe, in a few sentences, what the visualization shows about the relationship between these variables.
Note: Pay attention to which dataset you use for your exploration.
ggplot(forest_train, aes(x = temp_annual_mean, fill = land_type)) +
geom_histogram(alpha = 0.7) +
facet_wrap(~ forested, ncol = 1)We observe that that the majority of plots classified as not forested are of land type “Barren” or “Non-tree vegetation”, not “Tree”. The distribution of temp_annual_mean is centered at a higher value for non-forested plots than for forested, and it’s more concentrated around it’s center as well.
Model 1: Custom choice of predictors
Fit
Fit a model for classifying plots as forested or not based on a subset of predictors of your choice. Name the model forested_custom_fit and display a tidy output of the model.
forested_custom_fit <- logistic_reg() |>
fit(forested ~ precip_annual + elevation + temp_annual_mean, data = forest_train)
tidy(forested_custom_fit)# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -6.31 0.663 -9.52 1.77e- 21
2 precip_annual -0.00170 0.0000664 -25.7 3.59e-145
3 elevation 0.00159 0.000257 6.20 5.52e- 10
4 temp_annual_mean 0.770 0.0559 13.8 4.01e- 43
Interpret
Interpret the intercept (with respect to \(\widehat{p}\)) in the context of these data. Then, interpret one slope coefficient of your choice (with respect to the odds), again in the context of these data.
Intercept:
For a plot of land with 0 annual precipitation, 0 elevation, and an average annual temperature 0, we predict the probability that the plot is not forested is 0.00181323.
All else equal, we expect that for a one unit increase in annual precipitation, the predicted odds that a plot of land is not forested will decrease by a multiplicative factor of 0.9982989, on average.
Predict
Predict for the testing data using this model.
forest_aug <- augment(forested_custom_fit, new_data = forest_test)
forest_aug# A tibble: 1,777 × 22
.pred_class .pred_Yes .pred_No forested year elevation eastness northness
<fct> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 Yes 0.624 0.376 No 2005 164 -84 53
2 Yes 0.986 0.0136 Yes 2005 2240 -67 -74
3 Yes 0.983 0.0171 Yes 2005 787 66 -74
4 Yes 0.997 0.00274 Yes 2005 1330 99 7
5 Yes 0.999 0.00147 Yes 2005 1423 46 88
6 Yes 0.994 0.00577 No 2005 1713 -66 75
7 Yes 0.983 0.0175 Yes 2014 546 -92 -38
8 Yes 0.998 0.00210 Yes 2014 1612 30 -95
9 No 0.0620 0.938 No 2014 235 0 -100
10 No 0.121 0.879 No 2014 302 -31 -94
# ℹ 1,767 more rows
# ℹ 14 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
# precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
# temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
# vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>
Evaluate
Calculate the error rates (and their complements) for the testing data using this model.
forest_aug |>
count(forested, .pred_class) |>
group_by(forested) |>
mutate(prop = n / sum(n),
decision = case_when(
forested == "Yes" & .pred_class == "Yes" ~ "True Positive",
forested == "Yes" & .pred_class == "No" ~ "False Negative",
forested == "No" & .pred_class == "Yes" ~ "False Positive",
forested == "No" & .pred_class == "No" ~ "True Negative"
))# A tibble: 4 × 5
# Groups: forested [2]
forested .pred_class n prop decision
<fct> <fct> <int> <dbl> <chr>
1 Yes Yes 821 0.828 True Positive
2 Yes No 170 0.172 False Negative
3 No Yes 107 0.136 False Positive
4 No No 679 0.864 True Negative
Another commonly used display of this information is a confusion matrix. Create this using the conf_mat() function. You will need to review the documentation for the function to determine how to use it.
conf_mat(forest_aug,
truth = forested,
estimate = .pred_class) Truth
Prediction Yes No
Yes 821 107
No 170 679
Sensitivity, specificity, ROC curve
Calculate sensitivity and specificity and draw the ROC curve.
forest_custom_roc <- roc_curve(forest_aug,
truth = forested,
.pred_Yes,
event_level = "first")
forest_custom_roc# A tibble: 1,779 × 3
.threshold specificity sensitivity
<dbl> <dbl> <dbl>
1 -Inf 0 1
2 0.0397 0 1
3 0.0422 0.00127 1
4 0.0452 0.00254 1
5 0.0486 0.00382 1
6 0.0487 0.00509 1
7 0.0490 0.00636 1
8 0.0492 0.00763 1
9 0.0493 0.00891 1
10 0.0502 0.0102 1
# ℹ 1,769 more rows
ggplot(forest_custom_roc, aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = "dashed")Model 2: All predictors
Fit
Fit a model for classifying plots as forested or not based on all predictors available. Name the model forested_full_fit and display a tidy output of the model.
forested_full_fit <- logistic_reg() |>
fit(forested ~ ., data = forest_train)
tidy(forested_full_fit)# A tibble: 20 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -38.7 31.9 -1.22 2.24e- 1
2 year 0.0160 0.0149 1.07 2.85e- 1
3 elevation -0.00195 0.000647 -3.02 2.57e- 3
4 eastness -0.000166 0.000724 -0.230 8.18e- 1
5 northness 0.00223 0.000745 2.99 2.79e- 3
6 roughness -0.00518 0.00148 -3.50 4.57e- 4
7 tree_no_treeNo tree 1.12 0.135 8.27 1.31e-16
8 dew_temp -0.102 0.177 -0.578 5.63e- 1
9 precip_annual -0.000224 0.000105 -2.13 3.31e- 2
10 temp_annual_mean -16.8 12.3 -1.36 1.73e- 1
11 temp_annual_min 0.810 0.104 7.81 5.51e-15
12 temp_annual_max 7.45 6.16 1.21 2.26e- 1
13 temp_january_min 8.14 6.15 1.32 1.86e- 1
14 vapor_min -0.000437 0.00354 -0.124 9.02e- 1
15 vapor_max 0.00876 0.00128 6.86 6.79e-12
16 canopy_cover -0.0431 0.00369 -11.7 1.51e-31
17 lon -0.0948 0.0550 -1.72 8.46e- 2
18 lat 0.119 0.110 1.08 2.80e- 1
19 land_typeNon-tree vegetation -0.601 0.267 -2.25 2.45e- 2
20 land_typeTree -1.53 0.283 -5.39 7.07e- 8
Predict
Predict for the testing data using this model.
forest_full_aug <- augment(forested_full_fit, new_data = forest_test)
forest_full_aug# A tibble: 1,777 × 22
.pred_class .pred_Yes .pred_No forested year elevation eastness northness
<fct> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 Yes 0.916 0.0841 No 2005 164 -84 53
2 Yes 0.834 0.166 Yes 2005 2240 -67 -74
3 Yes 0.994 0.00578 Yes 2005 787 66 -74
4 Yes 0.985 0.0152 Yes 2005 1330 99 7
5 Yes 0.985 0.0153 Yes 2005 1423 46 88
6 Yes 0.737 0.263 No 2005 1713 -66 75
7 Yes 0.986 0.0143 Yes 2014 546 -92 -38
8 Yes 0.982 0.0177 Yes 2014 1612 30 -95
9 No 0.0607 0.939 No 2014 235 0 -100
10 No 0.0226 0.977 No 2014 302 -31 -94
# ℹ 1,767 more rows
# ℹ 14 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
# precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
# temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
# vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>
Evaluate
Calculate the error rates (and their complements) for the testing data using this model.
forest_full_aug |>
count(forested, .pred_class) |>
group_by(forested) |>
mutate(prop = n / sum(n),
decision = case_when(
forested == "Yes" & .pred_class == "Yes" ~ "True Positive",
forested == "Yes" & .pred_class == "No" ~ "False Negative",
forested == "No" & .pred_class == "Yes" ~ "False Positive",
forested == "No" & .pred_class == "No" ~ "True Negative"
))# A tibble: 4 × 5
# Groups: forested [2]
forested .pred_class n prop decision
<fct> <fct> <int> <dbl> <chr>
1 Yes Yes 910 0.918 True Positive
2 Yes No 81 0.0817 False Negative
3 No Yes 82 0.104 False Positive
4 No No 704 0.896 True Negative
Sensitivity, specificity, ROC curve
Calculate sensitivity and specificity and draw the ROC curve.
forest_full_roc <- roc_curve(forest_full_aug,
truth = forested,
.pred_Yes,
event_level = "first")
ggplot(forest_full_roc, aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = "dashed")Model 1 vs. Model 2
Plot both ROC curves and articulate how you would use them to compare these models.
forest_custom_roc <- forest_custom_roc |>
mutate(model = "custom")
forest_custom_roc# A tibble: 1,779 × 4
.threshold specificity sensitivity model
<dbl> <dbl> <dbl> <chr>
1 -Inf 0 1 custom
2 0.0397 0 1 custom
3 0.0422 0.00127 1 custom
4 0.0452 0.00254 1 custom
5 0.0486 0.00382 1 custom
6 0.0487 0.00509 1 custom
7 0.0490 0.00636 1 custom
8 0.0492 0.00763 1 custom
9 0.0493 0.00891 1 custom
10 0.0502 0.0102 1 custom
# ℹ 1,769 more rows
forest_full_roc <- forest_full_roc |>
mutate(model = "full")
forest_roc <- rbind(forest_custom_roc, forest_full_roc)
forest_roc# A tibble: 3,558 × 4
.threshold specificity sensitivity model
<dbl> <dbl> <dbl> <chr>
1 -Inf 0 1 custom
2 0.0397 0 1 custom
3 0.0422 0.00127 1 custom
4 0.0452 0.00254 1 custom
5 0.0486 0.00382 1 custom
6 0.0487 0.00509 1 custom
7 0.0490 0.00636 1 custom
8 0.0492 0.00763 1 custom
9 0.0493 0.00891 1 custom
10 0.0502 0.0102 1 custom
# ℹ 3,548 more rows
ggplot(forest_roc, aes(x = 1 - specificity, y = sensitivity, color = model)) +
geom_path() +
geom_abline(lty = "dashed")We would choose to favor the model with the larger AUC; in this case, this is the full model.
forest_full_auc <- roc_auc(forest_full_aug,
truth = forested,
.pred_Yes,
event_level = "first")
forest_full_auc# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.967
forest_custom_auc <- roc_auc(forest_aug,
truth = forested,
.pred_Yes,
event_level = "first")
forest_custom_auc# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.895