AE 16: Forest classification 🌳🌲

Spending your data

Split your data into testing and training in a reproducible manner and display the split object.

set.seed(253)

forest_split <- initial_split(forested)
forest_split

<Training/Testing/Total>
<5330/1777/7107>

What percent of the original forested data is allocated to training and what percent to testing? Compare your response to your neighbor’s. Are the percentages roughly consistent? What determines this in the initial_split()? How would the code need to be updated to allocate 80% of the data to training and the remaining 20% to testing?

~75% of the data is allocated to training, by default.

5330/7107

[1] 0.7499648

To allocate 80% to training, we need to add the argument prop = .8 to our initial_split() function.

forested_split_2 <- initial_split(forested, prop = .8)
forested_split_2

<Training/Testing/Total>
<5685/1422/7107>

Let’s stick with the default split and save our testing and training data.

forest_train <- training(forest_split)
forest_test <- testing(forest_split)

Exploratory data analysis

Create a visualization that explores the relationship between the outcome, one numerical predictor, and one categorical predictor. Then, describe, in a few sentences, what the visualization shows about the relationship between these variables.

Note: Pay attention to which dataset you use for your exploration.

ggplot(forest_train, aes(x = temp_annual_mean, fill = land_type)) +
  geom_histogram(alpha = 0.7) +
  facet_wrap(~ forested, ncol = 1)

We observe that that the majority of plots classified as not forested are of land type “Barren” or “Non-tree vegetation”, not “Tree”. The distribution of temp_annual_mean is centered at a higher value for non-forested plots than for forested, and it’s more concentrated around it’s center as well.

Model 1: Custom choice of predictors

Fit

Fit a model for classifying plots as forested or not based on a subset of predictors of your choice. Name the model forested_custom_fit and display a tidy output of the model.

forested_custom_fit <- logistic_reg() |>
  fit(forested ~ precip_annual + elevation + temp_annual_mean, data = forest_train)

tidy(forested_custom_fit)

# A tibble: 4 × 5
  term             estimate std.error statistic   p.value
  <chr>               <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      -6.31    0.663         -9.52 1.77e- 21
2 precip_annual    -0.00170 0.0000664    -25.7  3.59e-145
3 elevation         0.00159 0.000257       6.20 5.52e- 10
4 temp_annual_mean  0.770   0.0559        13.8  4.01e- 43

Interpret

Interpret the intercept (with respect to \(\widehat{p}\)) in the context of these data. Then, interpret one slope coefficient of your choice (with respect to the odds), again in the context of these data.

Intercept:

For a plot of land with 0 annual precipitation, 0 elevation, and an average annual temperature 0, we predict the probability that the plot is not forested is 0.00181323.

All else equal, we expect that for a one unit increase in annual precipitation, the predicted odds that a plot of land is not forested will decrease by a multiplicative factor of 0.9982989, on average.

exp(-6.310830752) / (1 + exp(-6.310830752   ))

[1] 0.00181323

levels(forest_train$forested)

[1] "Yes" "No"

exp(-0.001702569)

[1] 0.9982989

Predict

Predict for the testing data using this model.

forest_aug <- augment(forested_custom_fit, new_data = forest_test)

forest_aug

# A tibble: 1,777 × 22
   .pred_class .pred_Yes .pred_No forested  year elevation eastness northness
   <fct>           <dbl>    <dbl> <fct>    <dbl>     <dbl>    <dbl>     <dbl>
 1 Yes            0.624   0.376   No        2005       164      -84        53
 2 Yes            0.986   0.0136  Yes       2005      2240      -67       -74
 3 Yes            0.983   0.0171  Yes       2005       787       66       -74
 4 Yes            0.997   0.00274 Yes       2005      1330       99         7
 5 Yes            0.999   0.00147 Yes       2005      1423       46        88
 6 Yes            0.994   0.00577 No        2005      1713      -66        75
 7 Yes            0.983   0.0175  Yes       2014       546      -92       -38
 8 Yes            0.998   0.00210 Yes       2014      1612       30       -95
 9 No             0.0620  0.938   No        2014       235        0      -100
10 No             0.121   0.879   No        2014       302      -31       -94
# ℹ 1,767 more rows
# ℹ 14 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
#   precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
#   temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
#   vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>

Evaluate

Calculate the error rates (and their complements) for the testing data using this model.

forest_aug |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(prop = n / sum(n),
         decision = case_when(
           forested == "Yes" & .pred_class == "Yes" ~ "True Positive",
           forested == "Yes" & .pred_class == "No" ~ "False Negative",
           forested == "No" & .pred_class == "Yes" ~ "False Positive",
           forested == "No" & .pred_class == "No" ~ "True Negative"
         ))

# A tibble: 4 × 5
# Groups:   forested [2]
  forested .pred_class     n  prop decision      
  <fct>    <fct>       <int> <dbl> <chr>         
1 Yes      Yes           821 0.828 True Positive 
2 Yes      No            170 0.172 False Negative
3 No       Yes           107 0.136 False Positive
4 No       No            679 0.864 True Negative

Another commonly used display of this information is a confusion matrix. Create this using the conf_mat() function. You will need to review the documentation for the function to determine how to use it.

conf_mat(forest_aug,
         truth = forested,
         estimate = .pred_class)

          Truth
Prediction Yes  No
       Yes 821 107
       No  170 679

Sensitivity, specificity, ROC curve

Calculate sensitivity and specificity and draw the ROC curve.

forest_custom_roc <- roc_curve(forest_aug,
                               truth = forested,
                               .pred_Yes,
                               event_level = "first")

forest_custom_roc

# A tibble: 1,779 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
 1  -Inf          0                 1
 2     0.0397     0                 1
 3     0.0422     0.00127           1
 4     0.0452     0.00254           1
 5     0.0486     0.00382           1
 6     0.0487     0.00509           1
 7     0.0490     0.00636           1
 8     0.0492     0.00763           1
 9     0.0493     0.00891           1
10     0.0502     0.0102            1
# ℹ 1,769 more rows

ggplot(forest_custom_roc, aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = "dashed")

Model 2: All predictors

Fit

Fit a model for classifying plots as forested or not based on all predictors available. Name the model forested_full_fit and display a tidy output of the model.

forested_full_fit <- logistic_reg() |>
  fit(forested ~ ., data = forest_train)

tidy(forested_full_fit)

# A tibble: 20 × 5
   term                           estimate std.error statistic  p.value
   <chr>                             <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                  -38.7      31.9         -1.22  2.24e- 1
 2 year                           0.0160    0.0149       1.07  2.85e- 1
 3 elevation                     -0.00195   0.000647    -3.02  2.57e- 3
 4 eastness                      -0.000166  0.000724    -0.230 8.18e- 1
 5 northness                      0.00223   0.000745     2.99  2.79e- 3
 6 roughness                     -0.00518   0.00148     -3.50  4.57e- 4
 7 tree_no_treeNo tree            1.12      0.135        8.27  1.31e-16
 8 dew_temp                      -0.102     0.177       -0.578 5.63e- 1
 9 precip_annual                 -0.000224  0.000105    -2.13  3.31e- 2
10 temp_annual_mean             -16.8      12.3         -1.36  1.73e- 1
11 temp_annual_min                0.810     0.104        7.81  5.51e-15
12 temp_annual_max                7.45      6.16         1.21  2.26e- 1
13 temp_january_min               8.14      6.15         1.32  1.86e- 1
14 vapor_min                     -0.000437  0.00354     -0.124 9.02e- 1
15 vapor_max                      0.00876   0.00128      6.86  6.79e-12
16 canopy_cover                  -0.0431    0.00369    -11.7   1.51e-31
17 lon                           -0.0948    0.0550      -1.72  8.46e- 2
18 lat                            0.119     0.110        1.08  2.80e- 1
19 land_typeNon-tree vegetation  -0.601     0.267       -2.25  2.45e- 2
20 land_typeTree                 -1.53      0.283       -5.39  7.07e- 8

Predict

Predict for the testing data using this model.

forest_full_aug <- augment(forested_full_fit, new_data = forest_test)
forest_full_aug

# A tibble: 1,777 × 22
   .pred_class .pred_Yes .pred_No forested  year elevation eastness northness
   <fct>           <dbl>    <dbl> <fct>    <dbl>     <dbl>    <dbl>     <dbl>
 1 Yes            0.916   0.0841  No        2005       164      -84        53
 2 Yes            0.834   0.166   Yes       2005      2240      -67       -74
 3 Yes            0.994   0.00578 Yes       2005       787       66       -74
 4 Yes            0.985   0.0152  Yes       2005      1330       99         7
 5 Yes            0.985   0.0153  Yes       2005      1423       46        88
 6 Yes            0.737   0.263   No        2005      1713      -66        75
 7 Yes            0.986   0.0143  Yes       2014       546      -92       -38
 8 Yes            0.982   0.0177  Yes       2014      1612       30       -95
 9 No             0.0607  0.939   No        2014       235        0      -100
10 No             0.0226  0.977   No        2014       302      -31       -94
# ℹ 1,767 more rows
# ℹ 14 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
#   precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
#   temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
#   vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>

Evaluate

Calculate the error rates (and their complements) for the testing data using this model.

forest_full_aug |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(prop = n / sum(n),
         decision = case_when(
           forested == "Yes" & .pred_class == "Yes" ~ "True Positive",
           forested == "Yes" & .pred_class == "No" ~ "False Negative",
           forested == "No" & .pred_class == "Yes" ~ "False Positive",
           forested == "No" & .pred_class == "No" ~ "True Negative"
         ))

# A tibble: 4 × 5
# Groups:   forested [2]
  forested .pred_class     n   prop decision      
  <fct>    <fct>       <int>  <dbl> <chr>         
1 Yes      Yes           910 0.918  True Positive 
2 Yes      No             81 0.0817 False Negative
3 No       Yes            82 0.104  False Positive
4 No       No            704 0.896  True Negative

Sensitivity, specificity, ROC curve

Calculate sensitivity and specificity and draw the ROC curve.

forest_full_roc <- roc_curve(forest_full_aug,
                               truth = forested,
                               .pred_Yes,
                               event_level = "first")

ggplot(forest_full_roc, aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = "dashed")

Model 1 vs. Model 2

Plot both ROC curves and articulate how you would use them to compare these models.

forest_custom_roc <- forest_custom_roc |>
  mutate(model = "custom")
forest_custom_roc

# A tibble: 1,779 × 4
   .threshold specificity sensitivity model 
        <dbl>       <dbl>       <dbl> <chr> 
 1  -Inf          0                 1 custom
 2     0.0397     0                 1 custom
 3     0.0422     0.00127           1 custom
 4     0.0452     0.00254           1 custom
 5     0.0486     0.00382           1 custom
 6     0.0487     0.00509           1 custom
 7     0.0490     0.00636           1 custom
 8     0.0492     0.00763           1 custom
 9     0.0493     0.00891           1 custom
10     0.0502     0.0102            1 custom
# ℹ 1,769 more rows

forest_full_roc <- forest_full_roc |>
  mutate(model = "full")

forest_roc <- rbind(forest_custom_roc, forest_full_roc)
forest_roc

# A tibble: 3,558 × 4
   .threshold specificity sensitivity model 
        <dbl>       <dbl>       <dbl> <chr> 
 1  -Inf          0                 1 custom
 2     0.0397     0                 1 custom
 3     0.0422     0.00127           1 custom
 4     0.0452     0.00254           1 custom
 5     0.0486     0.00382           1 custom
 6     0.0487     0.00509           1 custom
 7     0.0490     0.00636           1 custom
 8     0.0492     0.00763           1 custom
 9     0.0493     0.00891           1 custom
10     0.0502     0.0102            1 custom
# ℹ 3,548 more rows

ggplot(forest_roc, aes(x = 1 - specificity, y = sensitivity, color = model)) +
  geom_path() +
  geom_abline(lty = "dashed")

We would choose to favor the model with the larger AUC; in this case, this is the full model.

forest_full_auc <- roc_auc(forest_full_aug,
                               truth = forested,
                               .pred_Yes,
                               event_level = "first")
forest_full_auc

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.967

forest_custom_auc <- roc_auc(forest_aug,
                               truth = forested,
                               .pred_Yes,
                               event_level = "first")
forest_custom_auc

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.895