AE 13: Modeling penguins

Suggested answers

In this application exercise we will be studying penguins. The data can be found in the palmerpenguins package and we will use tidyverse and tidymodels for data exploration and modeling, respectively.

Please read the following context and take a glimpse at the data set before we get started.

This data set comprising various measurements of three different penguin species, namely Adelie, Gentoo, and Chinstrap. The rigorous study was conducted in the islands of the Palmer Archipelago, Antarctica. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data set is called penguins.

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Our goal is to understand better how various body measurements and attributes of penguins relate to their body mass. First, we are going to investigate the relationship between a penguins’ flipper lengths and their body masses.

body_mass_g

penguins |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

Correlation

  • Your turn:

    • What is correlation? What values can correlation take?

    A measure of the strength and direction of the linear relationship between the predictor(s) and outcome. It can be a number between -1 and 1, inclusive.

  • Demo: What is the correlation between flipper length and body mass of penguins?

penguins |>
  summarize(r = cor(flipper_length_mm, body_mass_g, use = "complete.obs"))
# A tibble: 1 × 1
      r
  <dbl>
1 0.871
  • Demo: What is the coefficient of determination?
0.8712018^2 
[1] 0.7589926
  • Demo: What is the slope of the regression line?

The estimated slope, as computed by hand, is ~ 49.7

penguins |>
  summarize(
    r = cor(flipper_length_mm, body_mass_g, use = "complete.obs"),
    sy = sd(body_mass_g, na.rm = TRUE),
    sx = sd(flipper_length_mm, na.rm = TRUE),
    b1 = r * sy / sx
  )
# A tibble: 1 × 4
      r    sy    sx    b1
  <dbl> <dbl> <dbl> <dbl>
1 0.871  802.  14.1  49.7

Defining, fitting, and summarizing a model

  • Demo: Write the population model below that explains the relationship between body mass and flipper length.

\[ \text{body mass (g)} = \beta_0 + \beta_1 \times \text{flipper length (mm)} + \epsilon \]

  • Demo: Fit the linear regression model and display the results. Write the fitted model below.
penguins_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins)

tidy(penguins_fit)
# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        -5781.     306.       -18.9 5.59e- 55
2 flipper_length_mm     49.7      1.52      32.7 4.37e-107

\[ \widehat{\text{body mass(g)}} = -5780.83136 + 49.68557 \times \text{flipper length (mm)} \]

  • Your turn: Interpret the slope and the intercept in the context of the data.

    • Intercept: For a penguin with a flipper length of 0mm, we expect it to have a body mass of -5780.83g, on average. This is not meaningful in the context of these data because a penguin with 0mm flipper length doesn’t exist.

    • Slopes: For every 1 mm increase in flipper length, we expect a penguin’s body mass to inc. by 49.69 g, on average.

  • Your turn: Recreate the visualization from above, this time adding a regression line to the visualization geom_smooth(method = "lm").

penguins |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", color = "cornflowerblue", se = FALSE)

  • Demo: What is the coefficient of determination (\(R^2\))?

75.9% of the variation in body mass (g) is explainable by the variation in flipper length (mm)

penguins_fit |>
  glance() |>
  select(r.squared)
# A tibble: 1 × 1
  r.squared
      <dbl>
1     0.759
  • Demo: What is the estimated body mass for a penguin with a flipper length of 210?
  • Your turn: What is the estimated body mass for a penguin with a flipper length of 100?
est_bm <- tibble(flipper_length_mm = c(210, 100))

penguins_fit |>
  predict(new_data = est_bm)
# A tibble: 2 × 1
  .pred
  <dbl>
1 4653.
2 -812.
## alternative method

penguins_fit |>
  augment(new_data = est_bm)
# A tibble: 2 × 2
  .pred flipper_length_mm
  <dbl>             <dbl>
1 4653.               210
2 -812.               100

Another model

  • Demo: A different researcher wants to look at body weight of penguins based on the island they were recorded on. How are the variables involved in this analysis different?

Island is a categorical variable while flipper length was numeric

Add response here

  • Demo: Make an appropriate visualization to investigate this relationship below. Additionally, calculate the mean body mass by island.
penguins |>
  ggplot(aes(x = body_mass_g)) +
  geom_histogram() +
  facet_wrap(~ island) +
  labs(x = "Body Mass (g)",
       y = "Count",
       title = "Distribution of Penguin Body Mass by Island")

penguins |>
  group_by(island) |>
  summarize(mean_bmg = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
  island    mean_bmg
  <fct>        <dbl>
1 Biscoe       4716.
2 Dream        3713.
3 Torgersen    3706.
  • Demo: Change the geom of your previous plot to geom_point(). Use this plot to think about how R models these data.
penguins |>
  ggplot(aes(x = island, y = body_mass_g)) +
  geom_point()

  • Your turn: Fit the linear regression model and display the results. Write the fitted model below.
penguins_fit_2 <- linear_reg() |>
  fit(body_mass_g ~ island, data = penguins)

tidy(penguins_fit_2)
# A tibble: 3 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21

\[ \widehat{\text{body mass (g)}} = 4716.018 -1003.115 \times \text{Dream} - 1009.645 \times \text{Torgerson} \]

  • Demo: Interpret each coefficient in context of the problem.

    • Intercept: A penguin from Biscoe island is predicted to have a body mass of 4,716.02g, on average.

    • Slopes: A penguin from Dream island is predicted to have a body mass that is 1,003.12g lower than that of a penguin from Biscoe island, on average; a penguin from Torgerson island is predicted to have a body mass that is 1,009.65g lower than that of a penguin from Biscoe island, on average.

  • Demo: What is the estimated body weight of a penguin on Biscoe island? What are the estimated body weights of penguins on Dream and Torgersen islands?

three_penguins <- tibble(
  island = c("Biscoe", "Dream", "Torgersen")
)
penguins_fit_2 |>
  predict(new_data = three_penguins)
# A tibble: 3 × 1
  .pred
  <dbl>
1 4716.
2 3713.
3 3706.