library(tidyverse)
library(tidymodels)
fish <- read_csv("data/fish.csv")AE 12: Modeling fish 🐟
For this application exercise, we will work with data on fish. The dataset we will use, called fish, is on two common fish species in fish market sales.
The data dictionary is below:
| variable | description |
|---|---|
species |
Species name of fish |
weight |
Weight, in grams |
length_vertical |
Vertical length, in cm |
length_diagonal |
Diagonal length, in cm |
length_cross |
Cross length, in cm |
height |
Height, in cm |
width |
Diagonal width, in cm |
Visualizing the model
We’re going to investigate the relationship between the weights and heights of fish.
- Create an appropriate plot to investigate this relationship. Add appropriate labels to the plot.
fish |>
ggplot(aes(x = height, y = weight)) +
geom_point() +
labs(x = "Height (cm)",
y = "Weight (g)",
title = "Fish Weight vs. Height")- If you were to draw a a straight line to best represent the relationship between the heights and weights of fish, where would it go? Why?
Positive, y-intercept below 0
- Now, let R draw the line for you.
fish |>
ggplot(aes(x = height, y = weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "deeppink") +
labs(x = "Height (cm)",
y = "Weight (g)",
title = "Fish Weight vs. Height")- What types of questions can this plot help answer?
We can now describe the relationship between fish height and weight, and we could also make predictions of a fish’s weight given its height.
- We can use this line to make predictions. Predict what you think the weight of a fish would be with a height of 10 cm, 15 cm, and 20 cm. Which prediction is considered extrapolation?
10cm: Between 310-325g
15cm: 625g
20cm: 920g - this prediction is considered an extrapolation because we don’t observe any fish with weights > 19.5cm in our dataset
- What is a residual?
Observed - predicted; in statistical notation, this is \(y - \widehat{y}\)
Model fitting
- Demo: Fit a model to predict fish weights from their heights.
fish_fit <- linear_reg() |>
fit(weight ~ height, data = fish)
tidy(fish_fit)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -288. 34.0 -8.49 1.83e-11
2 height 60.9 2.64 23.1 2.40e-29
- Predict what the weight of a fish would be with a height of 10 cm, 15 cm, and 20 cm using this model.
y_10 <- -288.415 + (60.916 * 10)
y_15 <- -288.415 + (60.916 * 15)
y_20 <- -288.415 + (60.916 * 20)
predict <- c(y_10, y_15, y_20)
predict[1] 320.745 625.325 929.905
- Demo: Calculate predicted weights for all fish in the data and visualize the residuals under this model.
fish_augment <- augment(fish_fit, new_data = fish)
ggplot(fish_augment, aes(x = height, y = weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "deeppink") +
geom_segment(aes(xend = height, yend = .pred), color = "grey") +
geom_point(aes(y = .pred), shape = "circle open") +
theme_minimal() +
labs(x = "Height (cm)",
y = "Weight (g)",
title = "Fish Weight vs. Height",
subtitle = "Residuals overlaid")Model summary
- Demo: Display the model summary including estimates for the slope and intercept along with measurements of uncertainty around them. Show how you can extract these values from the model output.
tidy_fit <- tidy(fish_fit)
int = tidy_fit$estimate[1]
int[1] -288.4152
slope = tidy_fit$estimate[2]
slope[1] 60.91587
- Demo: Write out your model using mathematical notation.
\(\widehat{\text{weight}} = -288.415 + 60.916 \times \text{height}\) - Fitted model
\(\text{weight} = \beta_0 + \beta_1 \times \text{height} + \epsilon\) - True population model
Correlation
We can also assess correlation between two quantitative variables.
- What is correlation? What are values correlation can take?
Correlation is a measure of the strength and direction of the linear relationship between two variables; it is a number between -1 and 1, inclusive.
- Demo: What is the correlation between heights and weights of fish?
Adding a third variable
- Demo: Does the relationship between heights and weights of fish change if we take into consideration species? Plot two separate straight lines for the Bream and Roach species.
fish |>
ggplot(aes(x = height, y = weight, color = species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Height (cm)",
y = "Weight (g)",
title = "Fish Weight vs. Height")Fitting other models
-
Demo: We can fit more models than just a straight line. Change the following code below to read
method = "loess". What is different from the plot created before?
ggplot(fish,
aes(x = height, y = weight)) +
geom_point() +
geom_smooth(method = "loess") +
labs(
title = "Weights vs. heights of fish",
x = "Height (cm)",
y = "Weight (gr)"
)