AE 14: Modeling Housing Prices 🏠

Suggested answers

In this application exercise we will be studying housing prices. The dataset is a cleaned version of publicly available real estate data. We will use tidyverse and tidymodels for data exploration and modeling, respectively.

We will use the ames_housing dataset from the modeldata package.

Before we use the dataset, we’ll make a few transformations to it.

data(ames)
housing <- ames |>
  select(Sale_Price, Gr_Liv_Area, Bldg_Type, Bedroom_AbvGr, Paved_Drive, Exter_Cond) |>
  mutate(home_type = fct_collapse(Bldg_Type,
    "House" = c("OneFam", "TwnhsE"),
    "Townhouse" = "Twnhs",
    "Duplex" = "Duplex"
  )) |>
  select(-Bldg_Type) |>
  rename(price = Sale_Price, sqft = Gr_Liv_Area, bedrooms = Bedroom_AbvGr) |>
  filter(home_type %in% c("House", "Townhouse", "Duplex"))

Here is a glimpse at the data:

glimpse(housing)
Rows: 2,868
Columns: 6
$ price       <int> 215000, 105000, 172000, 244000, 189900, 195500, 213500, 19…
$ sqft        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1804,…
$ bedrooms    <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, 2, 3…
$ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved, Paved…
$ Exter_Cond  <fct> Typical, Typical, Typical, Typical, Typical, Typical, Typi…
$ home_type   <fct> House, House, House, House, House, House, House, House, Ho…

Get to know the data

  • Your turn: What is a typical house price in this dataset? What are some common square footage values? What types of homes are most common? Additionally, explore at least 1-2 other features that could be interesting. Share your findings!
ggplot(housing, aes(x = price)) +
  geom_boxplot()
ggplot(housing, aes(x = sqft)) +
  geom_histogram()
housing |>
  count(home_type)

# A tibble: 3 × 2
  home_type     n
  <fct>     <int>
1 House      2658
2 Duplex      109
3 Townhouse   101

Price vs. square footage

How can we use square footage to model/predict pricing? Here is the model:

price_sqft_fit <- linear_reg() |>
  fit(price ~ sqft, data = housing)

tidy(price_sqft_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   11430.   3271.        3.49 0.000482
2 sqft            114.      2.07     55.0  0       

And here is the model visualized:

ggplot(housing, aes(x = sqft, y = price)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm")

  • Your turn: Write the fitted equation of the model in mathematical notation. Then, interpret the intercept and slope.

\[ \widehat{price} = 11430.1960 +113.7605 \times sqft \] For a home with 0 sq. ft., we predict the price to be $11430.20, on average. For each additional sq. ft, we predict the price of a home to increase by $113.76, on average.

Price vs. home type

price_type_fit <- linear_reg() |>
  fit(price ~ home_type, data = housing)

tidy(price_type_fit)
# A tibble: 3 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)         185469.     1537.    121.   0       
2 home_typeDuplex     -45661.     7746.     -5.89 4.20e- 9
3 home_typeTownhouse  -49535.     8036.     -6.16 8.06e-10
  • Your turn: Write the fitted equation of the model in mathematical notation. Then, interpret the intercept and each coefficient in context.

\[ \widehat{price} = 185469.48 - 45660.54 \times Duplex - 49535.42 \times Townhouse \] We predict the price of a house to be $185,469.48, on average. We predict that the price of a duplex will be $45,660.54 lower than the price of a house, on average. We predict that the price of a townhouse will be $49,535.42 lower than the price of a house, on average.

Price vs. square footage and home type

Now, let’s fit a model that use both variables!

Main effects model

The main effects model is another name for the additive model. Here is the model:

price_main_fit <- linear_reg() |>
  fit(price ~ sqft + home_type, data = housing)

tidy(price_main_fit)
# A tibble: 4 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)          13395.   3225.        4.15 3.38e- 5
2 sqft                   115.      2.03     56.5  0       
3 home_typeDuplex     -63251.   5338.      -11.8  1.19e-31
4 home_typeTownhouse  -20306.   5552.       -3.66 2.60e- 4
  • Your turn: Write the fitted equation of the model in mathematical notation. Then, interpret the intercept and each slope coefficient.

\[ \widehat{price} = 13394.8010 + 114.5305 \times sqft - 63251.4927 \times Duplex - 20306.3316 \times Townhouse \] We predict the price of a house with 0 sq. ft. to be $13,394.80, on average. All else equal, for each additional sq. ft., we predict the price of a home to increase by $114.53, on average. All else equal, we predict the price of a Duplex to be $63,251.50 lower than the price of a house, on average.

Interaction effects model

Now, we will fit an interaction effects model.

Task: Write code to fit an interaction effects model predicting price from square footage and home type.

model_int <- linear_reg() |>
  fit(price ~ sqft * home_type, data = housing)

tidy(model_int)
# A tibble: 6 × 5
  term                    estimate std.error statistic  p.value
  <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)               9087.    3273.       2.78  5.53e- 3
2 sqft                       117.       2.06    56.9   0       
3 home_typeDuplex          57547.   19114.       3.01  2.63e- 3
4 home_typeTownhouse       13970.   21910.       0.638 5.24e- 1
5 sqft:home_typeDuplex       -73.2     11.1     -6.58  5.56e-11
6 sqft:home_typeTownhouse    -26.9     17.0     -1.59  1.13e- 1

Task: Write the fitted equation using mathematical notation.

\[ \begin{aligned} \widehat{price} = 9087.38517 + 117.39744 \times sqft + \\ 57547.32571 \times Duplex + 13970.23460 \times Townhouse \\ - 73.21084 \times (sqft *Duplex) -26.89557 (sqft * Townhouse) \end{aligned} \]

Task: Write the fitted equations for each home type.

Duplex

\[ \widehat{price} = (9087.38517 + 57547.32571) + (117.39744 - 73.21084) \times sqft \] House

\[ \widehat{price} = 9087.38517 + 117.39744 \times sqft \] Townhouse \[ \widehat{price} = (9087.38517 + 13970.23460) + (117.39744 -26.89557) \times sqft \]

Model Comparison

So, we fit multiple models - how do we know which one is better?

Using glance(), report each of the above models’ adjusted \(R^2\) values.

Which model is the best fit? Which is the worst?

glance(model_int)$adj.r.squared
[1] 0.5445571
glance(price_main_fit)$adj.r.squared
[1] 0.53766
glance(price_sqft_fit)$adj.r.squared
[1] 0.5136041
glance(price_type_fit)$adj.r.squared
[1] 0.0231983

One more model?

Task: Try adding one more variable present in the data frame to the “best” model chosen above. Does it make a difference in adjusted \(R^2\)? What about plain old \(R^2\)?

glimpse(housing)
Rows: 2,868
Columns: 6
$ price       <int> 215000, 105000, 172000, 244000, 189900, 195500, 213500, 19…
$ sqft        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1804,…
$ bedrooms    <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, 2, 3…
$ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved, Paved…
$ Exter_Cond  <fct> Typical, Typical, Typical, Typical, Typical, Typical, Typi…
$ home_type   <fct> House, House, House, House, House, House, House, House, Ho…
housing_new_fit <- linear_reg() |>
  fit(price ~ sqft * home_type + bedrooms, data = housing)

tidy(housing_new_fit)
# A tibble: 7 × 5
  term                    estimate std.error statistic  p.value
  <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)              52516.    3970.     13.2    8.02e-39
2 sqft                       136.       2.23   61.0    0       
3 home_typeDuplex          14362.   18325.      0.784  4.33e- 1
4 home_typeTownhouse       -1089.   20834.     -0.0523 9.58e- 1
5 bedrooms                -25385.    1442.    -17.6    6.58e-66
6 sqft:home_typeDuplex       -34.0     10.8    -3.14   1.68e- 3
7 sqft:home_typeTownhouse    -21.1     16.1    -1.31   1.90e- 1
glance(housing_new_fit)$adj.r.squared
[1] 0.5889029
glance(housing_new_fit)$r.squared
[1] 0.5897632
glance(model_int)$r.squared
[1] 0.5453514
glance(model_int)$r.squared
[1] 0.5453514