AE 14: Modeling Housing Prices 🏠
Suggested answers
In this application exercise we will be studying housing prices. The dataset is a cleaned version of publicly available real estate data. We will use tidyverse and tidymodels for data exploration and modeling, respectively.
We will use the ames_housing dataset from the modeldata package.
Before we use the dataset, we’ll make a few transformations to it.
Your turn: Review the code below with your neighbors and write a summary of the data transformation pipeline.
Selected for a subset of relevant columns
Created a new variable,
home_type, by consolidating the levels of the og. factor variable,Bldg_Type; then, got rid ofBldg_TypeRenamed some columns
Subset df to homes of type House, Townhouse, or Duplex
data(ames)
housing <- ames |>
select(Sale_Price, Gr_Liv_Area, Bldg_Type, Bedroom_AbvGr, Paved_Drive, Exter_Cond) |>
mutate(home_type = fct_collapse(Bldg_Type,
"House" = c("OneFam", "TwnhsE"),
"Townhouse" = "Twnhs",
"Duplex" = "Duplex"
)) |>
select(-Bldg_Type) |>
rename(price = Sale_Price, sqft = Gr_Liv_Area, bedrooms = Bedroom_AbvGr) |>
filter(home_type %in% c("House", "Townhouse", "Duplex"))Here is a glimpse at the data:
glimpse(housing)Rows: 2,868
Columns: 6
$ price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213500, 19…
$ sqft <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1804,…
$ bedrooms <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, 2, 3…
$ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved, Paved…
$ Exter_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typical, Typi…
$ home_type <fct> House, House, House, House, House, House, House, House, Ho…
Get to know the data
- Your turn: What is a typical house price in this dataset? What are some common square footage values? What types of homes are most common? Additionally, explore at least 1-2 other features that could be interesting. Share your findings!
ggplot(housing, aes(x = price)) +
geom_boxplot()
ggplot(housing, aes(x = sqft)) +
geom_histogram()
housing |>
count(home_type)# A tibble: 3 × 2
home_type n
<fct> <int>
1 House 2658
2 Duplex 109
3 Townhouse 101
Price vs. square footage
How can we use square footage to model/predict pricing? Here is the model:
price_sqft_fit <- linear_reg() |>
fit(price ~ sqft, data = housing)
tidy(price_sqft_fit)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 11430. 3271. 3.49 0.000482
2 sqft 114. 2.07 55.0 0
And here is the model visualized:
ggplot(housing, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm")- Your turn: Write the fitted equation of the model in mathematical notation. Then, interpret the intercept and slope.
\[ \widehat{price} = 11430.1960 +113.7605 \times sqft \] For a home with 0 sq. ft., we predict the price to be $11430.20, on average. For each additional sq. ft, we predict the price of a home to increase by $113.76, on average.
Price vs. home type
price_type_fit <- linear_reg() |>
fit(price ~ home_type, data = housing)
tidy(price_type_fit)# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 185469. 1537. 121. 0
2 home_typeDuplex -45661. 7746. -5.89 4.20e- 9
3 home_typeTownhouse -49535. 8036. -6.16 8.06e-10
- Your turn: Write the fitted equation of the model in mathematical notation. Then, interpret the intercept and each coefficient in context.
\[ \widehat{price} = 185469.48 - 45660.54 \times Duplex - 49535.42 \times Townhouse \] We predict the price of a house to be $185,469.48, on average. We predict that the price of a duplex will be $45,660.54 lower than the price of a house, on average. We predict that the price of a townhouse will be $49,535.42 lower than the price of a house, on average.
Price vs. square footage and home type
Now, let’s fit a model that use both variables!
Main effects model
The main effects model is another name for the additive model. Here is the model:
price_main_fit <- linear_reg() |>
fit(price ~ sqft + home_type, data = housing)
tidy(price_main_fit)# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 13395. 3225. 4.15 3.38e- 5
2 sqft 115. 2.03 56.5 0
3 home_typeDuplex -63251. 5338. -11.8 1.19e-31
4 home_typeTownhouse -20306. 5552. -3.66 2.60e- 4
- Your turn: Write the fitted equation of the model in mathematical notation. Then, interpret the intercept and each slope coefficient.
\[ \widehat{price} = 13394.8010 + 114.5305 \times sqft - 63251.4927 \times Duplex - 20306.3316 \times Townhouse \] We predict the price of a house with 0 sq. ft. to be $13,394.80, on average. All else equal, for each additional sq. ft., we predict the price of a home to increase by $114.53, on average. All else equal, we predict the price of a Duplex to be $63,251.50 lower than the price of a house, on average.
Interaction effects model
Now, we will fit an interaction effects model.
Task: Write code to fit an interaction effects model predicting price from square footage and home type.
model_int <- linear_reg() |>
fit(price ~ sqft * home_type, data = housing)
tidy(model_int)# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9087. 3273. 2.78 5.53e- 3
2 sqft 117. 2.06 56.9 0
3 home_typeDuplex 57547. 19114. 3.01 2.63e- 3
4 home_typeTownhouse 13970. 21910. 0.638 5.24e- 1
5 sqft:home_typeDuplex -73.2 11.1 -6.58 5.56e-11
6 sqft:home_typeTownhouse -26.9 17.0 -1.59 1.13e- 1
Task: Write the fitted equation using mathematical notation.
\[ \begin{aligned} \widehat{price} = 9087.38517 + 117.39744 \times sqft + \\ 57547.32571 \times Duplex + 13970.23460 \times Townhouse \\ - 73.21084 \times (sqft *Duplex) -26.89557 (sqft * Townhouse) \end{aligned} \]
Task: Write the fitted equations for each home type.
Duplex
\[ \widehat{price} = (9087.38517 + 57547.32571) + (117.39744 - 73.21084) \times sqft \] House
\[ \widehat{price} = 9087.38517 + 117.39744 \times sqft \] Townhouse \[ \widehat{price} = (9087.38517 + 13970.23460) + (117.39744 -26.89557) \times sqft \]
Model Comparison
So, we fit multiple models - how do we know which one is better?
Using glance(), report each of the above models’ adjusted \(R^2\) values.
Which model is the best fit? Which is the worst?
glance(model_int)$adj.r.squared[1] 0.5445571
glance(price_main_fit)$adj.r.squared[1] 0.53766
glance(price_sqft_fit)$adj.r.squared[1] 0.5136041
glance(price_type_fit)$adj.r.squared[1] 0.0231983
One more model?
Task: Try adding one more variable present in the data frame to the “best” model chosen above. Does it make a difference in adjusted \(R^2\)? What about plain old \(R^2\)?
glimpse(housing)Rows: 2,868
Columns: 6
$ price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213500, 19…
$ sqft <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1804,…
$ bedrooms <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, 2, 3…
$ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved, Paved…
$ Exter_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typical, Typi…
$ home_type <fct> House, House, House, House, House, House, House, House, Ho…
housing_new_fit <- linear_reg() |>
fit(price ~ sqft * home_type + bedrooms, data = housing)
tidy(housing_new_fit)# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 52516. 3970. 13.2 8.02e-39
2 sqft 136. 2.23 61.0 0
3 home_typeDuplex 14362. 18325. 0.784 4.33e- 1
4 home_typeTownhouse -1089. 20834. -0.0523 9.58e- 1
5 bedrooms -25385. 1442. -17.6 6.58e-66
6 sqft:home_typeDuplex -34.0 10.8 -3.14 1.68e- 3
7 sqft:home_typeTownhouse -21.1 16.1 -1.31 1.90e- 1
glance(housing_new_fit)$adj.r.squared[1] 0.5889029
glance(housing_new_fit)$r.squared[1] 0.5897632
glance(model_int)$r.squared[1] 0.5453514
glance(model_int)$r.squared[1] 0.5453514