Hypothesis Testing

Lecture 20

Author

Affiliation

Katie Solarz

Duke University
STA 199 Summer 2026: Session I

Published

June 16, 2026

Recap: sampling uncertainty

What if this was my dataset?

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    2.94 
2 log_inc        0.657

What if this was my dataset instead?

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    5.29 
2 log_inc        0.486

What if this was my dataset instead?

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    1.62 
2 log_inc        0.805

Rinse and repeat 1000 times…

Sampling uncertainty

How sensitive are the estimates to the data they are based on?

- Very? Then uncertainty is high, results are unreliable;
- Not very? Uncertainty is low, results are more reliable.

That was for n = 50. What if I was starting with n = 1000?

Sampling uncertainty decreased!

Bootstrapping

Data collection is costly, so we have to do our best with what we already have;
We approximate this idea of “alternative, hypothetical datasets I could have observed” by resampling our data with replacement;
We construct a new dataset of the same size by randomly picking rows out of the original one:
- Some rows will be duplicated;
- Some rows will not appear at all;
- Hence, the new dataset is different from the original;
- Different dataset $\rightarrow$ different estimate
Repeat this processes hundred or thousands of times, and observe how the estimates vary as you refit the model on alternative datasets.
This gives you a sense of the sampling variability of your estimates.

Bootstrap samples 1

Original data

# A tibble: 6 × 3
     id       x       y
  <int>   <dbl>   <dbl>
1     1  0.432   1.53  
2     2 -2.01    1.80  
3     3 -0.0467  1.43  
4     4 -1.05    0.0518
5     5  0.327   0.820 
6     6 -0.679  -0.961

Original estimates:

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)   0.801 
2 x             0.0450

Sample with replacement:

# A tibble: 6 × 3
     id      x      y
  <int>  <dbl>  <dbl>
1     5  0.327  0.820
2     6 -0.679 -0.961
3     6 -0.679 -0.961
4     1  0.432  1.53 
5     6 -0.679 -0.961
6     1  0.432  1.53

Different data $\rightarrow$ new estimates:

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    0.462
2 x              2.11

Bootstrap samples 2

Original data

# A tibble: 6 × 3
     id       x       y
  <int>   <dbl>   <dbl>
1     1  0.432   1.53  
2     2 -2.01    1.80  
3     3 -0.0467  1.43  
4     4 -1.05    0.0518
5     5  0.327   0.820 
6     6 -0.679  -0.961

Original estimates:

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)   0.801 
2 x             0.0450

Sample with replacement:

# A tibble: 6 × 3
     id       x      y
  <int>   <dbl>  <dbl>
1     2 -2.01    1.80 
2     5  0.327   0.820
3     1  0.432   1.53 
4     6 -0.679  -0.961
5     3 -0.0467  1.43 
6     2 -2.01    1.80

Different data $\rightarrow$ new estimates:

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    0.913
2 x             -0.236

Bootstrap samples 3

Original data

# A tibble: 6 × 3
     id       x       y
  <int>   <dbl>   <dbl>
1     1  0.432   1.53  
2     2 -2.01    1.80  
3     3 -0.0467  1.43  
4     4 -1.05    0.0518
5     5  0.327   0.820 
6     6 -0.679  -0.961

Original estimates:

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)   0.801 
2 x             0.0450

Sample with replacement:

# A tibble: 6 × 3
     id      x      y
  <int>  <dbl>  <dbl>
1     6 -0.679 -0.961
2     1  0.432  1.53 
3     5  0.327  0.820
4     6 -0.679 -0.961
5     6 -0.679 -0.961
6     5  0.327  0.820

Different data $\rightarrow$ new estimates:

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    0.357
2 x              1.96

Confidence intervals

Point estimation: report your single number best guess for the unknown quantity;
Interval estimation: report a range, or interval, or values where you think the unknown quantity is likely to live;
- Interval should be wide enough to capture the truth with high probability;
- Interval should be narrow enough to be informative;
Unfortunately, there is a trade-off. You adjust the confidence level to try to negotiate the trade-off;
Common choices: 90%, 95%, 99%.

Precision vs. accuracy

Data: Houses in Duke Forest

Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020
Scraped from Zillow
Source: openintro::duke_forest

Home in Duke Forest

Goal: Use the area (in square feet) to understand variability in the price of houses in Duke Forest.

Modeling

df_fit <- linear_reg() |>
  fit(price ~ area, data = duke_forest)

tidy(df_fit) |>
  kable(digits = 2) # neatly format table to 2 digits

term	estimate	std.error	statistic	p.value
(Intercept)	116652.33	53302.46	2.19	0.03
area	159.48	18.17	8.78	0.00

Confidence interval for the slope

A confidence interval will allow us to make a statement like “For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus X dollars.”

95% confidence interval

A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution
We are 95% confident that for each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $90.43 to $205.77.

Where do the bounds come from?

Quantiles!
90% of the bootstrap distribution is between the 5% quantile on the left and the 95% quantile on the right;
95% of the bootstrap distribution is between the 2.5% quantile on the left and the 97.5% quantile on the right;
And so on.

Computing the CI for the slope I

Calculate the observed slope:

observed_fit <- duke_forest |>
  specify(price ~ area) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept  116652.
2 area          159.

Computing the CI for the slope II

Take 100 bootstrap samples and fit models to each one:

set.seed(1120)

boot_fits <- duke_forest |>
  specify(price ~ area) |>
  generate(reps = 100, type = "bootstrap") |>
  fit()

boot_fits

# A tibble: 200 × 3
# Groups:   replicate [100]
   replicate term      estimate
       <int> <chr>        <dbl>
 1         1 intercept   47819.
 2         1 area          191.
 3         2 intercept  144645.
 4         2 area          134.
 5         3 intercept  114008.
 6         3 area          161.
 7         4 intercept  100639.
 8         4 area          166.
 9         5 intercept  215264.
10         5 area          125.
# ℹ 190 more rows

Computing the CI for the slope III

Percentile method: Compute the 95% CI as the middle 95% of the bootstrap distribution:

get_confidence_interval(
  boot_fits, 
  point_estimate = observed_fit, 
  level = 0.95,
  type = "percentile" # default method
)

# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 area          92.1     223.
2 intercept -36765.   296528.

Computing the CI for the slope IV

If we did it manually…

boot_fits |>
  filter(term == "area") |>
  ungroup() |>
  summarize(
    lower_ci = quantile(estimate, 0.025),
    upper_ci = quantile(estimate, 0.975),
  )

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1     92.1     223.

Changing confidence level

How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval?

get_confidence_interval(
  boot_fits, 
  point_estimate = observed_fit, 
  level = 0.95,
  type = "percentile"
)

# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 area          92.1     223.
2 intercept -36765.   296528.

Changing confidence level

## confidence level: 90%
get_confidence_interval(
  boot_fits, point_estimate = observed_fit, 
  level = 0.90, type = "percentile"
)

# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 area          104.     212.
2 intercept  -24380.  256730.

## confidence level: 99%
get_confidence_interval(
  boot_fits, point_estimate = observed_fit, 
  level = 0.99, type = "percentile"
)

# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 area          56.3     226.
2 intercept -61950.   370395.

Recap

Population: Complete set of observations of whatever we are studying, e.g., people, tweets, photographs, etc. (population size = $N$)
Sample: Subset of the population, ideally random and representative (sample size = $n$)
Sample statistic $\ne$ population parameter, but if the sample is good, it can be a good estimate
Statistical inference: Discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process
We report the estimate with a confidence interval, and the width of this interval depends on the variability of sample statistics from different samples from the population
Since we can’t continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability

Hypothesis testing

A hypothesis test is a statistical technique used to evaluate competing claims using data

Null hypothesis, $H_0$: An assumption about the population. With respect to the slope parameter, our null hypothesis is one of “no relationship”
Alternative hypothesis, $H_A$: A research question about the population. With respect to the slope parameter, our alternative hypothesis is that “there is some relationship”

. . .

Note: Hypotheses are always at the population level!

Setting hypotheses

Null hypothesis, $H_0$: “There is no relationship.” The slope of the model for predicting the prices of houses in Duke Forest from their areas is 0, $\beta_1 = 0$.
Alternative hypothesis, $H_A$: “There is some relationship”. The slope of the model for predicting the prices of houses in Duke Forest from their areas is different than 0, $\beta_1 \ne 0$.