AE 18: FIFA Inference ⚽

Packages

library(tidyverse)
library(tidymodels)

Data

In this exercise, we will work with data from FiveThirtyEight’s story How To Break FIFA. The dataset contains information on FIFA member countries, including each country’s confederation, share of the global population, share of World Cup TV audience, and GDP-weighted audience share.

The dataset, called fifa_countries_audience.csv, can be found in the data folder.

First, let’s load the data:

fifa <- read_csv("data/fifa_countries_audience.csv")

glimpse(fifa)

Rows: 191
Columns: 5
$ country            <chr> "United States", "Japan", "China", "Germany", "Braz…
$ confederation      <chr> "CONCACAF", "AFC", "AFC", "UEFA", "CONMEBOL", "UEFA…
$ population_share   <dbl> 4.5, 1.9, 19.5, 1.2, 2.8, 0.9, 0.9, 0.9, 2.1, 0.7, …
$ tv_audience_share  <dbl> 4.3, 4.9, 14.8, 2.9, 7.1, 2.1, 2.1, 2.0, 3.1, 1.8, …
$ gdp_weighted_share <dbl> 11.3, 9.1, 7.3, 6.3, 5.4, 4.2, 4.0, 4.0, 3.5, 3.1, …

Data Dictionary

Variable	Description
`country`	FIFA member country
`confederation`	Confederation to which the country belongs
`population_share`	Country’s share of the global population, recorded as a percentage
`tv_audience_share`	Country’s share of the global World Cup TV audience, recorded as a percentage
`gdp_weighted_share`	Country’s GDP-weighted audience share, recorded as a percentage

ABV (Always Be Visualizing)

Your turn: First, let’s get rid of two outliers. Create a new data frame called fifa_clean that contains only those member countries whose share of the global population is below 10%.

fifa_clean <- fifa |>
  filter(population_share < 10)

Your turn: Create a visualization to explore the relationship between our variables of interest. Let’s consider population_share as our explanatory variable and tv_audience_share as our outcome. Include a linear trendline.

ggplot(fifa_clean, aes(x = population_share, y = tv_audience_share)) +
  geom_point() +
  geom_smooth(method = "lm")

Now let’s imagine we only had a tiny subset of these data to work with:

set.seed(847)
baby_fifa <- fifa_clean |>
  slice(sample(1:nrow(fifa_clean), 25))
glimpse(baby_fifa)

Rows: 25
Columns: 5
$ country            <chr> "Japan", "Madagascar", "South Korea", "Uruguay", "M…
$ confederation      <chr> "AFC", "CAF", "AFC", "CONMEBOL", "UEFA", "CONCACAF"…
$ population_share   <dbl> 1.9, 0.3, 0.7, 0.0, 0.1, 0.1, 0.2, 0.5, 0.1, 0.1, 0…
$ tv_audience_share  <dbl> 4.9, 0.1, 1.8, 0.1, 0.1, 0.1, 0.5, 0.5, 0.3, 0.0, 0…
$ gdp_weighted_share <dbl> 9.1, 0.0, 3.0, 0.1, 0.0, 0.0, 0.1, 0.4, 0.4, 0.0, 0…

Plot the baby thing, again adding a linear trendline:

ggplot(baby_fifa, aes(x = population_share, y = tv_audience_share)) +
  geom_point() +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Inference with the small dataset

Your turn: Obtain the point estimate \(b_1\) from the baby data.

observed_fit <- baby_fifa |>
  specify(tv_audience_share ~ population_share) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term             estimate
  <chr>               <dbl>
1 intercept         -0.0942
2 population_share   2.45

Note

This gives the exact same numbers that you get if you use linear_reg() |> fit(), but we need this new syntax because it plays nice with the tools we have for confidence intervals and hypothesis tests. I know, I hate it too, but it’s the way it is.

Your turn: Typeset the equation for the model fit:

\[ \widehat{tv~audience~share} = -0.09415766 + 2.45268107 \times population~share \]

Your turn: Interpret the slope and the intercept estimates:

We predict that a FIFA member country with 0% share of the global population has a -9.4% share of the global World Cup TV audience, on average.

We would expect that for every percentage point increase in a FIFA member country’s share of the global population, the country’s share of the global World Cup TV audience increases by 2.45 percentage points, on average.

Hypothesis Testing

Let’s consider the hypotheses:

\[ H_0:\beta_1=0 \] \[ H_A: \beta_1\neq 0. \] The null hypothesis corresponds to the claim that a FIFA member country’s share of the global population and its share of the global World Cup TV audience are unrelated / uncorrelated.

Simulate and plot the null distribution for the slope:

set.seed(847)

null_dist <- baby_fifa |>
  specify(tv_audience_share ~ population_share) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

null_dist

# A tibble: 2,000 × 3
# Groups:   replicate [1,000]
   replicate term             estimate
       <int> <chr>               <dbl>
 1         1 intercept           0.505
 2         1 population_share   -0.428
 3         2 intercept           0.329
 4         2 population_share    0.419
 5         3 intercept           0.465
 6         3 population_share   -0.233
 7         4 intercept           0.445
 8         4 population_share   -0.141
 9         5 intercept           0.464
10         5 population_share   -0.231
# ℹ 1,990 more rows

null_dist |>
  filter(term == "population_share") |>
  ggplot(aes(x = estimate)) +
  geom_histogram()

Where does our actual point estimate fall under the null distribution? Add a vertical line corresponding to our point estimate and shade the region corresponding to the \(p\)-value.

visualize(null_dist) +
  shade_p_value(obs_stat = observed_fit, direction = "two-sided")

Compute the \(p\)-value for this test and interpret it:

null_dist |>
  get_p_value(obs_stat = observed_fit, direction = "two-sided")

# A tibble: 2 × 2
  term             p_value
  <chr>              <dbl>
1 intercept              0
2 population_share       0

Reject the null hypothesis. There is sufficient evidence for some relationship between a FIFA member country’s share of the global population and its share of the global World Cup TV audience.

Interval Estimation

Demo: Generate 500 bootstrap samples, and store them in a new data frame called bstrap_samples.

set.seed(847)

bstrap_samples <- baby_fifa |>
  specify(tv_audience_share ~ population_share) |>
  generate(reps = 500, type = "bootstrap")

Demo: Fit a linear model to each of these bootstrap samples and store the estimates in a new data framed called bstrap_fits.

bstrap_fits <- bstrap_samples |>
  fit()

Demo: Compute 95% confidence intervals for the slope and the intercept using the get_confidence_interval command.

ci_95 <- get_confidence_interval(bstrap_fits,
                        point_estimate = observed_fit,
                        level = 0.95,
                        type = "percentile")

ci_95

# A tibble: 2 × 3
  term             lower_ci upper_ci
  <chr>               <dbl>    <dbl>
1 intercept          -0.174   0.0347
2 population_share    0.955   2.60

Your turn: Verify that you get the same numbers when you manually calculate the quantiles of the slope estimates using summarize and quantile. Pay attention to the grouping.

bstrap_fits |>
  filter(term == "population_share") |>
  ungroup() |>
  summarize(lower_ci = quantile(estimate, .025),
            upper_ci = quantile(estimate, .975))

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.955     2.60

BONUS: You can visualize the confidence interval:

visualize(bstrap_fits) +
  shade_confidence_interval(ci_95)