AE 18: FIFA Inference ⚽

Today we’ll explore the relationship between a FIFA member country’s share of the world population and it’s share of the global World Cup TV audience.

Packages

library(tidyverse)
library(tidymodels)

Data

In this exercise, we will work with data from FiveThirtyEight’s story How To Break FIFA. The dataset contains information on FIFA member countries, including each country’s confederation, share of the global population, share of World Cup TV audience, and GDP-weighted audience share.

The dataset, called fifa_countries_audience.csv, can be found in the data folder.

First, let’s load the data:

fifa <- read_csv("data/fifa_countries_audience.csv")

glimpse(fifa)

Rows: 191
Columns: 5
$ country            <chr> "United States", "Japan", "China", "Germany", "Braz…
$ confederation      <chr> "CONCACAF", "AFC", "AFC", "UEFA", "CONMEBOL", "UEFA…
$ population_share   <dbl> 4.5, 1.9, 19.5, 1.2, 2.8, 0.9, 0.9, 0.9, 2.1, 0.7, …
$ tv_audience_share  <dbl> 4.3, 4.9, 14.8, 2.9, 7.1, 2.1, 2.1, 2.0, 3.1, 1.8, …
$ gdp_weighted_share <dbl> 11.3, 9.1, 7.3, 6.3, 5.4, 4.2, 4.0, 4.0, 3.5, 3.1, …

Data Dictionary

Variable	Description
`country`	FIFA member country
`confederation`	Confederation to which the country belongs
`population_share`	Country’s share of the global population, recorded as a percentage
`tv_audience_share`	Country’s share of the global World Cup TV audience, recorded as a percentage
`gdp_weighted_share`	Country’s GDP-weighted audience share, recorded as a percentage

ABV (Always Be Visualizing)

Your turn: First, let’s get rid of two outliers. Create a new data frame called fifa_clean that contains only those member countries whose share of the global population is below 10%.

# add code here

Your turn: Create a visualization to explore the relationship between our variables of interest. Let’s consider population_share as our explanatory variable and tv_audience_share as our outcome. Include a linear trendline.

# add code here

Now let’s imagine we only had a tiny subset of these data to work with:

set.seed(847)
baby_fifa <- fifa_clean |>
  slice(sample(1:nrow(fifa_clean), 25))
glimpse(baby_fifa)

Plot the baby thing, again adding a linear trendline:

# add code here

Inference with the small dataset

Your turn: Obtain the point estimate \(b_1\) from the baby data.

# add code here

Note

This gives the exact same numbers that you get if you use linear_reg() |> fit(), but we need this new syntax because it plays nice with the tools we have for confidence intervals and hypothesis tests. I know, I hate it too, but it’s the way it is.

Your turn: Typeset the equation for the model fit:

\[ add~math~here \]

Your turn: Interpret the slope and the intercept estimates:

Hypothesis Testing

Let’s consider the hypotheses:

\[ H_0:\beta_1=0 \] \[ H_A: \beta_1\neq 0. \] The null hypothesis corresponds to the claim that a FIFA member country’s share of the global population and its share of the global World Cup TV audience are unrelated / uncorrelated.

Simulate and plot the null distribution for the slope:

set.seed(847)
# add code here

Where does our actual point estimate fall under the null distribution? Add a vertical line correspoinding to our point estimate and shade the region corresponding to the \(p\)-value.

# add code here

Compute the \(p\)-value for this test and interpret it:

# add code here

Interval Estimation

Demo: Generate 500 bootstrap samples, and store them in a new data frame called bstrap_samples.

set.seed(847)
# add code here

Demo: Fit a linear model to each of these bootstrap samples and store the estimates in a new data framed called bstrap_fits.

# add code here

Demo: Compute 95% confidence intervals for the slope and the intercept using the get_confidence_interval command.

# add code here

Your turn: Verify that you get the same numbers when you manually calculate the quantiles of the slope estimates using summarize and quantile. Pay attention to the grouping.

# add code here

BONUS: You can visualize the confidence interval:

# add code here