AE 18: FIFA Inference ⚽
Today we’ll explore the relationship between a FIFA member country’s share of the world population and it’s share of the global World Cup TV audience.
Packages
Data
In this exercise, we will work with data from FiveThirtyEight’s story How To Break FIFA. The dataset contains information on FIFA member countries, including each country’s confederation, share of the global population, share of World Cup TV audience, and GDP-weighted audience share.
The dataset, called fifa_countries_audience.csv, can be found in the data folder.
First, let’s load the data:
Rows: 191
Columns: 5
$ country <chr> "United States", "Japan", "China", "Germany", "Braz…
$ confederation <chr> "CONCACAF", "AFC", "AFC", "UEFA", "CONMEBOL", "UEFA…
$ population_share <dbl> 4.5, 1.9, 19.5, 1.2, 2.8, 0.9, 0.9, 0.9, 2.1, 0.7, …
$ tv_audience_share <dbl> 4.3, 4.9, 14.8, 2.9, 7.1, 2.1, 2.1, 2.0, 3.1, 1.8, …
$ gdp_weighted_share <dbl> 11.3, 9.1, 7.3, 6.3, 5.4, 4.2, 4.0, 4.0, 3.5, 3.1, …
Data Dictionary
| Variable | Description |
|---|---|
country |
FIFA member country |
confederation |
Confederation to which the country belongs |
population_share |
Country’s share of the global population, recorded as a percentage |
tv_audience_share |
Country’s share of the global World Cup TV audience, recorded as a percentage |
gdp_weighted_share |
Country’s GDP-weighted audience share, recorded as a percentage |
ABV (Always Be Visualizing)
-
Your turn: First, let’s get rid of two outliers. Create a new data frame called
fifa_cleanthat contains only those member countries whose share of the global population is below 10%.
# add code here-
Your turn: Create a visualization to explore the relationship between our variables of interest. Let’s consider
population_shareas our explanatory variable andtv_audience_shareas our outcome. Include a linear trendline.
# add code hereNow let’s imagine we only had a tiny subset of these data to work with:
- Plot the baby thing, again adding a linear trendline:
# add code hereInference with the small dataset
- Your turn: Obtain the point estimate \(b_1\) from the baby data.
# add code hereThis gives the exact same numbers that you get if you use linear_reg() |> fit(), but we need this new syntax because it plays nice with the tools we have for confidence intervals and hypothesis tests. I know, I hate it too, but it’s the way it is.
- Your turn: Typeset the equation for the model fit:
\[ add~math~here \]
- Your turn: Interpret the slope and the intercept estimates:
Hypothesis Testing
Let’s consider the hypotheses:
\[ H_0:\beta_1=0 \] \[ H_A: \beta_1\neq 0. \] The null hypothesis corresponds to the claim that a FIFA member country’s share of the global population and its share of the global World Cup TV audience are unrelated / uncorrelated.
- Simulate and plot the null distribution for the slope:
set.seed(847)
# add code here- Where does our actual point estimate fall under the null distribution? Add a vertical line correspoinding to our point estimate and shade the region corresponding to the \(p\)-value.
# add code here- Compute the \(p\)-value for this test and interpret it:
# add code hereInterval Estimation
-
Demo: Generate
500bootstrap samples, and store them in a new data frame calledbstrap_samples.
set.seed(847)
# add code here-
Demo: Fit a linear model to each of these bootstrap samples and store the estimates in a new data framed called
bstrap_fits.
# add code here-
Demo: Compute 95% confidence intervals for the slope and the intercept using the
get_confidence_intervalcommand.
# add code here-
Your turn: Verify that you get the same numbers when you manually calculate the quantiles of the slope estimates using
summarizeandquantile. Pay attention to the grouping.
# add code here- BONUS: You can visualize the confidence interval:
# add code here