Rows: 217 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (2): year, population
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 285 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): entity, code, continent
dbl (1): year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at the data:
population
# A tibble: 217 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 2022 41129.
2 Albania 2022 2778.
3 Algeria 2022 44903.
4 American Samoa 2022 44.3
5 Andorra 2022 79.8
6 Angola 2022 35589.
7 Antigua and Barbuda 2022 93.8
8 Argentina 2022 46235.
9 Armenia 2022 2780.
10 Aruba 2022 106.
# ℹ 207 more rows
continents
# A tibble: 285 × 4
entity code year continent
<chr> <chr> <dbl> <chr>
1 Abkhazia OWID_ABK 2015 Asia
2 Afghanistan AFG 2015 Asia
3 Akrotiri and Dhekelia OWID_AKD 2015 Asia
4 Aland Islands ALA 2015 Europe
5 Albania ALB 2015 Europe
6 Algeria DZA 2015 Africa
7 American Samoa ASM 2015 Oceania
8 Andorra AND 2015 Europe
9 Angola AGO 2015 Africa
10 Anguilla AIA 2015 North America
# ℹ 275 more rows
Question 1: Join Concept
We want to know what continent all of the variables in the population data frame are in.
What type of join should we use?
Left join with population as the “left” or “main” df because we want to keep all info in this df & supplement with info from continents
Which variable in each data frame should we use?
Country & Entity
Question 2: Implement the Join
Join the two data frames and name assign the joined data frame to a new data frame population_continents .
population_continents<-population|>left_join(continents, by =join_by(country==entity))population_continents
# A tibble: 217 × 6
country year.x population code year.y continent
<chr> <dbl> <dbl> <chr> <dbl> <chr>
1 Afghanistan 2022 41129. AFG 2015 Asia
2 Albania 2022 2778. ALB 2015 Europe
3 Algeria 2022 44903. DZA 2015 Africa
4 American Samoa 2022 44.3 ASM 2015 Oceania
5 Andorra 2022 79.8 AND 2015 Europe
6 Angola 2022 35589. AGO 2015 Africa
7 Antigua and Barbuda 2022 93.8 ATG 2015 North America
8 Argentina 2022 46235. ARG 2015 South America
9 Armenia 2022 2780. ARM 2015 Asia
10 Aruba 2022 106. ABW 2015 North America
# ℹ 207 more rows
How does that look? Take a look at your new data frame!
Question 3: What went wrong?
It might not be obvious, but something is a little weird about this. Go ahead and filter the resulting data frame to see if any of the continent values are NA.
# A tibble: 6 × 6
country year.x population code year.y continent
<chr> <dbl> <dbl> <chr> <dbl> <chr>
1 Congo, Dem. Rep. 2022 99010. <NA> NA <NA>
2 Congo, Rep. 2022 5970. <NA> NA <NA>
3 Hong Kong SAR, China 2022 7346. <NA> NA <NA>
4 Korea, Dem. People's Rep. 2022 26069. <NA> NA <NA>
5 Korea, Rep. 2022 51628. <NA> NA <NA>
6 Kyrgyz Republic 2022 6975. <NA> NA <NA>
There are! This means that there were no rows in the continents data frame with those countries. That seems a little weird. Take a scroll through the continents data frame.
Do you see what the cause of this is??
Question 4: Let’s fix this!
So, countries have to be spelled the exact same way in each data set. The code below renames the missing countries in the population data set to match the spelling in contintents; we will discuss case_when() and other similar functions in more detail next week! Run this code first, and then re-run the join in the subsequent code chunk: we are no longer missing these values!
# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, year.x <dbl>, population <dbl>, code <chr>,
# year.y <dbl>, continent <chr>
The Bigger Picture
How does what we just did fit into the bigger picture of the class?
You are almost never going to be joining data sets just for fun! Joining data sets is a way to enable investigating more interesting patterns in data.
Now, you could use your joined df to create plots or compute summary statistics to investigate the relationship between continent and total population.
Keep joining in mind when you are looking for datasets in your project teams… if you find two datasets that share some common information, feel free to join them into a larger dataset for exploration!
For extra practice, try making a bar plot of total population in each country.
population_continent_new<-population_continent_new|>mutate(population =population/1000)## subset the data and plot only the top 20 most populous countries top_20<-population_continent_new|>slice_max(population, n =20)top_20|>ggplot(aes(x =population, y =fct_reorder(country, population), fill =continent))+geom_col()+labs(title ="20 Most Populous Countries", x ="Population (millions)", y =NULL, fill ="Continent")+theme_minimal()
To see something new, try making a ‘lollipop chart’ of population by country (hint: this will require using geom_point() and geom_segment().
top_20|>ggplot(aes(x =population, y =fct_reorder(country, population), color =continent))+# line segmentgeom_segment(aes(x =0, xend =population, y =country, yend =country), lwd =1.2, alpha =0.7)+# point at the endgeom_point(size =4)+labs( title ="20 Most Populous Countries", x ="Population (millions)", y =NULL, color ="Continent")+theme_minimal()