AE 07: Joining

Goal

Let’s suppose we were interested in examining the total populations of continents. However, these are the two data sets we have available:

world-pop-22.csv: contains data from The World Bank with countries’ populations
continents.csv: contains data mapping countries to continents

To do analysis of population by continent, we need to join!

Load packages and read in data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

population <- read_csv("data/world-pop-2022.csv")

Rows: 217 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (2): year, population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

continents <- read_csv("data/continents.csv")

Rows: 285 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): entity, code, continent
dbl (1): year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Take a look at the data:

population

# A tibble: 217 × 3
   country              year population
   <chr>               <dbl>      <dbl>
 1 Afghanistan          2022    41129. 
 2 Albania              2022     2778. 
 3 Algeria              2022    44903. 
 4 American Samoa       2022       44.3
 5 Andorra              2022       79.8
 6 Angola               2022    35589. 
 7 Antigua and Barbuda  2022       93.8
 8 Argentina            2022    46235. 
 9 Armenia              2022     2780. 
10 Aruba                2022      106. 
# ℹ 207 more rows

continents

# A tibble: 285 × 4
   entity                code      year continent    
   <chr>                 <chr>    <dbl> <chr>        
 1 Abkhazia              OWID_ABK  2015 Asia         
 2 Afghanistan           AFG       2015 Asia         
 3 Akrotiri and Dhekelia OWID_AKD  2015 Asia         
 4 Aland Islands         ALA       2015 Europe       
 5 Albania               ALB       2015 Europe       
 6 Algeria               DZA       2015 Africa       
 7 American Samoa        ASM       2015 Oceania      
 8 Andorra               AND       2015 Europe       
 9 Angola                AGO       2015 Africa       
10 Anguilla              AIA       2015 North America
# ℹ 275 more rows

Question 1: Join Concept

We want to know what continent all of the variables in the population data frame are in.

What type of join should we use?

Left join with population as the “left” or “main” df because we want to keep all info in this df & supplement with info from continents

Which variable in each data frame should we use?

Country & Entity

Question 2: Implement the Join

Join the two data frames and name assign the joined data frame to a new data frame population_continents .

population_continents <- population |>
  left_join(
    continents,
    by = join_by(country == entity)
  )

population_continents

# A tibble: 217 × 6
   country             year.x population code  year.y continent    
   <chr>                <dbl>      <dbl> <chr>  <dbl> <chr>        
 1 Afghanistan           2022    41129.  AFG     2015 Asia         
 2 Albania               2022     2778.  ALB     2015 Europe       
 3 Algeria               2022    44903.  DZA     2015 Africa       
 4 American Samoa        2022       44.3 ASM     2015 Oceania      
 5 Andorra               2022       79.8 AND     2015 Europe       
 6 Angola                2022    35589.  AGO     2015 Africa       
 7 Antigua and Barbuda   2022       93.8 ATG     2015 North America
 8 Argentina             2022    46235.  ARG     2015 South America
 9 Armenia               2022     2780.  ARM     2015 Asia         
10 Aruba                 2022      106.  ABW     2015 North America
# ℹ 207 more rows

How does that look? Take a look at your new data frame!

Question 3: What went wrong?

It might not be obvious, but something is a little weird about this. Go ahead and filter the resulting data frame to see if any of the continent values are NA.

population_continents |>
  filter(is.na(continent))

# A tibble: 6 × 6
  country                   year.x population code  year.y continent
  <chr>                      <dbl>      <dbl> <chr>  <dbl> <chr>    
1 Congo, Dem. Rep.            2022     99010. <NA>      NA <NA>     
2 Congo, Rep.                 2022      5970. <NA>      NA <NA>     
3 Hong Kong SAR, China        2022      7346. <NA>      NA <NA>     
4 Korea, Dem. People's Rep.   2022     26069. <NA>      NA <NA>     
5 Korea, Rep.                 2022     51628. <NA>      NA <NA>     
6 Kyrgyz Republic             2022      6975. <NA>      NA <NA>

There are! This means that there were no rows in the continents data frame with those countries. That seems a little weird. Take a scroll through the continents data frame.

Do you see what the cause of this is??

Question 4: Let’s fix this!

So, countries have to be spelled the exact same way in each data set. The code below renames the missing countries in the population data set to match the spelling in contintents; we will discuss case_when() and other similar functions in more detail next week! Run this code first, and then re-run the join in the subsequent code chunk: we are no longer missing these values!

population_new_spellings <- population |>
  mutate(country = case_when(
    country == "Congo, Dem. Rep." ~ "Democratic Republic of Congo",
    country == "Congo, Rep." ~ "Congo",
    country == "Hong Kong SAR, China" ~ "Hong Kong",
    country == "Korea, Dem. People's Rep." ~ "North Korea",
    country == "Korea, Rep." ~ "South Korea",
    country == "Kyrgyz Republic" ~ "Kyrgyzstan",
    .default = country
    )
  )

population_continent_new = population_new_spellings |>
  left_join(continents, by = join_by(country == entity))

Are there any NAs left? Run the code to see!

population_continent_new |>
  filter(is.na(continent))

# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, year.x <dbl>, population <dbl>, code <chr>,
#   year.y <dbl>, continent <chr>

The Bigger Picture

How does what we just did fit into the bigger picture of the class?

You are almost never going to be joining data sets just for fun! Joining data sets is a way to enable investigating more interesting patterns in data.
Now, you could use your joined df to create plots or compute summary statistics to investigate the relationship between continent and total population.
Keep joining in mind when you are looking for datasets in your project teams… if you find two datasets that share some common information, feel free to join them into a larger dataset for exploration!

For extra practice, try making a bar plot of total population in each country.

population_continent_new <- population_continent_new |>
  mutate(population = population / 1000)

## subset the data and plot only the top 20 most populous countries 

top_20 <- population_continent_new |>
  slice_max(population, n = 20)

top_20 |>
  ggplot(aes(x = population, y = fct_reorder(country, population),
             fill = continent)) +
  geom_col() +
  labs(title = "20 Most Populous Countries",
       x = "Population (millions)",
       y = NULL,
       fill = "Continent") +
  theme_minimal()

To see something new, try making a ‘lollipop chart’ of population by country (hint: this will require using geom_point() and geom_segment().

top_20 |>
  ggplot(aes(x = population, y = fct_reorder(country, population),
             color = continent)) +
  # line segment
  geom_segment(aes(x = 0, xend = population,
                   y = country, yend = country),
               lwd = 1.2, alpha = 0.7) +
  # point at the end
  geom_point(size = 4) +
  labs(
    title = "20 Most Populous Countries",
    x = "Population (millions)",
    y = NULL,
    color = "Continent"
  ) +
  theme_minimal()