Rows: 217 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (2): year, population
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 285 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): entity, code, continent
dbl (1): year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at the data:
population
# A tibble: 217 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 2022 41129.
2 Albania 2022 2778.
3 Algeria 2022 44903.
4 American Samoa 2022 44.3
5 Andorra 2022 79.8
6 Angola 2022 35589.
7 Antigua and Barbuda 2022 93.8
8 Argentina 2022 46235.
9 Armenia 2022 2780.
10 Aruba 2022 106.
# ℹ 207 more rows
continents
# A tibble: 285 × 4
entity code year continent
<chr> <chr> <dbl> <chr>
1 Abkhazia OWID_ABK 2015 Asia
2 Afghanistan AFG 2015 Asia
3 Akrotiri and Dhekelia OWID_AKD 2015 Asia
4 Aland Islands ALA 2015 Europe
5 Albania ALB 2015 Europe
6 Algeria DZA 2015 Africa
7 American Samoa ASM 2015 Oceania
8 Andorra AND 2015 Europe
9 Angola AGO 2015 Africa
10 Anguilla AIA 2015 North America
# ℹ 275 more rows
Question 1: Join Concept
We want to know what continent all of the variables in the population data frame are in.
What type of join should we use?
Which variable in each data frame should we use?
Question 2: Implement the Join
Join the two data frames and name assign the joined data frame to a new data frame population_continents .
How does that look? Take a look at your new data frame!
Question 3: What went wrong?
It might not be obvious, but something is a little weird about this. Go ahead and filter the resulting data frame to see if any of the continent values are NA.
There are! This means that there were no rows in the continents data frame with those countries. That seems a little weird. Take a scroll through the continents data frame.
Do you see what the cause of this is??
Question 4: Let’s fix this!
So, countries have to be spelled the exact same way in each data set. The code below renames the missing countries in the population data set to match the spelling in contintents; we will discuss case_when() and other similar functions in more detail next week! Run this code first, and then re-run the join in the subsequent code chunk: we are no longer missing these values!
# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, year.x <dbl>, population <dbl>, code <chr>,
# year.y <dbl>, continent <chr>
The Bigger Picture
How does what we just did fit into the bigger picture of the class?
You are almost never going to be joining data sets just for fun! Joining data sets is a way to enable investigating more interesting patterns in data.
Now, you could use your joined df to create plots or compute summary statistics to investigate the relationship between continent and total population.
Keep joining in mind when you are looking for datasets in your project teams… if you find two datasets that share some common information, feel free to join them into a larger dataset for exploration!
For extra practice, try making a bar plot of total population in each country.
To see something new, try making a ‘lollipop chart’ of population by country (hint: this will require using geom_point() and geom_segment().