Lecture 9
Duke University
STA 199 Summer 2026: Session I
May 27, 2026
Midterm next Monday!
Practice + some review tomorrow in class
Practice questions for both the multiple choice & live coding portions of the exam will be posted tomorrow!
I will post solutions to the practice questions by Saturday morning
Lab 3 after lecture tomorrow
This will be a shorter than usual lab & is meant to serve as a recap of all we’ve learned so far (aka, helpful exam prep!)
Reminder: Lab 3 will be due by 12:00pm (noon) on Sunday; there will be no late window for this lab assignment
Solutions will posted to the website promptly thereafter
Project work begins next week!
Since the midterm takes place during both of the formal lecture and lab timeslots next Monday, lecture time next Tuesday (9:30am - 10:45am) will function as a lab meeting
Both lab sessions next week (T, Th) will be dedicated worktime
You will receive your team assignments (2 groups of 3, 2 groups of 4) in lab next Tuesday
As a reminder, projects are completed in groups BUT grades are individual; your final project grade may ultimately differ from your teammates’ if there is unequal participation
Participation is measured by 1) lab attendance on project work days; 2) commit history in project repos on GitHub; 3) peer review forms
Before Midterm 1…
After Midterm 1…

library(); df <- library::df), data importation (read_csv, read_xlsx, read_xls)mutate, fct_relevel, pivot_*, *_join
Collection: we won’t seriously study data collection; we will discuss data importation methods today
Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.
ggplot, geom_*, etc.summarize, group_by, count, mean, median, sd, quantile, IQR, cor, etc.When data is neatly stored in a package, such as tidyverse, loading the package loads the dataset; you can explicitly save a packaged df to your RStudio environment by running df <- library::df
Most often, this is not the case

read_csv() - file saved as .csv
read_tsv(), read_delim(), etc - other file formatsread_excel() - R determines file type for youread_xls(), read_xlsx() - use if you know whether you have a .xls or .xlsx fileUsing googlesheets4: read_sheet() – we haven’t covered this in the videos, but might be useful for your projects
Generally, the format is:
For example, recall we worked with durham-climate.csv in AE08. Where is durham-climate.csv?
When in our AE repo, we read in the data with the following code:


use / to separate folder(s) + file names; file path in quotes
Answer: read_csv("data/durham-climate.csv")
ae-kgsolarz?Where is durham-climate.csv?



This allows us to save data for later usage, share data outside of R, etc.
Using write_csv():
Read a CSV file with tidy data
Split it into subsets based on features of the data
Write out subsets as CSV files
case_when() is similar to if_else(), but allows multiple casescase_when() is often used within mutate() to create a new columncase_when() is similar to if_else(), but if_else() only allows for 2 cases / conditions
Long story short… use if_else() if you are only considering 2 cases / conditions; for > 2 conditions, use case_when()
In words, we’d read the code below as: “If logical_1 evaluates to TRUE (i.e., this condition is met), then choose result_1; else (i.e., this condition is not met), choose result 2

Read an Excel file with non-tidy data
Tidy it up!
We’ve seen lots of functions that deal with numeric data (mean, median, sum, etc.) - what about characters?
stringr is a tidyverse package with lots of functions for dealing with character strings
today: str_detect in stringr

str_detect() identifies if a character / sequence of characters is a substring within a longer string
useful in cases when you need to check some condition, for example:
in a filter()
in an if_else() or case_when()
str_detect() identifies if a character / sequence of characters is a substring within a longer string
useful in cases when you need to check some condition, for example:
in a filter()
in an if_else() or case_when()
example: which classes in a list are in the stats department?
General form:

Are these data tidy? Why or why not?
What data tidying must be done to go from the original, non-tidy data to the below tidy version of these data?
