Grammar of Data Transformation

Lecture 3

Katie Solarz

Duke University
STA 199 Summer 2026: Session I

May 18, 2025

Announcements / Reminders

  • First “official” lab takes place immediately after this lecture (well, 15 mins after… stretch your legs & touch grass in between)

  • Office hours start tomorrow; you can find time / location details here

  • Come to office hours and / or post on Ed for help!

Lab Assignments (& Exams): Code Style (Do these things!)

Which of these pieces of code is easier to read?

ggplot(bechdel, aes(x=budget_2013,y=gross_2013,color=binary,size=roi)) +
geom_point(alpha = 0.5) + facet_wrap(~ clean_test) 



ggplot(bechdel, aes(x = budget_2013, y = gross_2013,
                    color = binary, size = roi)) +
  geom_point(alpha = 0.5) + 
  facet_wrap(~ clean_test) 

Lab Assignments (& Exams): Code Style (Do these things!)

Code should follow the tidyverse style:

  • there should be spaces before and line breaks after each + when building a ggplot

  • there should also be spaces before and line breaks after each |> in a data transformation pipeline (we will introduce the “pipe” today!)

  • code should be properly indented (check: Code -> Reindent Lines; equivalently, ⌘I)

  • spaces around = signs and spaces after commas

  • you can find all tidyverse style guidelines here

All code should be visible in the PDF output (should not run off the page)! Use line breaks to prevent this.

Outline

  • Last Time: Grammar of data viz in R (via ggplot())

  • Today: Grammar of ‘data wrangling’

Alison Bechdel

The Bechdel Test

(Dykes to Watch Out For - 1985)

Film passes if it has…

  1. two (named) female characters;
  2. who talk to each other;
  3. about something besides a man.

Recent releases

Title Year Bechdel Director
Dune 2 2024 M
Conclave 2024 M
Wicked 1 2024 M
Bugonia 2025 M
Wicked 2 2025 M
Marty Supreme 2025 M
Wuthering Heights 2026 F
The Devil Wears Prada 2 2026 M

Data Transformation

dplyr

Primary package in the tidyverse for data wrangling and transformation

What is data transformation?

  • Creating new variables (perhaps as a function of some existing variables, but not necessarily so…)

  • Reshaping your data frame

  • Summarizing information about your variables

  • And more!

The pipe

  • The pipe, |>, is an operator (a tool) that allows us to link two functions together in a way that is readable from left to right

  • Use |> to pass the output of the previous line of code as the first argument to the function in the following line of code.

  • When reading code “in English”, say “(and) then” whenever you see a pipe.

  • You can string multiple pipes together to continue passing upstream outputs along to downstream functions; a string of pipes is still referred to as a “single pipeline”

Readability

Consider the following sequence of actions that describe the process of getting to campus in the morning:

I need to find my key, then unlock my car, then start my car, then drive to school, then park.


Expressed as a set of nested functions in R pseudocode this would look like:

park(drive(start_car(find("keys")), to="campus"))

Writing it out using pipes give it a more natural (and easier to read) structure:

find("keys") |>
    start_car() |>
    drive(to="campus") |>
    park()

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

Core single data frame functions / verbs:

  • filter() / slice() - pick rows based on criteria
  • select() / rename() - select columns by name
  • pull() - grab a column as a vector
  • arrange() - reorder rows
  • mutate() / transmute() - create or modify columns
  • distinct() - filter for unique rows
  • summarize() / count() - reduce variables to values
  • group_by() / ungroup() - modify other verbs to act on subsets
  • relocate() - change column order
  • … (many more)

Row Operations

slice()

  • slice(): chooses rows based on location

Ex: Display the first five rows of bechdel:

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  slice(1:5)
# A tibble: 5 × 7
  title            year gross_2013 budget_2013   roi binary clean_test
  <chr>           <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
1 21 & Over        2013   67878146    13000000  5.22 FAIL   notalk    
2 Dredd 3D         2012   55078343    45658735  1.21 PASS   ok        
3 12 Years a Sla…  2013  211714070    20000000 10.6  FAIL   notalk    
4 2 Guns           2013  208105475    61000000  3.41 FAIL   notalk    
5 42               2013  190040426    40000000  4.75 FAIL   men       

arrange()

  • arrange(): changes the order of the rows; default is ascending order
bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  arrange(year)
# A tibble: 1,615 × 7
   title           year gross_2013 budget_2013   roi binary clean_test
   <chr>          <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 Back to the F…  1990  590818548    71319016  8.28 FAIL   notalk    
 2 Child's Play 2  1990  108888347    23178680  4.70 PASS   ok        
 3 Dark Angel (I…  1990   15592338    12480828  1.25 FAIL   nowomen   
 4 Die Hard 2      1990  636768095   124808278  5.10 FAIL   dubious   
 5 Edward Scisso…  1990  192479280    35659508  5.40 PASS   ok        
 6 Flatliners      1990  218621858    46357360  4.72 PASS   ok        
 7 Ghost           1990 1310899333    39225459 33.4  FAIL   men       
 8 Goodfellas      1990  166686124    44574385  3.74 FAIL   men       
 9 Home Alone      1990 1359422317    26744631 50.8  FAIL   men       
10 Nikita          1990   17893838    12480828  1.43 PASS   ok        
# ℹ 1,605 more rows

sample_n()

  • sample_n(): take a random subset of the rows

Display five random rows of bechdel:

bechdel 
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  sample_n(5)
# A tibble: 5 × 7
  title            year gross_2013 budget_2013   roi binary clean_test
  <chr>           <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
1 Saving Grace     2000   54069594     5411634 9.99  PASS   ok        
2 The Adventures…  2011  467703233   134639802 3.47  FAIL   notalk    
3 Dazed and Conf…  1993   25640964    11125966 2.30  PASS   ok        
4 Side Effects     2013   92461120    30000000 3.08  PASS   ok        
5 Virus            1999   62423681   104884652 0.595 FAIL   dubious   

filter()

  • filter():chooses rows based on column values
  • You should think of the logic you provide within a filter() function call as telling R what observations it should keep

Keep only the rows of bechdel that pass the test:

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  filter(binary == "PASS")
# A tibble: 753 × 7
   title           year gross_2013 budget_2013   roi binary clean_test
   <chr>          <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 Dredd 3D        2012   55078343    45658735  1.21 PASS   ok        
 2 About Time      2013  102648667    12000000  8.55 PASS   ok        
 3 Admission       2013   36014634    13000000  2.77 PASS   ok        
 4 American Hust…  2013  397915817    40000000  9.95 PASS   ok        
 5 August: Osage…  2013   87609748    25000000  3.50 PASS   ok        
 6 Beautiful Cre…  2013   75392809    50000000  1.51 PASS   ok        
 7 Blue Jasmine    2013  101793664    18000000  5.66 PASS   ok        
 8 Carrie          2013  120268278    30000000  4.01 PASS   ok        
 9 Despicable Me…  2013 1338831390    76000000 17.6  PASS   ok        
10 Elysium         2013  379242208   120000000  3.16 PASS   ok        
# ℹ 743 more rows

filter()

Keep only the movies released before 2000

bechdel |>
  filter(year < 2000)
# A tibble: 337 × 7
   title           year gross_2013 budget_2013   roi binary clean_test
   <chr>          <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 10 Things I H…  1999  137877156    18180006  7.58 PASS   ok        
 2 8MM             1999  185774868    55938481  3.32 FAIL   notalk    
 3 American Beau…  1999  680094591    20976930 32.4  PASS   ok        
 4 American Pie    1999  470616170    16781544 28.0  FAIL   men       
 5 Analyze This    1999  396843410    41953861  9.46 FAIL   notalk    
 6 Anna and the …  1999  109782424   104884652  1.05 FAIL   men       
 7 Anywhere But …  1999   52172744    32164627  1.62 FAIL   dubious   
 8 Austin Powers…  1999  722127642    48946171 14.8  FAIL   notalk    
 9 Being John Ma…  1999   77252870    18180006  4.25 PASS   ok        
10 Black and Whi…  1999   14659560    13984620  1.05 PASS   ok        
# ℹ 327 more rows

filter()

Often (but not always), looks like:

filter(variable [logical operator] value)

or

filter(!is.na(variable))

Some logical operators

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?

More logical operators

operator definition
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x? (only makes sense if x is TRUE or FALSE)

filter()

Keep only the movies from before 2000 AND that pass the test

bechdel |>
  filter(year < 2000 & binary == "PASS")
# A tibble: 147 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 10 Things I …  1999  137877156    18180006  7.58  PASS   ok        
 2 American Bea…  1999  680094591    20976930 32.4   PASS   ok        
 3 Being John M…  1999   77252870    18180006  4.25  PASS   ok        
 4 Black and Wh…  1999   14659560    13984620  1.05  PASS   ok        
 5 Boys Don't C…  1999   45144602     2796924 16.1   PASS   ok        
 6 But I'm a Ch…  1999    6761310     1678154  4.03  PASS   ok        
 7 Carrie 2: Th…  1999   49674054    29367703  1.69  PASS   ok        
 8 Cruel Intent…  1999  159471926    15383082 10.4   PASS   ok        
 9 Dick           1999   17555926    18180006  0.966 PASS   ok        
10 Drop Dead Go…  1999   29567426    13984620  2.11  PASS   ok        
# ℹ 137 more rows

Column Operations

select()

  • select(): changes whether or not a column is included.

Keep only the title and test status.

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  select(title, clean_test)
# A tibble: 1,615 × 2
   title                  clean_test
   <chr>                  <chr>     
 1 21 & Over              notalk    
 2 Dredd 3D               ok        
 3 12 Years a Slave       notalk    
 4 2 Guns                 notalk    
 5 42                     men       
 6 47 Ronin               men       
 7 A Good Day to Die Hard notalk    
 8 About Time             ok        
 9 Admission              ok        
10 After Earth            notalk    
# ℹ 1,605 more rows

select()

  • select() also allows you to exclude particular columns using the - symbol

Again, keep only the title and test status but this time by explicitly excluding all other columns

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  select(-(2:6))
# A tibble: 1,615 × 2
   title                  clean_test
   <chr>                  <chr>     
 1 21 & Over              notalk    
 2 Dredd 3D               ok        
 3 12 Years a Slave       notalk    
 4 2 Guns                 notalk    
 5 42                     men       
 6 47 Ronin               men       
 7 A Good Day to Die Hard notalk    
 8 About Time             ok        
 9 Admission              ok        
10 After Earth            notalk    
# ℹ 1,605 more rows

rename()

  • rename(): changes the name of columns.

Rename clean_test to test_result

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  rename(test_result = clean_test)
# A tibble: 1,615 × 7
   title         year gross_2013 budget_2013    roi binary test_result
   <chr>        <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>      
 1 21 & Over     2013   67878146    13000000  5.22  FAIL   notalk     
 2 Dredd 3D      2012   55078343    45658735  1.21  PASS   ok         
 3 12 Years a …  2013  211714070    20000000 10.6   FAIL   notalk     
 4 2 Guns        2013  208105475    61000000  3.41  FAIL   notalk     
 5 42            2013  190040426    40000000  4.75  FAIL   men        
 6 47 Ronin      2013  184166317   225000000  0.819 FAIL   men        
 7 A Good Day …  2013  371598396    92000000  4.04  FAIL   notalk     
 8 About Time    2013  102648667    12000000  8.55  PASS   ok         
 9 Admission     2013   36014634    13000000  2.77  PASS   ok         
10 After Earth   2013  304895295   130000000  2.35  FAIL   notalk     
# ℹ 1,605 more rows

rename()

Generally, looks like:

rename(new_variable_name = old_variable_name)

mutate()

  • mutate(): changes the values of columns (i.e., modifies existing columns) and creates new columns.

Create a new variable for the budget in millions

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  mutate(budget_million = budget_2013/1000000)
# A tibble: 1,615 × 8
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows
# ℹ 1 more variable: budget_million <dbl>

mutate()

Generally, looks like:

mutate(new_variable_name = function(existing_variable))

Groups of rows

count()

  • count(): count unique values of one or more variables.

Count how many movies pass or fail the Bechdel test.

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  count(binary)
# A tibble: 2 × 2
  binary     n
  <chr>  <int>
1 FAIL     862
2 PASS     753

group_by()

  • group_by(): group separately for each value of a variable

Group by movies passing or failing the test

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  group_by(binary)
# A tibble: 1,615 × 7
# Groups:   binary [2]
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

summarize()

  • summarize(): collapses a group into a single row.

Compute average budget

bechdel 
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows


bechdel |>
  summarize(mean_budget = mean(budget_2013))
# A tibble: 1 × 1
  mean_budget
        <dbl>
1   57035015.

summarize()

Generally, looks like:

summarize(result_variable_name = function(existing_variable))

group_by() + summarize()

Group by movies passing/failing and compute within-group average budget

bechdel |>
  group_by(binary) |>
  summarize(mean_budget = mean(budget_2013))
# A tibble: 2 × 2
  binary mean_budget
  <chr>        <dbl>
1 FAIL     65877024.
2 PASS     46913086.

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Start with the bechdel data frame:

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Filter for rows where binary is equal to "PASS":

bechdel |>
  filter(binary == "PASS")
# A tibble: 753 × 7
   title           year gross_2013 budget_2013   roi binary clean_test
   <chr>          <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 Dredd 3D        2012   55078343    45658735  1.21 PASS   ok        
 2 About Time      2013  102648667    12000000  8.55 PASS   ok        
 3 Admission       2013   36014634    13000000  2.77 PASS   ok        
 4 American Hust…  2013  397915817    40000000  9.95 PASS   ok        
 5 August: Osage…  2013   87609748    25000000  3.50 PASS   ok        
 6 Beautiful Cre…  2013   75392809    50000000  1.51 PASS   ok        
 7 Blue Jasmine    2013  101793664    18000000  5.66 PASS   ok        
 8 Carrie          2013  120268278    30000000  4.01 PASS   ok        
 9 Despicable Me…  2013 1338831390    76000000 17.6  PASS   ok        
10 Elysium         2013  379242208   120000000  3.16 PASS   ok        
# ℹ 743 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Arrange the rows in descending order of roi:

bechdel |>
  filter(binary == "PASS") |>
  arrange(desc(roi))
# A tibble: 753 × 7
   title           year gross_2013 budget_2013   roi binary clean_test
   <chr>          <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 The Blair Wit…  1999  543776715      839077 648.  PASS   ok        
 2 The Devil Ins…  2012  157289709     1014639 155.  PASS   ok        
 3 My Big Fat Gr…  2002  768922942     6475896 119.  PASS   ok        
 4 Chasing Amy     1997   39417963      362810 109.  PASS   ok        
 5 Slacker         1991    4200140       39349 107.  PASS   ok        
 6 Insidious       2010  164379554     1602348 103.  PASS   ok        
 7 Paranormal Ac…  2010  280159759     3204696  87.4 PASS   ok        
 8 Paranormal Ac…  2011  322170936     5178454  62.2 PASS   ok        
 9 The Last Exor…  2010  118787648     1922817  61.8 PASS   ok        
10 Cinderella      1997  246710482     4208591  58.6 PASS   ok        
# ℹ 743 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Select columns title and roi:

bechdel |>
  filter(binary == "PASS") |>
  arrange(desc(roi)) |>
  select(title, roi)
# A tibble: 753 × 2
   title                      roi
   <chr>                    <dbl>
 1 The Blair Witch Project  648. 
 2 The Devil Inside         155. 
 3 My Big Fat Greek Wedding 119. 
 4 Chasing Amy              109. 
 5 Slacker                  107. 
 6 Insidious                103. 
 7 Paranormal Activity 2     87.4
 8 Paranormal Activity 3     62.2
 9 The Last Exorcism         61.8
10 Cinderella                58.6
# ℹ 743 more rows

AE 03