Intro to dplyr

We have discussed some basic data manipulation commands in base R in the data and descriptive statistics page.

The tidyverse includes the package dplyr, which is designed to help us readily tackle data manipulation tasks, including the sorts of tasks we will encounter in MATH 140.

I encourage you to reproduce all of the results on this page in your own RStudio session.

Load the earthquakes data that came with our text. This data set gives us information about all major 20th century earthquakes.

If you’ve installed the package openintro you can load the data set into your session by loading the package, and then by running df <- earthquakes Alternatively, you can load the data set into your session directly from its url:

df <- read.csv("https://www.openintro.org/data/csv/earthquakes.csv")

To be able to use the dplyr functions in an RStudio session we must load dplyr, which we do when we load the tidyverse:

library(tidyverse)

The key to using dplyr is the %>% command, which I read as “then”, followed by an action, such as filter() or select(), which we discuss below.

A brief list of commonly used dplyr action verbs:

dplyr command Description
select() selecting columns (variables)
filter() filter (subset) rows
group_by() group the data
summarise() summarize or aggregate the data
arrange() sort the data
join() joining data frames
mutate() create new variables (columns)

Filter

We use the filter() action to find those rows (observations) that meet some criterion for a variable.

For instance, the following code starts with the data frame df, then finds (via filter()) those earthquakes occurring in the region equal to the United States:

df %>% filter(region=="United States")
## # A tibble: 8 × 7
##    year month      day richter area                region        deaths
##   <dbl> <chr>    <dbl>   <dbl> <chr>               <chr>          <dbl>
## 1  1906 April       18     7.7 San Francisco       United States   3000
## 2  1933 March       10     6.2 Long Beach          United States    115
## 3  1964 March       27     9.2 Alaska              United States    131
## 4  1971 February     9     6.6 San Fernando Valley United States     65
## 5  1989 October     17     6.9 San Francisco       United States     63
## 6  1992 June        28     7.3 Landers             United States      3
## 7  1992 June        28     6.5 Big Bear            United States     NA
## 8  1994 January     17     6.8 Northridge          United States     61

Note: We use the double equal signs == when we are searching for entries that match a particular value or character. We use a single equal sign (e.g., x = 4) to assign a value or character to a variable.


Question: Which earthquakes registered 8.0 or higher on the Richter scale?


df %>% filter(richter>=8.0)
## # A tibble: 12 × 7
##     year month       day richter area                   region        deaths
##    <dbl> <chr>     <dbl>   <dbl> <chr>                  <chr>          <dbl>
##  1  1906 January      31     8.8 Esmeraldas (off coast) Ecuador         1000
##  2  1906 August       17     8.6 Valparaiso             Chile           3882
##  3  1907 October      21     8.1 Central                Asia           12000
##  4  1931 August       10     8   Xinjiang               China          10000
##  5  1933 March         2     8.4 Sanriku                Japan           2990
##  6  1934 January      15     8.1 Bihar                  India-Nepal    10700
##  7  1945 November     27     8   Makran Coast           Pakistan        4000
##  8  1946 December     29     8.1 Honshu                 Japan           1362
##  9  1950 August       15     8.6 Assam                  India           1526
## 10  1960 May          21     9.5 South                  Chile           1655
## 11  1964 March        27     9.2 Alaska                 United States    131
## 12  1985 September    19     8   Michoacan              Mexico          9500

After filtering, we can arrange() (sort) this list of earthquakes in descending order of magnitude on the richter scale:

df %>% filter(richter>=8.0) %>% arrange(desc(richter))
## # A tibble: 12 × 7
##     year month       day richter area                   region        deaths
##    <dbl> <chr>     <dbl>   <dbl> <chr>                  <chr>          <dbl>
##  1  1960 May          21     9.5 South                  Chile           1655
##  2  1964 March        27     9.2 Alaska                 United States    131
##  3  1906 January      31     8.8 Esmeraldas (off coast) Ecuador         1000
##  4  1906 August       17     8.6 Valparaiso             Chile           3882
##  5  1950 August       15     8.6 Assam                  India           1526
##  6  1933 March         2     8.4 Sanriku                Japan           2990
##  7  1907 October      21     8.1 Central                Asia           12000
##  8  1934 January      15     8.1 Bihar                  India-Nepal    10700
##  9  1946 December     29     8.1 Honshu                 Japan           1362
## 10  1931 August       10     8   Xinjiang               China          10000
## 11  1945 November     27     8   Makran Coast           Pakistan        4000
## 12  1985 September    19     8   Michoacan              Mexico          9500

Question: Which earthquakes less than 6 on the richter scale caused the most fatalities?


We can use the head() command to list the top three:

head((df %>% filter(richter < 6)) %>% arrange(desc(deaths)),3)
## # A tibble: 3 × 7
##    year month      day richter area      region             deaths
##   <dbl> <chr>    <dbl>   <dbl> <chr>     <chr>               <dbl>
## 1  1960 February    29     5.7 Agadir    Morocco             12000
## 2  1969 July        25     5.9 Guangdong China                3000
## 3  1931 April       27     5.7 Border    Armenia-Azerbaijan   2800

Question: Have any major earthquakes occurred in Japan in May?


df %>% filter(month=="May") %>% filter(region=="Japan")
## # A tibble: 0 × 7
## # ℹ 7 variables: year <dbl>, month <chr>, day <dbl>, richter <dbl>, area <chr>,
## #   region <chr>, deaths <dbl>

Nope! The table has 0 rows.

As a check, here’s a table of earthquakes by month in Japan:

table((df %>% filter(region=="Japan"))$month)
## 
##  December   January      June     March September 
##         1         2         1         2         2

Question: Find the average magnitude of major earthquakes in the US.


mean((df %>% filter(region=="United States"))$richter)
## [1] 7.15

It looks like the average is 7.15.

Select

We can select() certain columns to display (and choose the order in which to display them). The code below creates a new data frame called df1, obtained from df by selecting just the region, richter, and deaths variables. Then we display the first 5 rows of df1.

df1 = df %>% select(region, richter, deaths)
head(df1,5)
## # A tibble: 5 × 3
##   region    richter deaths
##   <chr>       <dbl>  <dbl>
## 1 Guatemala     7.5   2000
## 2 Russia        6.4   4700
## 3 Turkey        7     3500
## 4 Turkey        5.8   1000
## 5 India         7.5  19000

Summary statistics by group


Task: Group the earthquakes by month and record

  1. how many earthquakes there were in each month,
  2. the average magnitude, and
  3. the standard deviation of the magnitudes for each month.

df %>% group_by(month) %>%
  summarise(size=length(richter),
            avg=mean(richter),
            stdev=sd(richter))
## # A tibble: 12 × 4
##    month      size   avg stdev
##    <chr>     <int> <dbl> <dbl>
##  1 April         7  7.09 0.664
##  2 August       12  7.42 0.808
##  3 December     12  7.03 0.710
##  4 February      9  6.71 0.683
##  5 January      11  7.35 0.717
##  6 July         10  7    0.650
##  7 June          8  7.12 0.381
##  8 March        17  7.02 0.859
##  9 May          14  7.16 0.917
## 10 November      6  7.33 0.493
## 11 October       8  7.05 0.760
## 12 September     9  7.28 0.618

Note: This is a very useful thing to do. Later in the course we will regularly want to find the sample mean, sample size, and sample standard deviation across different groups like this.

Here’s a nice display of these summary statistics, which uses the package kableExtra:

library(kableExtra)
df %>% group_by(month) %>%
  summarise(size=length(richter),
            avg=round(mean(richter),2),
            stdev=round(sd(richter),2)) %>%
  kable() %>%   kable_styling(bootstrap_options = c("striped","hover"),full_width=FALSE,position="left")
month size avg stdev
April 7 7.09 0.66
August 12 7.42 0.81
December 12 7.03 0.71
February 9 6.71 0.68
January 11 7.35 0.72
July 10 7.00 0.65
June 8 7.12 0.38
March 17 7.02 0.86
May 14 7.16 0.92
November 6 7.33 0.49
October 8 7.05 0.76
September 9 7.28 0.62

Combining dplyr with ggplot

The packages dplyr and ggplot2 are both a part of the tidyverse and can work together. For instance, we can make a plot on some version of the data frame first modified by dplyr.


Question: Make a line plot that shows the cumulative death count from 20th century earthquake. By cumulative death count, we mean that for each year, we want to know the total number of deaths from earthquakes during the century up to and including that year.


df %>% mutate(cum.death=cumsum(deaths)) %>%
  ggplot()+
  geom_line(aes(x=year,y=cum.death))+
  ylab("cumulative major earthquake fatalities")

It looks like the 70s were particularly devastating. What are those two most devastating earthquakes?

head(df %>% arrange(desc(deaths)),2)
## # A tibble: 2 × 7
##    year month   day richter area     region deaths
##   <dbl> <chr> <dbl>   <dbl> <chr>    <chr>   <dbl>
## 1  1970 May      31     7.9 Chimbote Peru   700000
## 2  1976 July     28     7.5 Tangshan China  255000

Note: The cumsum() command is a built-in R command that gives progressive cumulative sums of a vector.

vec=c(2,4,1,-3)
cumsum(vec)
## [1] 2 6 7 4

Cheat Sheet

The following page has a downloadable dplyr cheat sheet (pdf)