dplyr
We have discussed some basic data manipulation commands in base R in the data and descriptive statistics page.
The tidyverse
includes the package dplyr
,
which is designed to help us readily tackle data manipulation tasks,
including the sorts of tasks we will encounter in MATH 140.
I encourage you to reproduce all of the results on this page in your own RStudio session.
Load the earthquakes data that came with our text. This data set gives us information about all major 20th century earthquakes.
If you’ve installed the package openintro
you can load
the data set into your session by loading the package, and then by
running df <- earthquakes
Alternatively, you can load
the data set into your session directly from its url:
df <- read.csv("https://www.openintro.org/data/csv/earthquakes.csv")
To be able to use the dplyr
functions in an RStudio
session we must load dplyr
, which we do when we load the
tidyverse:
library(tidyverse)
The key to using
dplyr
is the%>%
command, which I read as “then”, followed by an action, such asfilter()
orselect()
, which we discuss below.
A brief list of commonly used dplyr
action verbs:
dplyr command | Description |
---|---|
select() |
selecting columns (variables) |
filter() |
filter (subset) rows |
group_by() |
group the data |
summarise() |
summarize or aggregate the data |
arrange() |
sort the data |
join() |
joining data frames |
mutate() |
create new variables (columns) |
We use the filter()
action to find those rows
(observations) that meet some criterion for a variable.
For instance, the following code starts with the data frame
df
, then finds (via filter()
) those
earthquakes occurring in the region equal to the United States:
df %>% filter(region=="United States")
## # A tibble: 8 × 7
## year month day richter area region deaths
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1906 April 18 7.7 San Francisco United States 3000
## 2 1933 March 10 6.2 Long Beach United States 115
## 3 1964 March 27 9.2 Alaska United States 131
## 4 1971 February 9 6.6 San Fernando Valley United States 65
## 5 1989 October 17 6.9 San Francisco United States 63
## 6 1992 June 28 7.3 Landers United States 3
## 7 1992 June 28 6.5 Big Bear United States NA
## 8 1994 January 17 6.8 Northridge United States 61
Note: We use the double equal signs ==
when we are searching for entries that match a particular value or
character. We use a single equal sign (e.g., x = 4
) to
assign a value or character to a variable.
Question: Which earthquakes registered 8.0 or higher on the Richter scale?
df %>% filter(richter>=8.0)
## # A tibble: 12 × 7
## year month day richter area region deaths
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1906 January 31 8.8 Esmeraldas (off coast) Ecuador 1000
## 2 1906 August 17 8.6 Valparaiso Chile 3882
## 3 1907 October 21 8.1 Central Asia 12000
## 4 1931 August 10 8 Xinjiang China 10000
## 5 1933 March 2 8.4 Sanriku Japan 2990
## 6 1934 January 15 8.1 Bihar India-Nepal 10700
## 7 1945 November 27 8 Makran Coast Pakistan 4000
## 8 1946 December 29 8.1 Honshu Japan 1362
## 9 1950 August 15 8.6 Assam India 1526
## 10 1960 May 21 9.5 South Chile 1655
## 11 1964 March 27 9.2 Alaska United States 131
## 12 1985 September 19 8 Michoacan Mexico 9500
After filtering, we can arrange()
(sort) this list of
earthquakes in descending order of magnitude on the richter scale:
df %>% filter(richter>=8.0) %>% arrange(desc(richter))
## # A tibble: 12 × 7
## year month day richter area region deaths
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1960 May 21 9.5 South Chile 1655
## 2 1964 March 27 9.2 Alaska United States 131
## 3 1906 January 31 8.8 Esmeraldas (off coast) Ecuador 1000
## 4 1906 August 17 8.6 Valparaiso Chile 3882
## 5 1950 August 15 8.6 Assam India 1526
## 6 1933 March 2 8.4 Sanriku Japan 2990
## 7 1907 October 21 8.1 Central Asia 12000
## 8 1934 January 15 8.1 Bihar India-Nepal 10700
## 9 1946 December 29 8.1 Honshu Japan 1362
## 10 1931 August 10 8 Xinjiang China 10000
## 11 1945 November 27 8 Makran Coast Pakistan 4000
## 12 1985 September 19 8 Michoacan Mexico 9500
Question: Which earthquakes less than 6 on the richter scale caused the most fatalities?
We can use the head()
command to list the top three:
head((df %>% filter(richter < 6)) %>% arrange(desc(deaths)),3)
## # A tibble: 3 × 7
## year month day richter area region deaths
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1960 February 29 5.7 Agadir Morocco 12000
## 2 1969 July 25 5.9 Guangdong China 3000
## 3 1931 April 27 5.7 Border Armenia-Azerbaijan 2800
Question: Have any major earthquakes occurred in Japan in May?
df %>% filter(month=="May") %>% filter(region=="Japan")
## # A tibble: 0 × 7
## # ℹ 7 variables: year <dbl>, month <chr>, day <dbl>, richter <dbl>, area <chr>,
## # region <chr>, deaths <dbl>
Nope! The table has 0 rows.
As a check, here’s a table of earthquakes by month in Japan:
table((df %>% filter(region=="Japan"))$month)
##
## December January June March September
## 1 2 1 2 2
Question: Find the average magnitude of major earthquakes in the US.
mean((df %>% filter(region=="United States"))$richter)
## [1] 7.15
It looks like the average is 7.15.
We can select()
certain columns to display (and choose
the order in which to display them). The code below creates a new data
frame called df1
, obtained from df by selecting just the
region, richter, and deaths variables. Then we display the first 5 rows
of df1
.
df1 = df %>% select(region, richter, deaths)
head(df1,5)
## # A tibble: 5 × 3
## region richter deaths
## <chr> <dbl> <dbl>
## 1 Guatemala 7.5 2000
## 2 Russia 6.4 4700
## 3 Turkey 7 3500
## 4 Turkey 5.8 1000
## 5 India 7.5 19000
Task: Group the earthquakes by month and record
df %>% group_by(month) %>%
summarise(size=length(richter),
avg=mean(richter),
stdev=sd(richter))
## # A tibble: 12 × 4
## month size avg stdev
## <chr> <int> <dbl> <dbl>
## 1 April 7 7.09 0.664
## 2 August 12 7.42 0.808
## 3 December 12 7.03 0.710
## 4 February 9 6.71 0.683
## 5 January 11 7.35 0.717
## 6 July 10 7 0.650
## 7 June 8 7.12 0.381
## 8 March 17 7.02 0.859
## 9 May 14 7.16 0.917
## 10 November 6 7.33 0.493
## 11 October 8 7.05 0.760
## 12 September 9 7.28 0.618
Note: This is a very useful thing to do. Later in the course we will regularly want to find the sample mean, sample size, and sample standard deviation across different groups like this.
Here’s a nice display of these summary statistics, which uses the
package kableExtra
:
library(kableExtra)
df %>% group_by(month) %>%
summarise(size=length(richter),
avg=round(mean(richter),2),
stdev=round(sd(richter),2)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped","hover"),full_width=FALSE,position="left")
month | size | avg | stdev |
---|---|---|---|
April | 7 | 7.09 | 0.66 |
August | 12 | 7.42 | 0.81 |
December | 12 | 7.03 | 0.71 |
February | 9 | 6.71 | 0.68 |
January | 11 | 7.35 | 0.72 |
July | 10 | 7.00 | 0.65 |
June | 8 | 7.12 | 0.38 |
March | 17 | 7.02 | 0.86 |
May | 14 | 7.16 | 0.92 |
November | 6 | 7.33 | 0.49 |
October | 8 | 7.05 | 0.76 |
September | 9 | 7.28 | 0.62 |
The packages dplyr
and ggplot2
are both a
part of the tidyverse and can work together. For instance, we can make a
plot on some version of the data frame first modified by
dplyr
.
Question: Make a line plot that shows the cumulative death count from 20th century earthquake. By cumulative death count, we mean that for each year, we want to know the total number of deaths from earthquakes during the century up to and including that year.
df %>% mutate(cum.death=cumsum(deaths)) %>%
ggplot()+
geom_line(aes(x=year,y=cum.death))+
ylab("cumulative major earthquake fatalities")
It looks like the 70s were particularly devastating. What are those two most devastating earthquakes?
head(df %>% arrange(desc(deaths)),2)
## # A tibble: 2 × 7
## year month day richter area region deaths
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1970 May 31 7.9 Chimbote Peru 700000
## 2 1976 July 28 7.5 Tangshan China 255000
Note: The cumsum()
command is a
built-in R command that gives progressive cumulative sums of a
vector.
vec=c(2,4,1,-3)
cumsum(vec)
## [1] 2 6 7 4
The following page has a downloadable dplyr cheat sheet (pdf)