The tidyverse package contains useful packages for managing and visualizing data. Today we use some commands from two packages in the tidyverse:
dplyrData verbs are commands designed to help us intuitively
tackle data frame manipulation tasks. We discuss the syntax for doing
data management with dplyr and the pipe command (|> or
%>%) in class. A more detailed tutorial can be found on
our Data
Management with dplyr tutorial on our course resource page.
Here are some commonly used data verbs:
| dplyr command | Description |
|---|---|
filter() |
filter (subset) rows by a condition |
arrange() |
sort the data |
select() |
selecting columns (variables) |
mutate() |
create new variables (columns) |
slice() |
consider only the rows indicated |
group_by() |
group the data |
summarize() |
summarize or aggregate the data |
Example: Let’s say I have a data set called
lego that has many columns, three of which are called
name (the name of the set), pieces (how many
pieces it has), and cost. Then the code
lego |>
filter(pieces > 500) |>
arrange(desc(cost)) |>
select(name,pieces,cost)
would be interpreted to mean this: take the data set
lego, then filter the data set to include just those sets
that have more than 500 pieces. Then sort those rows in descending order
by how much they cost. Then select just the three columns
name, pieces, and cost to
disply.
The pipe command |> essentially means
then
ggplotWe discuss the anatomy of a ggplot in class. A more detailed tutorial can be found on our Data Visualization with ggplot tutorial on our course resource page.
We can choose to make a histogram,
ggplot(faithful)+
geom_histogram(aes(x = eruptions), fill = 'wheat', col = 'blue',bins = 25)+
labs(title = 'Old Faithful eruption time',
x = 'time (min)')
or a boxplot,
ggplot(faithful)+
geom_boxplot(aes(x = eruptions), fill = 'seagreen', col = 'black', alpha = .2)+
labs(title = 'Old Faithful eruption time',
x = 'time (min)')
or a bar plot (to summarize the frequency distribution of a categorical variable)
ggplot(starwars)+
geom_bar(aes(y = eye_color))+
labs(title = 'Eye color of Star Wars characters',
y = 'Eye color')
We will work through these steps together in class.
- Fire up RStudio
- Create a new script.
- Enter your Name as a comment in line 1 of your script, e.g.,
# Mike Hitchman
- Load the tidyverse into your session by typing
library(tidyverse)in line 2 of your script and then running the line.
- We will talk as a class about the expected format of your script as you proceed. Your script will record code you use to accompish tasks, as well as comments (beginning with #) when you are asked to discuss the code outputs.
- When you are done with this activity you will submit your script to blackboard - we will go over how to do this.
- Load the rectangles data into your session by copying the line below into your script, and then running the line. To see if it loaded, check that
dfappears in your environment tab.
df <- read.csv("https://mphitchman.com/stats/data/mph_rectangles.csv")
- Run
summary(df)(enter this into your script) to obtain a quick summary of what the data frame contains. Then answer these questions as comments in your script.
- What are the different rectangle colors in this data set, and how many of each color are there? Use code to answer this question, and summarize your answer with a comment.
Intermission: Here’s what my script looks like after I have completed Q1.
End Intermission
length
- Use
mean()andsd()to find the the mean and standard deviation of thelengthvariable.
- Determine the IQR of the
lengthvariable.
- Visualize the distribution of lengths with a ggplot histogram. Comment in your script about the shape of the distribution (e.g., symmetric and bell-shaped, uniform, skewed right, skewed left, outliers…). Feel free to adjust plot colors and change how many bins to use. Find a plot that gives you a good sense of the overall shape of the distribution.
widthRepeat Q2 with the
widthvariable.
Which variable,
lengthorwidth, has a larger IQR? Comment on why this answer is reasonable given the shapes of the two distributions.
What kind of association is there, if any, between a rectangle’s width and its length? Make a point plot to inform your response.
Using ggplot(), use plot “geometry” is
geom_point() for a point plot (aka ‘scatter plot’).
What kind of association is there, if any, between a rectangle’s
colorand itswidth? Make side-by-side boxplots (by color groups) to inform your response. Withggplot, usegeom_boxplotfor a boxplot, and the aesthetic in this case would beaes(x = width, y = color)if you want horizontal boxplots of width grouped by color.
Intermission: Counting how many observations meet some criteria.
We can use the sum() command to see how many elements in
a vector meet some condition.
For instance, how many elements in the vector x below
have a value greater than 10?
x = c(3, 13, 7, 12, 10, 4, 5)
sum(x > 10)
## [1] 2
Or, how many elements in x are between 6 and 11?
sum(x > 6 & x < 11)
## [1] 2
Or, how many rectangles in the data frame have width \(w\) such that \(12 \leq w \leq 12.5\)?
sum(df$width >= 12 & df$width <= 12.5)
## [1] 325
So, the proportion of all rectangles with a width in this range is
sum(df$width >= 12 & df$width <= 12.5) / nrow(df)
## [1] 0.1625
In other words, if I pick a rectangle at random from this data set, the probability is 0.1625 that its width is between 12 and 12.5.
End Intermission
What proportion of retangles have length between 14 and 15?
Intermission: Sorting and filtering data.
We use the filter() verb to find those rows
(observations) that meet some condition (or conditions) for a variable.
We use the arrange() verb to sort by a column.
Example 1: List all the rectangles named
Mike?
df |>
filter(name == 'Mike')
## name length width color
## 1 Mike 13.54 13.48 green
Example 2: What are the five widest yellow rectangles? List them in descending order.
Strategy:
filter() the data to include just the yellow
rectangles, thenarrange() these data in desc() order by
their width, thenslice() off just rows 1 through 5 for display.df |>
filter(color == 'yellow') |>
arrange(desc(width)) |>
slice(1:5)
## name length width color
## 1 Butler 13.09 14.49 yellow
## 2 Peggie 12.97 14.48 yellow
## 3 Kennard 12.36 14.48 yellow
## 4 Euna 13.06 14.47 yellow
## 5 Deidre 13.01 14.47 yellow
Example 3: List in alphabetical order all red rectangles that are at least 2 units longer than they are wide.
Strategy:
filter() the data to include just the red rectangles
that also have (length - width) at least 2, thenarrange() these data by their name (by
default arrange() sorts a column in ascending order).df |>
filter(color == 'red' & (length - width) >= 2) |>
arrange(name)
## name length width color
## 1 Berton 14.24 11.80 red
## 2 Cornell 13.81 11.60 red
## 3 Netta 13.69 11.57 red
## 4 Nyasia 13.69 11.51 red
End Intermission
- Among all the green rectangles with length greater than 13, find the name of the one with the smallest width.
- Among all rectangles with width less than 12, find the name and color of the one with the greatest length.