Introduction

The tidyverse package contains useful packages for managing and visualizing data. Today we use some commands from two packages in the tidyverse:

  • dplyr (data verbs for managing data)
  • ggplot (for creating wonderful plots)

dplyr

Data verbs are commands designed to help us intuitively tackle data frame manipulation tasks. We discuss the syntax for doing data management with dplyr and the pipe command (|> or %>%) in class. A more detailed tutorial can be found on our Data Management with dplyr tutorial on our course resource page.

Here are some commonly used data verbs:

dplyr command Description
filter() filter (subset) rows by a condition
arrange() sort the data
select() selecting columns (variables)
mutate() create new variables (columns)
slice() consider only the rows indicated
group_by() group the data
summarize() summarize or aggregate the data

Example: Let’s say I have a data set called lego that has many columns, three of which are called name (the name of the set), pieces (how many pieces it has), and cost. Then the code

lego |> 
  filter(pieces > 500) |> 
  arrange(desc(cost)) |> 
  select(name,pieces,cost)

would be interpreted to mean this: take the data set lego, then filter the data set to include just those sets that have more than 500 pieces. Then sort those rows in descending order by how much they cost. Then select just the three columns name, pieces, and cost to disply.

The pipe command |> essentially means then

ggplot

We discuss the anatomy of a ggplot in class. A more detailed tutorial can be found on our Data Visualization with ggplot tutorial on our course resource page.

We can choose to make a histogram,

ggplot(faithful)+
  geom_histogram(aes(x = eruptions), fill = 'wheat', col = 'blue',bins = 25)+
  labs(title = 'Old Faithful eruption time',
       x = 'time (min)')

or a boxplot,

ggplot(faithful)+
  geom_boxplot(aes(x = eruptions), fill = 'seagreen', col = 'black', alpha = .2)+
  labs(title = 'Old Faithful eruption time',
       x = 'time (min)')

or a bar plot (to summarize the frequency distribution of a categorical variable)

ggplot(starwars)+
  geom_bar(aes(y = eye_color))+
  labs(title = 'Eye color of Star Wars characters',
       y = 'Eye color')

Q0. Setup

We will work through these steps together in class.

  1. Fire up RStudio
  1. Create a new script.
  1. Enter your Name as a comment in line 1 of your script, e.g.,

# Mike Hitchman

  1. Load the tidyverse into your session by typing library(tidyverse) in line 2 of your script and then running the line.
  1. We will talk as a class about the expected format of your script as you proceed. Your script will record code you use to accompish tasks, as well as comments (beginning with #) when you are asked to discuss the code outputs.
  1. When you are done with this activity you will submit your script to blackboard - we will go over how to do this.

Q1. The data

  1. Load the rectangles data into your session by copying the line below into your script, and then running the line. To see if it loaded, check that df appears in your environment tab.
df <- read.csv("https://mphitchman.com/stats/data/mph_rectangles.csv")
  1. Run summary(df) (enter this into your script) to obtain a quick summary of what the data frame contains. Then answer these questions as comments in your script.
  • How many observations are there?
  • Name each variable, and indicate whether it is categorical or numerical.
  1. What are the different rectangle colors in this data set, and how many of each color are there? Use code to answer this question, and summarize your answer with a comment.

Intermission: Here’s what my script looks like after I have completed Q1.

End Intermission


Q2. Investigating length

  1. Use mean() and sd() to find the the mean and standard deviation of the length variable.
  1. Determine the IQR of the length variable.
  1. Visualize the distribution of lengths with a ggplot histogram. Comment in your script about the shape of the distribution (e.g., symmetric and bell-shaped, uniform, skewed right, skewed left, outliers…). Feel free to adjust plot colors and change how many bins to use. Find a plot that gives you a good sense of the overall shape of the distribution.

Q3. Investigating width

Repeat Q2 with the width variable.

Q4. Comparing IQRs

Which variable, length or width, has a larger IQR? Comment on why this answer is reasonable given the shapes of the two distributions.

Q5. Association (num vs num)

What kind of association is there, if any, between a rectangle’s width and its length? Make a point plot to inform your response.

Using ggplot(), use plot “geometry” is geom_point() for a point plot (aka ‘scatter plot’).

Q6. Association (cat vs num)

What kind of association is there, if any, between a rectangle’s color and its width? Make side-by-side boxplots (by color groups) to inform your response. With ggplot, use geom_boxplot for a boxplot, and the aesthetic in this case would be aes(x = width, y = color) if you want horizontal boxplots of width grouped by color.


Intermission: Counting how many observations meet some criteria.

We can use the sum() command to see how many elements in a vector meet some condition.

For instance, how many elements in the vector x below have a value greater than 10?

x = c(3, 13, 7, 12, 10, 4, 5)
sum(x > 10)
## [1] 2

Or, how many elements in x are between 6 and 11?

sum(x > 6 & x < 11)
## [1] 2

Or, how many rectangles in the data frame have width \(w\) such that \(12 \leq w \leq 12.5\)?

sum(df$width >= 12 & df$width <= 12.5)
## [1] 325

So, the proportion of all rectangles with a width in this range is

sum(df$width >= 12 & df$width <= 12.5) / nrow(df)
## [1] 0.1625

In other words, if I pick a rectangle at random from this data set, the probability is 0.1625 that its width is between 12 and 12.5.

End Intermission


Q7. Conditional Counts

What proportion of retangles have length between 14 and 15?


Intermission: Sorting and filtering data.

We use the filter() verb to find those rows (observations) that meet some condition (or conditions) for a variable. We use the arrange() verb to sort by a column.

Example 1: List all the rectangles named Mike?

df |> 
  filter(name == 'Mike')
##   name length width color
## 1 Mike  13.54 13.48 green

Example 2: What are the five widest yellow rectangles? List them in descending order.

Strategy:

  • First take the data, then
  • filter() the data to include just the yellow rectangles, then
  • arrange() these data in desc() order by their width, then
  • slice() off just rows 1 through 5 for display.
df |> 
  filter(color == 'yellow') |> 
  arrange(desc(width)) |> 
  slice(1:5)
##      name length width  color
## 1  Butler  13.09 14.49 yellow
## 2  Peggie  12.97 14.48 yellow
## 3 Kennard  12.36 14.48 yellow
## 4    Euna  13.06 14.47 yellow
## 5  Deidre  13.01 14.47 yellow

Example 3: List in alphabetical order all red rectangles that are at least 2 units longer than they are wide.

Strategy:

  • First take the data, then
  • filter() the data to include just the red rectangles that also have (length - width) at least 2, then
  • arrange() these data by their name (by default arrange() sorts a column in ascending order).
df |> 
  filter(color == 'red' & (length - width) >= 2) |> 
  arrange(name)
##      name length width color
## 1  Berton  14.24 11.80   red
## 2 Cornell  13.81 11.60   red
## 3   Netta  13.69 11.57   red
## 4  Nyasia  13.69 11.51   red

End Intermission


Q8. Filter and Sort

  1. Among all the green rectangles with length greater than 13, find the name of the one with the smallest width.
  1. Among all rectangles with width less than 12, find the name and color of the one with the greatest length.