The Scene

On day one in this class, students completed a survey (via Google forms). I have downloaded the results of the survey in the form of a .csv file, and after cleaning the data (we’ll discuss what this meant in class), I saved the .csv file was saved to our course resource site in the data tab.

In this webpage, we show how to load the data into an RStudio session, and some ways to begin exploring the data in RStudio.

Import the data

df <- read.csv("https://mphitchman.com/stats/data/student_surveys.csv")

You can run this line of code at the console prompt in RStudio (copy and paste!) to load the survey data into your own RStudio session. The code loads the data and gives it the working name df in your session.

Exploring Numerical Data

By clicking on df in your Environment Tab you can check the data set. We have 141 rows and 20 columns. This means that 141 students filled out the survey, and 20 variables for each student have been recorded, including eyecolor, year (at Linfield), height (in inches), phone.usage (per day), and others.

By clicking the arrow in the blue circle next to df in your Environment Tab, you can see a list of the column names, and whether they are categorical or numerical variables.

To ask R to give us information about a particular column in a data frame we need to tell it the name of the data frame and the name of the column of interest, using a $ sign between these names:

df$sleep
##   [1] 11.0000  6.0000  8.0000  7.0000  8.0000  7.0000  6.5000  8.0000  7.0000
##  [10]  6.0000  8.0000  8.0000  6.0000  8.0000  8.0000  7.0000  8.0000 10.0000
##  [19]  7.0000  8.0000  7.5000  8.0000  7.5000  7.0000  7.0000  8.0000  4.0000
##  [28]  7.5833  6.0000  9.0000  7.0000  7.0000 10.0000  9.0000  8.0000  6.5000
##  [37]  7.0000  7.0000  8.0000  8.0000  8.0000  6.0000  6.5000  9.0000  6.0000
##  [46]  7.0000  7.0000  8.0000  8.0000  7.0000  8.0000  5.0000  7.0000  6.0000
##  [55]  6.5000  8.0000  7.0000  7.0000  5.0000  7.5000  7.0000  7.0000  8.0000
##  [64]  8.0000  6.0000  7.0000  7.5000  7.0000  7.0000  7.0000  8.0000  7.0000
##  [73]  8.0000  8.0000  6.0000  7.5000  7.0000  7.0000  7.0000  9.0000  8.0000
##  [82]  8.0000  8.0000  7.0000  8.0000  7.5000  8.5000  6.5000  6.0000  7.0000
##  [91]  8.0000  8.0000  8.0000  6.0000  8.0000  7.0000  8.0000  7.0000  5.0000
## [100]  9.0000  9.0000  7.5000  6.0000  7.5000  8.0000  7.0000  8.0000  8.0000
## [109]  8.0000  6.0000  8.0000  8.0000  8.0000  9.0000  8.0000  8.0000  6.5000
## [118]  8.0000  8.0000  8.0000  5.0000  9.0000  7.5000  8.0000  7.0000  6.0000
## [127]  6.0000  8.0000  4.0000  5.0000  6.0000  9.0000  7.0000  6.0000  7.5000
## [136]  8.0000  8.5000  8.0000  7.0000  6.5000  9.0000

The code above displays the raw data from the sleep column, which doesn’t tell us much in the way of trends or patterns. Note: the numbers in brackets at the start of each row are NOT part of the data, they mark the position of the data point in the list. For instance, if a row of the print out starts with [10], then the data point after [10] is the 10th entry in the list.

We can summarize these data with a frequency plot which we call a histogram. See the basic data visualization tuturial for details on making histograms look sharp.

hist(df$sleep)

Or a boxplot:

boxplot(df$sleep)

It looks like everyone expects to sleep between 4 and 11 hours a night!

What is the average height of students in this class, as reported in the survey:

mean(df$height)
## [1] 66.75177

Exploring Categorical Data

Use table() to quickly summarize categorical variable responses. For instance, how does the class feel about whether to abolish the penny?

table(df$abolish.penny)
## 
##  No Yes 
##  96  45

What is the class distribution of eye color?

table(df$eyecolor)
## 
##  Blue Brown Green Hazel 
##    30    91    10    10

We can also use table() to investigate a possible relationship between two categorical variables:

table(df$degree,df$abolish.penny)
##           
##            No Yes
##   business 33  17
##   CAS      24  17
##   nursing  39  11

Hmm… what about

table(df$degree,df$class.dread)
##           
##             1  2  3  4  5
##   business 12 15 17  5  1
##   CAS      11  8 11  7  4
##   nursing   7 18 20  2  3

Exploring a numerical variable grouped by a categorical variable

We can make side-by-side boxplots of a numerical variable (haircut, for instance) grouped by some categorical variable, such as… eye color!

boxplot(df$haircut~df$achieve)

Once we have loaded the tidyverse, we can access our data verbs to compute the average haircut cost by eye color group.

library(tidyverse)
df |> 
  group_by(eyecolor) |>
  summarize(avg = mean(haircut))
## # A tibble: 4 × 2
##   eyecolor   avg
##   <chr>    <dbl>
## 1 Blue      37.0
## 2 Brown     39.7
## 3 Green     68.7
## 4 Hazel     39

Are these means useful measures of center? To answer this question, we should ask whether there are outliers and/or strong skewness in the haircut data. Let’s look, with a fairly polished plot with various bells and whistles.

ggplot(df)+
  geom_dotplot(aes(x = haircut, fill = eyecolor), binwidth = 5)+
  scale_y_continuous(breaks = NULL, name = "") + # Hide the y-axis
  scale_fill_manual(values = c("blue","brown","green","wheat2"))+ #match dot color with eyecolor
  facet_grid(eyecolor~.)+
  labs(title = 'Cost of haircuts, grouped by eye color')+
  theme_bw()+
  theme(legend.position = 'none')

There are some definite outliers, which influence the means greatly. The medians would be a better measure of center:

df |> 
  group_by(eyecolor) |>
  summarize(M = median(haircut))
## # A tibble: 4 × 2
##   eyecolor     M
##   <chr>    <dbl>
## 1 Blue      25.5
## 2 Brown     35  
## 3 Green     21  
## 4 Hazel     32.5

Now you try

  1. On average, how many states have students in this class visited? Are there outliers in these data?
  1. On average, how many hours a week to students in this class plan to study? Are there outliers in these data?
  1. Make a histogram of the haircut variable (how much you spent on your last haircut). Based on the histogram, do you think you spend more or less than the typicial student in this class? Is this distribution skewed left, skewed right, or symmetric?
  1. Of all the brown eyed people in the class, how many answered that they went to sleep around midnight?
  1. Compute the median study hours for students in this class grouped by whether they want to abolish the penny? Which group has a higher median?