The Scene

On day one in this class, students filled out a survey (via Google forms). I have downloaded the results of the survey in the form of a .csv file, and after cleaning the data (we’ll discuss what this meant in class), I saved the .csv file was saved to our course resource site in the data tab.

In this webpage, we show how to load the data into an RStudio session, and some ways to begin exploring the data in RStudio.

Import the data

df <- read.csv("https://mphitchman.com/stats/data/survey_cleaned_f22.csv")

You can run this line of code at the console prompt in RStudio (copy and paste!) to load the survey data into your own RStudio session. The code loads the data and gives it the working name df in your session.

Exploring Numerical Data

By clicking on df in your Environment Tab you can check the data set. We have 27 rows and 19 columns. This means that 27 students filled out the survey, and 19 variables for each student have been recorded, including eyecolor, year (at Linfield), height (in inches), phone.usage (per day), and others.

By clicking the arrow in the blue circle next to df in your Environment Tab, you can see a list of the column names, and whether they are categorical or numerical variables.

To ask R to give us information about a particular column in a data frame we need to tell it the name of the data frame and the name of the column of interest, using a $ sign between these names:

df$sleep
##  [1] 7.0 7.0 7.0 7.0 8.0 8.0 8.0 8.0 7.0 8.0 7.5 6.0 7.0 7.0 7.0 6.5 9.0 8.5 8.0
## [20] 6.0 6.0 8.0 8.0 7.5 6.0 8.0 8.0

The code above displays the raw data from the sleep column, which doesn’t tell us much in the way of trends or patterns. Note: the numbers in brackets at the start of each row are NOT part of the data, they mark the position of the data point in the list. For instance, the 2nd row of the print out starts with [20], which just indicates that the first data entry on this row (6.0) is the 20th data point in the column.

We can summarize these data with a frequency plot which we call a histogram. See the basic data visualization tuturial for details on making histograms look sharp.

hist(df$sleep)

Or a boxplot:

boxplot(df$sleep)

It looks like everyone expects to sleep between 6 and 9 hours a night!

What is the average height of students in this class, as reported in the survey:

mean(df$height)
## [1] 66.2963

Exploring Categorical Data

Use table() to quickly summarize categorical variable responses. For instance, how does the class feel about whether to abolish the penny?

table(df$abolish.penny)
## 
##  No Yes 
##  18   9

What is the class distribution of eye color?

table(df$eyecolor)
## 
##  Blue Brown Green Hazel 
##     9    14     1     3

We can also use table() to investigate a possible relationship between two categorical variables:

table(df$eyecolor,df$abolish.penny)
##        
##         No Yes
##   Blue   7   2
##   Brown 10   4
##   Green  0   1
##   Hazel  1   2

Hmm… what about

table(df$degree,df$class.dread)
##                               
##                                1 2 3 5
##   College of Arts and Sciences 4 3 1 0
##   School of Business           1 2 1 0
##   School of Nursing            2 9 3 1

Exploring a numerical variable grouped by a categorical variable

We can make side-by-side boxplots of a numerical variable (haircut, for instance) grouped by some categorical variable, such as… eye color!

boxplot(df$haircut~df$eyecolor)

We can also compute the average haircut by eye color group:

aggregate(df$haircut, list(df$eyecolor), FUN=mean) 
##   Group.1        x
## 1    Blue 29.88889
## 2   Brown 33.92857
## 3   Green 30.00000
## 4   Hazel 40.00000

Are these means useful measures of center? I don’t know! Are there outliers in the haircut data? Let’s look, with a fairly polished plot with various bells and whistles.

stripchart(df$haircut~df$eyecolor,
           main="Cost of Haircuts", # title
           xlab="Cost", #x-axis label
           ylab="", #y axis label (empty for this chart)
           method = "stack", #
           col=c("blue","brown","green","wheat2"), #dot color
           pch=16, # dot shape and filling
           )

There are some definite outliers, which influence the means greatly. The medians would be a better measure of center:

aggregate(df$haircut, list(df$eyecolor), FUN=median) 
##   Group.1  x
## 1    Blue 27
## 2   Brown 25
## 3   Green 30
## 4   Hazel 30

Now you try

  1. On average, how many states have students in this class visited? Are there outliers in these data?
  1. On average, how many hours a week to students in this class plan to study? Are there outliers in these data?
  1. Make a histogram of the haircut variable (how much you spent on your last haircut). Based on the histogram, do you think you spend more or less than the typicial student in this class? Is this distribution skewed left, skewed right, or symmetric?
  1. Of all the brown eyed people in the class, how many answered that they went to sleep around midnight?
  1. Can you compute the average study hours for students in this class grouped by whether they want to abolish the penny?