Directions

The exercises to complete appear at the bottom of this tutorial. Create a script to answer these questions, using comments to provide context for your code, and for written responses. Copy and paste your script into the submission box for this activity on blackboard. To get started, these should be the first three lines of your script:

# Your Name
library(tidyverse)
df <- read.csv("https://mphitchman.com/stats/data/student_surveys.csv")

I encourage you to read through the tutorial carefully before tackling the questions.

Tutorial

On day one in this class, students completed a survey (via Google forms). I have downloaded the results of the survey in the form of a .csv file, and after cleaning the data (we’ll discuss what this meant in class), I saved the .csv file was saved to our course resource site in the data tab. The data is accessible as a link on our website. To load the data in your session and assign the data the name df, we run the following line:

Importing the data

df <- read.csv("https://mphitchman.com/stats/data/student_surveys.csv")

Exploring Numerical Data

By clicking on df in your Environment Tab you can check the data set. We have 141 rows and 20 columns. This means that 141 students filled out the survey, and 20 variables for each student have been recorded, including eyecolor, year (at Linfield), height (in inches), phone.usage (per day), and others.

By clicking the arrow in the blue circle next to df in your Environment Tab, you can see a list of the column names, and whether they are categorical or numerical variables.

To ask R to give us information about a particular column in a data frame we need to tell it the name of the data frame and the name of the column of interest, using a $ sign between these names:

df$sleep
##   [1] 11.0000  6.0000  8.0000  7.0000  8.0000  7.0000  6.5000  8.0000  7.0000
##  [10]  6.0000  8.0000  8.0000  6.0000  8.0000  8.0000  7.0000  8.0000 10.0000
##  [19]  7.0000  8.0000  7.5000  8.0000  7.5000  7.0000  7.0000  8.0000  4.0000
##  [28]  7.5833  6.0000  9.0000  7.0000  7.0000 10.0000  9.0000  8.0000  6.5000
##  [37]  7.0000  7.0000  8.0000  8.0000  8.0000  6.0000  6.5000  9.0000  6.0000
##  [46]  7.0000  7.0000  8.0000  8.0000  7.0000  8.0000  5.0000  7.0000  6.0000
##  [55]  6.5000  8.0000  7.0000  7.0000  5.0000  7.5000  7.0000  7.0000  8.0000
##  [64]  8.0000  6.0000  7.0000  7.5000  7.0000  7.0000  7.0000  8.0000  7.0000
##  [73]  8.0000  8.0000  6.0000  7.5000  7.0000  7.0000  7.0000  9.0000  8.0000
##  [82]  8.0000  8.0000  7.0000  8.0000  7.5000  8.5000  6.5000  6.0000  7.0000
##  [91]  8.0000  8.0000  8.0000  6.0000  8.0000  7.0000  8.0000  7.0000  5.0000
## [100]  9.0000  9.0000  7.5000  6.0000  7.5000  8.0000  7.0000  8.0000  8.0000
## [109]  8.0000  6.0000  8.0000  8.0000  8.0000  9.0000  8.0000  8.0000  6.5000
## [118]  8.0000  8.0000  8.0000  5.0000  9.0000  7.5000  8.0000  7.0000  6.0000
## [127]  6.0000  8.0000  4.0000  5.0000  6.0000  9.0000  7.0000  6.0000  7.5000
## [136]  8.0000  8.5000  8.0000  7.0000  6.5000  9.0000

The code above displays the raw data from the sleep column, which doesn’t tell us much in the way of trends or patterns. Note: the numbers in brackets at the start of each row are NOT part of the data, they mark the position of the data point in the list. For instance, if a row of the print out starts with [10], then the data point after [10] is the 10th entry in the list.

We can summarize these data with a frequency plot which we call a histogram. See the basic data visualization tuturial for details on making histograms look sharp.

hist(df$sleep)

Or a boxplot:

boxplot(df$sleep)

It looks like everyone expects to sleep between 4 and 11 hours a night!

What is the average height of students in this class, as reported in the survey:

mean(df$height)
## [1] 66.75177

Exploring Categorical Data

Use table() to quickly summarize categorical variable responses. For instance, how does the class feel about whether to abolish the penny?

table(df$abolish.penny)
## 
##  No Yes 
##  96  45

What is the class distribution of eye color?

table(df$eyecolor)
## 
##  Blue Brown Green Hazel 
##    30    91    10    10

We can also use table() to investigate a possible relationship between two categorical variables:

table(df$degree,df$abolish.penny)
##           
##            No Yes
##   business 33  17
##   CAS      24  17
##   nursing  39  11

Hmm… what about

table(df$degree,df$class.dread)
##           
##             1  2  3  4  5
##   business 12 15 17  5  1
##   CAS      11  8 11  7  4
##   nursing   7 18 20  2  3

Exploring a numerical variable grouped by a categorical variable

We can make side-by-side boxplots of a numerical variable (haircut, for instance) grouped by some categorical variable, such as… eye color!

boxplot(df$haircut~df$achieve)

Once we have loaded the tidyverse, we can access our data verbs to compute the average haircut cost by eye color group.

library(tidyverse)
df |> 
  group_by(eyecolor) |>
  summarize(avg = mean(haircut))
## # A tibble: 4 × 2
##   eyecolor   avg
##   <chr>    <dbl>
## 1 Blue      37.0
## 2 Brown     39.7
## 3 Green     68.7
## 4 Hazel     39

Are these means useful measures of center? To answer this question, we should ask whether there are outliers and/or strong skewness in the haircut data. Let’s look, with a fairly polished plot with various bells and whistles.

ggplot(df)+
  geom_dotplot(aes(x = haircut, fill = eyecolor), binwidth = 5)+
  scale_y_continuous(breaks = NULL, name = "") + # Hide the y-axis
  scale_fill_manual(values = c("blue","brown","green","wheat2"))+ #match dot color with eyecolor
  facet_grid(eyecolor~.)+
  labs(title = 'Cost of haircuts, grouped by eye color')+
  theme_bw()+
  theme(legend.position = 'none')

There are some definite outliers, which influence the means greatly. The medians would be a better measure of center:

df |> 
  group_by(eyecolor) |>
  summarize(M = median(haircut))
## # A tibble: 4 × 2
##   eyecolor     M
##   <chr>    <dbl>
## 1 Blue      25.5
## 2 Brown     35  
## 3 Green     21  
## 4 Hazel     32.5

Exercises

Use code to arrive at answers to each of these questions. Record your code in the script, make it clear (with comments) which question your code is tackling, and use comments in your script to record your responses.

  1. Find the mean number of states students in this survey have visited? Also find the median. Are there outliers in these data?
  1. On average, how many hours a week to students in this class plan to study (find the mean)? Are there outliers in these data?
  1. Make a histogram of the haircut variable (how much you spent on your last haircut). Based on the histogram, do you think you spend more or less than the typicial student in this class? Is this distribution skewed left, skewed right, or symmetric?
  1. Of all the brown eyed people in the class, how many answered that they went to sleep at midnight or later?
  1. Group the data according to the degree variable. Which group - Business, Nursing, or the College of Arts and Sciences (CAS) - has the highest mean hours a week that they say they study? Which has the lowest?
  1. Thinking of the averages found in the previous question, do you think there is a significant difference between students study habits based on their degree track? Make side-by-side boxplots of the study hours variable grouped by degree to inform your response.