The exercises to complete appear at the bottom of this tutorial. Create a script to answer these questions, using comments to provide context for your code, and for written responses. Copy and paste your script into the submission box for this activity on blackboard. To get started, these should be the first three lines of your script:
# Your Name
library(tidyverse)
df <- read.csv("https://mphitchman.com/stats/data/student_surveys.csv")
I encourage you to read through the tutorial carefully before tackling the questions.
On day one in this class, students completed a survey (via Google
forms). I have downloaded the results of the survey in the form of a
.csv file, and after cleaning the data (we’ll discuss what this meant in
class), I saved the .csv file was saved to our course resource site in
the data tab. The data is accessible as a link on our website. To load
the data in your session and assign the data the name df,
we run the following line:
df <- read.csv("https://mphitchman.com/stats/data/student_surveys.csv")
By clicking on df in your Environment Tab you can check
the data set. We have 141 rows and 20 columns. This means that 141
students filled out the survey, and 20 variables for each student have
been recorded, including eyecolor, year (at Linfield),
height (in inches), phone.usage (per day), and
others.
By clicking the arrow in the blue circle next to df in
your Environment Tab, you can see a list of the column names, and
whether they are categorical or numerical variables.
To ask R to give us information about a particular column in a data frame we need to tell it the name of the data frame and the name of the column of interest, using a $ sign between these names:
df$sleep
## [1] 11.0000 6.0000 8.0000 7.0000 8.0000 7.0000 6.5000 8.0000 7.0000
## [10] 6.0000 8.0000 8.0000 6.0000 8.0000 8.0000 7.0000 8.0000 10.0000
## [19] 7.0000 8.0000 7.5000 8.0000 7.5000 7.0000 7.0000 8.0000 4.0000
## [28] 7.5833 6.0000 9.0000 7.0000 7.0000 10.0000 9.0000 8.0000 6.5000
## [37] 7.0000 7.0000 8.0000 8.0000 8.0000 6.0000 6.5000 9.0000 6.0000
## [46] 7.0000 7.0000 8.0000 8.0000 7.0000 8.0000 5.0000 7.0000 6.0000
## [55] 6.5000 8.0000 7.0000 7.0000 5.0000 7.5000 7.0000 7.0000 8.0000
## [64] 8.0000 6.0000 7.0000 7.5000 7.0000 7.0000 7.0000 8.0000 7.0000
## [73] 8.0000 8.0000 6.0000 7.5000 7.0000 7.0000 7.0000 9.0000 8.0000
## [82] 8.0000 8.0000 7.0000 8.0000 7.5000 8.5000 6.5000 6.0000 7.0000
## [91] 8.0000 8.0000 8.0000 6.0000 8.0000 7.0000 8.0000 7.0000 5.0000
## [100] 9.0000 9.0000 7.5000 6.0000 7.5000 8.0000 7.0000 8.0000 8.0000
## [109] 8.0000 6.0000 8.0000 8.0000 8.0000 9.0000 8.0000 8.0000 6.5000
## [118] 8.0000 8.0000 8.0000 5.0000 9.0000 7.5000 8.0000 7.0000 6.0000
## [127] 6.0000 8.0000 4.0000 5.0000 6.0000 9.0000 7.0000 6.0000 7.5000
## [136] 8.0000 8.5000 8.0000 7.0000 6.5000 9.0000
The code above displays the raw data from the sleep column, which doesn’t tell us much in the way of trends or patterns. Note: the numbers in brackets at the start of each row are NOT part of the data, they mark the position of the data point in the list. For instance, if a row of the print out starts with [10], then the data point after [10] is the 10th entry in the list.
We can summarize these data with a frequency plot which we call a histogram. See the basic data visualization tuturial for details on making histograms look sharp.
hist(df$sleep)
Or a boxplot:
boxplot(df$sleep)
It looks like everyone expects to sleep between 4 and 11 hours a night!
What is the average height of students in this class, as reported in the survey:
mean(df$height)
## [1] 66.75177
Use table() to quickly summarize categorical variable
responses. For instance, how does the class feel about whether to
abolish the penny?
table(df$abolish.penny)
##
## No Yes
## 96 45
What is the class distribution of eye color?
table(df$eyecolor)
##
## Blue Brown Green Hazel
## 30 91 10 10
We can also use table() to investigate a possible
relationship between two categorical variables:
table(df$degree,df$abolish.penny)
##
## No Yes
## business 33 17
## CAS 24 17
## nursing 39 11
Hmm… what about
table(df$degree,df$class.dread)
##
## 1 2 3 4 5
## business 12 15 17 5 1
## CAS 11 8 11 7 4
## nursing 7 18 20 2 3
We can make side-by-side boxplots of a numerical variable (haircut, for instance) grouped by some categorical variable, such as… eye color!
boxplot(df$haircut~df$achieve)
Once we have loaded the tidyverse, we can access our data verbs to compute the average haircut cost by eye color group.
library(tidyverse)
df |>
group_by(eyecolor) |>
summarize(avg = mean(haircut))
## # A tibble: 4 × 2
## eyecolor avg
## <chr> <dbl>
## 1 Blue 37.0
## 2 Brown 39.7
## 3 Green 68.7
## 4 Hazel 39
Are these means useful measures of center? To answer this question, we should ask whether there are outliers and/or strong skewness in the haircut data. Let’s look, with a fairly polished plot with various bells and whistles.
ggplot(df)+
geom_dotplot(aes(x = haircut, fill = eyecolor), binwidth = 5)+
scale_y_continuous(breaks = NULL, name = "") + # Hide the y-axis
scale_fill_manual(values = c("blue","brown","green","wheat2"))+ #match dot color with eyecolor
facet_grid(eyecolor~.)+
labs(title = 'Cost of haircuts, grouped by eye color')+
theme_bw()+
theme(legend.position = 'none')
There are some definite outliers, which influence the means greatly. The medians would be a better measure of center:
df |>
group_by(eyecolor) |>
summarize(M = median(haircut))
## # A tibble: 4 × 2
## eyecolor M
## <chr> <dbl>
## 1 Blue 25.5
## 2 Brown 35
## 3 Green 21
## 4 Hazel 32.5
Use code to arrive at answers to each of these questions. Record your code in the script, make it clear (with comments) which question your code is tackling, and use comments in your script to record your responses.
- Find the mean number of states students in this survey have visited? Also find the median. Are there outliers in these data?
- On average, how many hours a week to students in this class plan to study (find the mean)? Are there outliers in these data?
- Make a histogram of the haircut variable (how much you spent on your last haircut). Based on the histogram, do you think you spend more or less than the typicial student in this class? Is this distribution skewed left, skewed right, or symmetric?
- Of all the brown eyed people in the class, how many answered that they went to sleep at midnight or later?
- Group the data according to the
degreevariable. Which group - Business, Nursing, or the College of Arts and Sciences (CAS) - has the highest mean hours a week that they say theystudy? Which has the lowest?
- Thinking of the averages found in the previous question, do you think there is a significant difference between students study habits based on their degree track? Make side-by-side boxplots of the study hours variable grouped by degree to inform your response.