On day one in this class, students filled out a survey (via Google forms). I have downloaded the results of the survey in the form of a .csv file, and after cleaning the data (we’ll discuss what this meant in class), I saved the .csv file was saved to our course resource site in the data tab.
In this webpage, we show how to load the data into an RStudio session, and some ways to begin exploring the data in RStudio.
df <- read.csv("https://mphitchman.com/stats/data/survey_cleaned_f22.csv")
You can run this line of code at the console prompt in RStudio (copy
and paste!) to load the survey data into your own RStudio session. The
code loads the data and gives it the working name df
in
your session.
By clicking on df
in your Environment Tab you can check
the data set. We have 27 rows and 19 columns. This means that 27
students filled out the survey, and 19 variables for each student have
been recorded, including eyecolor, year (at Linfield),
height (in inches), phone.usage (per day), and
others.
By clicking the arrow in the blue circle next to df
in
your Environment Tab, you can see a list of the column names, and
whether they are categorical or numerical variables.
To ask R to give us information about a particular column in a data frame we need to tell it the name of the data frame and the name of the column of interest, using a $ sign between these names:
df$sleep
## [1] 7.0 7.0 7.0 7.0 8.0 8.0 8.0 8.0 7.0 8.0 7.5 6.0 7.0 7.0 7.0 6.5 9.0 8.5 8.0
## [20] 6.0 6.0 8.0 8.0 7.5 6.0 8.0 8.0
The code above displays the raw data from the sleep column, which doesn’t tell us much in the way of trends or patterns. Note: the numbers in brackets at the start of each row are NOT part of the data, they mark the position of the data point in the list. For instance, the 2nd row of the print out starts with [20], which just indicates that the first data entry on this row (6.0) is the 20th data point in the column.
We can summarize these data with a frequency plot which we call a histogram. See the basic data visualization tuturial for details on making histograms look sharp.
hist(df$sleep)
Or a boxplot:
boxplot(df$sleep)
It looks like everyone expects to sleep between 6 and 9 hours a night!
What is the average height of students in this class, as reported in the survey:
mean(df$height)
## [1] 66.2963
Use table()
to quickly summarize categorical variable
responses. For instance, how does the class feel about whether to
abolish the penny?
table(df$abolish.penny)
##
## No Yes
## 18 9
What is the class distribution of eye color?
table(df$eyecolor)
##
## Blue Brown Green Hazel
## 9 14 1 3
We can also use table()
to investigate a possible
relationship between two categorical variables:
table(df$eyecolor,df$abolish.penny)
##
## No Yes
## Blue 7 2
## Brown 10 4
## Green 0 1
## Hazel 1 2
Hmm… what about
table(df$degree,df$class.dread)
##
## 1 2 3 5
## College of Arts and Sciences 4 3 1 0
## School of Business 1 2 1 0
## School of Nursing 2 9 3 1
We can make side-by-side boxplots of a numerical variable (haircut, for instance) grouped by some categorical variable, such as… eye color!
boxplot(df$haircut~df$eyecolor)
We can also compute the average haircut by eye color group:
aggregate(df$haircut, list(df$eyecolor), FUN=mean)
## Group.1 x
## 1 Blue 29.88889
## 2 Brown 33.92857
## 3 Green 30.00000
## 4 Hazel 40.00000
Are these means useful measures of center? I don’t know! Are there outliers in the haircut data? Let’s look, with a fairly polished plot with various bells and whistles.
stripchart(df$haircut~df$eyecolor,
main="Cost of Haircuts", # title
xlab="Cost", #x-axis label
ylab="", #y axis label (empty for this chart)
method = "stack", #
col=c("blue","brown","green","wheat2"), #dot color
pch=16, # dot shape and filling
)
There are some definite outliers, which influence the means greatly. The medians would be a better measure of center:
aggregate(df$haircut, list(df$eyecolor), FUN=median)
## Group.1 x
## 1 Blue 27
## 2 Brown 25
## 3 Green 30
## 4 Hazel 30
- On average, how many states have students in this class visited? Are there outliers in these data?
- On average, how many hours a week to students in this class plan to study? Are there outliers in these data?
- Make a histogram of the haircut variable (how much you spent on your last haircut). Based on the histogram, do you think you spend more or less than the typicial student in this class? Is this distribution skewed left, skewed right, or symmetric?
- Of all the brown eyed people in the class, how many answered that they went to sleep around midnight?
- Can you compute the average study hours for students in this class grouped by whether they want to abolish the penny?