In this lab we begin working in RStudio to examine data sets. In solving the exercises below, we practice how to
Golden Rule of Write-ups: Place statistics in context using complete sentences. In almost every question, our goal is to learn something about a particular distribution, and our answers should reflect that. Which answer is more interesting?
1. The mean is 2.654763; or
2. On average, Hitchman says “cripes” about 2.6 times per class.
Answer 2 is so much better, of course, because it gives the number context, it tells a story. Make an effort to put all your answers in context.
In this lab we conduct exploratory data analysis, aka EDA, on various data sets. EDA is the process of visualizing and otherwise exploring data to help us decide on statistical techniques appropriate for the task we are trying to accomplish.
The process of conducting EDA depends on the types of data.
Variable(s) | Useful tools |
---|---|
1 numerical | mean, median, 5 number summary, box plot, histogram |
1 categorical | table, bar plot |
1 numerical and 1 categorical | mean, medians by group, side by side box plots |
2 numerical | scatter plot, correlation coefficient |
The Basic Data Visualization tutorial on our course resource page has example code for generating plots that might be appropriate in the questions below.
The Data Visualization with ggplot tutorial on our course resource page has example code for generating superb graphics using the ggplot2 package, which is a part of the tidyverse package.
Open RStudio. Congratulations!
Create a New Project entitled Lab1. We’ll do this together.
Download the Lab 1 template file that I emailed you, and save it in your Lab1 project folder.
In RStudio, open the Lab 1 template file into your session.
Load the tidyverse
package into your session by
entering this command at the console prompt:
library(tidyverse)
At the console prompt run these lines of code (copy and paste!) to import the data sets for this lab into your session.
survey <- read.csv("https://mphitchman.com/stats/data/student_surveys_23_24.csv")
iris <- iris
(Optional) Create a new script in RStudio to use
as a place to write down, and test out commands you want to use in your
work.
To execute lines of code in your script, place your cursor
anywhere on that line of code and click Run.
Alternatively, you can use a keyboard shortcut:
The Class Survey Data. You should have already
loaded this data frame into session and named survey
.
Answer the following questions in your lab report. In each part, show
your code work, if there is any, by entering the code into the
appropriate gray code chunk. Then type your written response in the
space below the Written response heading.
- How many observations (rows) and how many variables (columns) does
survey
have? Can you find the row that represents your responses?
- The
class.dread
variable records responses to the question: “On a scale of 1 to 5 how much do you dread being in this class?” (1 being ‘not at all’, and 5 being ‘totally dread!’) Make a plot to summarize the distribution of responses to this question, and describe its key features. Were a lot of students dreading this class at the start of the term? What proportion of the respondents answered 4 or 5?
Code tips: In Base Rtable(survey$class.dread)
summarizes the results, andbarplot(table(survey$class.dread))
will create a basic plot of the results, to which you should add a title, axes labels, and color as you see fit. If you have loaded the tidyverse, the following code makes a sharper basic plot, to which you can add features to make the plot more consumer friendly (title, etc):
ggplot(survey)+geom_bar(aes(x=class.dread))
- Describe the distribution of heights of the respondents in the survey with an appropriate plot and a sentence or two. Give your plot a title, and label the axes. In your discussion, pay attention to center, shape, and spread.
- Is there an association between how much respondents spend on haircuts and how many shoes they own? Make a plot to investigate this question, and describe what the plot reveals in a sentence or two.
- The
degree
variable records the school in which respondents intend to earn their major. Make a barplot to summarize the results of this question. What does this plot tell us about the makeup of this class in terms of where people plan to earn their major?
- The
achieve
variable records responses to which achievement students would most prefer to have: Olympic Gold, Academy Award, Nobel Prize, or President of the US. Of those respondents that plan to major in the school of business, what proportion selected “president” for their response to theachieve
question? Answer this question by first building a two-way table for these two variables. In general, one can build a two-way table from two categorical variables in a data frame with the codetable(df$column1,df$column2)
- Is there a big difference in the average height of respondents according to where they plan to earn their degree?
Code Tips:
In base R:
aggregate(survey$height,by=list(df$degree),FUN=mean)
If you have the tidyverse loaded:
survey %>% group_by(degree) %>% summarise(avg=mean(height))
- Is there an association between the achievement chosen by respondents and how many hours a week a student plans to study? Support your answer with summary statistics and/or plots. You may want to construct side-by-side boxplots of
study
, split by theachieve
variable.
Code Tips:
In Base R:boxplot(survey$study~survey$achieve)
If the tidyverse is loaded:
ggplot(survey)+geom_boxplot(aes(x=study,y=achieve))
As usual, add titles, axes labels, colors, as you see fit to make the plots effective.
You may also want to compute the average study hours for each type of achievement, akin to something you did in part g.
You have already entered the iris
data set into your
session. This famous set gives petal and sepal lengths and widths of
three species of iris (measured in centimeters).
Answer the following questions in your lab report. In each part, show your code work, if there is any, by entering the code into the appropriate gray code chunk. Then type your written response in the space below the Written response heading.
- How many observations are in this data frame? How many variables?
- Which variables, if any, are categorical?
- Find the five number summary for the
Petal.Length
variable. What does th five number summary tell us about the distribution of petal lengths in the data?
- Now create side-by-side boxplots for the
Petal.Length
variable grouped by species. Based on these boxplots, does it seem likely that knowing an iris’ petal length is a good predictor for its species?
Code Tip:
If you want the five number summary, and not just the boxplot, the following code (which uses the tidyverse) finds the five number summary for petal length for those observations with species ‘setosa’.
fivenum((iris %>% filter(Species=="setosa"))$Petal.Length)
- Make a scatter plot of
Petal.Length
againstPetal.Width
. What does this plot tell us about the relationship between these two variables.
- Make a new scatter plot of
Petal.Length
againstPetal.Width
, this time coloring the dots according to thespecies
of the iris. What does this plot tell us about the role species plays in the relationship between the petal length and petal width of an iris.
- If you come across an iris of unknown species, and you measure its petal length to be 1.8 cm, and its petal width to be 0.4 cm, which species is it likely to be, given your data plots? How confident are you about this conclusion? Explain in a sentence.
- If you come across a second iris of unknown species, and you measure its petal length to be 5 cm, and its petal width to be 1.7 cm, which species is it likely to be, given your data plots? Are you as more or less confident about your conclusion in this case than you were about the previous case? Explain in a sentence.