Lab 1 Objectives

In this lab we begin working in RStudio to examine data sets. In solving the exercises below, we practice how to

  1. import data sets into an RStudio session,
  2. determine summary statistics,
  3. present data visually,
  4. use R Markdown to write a lab report, and
  5. write solutions following the Golden Rule of Write-ups.

Golden Rule of Write-ups: Place statistics in context using complete sentences. In almost every question, our goal is to learn something about a particular distribution, and our answers should reflect that. Which answer is more interesting?
1. The mean is 2.654763; or
2. On average, Hitchman says “cripes” about 2.6 times per class.
Answer 2 is so much better, of course, because it gives the number context, it tells a story. Make an effort to put all your answers in context.

In this lab we conduct exploratory data analysis, aka EDA, on various data sets. EDA is the process of visualizing and otherwise exploring data to help us decide on statistical techniques appropriate for the task we are trying to accomplish.

The process of conducting EDA depends on the types of data.

Variable(s) Useful tools
1 numerical mean, median, 5 number summary, box plot, histogram
1 categorical table, bar plot
1 numerical and 1 categorical mean, medians by group, side by side box plots
2 numerical scatter plot, correlation coefficient

The Basic Data Visualization tutorial on our course resource page has example code for generating plots that might be appropriate in the questions below.

The Data Visualization with ggplot tutorial on our course resource page has example code for generating superb graphics using the ggplot2 package, which is a part of the tidyverse package.

Getting Started

  1. Open RStudio. Congratulations!

  2. Create a New Project entitled Lab1. We’ll do this together.

  3. Download the Lab 1 template file that I emailed you, and save it in your Lab1 project folder.

  4. In RStudio, open the Lab 1 template file into your session.

  5. Load the tidyverse package into your session by entering this command at the console prompt:

    • library(tidyverse)
  6. At the console prompt run these lines of code (copy and paste!) to import the data sets for this lab into your session.

    • survey <- read.csv("https://mphitchman.com/stats/data/student_surveys_23_24.csv")
    • iris <- iris
  7. (Optional) Create a new script in RStudio to use as a place to write down, and test out commands you want to use in your work.
    To execute lines of code in your script, place your cursor anywhere on that line of code and click Run.
    Alternatively, you can use a keyboard shortcut:

    • Command + return on a Mac
    • ctrl + enter on a PC

Q1. Class Survey

The Class Survey Data. You should have already loaded this data frame into session and named survey.
Answer the following questions in your lab report. In each part, show your code work, if there is any, by entering the code into the appropriate gray code chunk. Then type your written response in the space below the Written response heading.

  1. How many observations (rows) and how many variables (columns) does survey have? Can you find the row that represents your responses?
  1. The class.dread variable records responses to the question: “On a scale of 1 to 5 how much do you dread being in this class?” (1 being ‘not at all’, and 5 being ‘totally dread!’) Make a plot to summarize the distribution of responses to this question, and describe its key features. Were a lot of students dreading this class at the start of the term? What proportion of the respondents answered 4 or 5?
    Code tips: In Base R table(survey$class.dread) summarizes the results, and barplot(table(survey$class.dread)) will create a basic plot of the results, to which you should add a title, axes labels, and color as you see fit. If you have loaded the tidyverse, the following code makes a sharper basic plot, to which you can add features to make the plot more consumer friendly (title, etc):
ggplot(survey)+geom_bar(aes(x=class.dread))
  1. Describe the distribution of heights of the respondents in the survey with an appropriate plot and a sentence or two. Give your plot a title, and label the axes. In your discussion, pay attention to center, shape, and spread.
  1. Is there an association between how much respondents spend on haircuts and how many shoes they own? Make a plot to investigate this question, and describe what the plot reveals in a sentence or two.
  1. The degree variable records the school in which respondents intend to earn their major. Make a barplot to summarize the results of this question. What does this plot tell us about the makeup of this class in terms of where people plan to earn their major?
  1. The achieve variable records responses to which achievement students would most prefer to have: Olympic Gold, Academy Award, Nobel Prize, or President of the US. Of those respondents that plan to major in the school of business, what proportion selected “president” for their response to the achieve question? Answer this question by first building a two-way table for these two variables. In general, one can build a two-way table from two categorical variables in a data frame with the code table(df$column1,df$column2)
  1. Is there a big difference in the average height of respondents according to where they plan to earn their degree?
    Code Tips:
    In base R:
    aggregate(survey$height,by=list(df$degree),FUN=mean)
    If you have the tidyverse loaded:
    survey %>% group_by(degree) %>% summarise(avg=mean(height))
  1. Is there an association between the achievement chosen by respondents and how many hours a week a student plans to study? Support your answer with summary statistics and/or plots. You may want to construct side-by-side boxplots of study, split by the achieve variable.
    Code Tips:
    In Base R: boxplot(survey$study~survey$achieve)
    If the tidyverse is loaded:
    ggplot(survey)+geom_boxplot(aes(x=study,y=achieve)) As usual, add titles, axes labels, colors, as you see fit to make the plots effective.
    You may also want to compute the average study hours for each type of achievement, akin to something you did in part g.

Q2. The Iris Data set

You have already entered the iris data set into your session. This famous set gives petal and sepal lengths and widths of three species of iris (measured in centimeters).

Answer the following questions in your lab report. In each part, show your code work, if there is any, by entering the code into the appropriate gray code chunk. Then type your written response in the space below the Written response heading.

  1. How many observations are in this data frame? How many variables?
  1. Which variables, if any, are categorical?
  1. Find the five number summary for the Petal.Length variable. What does th five number summary tell us about the distribution of petal lengths in the data?
  1. Now create side-by-side boxplots for the Petal.Length variable grouped by species. Based on these boxplots, does it seem likely that knowing an iris’ petal length is a good predictor for its species?
    Code Tip:
    If you want the five number summary, and not just the boxplot, the following code (which uses the tidyverse) finds the five number summary for petal length for those observations with species ‘setosa’.
    fivenum((iris %>% filter(Species=="setosa"))$Petal.Length)
  1. Make a scatter plot of Petal.Length against Petal.Width. What does this plot tell us about the relationship between these two variables.
  1. Make a new scatter plot of Petal.Length against Petal.Width, this time coloring the dots according to the species of the iris. What does this plot tell us about the role species plays in the relationship between the petal length and petal width of an iris.
  1. If you come across an iris of unknown species, and you measure its petal length to be 1.8 cm, and its petal width to be 0.4 cm, which species is it likely to be, given your data plots? How confident are you about this conclusion? Explain in a sentence.
  1. If you come across a second iris of unknown species, and you measure its petal length to be 5 cm, and its petal width to be 1.7 cm, which species is it likely to be, given your data plots? Are you as more or less confident about your conclusion in this case than you were about the previous case? Explain in a sentence.