Getting Started

  1. Open RStudio. Congratulations!

  2. Create a new script in RStudio. To create a script, you can follow File -> New File -> R Script, or, even faster, go to the upper left corner of your RStudio window and click on the green + symbol. Then select R script.
    Recall, scripts are a place to write down commands you want to use in your work.
    To execute lines of code in your script, place your cursor anywhere on that line of code and click Run.
    Alternatively, you can use a keyboard shortcut:

    • Command + return on a Mac
    • ctrl + enter on a PC
  3. Record the code you use to answer the following questions in this script. Use hashtags to provide comments in your script

1 Basic descriptive statistics.


Example: Determine the median of these data: 3.2, 3.7, 5.3, 0.7, 4.6, 6.2, 7.3, 1.2, 2.4, 5.2


Answer. We want to use the median() function, and we also need to input the data with the c() function. So here’s one solution, as I might write it in my script

# code for example
dist=c(3.2, 3.7, 5.3, 0.7, 4.6, 6.2, 7.3, 1.2, 2.4, 5.2)
median(dist)

The median of the vector of values I named dist is 4.15.

In each case below, use RStudio as in the example above to answer the question. Record the code you use in your script, and record your answers on the worksheet provided in class.

  1. Find the standard deviation of the dist data set. Use sd() for standard deviation.
  1. Find the five number summary of the dist data set. Use fivenum().
  1. Find the mean and standard deviation of the data below.
    4.5, 9.8, 3.4, 5.5, 3.9, 6.3, 7.9, 6.4, 11.7, 6.5, 6.2, 8.2

2. Hank Aaron

Hank Aaron is one of Major League Baseball’s greatest and most-admired stars. He retired in 1976 as the all-time Home Run leader. Here are the home runs he hit in each of his seasons:
13, 27, 26, 44, 30, 39, 40, 34, 45, 44, 24, 32, 44, 39, 29, 44, 38, 47, 34, 40, 20, 12, 10

Using the c() function, enter this data set into your script and call it hank, as I’ve suggested below (copy and paste!)

hank=c(13,27,26,44,30,39,40,34,45,44,24,32,44,39,29,44,38,47,34,40,20,12,10)

As you work through the questions below in RStudio, you may wish to consult the Descriptive Statistics tutorial on our course resource page.

  1. How many seasons did Hank Aaron play? Record the value as well as which RStudio command you ran to find it.
  1. How many home runs did Aaron hit in his career? Record the value as well as which RStudio command you ran to find it.
  1. What is the maximum number of home runs Aaron hit in a single season? Record the value as well as which RStudio command you ran to find it.
  1. Determine the five number summary for this distribution, and plot the corresponding box plot in RStudio. Based on this boxplot, would you consider Aaron’s distributions of home runs to be skewed right, skewed left, or symmetric? Explain briefly.

3. Cars Data

RStudio comes with some built-in data sets. One such data set is called mtcars. This data set gives information about Motor Trend Road Tests of various cars. Run the following code in your script.

df = mtcars

This loads the data frame called mtcars and gives it the working name df in our session. In the Help tab in the lower right window, enter mtcars in the search box to learn about this data set. Use RStudio to answer the following questions.

  1. How many observations are in this data frame? How many variables?
  1. What does the am variable tells us about a car? Is this variable categorical or numerical? Hint: Locate the Help tab in the plots and files pane, and search for mtcars.
  1. What is the average mpg for the 32 cars in this data frame? To determine the mean for a column in a data frame, we use the mean() function. We also need to tell RStudio which data frame AND which column we’re interested in, and we use a dollar sign ($) in our code:
mean(df$mpg)
  1. What is the median horsepower (hp) of the cars in this data frame?
  1. Note that the cyl variable records how many cylinders a car has. Run the code table(df$cyl). What information does this provide about our data frame?
  1. Describe the association between how many cylinders a car has and its fuel efficiency (as measured by mpg). One way to address this question is to compute the average mpg for cars in each group, which we can do with this code:
aggregate(df$mpg,by=list(df$cyl),FUN=mean)

A side-by-side box plot can be helpful as well when we’re exploring a possible association between a numerical and categorical variable.

boxplot(df$mpg~df$cyl, horizontal=TRUE, xlab="mpg", ylab="cylinders", main="mpg, grouped by number of cylinders")
  1. Describe the association between the weight (wt) of a car and its mpg. Since both variables are numerical, a scatter plot in RStudio with weight and mpg can be helpful. The following base R code can do the trick.
plot(x=df$wt,y=df$mpg)