Open RStudio. Congratulations!
Create a new script in RStudio. To create a
script, you can follow File -> New File -> R Script, or,
even faster, go to the upper left corner of your RStudio window and
click on the green + symbol. Then select R script.
Recall, scripts are a place to write down commands you want to use in
your work.
To execute lines of code in your script, place your cursor
anywhere on that line of code and click Run.
Alternatively, you can use a keyboard shortcut:
Record the code you use to answer the following questions in this script. Use hashtags to provide comments in your script
Example: Determine the median of these data: 3.2, 3.7, 5.3, 0.7, 4.6, 6.2, 7.3, 1.2, 2.4, 5.2
Answer. We want to use the median()
function, and we also need to input the data with the c()
function. So here’s one solution, as I might write it in my script
# code for example
dist=c(3.2, 3.7, 5.3, 0.7, 4.6, 6.2, 7.3, 1.2, 2.4, 5.2)
median(dist)
The median of the vector of values I named dist
is
4.15.
In each case below, use RStudio as in the example above to answer the question. Record the code you use in your script, and record your answers on the worksheet provided in class.
- Find the standard deviation of the
dist
data set. Usesd()
for standard deviation.
- Find the five number summary of the
dist
data set. Usefivenum()
.
- Find the mean and standard deviation of the data below.
4.5, 9.8, 3.4, 5.5, 3.9, 6.3, 7.9, 6.4, 11.7, 6.5, 6.2, 8.2
Using the c()
function, enter this data set into your
script and call it hank
, as I’ve suggested below (copy and
paste!)
hank=c(13,27,26,44,30,39,40,34,45,44,24,32,44,39,29,44,38,47,34,40,20,12,10)
As you work through the questions below in RStudio, you may wish to consult the Descriptive Statistics tutorial on our course resource page.
- How many seasons did Hank Aaron play? Record the value as well as which RStudio command you ran to find it.
- How many home runs did Aaron hit in his career? Record the value as well as which RStudio command you ran to find it.
- What is the maximum number of home runs Aaron hit in a single season? Record the value as well as which RStudio command you ran to find it.
- Determine the five number summary for this distribution, and plot the corresponding box plot in RStudio. Based on this boxplot, would you consider Aaron’s distributions of home runs to be skewed right, skewed left, or symmetric? Explain briefly.
RStudio comes with some built-in data sets. One such data set is
called mtcars
. This data set gives information about Motor
Trend Road Tests of various cars. Run the following code in your
script.
df = mtcars
This loads the data frame called mtcars
and gives it the
working name df
in our session. In the Help tab in
the lower right window, enter mtcars
in the search box to
learn about this data set. Use RStudio to answer the following
questions.
- How many observations are in this data frame? How many variables?
- What does the
am
variable tells us about a car? Is this variable categorical or numerical? Hint: Locate the Help tab in the plots and files pane, and search for mtcars.
- What is the average mpg for the 32 cars in this data frame? To determine the mean for a column in a data frame, we use the
mean()
function. We also need to tell RStudio which data frame AND which column we’re interested in, and we use a dollar sign ($) in our code:
mean(df$mpg)
- What is the median horsepower (hp) of the cars in this data frame?
- Note that the
cyl
variable records how many cylinders a car has. Run the codetable(df$cyl)
. What information does this provide about our data frame?
- Describe the association between how many cylinders a car has and its fuel efficiency (as measured by mpg). One way to address this question is to compute the average mpg for cars in each group, which we can do with this code:
aggregate(df$mpg,by=list(df$cyl),FUN=mean)
A side-by-side box plot can be helpful as well when we’re exploring a possible association between a numerical and categorical variable.
boxplot(df$mpg~df$cyl, horizontal=TRUE, xlab="mpg", ylab="cylinders", main="mpg, grouped by number of cylinders")
- Describe the association between the weight (wt) of a car and its mpg. Since both variables are numerical, a scatter plot in RStudio with weight and mpg can be helpful. The following base R code can do the trick.
plot(x=df$wt,y=df$mpg)