Lab Objectives

  1. Practice data visualization with the package ggplot, part of the tidyverse package.
  2. Practice Exploratory Data Analysis (EDA) on a data matrix.
  3. Build, utilize, and critique a linear regression model.

Lab Instructions

Each group will submit solutions to the Lab 3 exercises in the form of an .pdf file created from an R Markdown file in RStudio. A template markdown file will be provided. As a class we will go through a way to convert your knitted document to a pdf file.

Golden Rule of Write-ups: Make an effort to put all your answers in context. Place statistics in context using complete sentences. In almost every question, our goal is to learn something about a particular distribution, and our answers should reflect that. Which answer is more interesting?

  • The mean is 2.654763; or
  • On average, Hitchman says “cripes” about 2.6 times per class.

The second answer is more useful, of course, because it gives the number context!

The Starbucks data set

We work with data on nutrition information for Starbucks beverages. The data set comes from Kaggle, an online data repository and host of swell data science challenges. I have added the data set to our course resource page and you can import it to your session by running this code at the console prompt.

df <- read.csv("https://mphitchman.com/stats/data/starbucks.csv")

Units for the variables:

variable units
TotalFat g
TransFat g
SaturatedFat g
Sodium mg
TotalCarbohydrates g
Cholesterol mg
DietaryFibre g
Sugars g
Protein g
VitaminA pct DV
VitaminC pct DV
Calcium pct DV
Iron pct DV
Caffeine mg

Question of Interest

We explore the relationship between sugar content and the number of calories in Starbucks beverages. Specifically, we ask whether sugar content is associated with higher calories in Starbucks beverages. How well does sugar content predict calorie count?

Motivation
Sugar content and its relationship to obesity in the United States is an important issue, and companies like soda manufacturers are posting information cautioning consumers on the health risks associated with sugar.

Q1 - Exploring the data

a) Based on the question of interest, which variable in the data set df is our response variable? Which variable is our explanatory variable?

b) Since we’re comparing two numerical variables a scatter plot is a good place to start in our exploratory data analysis. In the Q1b code chunck adapt the code below to first glimpse at the relationship between Calories and Sugars. Make sure that the explanatory variable you identified in Q1a is on the \(x\)-axis. Clearly label your axes and provide a title. In your written solution to part (b) write a sentence or two describing the direction, form, and strength of any association you see.

ggplot(df,aes(x=   ,y=   ))+
  geom_point(col="chocolate",size=.9)+
  ggtitle("add title")+
  xlab("add x-axis label")+
  ylab("add y-axis label")

c) Based on the plot of Sugars and Calories, does it look like it would be reasonable to consider a linear model for the association between sugar and calories? Explain briefly.

d) What is the correlation coefficient between Sugars and Calories? What does this number tell us about the strength and direction of the linear relationship between these two variables?

Recall, to find the correlation coefficient between Sugars and Calories we can run the code cor(df$Calories,df$Sugars).

e) Unrelated Question: Is there also an association between the amount of protein in a Starbucks drink and its calorie content? How does this association compare to the association between sugar and calories? Justify your response with visualizations and numerical measures of association.

Q2 - Build the model

Once we have determined that a linear model for Sugars vs Calories is reasonable we can build this model. The lm() command in R creates a linear model with the least squares line. To run the command we need to indicate the data set, the chosen response variable, and the chosen explanatory variable.

a) Create the linear model by running the code fit = lm(Calories~Sugars, data = df)
(Just run the code in the Q2a code chunk - you don’t need to supply any written answer here.)

This creates the model and saves it as fit, which you should see in your Environment tab. fit is an object in R that contains all the information we need about the linear model we just built.

In particular,

  • fit$coefficients gives us the slope and \(y\)-intercept of the least squares line,
  • fit$fitted.values gives us all the predicted \(\hat{y}\) values, and
  • fit$residuals gives us all the residual values.

b) Find the equation of the least squares line by running the code fit$coefficients in the Q2b code chunk. Record the equation of the line in your written response.

Pro tip: Typing $$\hat{y} = a + bx$$ will display (once knitted) a nicely typeset equation of a line. Of course, you will want to change a and b to your actual intercept and slope values.

Check: The code in Q2b gives the \(y\)-intercept and slope of the least squares line. You should find that the least squares regression line for the linear model has equation \[\hat{y}=37.5429 + 4.7426x,\] where \(\hat{y}\) denotes the predicted calories of a Starbucks drink with sugars equal to \(x\) grams.

c) In the Q2c code chunk adapt the following code to add the least-squares regression line to a scatter plot of the association between Calories (response variable) and Sugars (explanatory variable).

ggplot(df,aes(x=Sugars,y=Calories))+
  geom_point(col="chocolate",size=.9)+
  geom_smooth(method='lm', formula=y~x,col="seagreen",linewidth=.8,se=FALSE)+
  ggtitle("add title")+
  xlab("add x-axis label")+
  ylab("add y-axis label")

Q3 - Assess the model

Before we use the least-squares regression line to make any conclusions about Starbucks drinks, let’s see if we can determine how good a job our model actually does at explaining the variation in Calories.

a) What percentage of the variability in Calories is explained by its linear relationship with Sugars?
Hint: Think \(r^2\)!!

Note: We have two ways to find \(r^2\). The first approach is to square the correlation coefficient:

cor(df$Sugars, df$Calories)^2

The second way makes use of the summary(fit) command, which gives us other useful information about our model that we will want. Note that \(r^2\) is called the in the summary output.

A residual plot can be used to assess the linearity of the fit. Recall that the residual of a data point is the observed \(y\) value of the observation minus the predicted (from the line) \(y\)-value.

b) In the part (b) code chunk run the code below to create the residual plot. Then in your written response to part (b), comment on whether you think constant variance seems to be a reasonable assumption? Explain your answer.

ggplot(fit, aes(x = Sugars, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0,linetype="dashed",col="red",size=.6)+
  ylab("residual")+
  ggtitle("residual plot")

c) Use ggplot to create a histogram of the residuals, choosing appropriate bin widths, and labeling the plot appropriately. (I have provided basic code below that you can “embellish”.) Does the histogram of the residuals look like what we would expect for a normal distribution? Explain.

ggplot(fit) +
  geom_histogram(aes(x=.resid),bins=25,col="white",fill="wheat")

If constant variance seems reasonable, and the histogram of the residuals is reasonably bell-shaped, these are good indications that a linear model is a good model for the relationship between sugar content and calories in Starbucks beverages. Regardless of your answers above, in the next question we use the linear model in the next questions 8-).

Q4 - Use the model

a) Interpret the slope of the regression line in the context of the data by completing this sentence: “For every 10 grams of additional sugar, the expected increase in calories is… .”

b) If a Starbucks drink has 51 grams of sugar, about how many calories would the drink have, as predicted by the linear model? Use the equation of the line you found in Q2 to make this prediction.

c) The data set does show a Starbucks drink with exactly 51 grams of sugar (the no-whip White Chocolate Mocha with Soy Milk, of course). Determine the residual value for this drink. Does this drink lie above or below the least squares line?

Note, we can ask R to search the starbucks data set for all drinks having 51 grams of sugar by running this code:
df[df$Sugars==51,]

BONUS: Grouping by category

a) The first variable in df is Beverage_category. What are the different beverage categories? Use the table() command.

We can add a categorical variable to a scatter plot by using color. In a ggplot, to add color to the dots according to their beverage type, we add aes(col=Beverage_category) inside the geom_point() layer (since it’s the geom_point() layer that creates the dots).

b) Run the following code in your BONUSb code chunk. Then explain interesting features this use of color reveals about the association between sugar and calories for Starbucks beverages. For instance, which drink categories does our linear model do the worst job of predicting? Which categories tend to have more calories than what is predicted by the linear fit with sugar? Which categories tend to have fewer calories than what is predicted by the line? Does that seem reasonable?

ggplot(df,aes(x=Sugars,y=Calories))+
  geom_point(aes(col=Beverage_category))+
  geom_smooth(method='lm', formula=y~x,col="seagreen",size=.8,se=FALSE)+
  ggtitle("add title")+
  xlab("add x-axis label")+
  ylab("add y-axis label")

c) Build a linear model for the association between sugar and calories for just the Signature Espresso Drinks. Report the correlation coefficient, and the equation of the least squares regression line. How does this fit compare with the fit for all Starbucks beverages?