Activity Objectives

  1. Practice data visualization with the package ggplot, part of the tidyverse package.
  2. Practice Exploratory Data Analysis (EDA) on a data matrix.
  3. Build, utilize, and critique a linear regression model.

Activity Instructions

Each of you will submit your completed worksheet (passed out in class). Please follow the…

Golden Rule of Write-ups: Make an effort to put all your answers in context. Place statistics in context using complete sentences. In almost every question, our goal is to learn something about a particular distribution, and our answers should reflect that. Which answer is more interesting?

  • The mean is 2.654763; or
  • On average, Hitchman says “cripes” about 2.6 times per class.

The second answer is more useful, of course, because it gives the number context!

The Starbucks data set

We work with data on nutrition information for Starbucks beverages. The data set comes from Kaggle, an online data repository and host of swell data science challenges. I have added the data set to our course resource page and you can import it to your session by running this code at the console prompt, or in a new script.

df <- read.csv("https://mphitchman.com/stats/data/starbucks.csv")

Units for the variables:

variable units
TotalFat g
TransFat g
SaturatedFat g
Sodium mg
TotalCarbohydrates g
Cholesterol mg
DietaryFibre g
Sugars g
Protein g
VitaminA pct DV
VitaminC pct DV
Calcium pct DV
Iron pct DV
Caffeine mg

Question of Interest

We explore the relationship between sugar content and the number of calories in Starbucks beverages. Specifically, we ask whether sugar content is associated with higher calories in Starbucks beverages. How well does sugar content predict calorie count?

Motivation
Sugar content and its relationship to obesity in the United States is an important issue, and companies like soda manufacturers are posting information cautioning consumers on the health risks associated with sugar.

Q1 - Exploring the data

a) Based on the question of interest, which variable in the data set df is our response variable? Which variable is our explanatory variable?

b) Make a point plot to visualize association between sugar and calories in Starbucks beverages. Make sure that the explanatory variable you identified in Q1a is on the \(x\)-axis. Clearly label your axes and provide a title. In your written response to part (b) write a sentence or two describing the direction, form, and strength of any association you see.

c) Based on the plot of Sugars and Calories, does it look like it would be reasonable to consider a linear model for the association between sugar and calories? Explain briefly.

d) What is the correlation coefficient between Sugars and Calories? What does this number tell us about the strength and direction of the linear relationship between these two variables?

Recall, to find the correlation coefficient between Sugars and Calories we can run the code cor(df$Calories,df$Sugars).

e) Is there also an association between the amount of cholesterol in a Starbucks drink and its calorie content? How does this association compare to the association between sugar and calories? Justify your response with visualizations and numerical measures of association.

Q2 - Build the model

Once we have determined that a linear model for Sugars vs Calories is reasonable we can build this model. The lm() command in R creates a linear model with the least squares line. To run the command we need to indicate the data set, the chosen response variable, and the chosen explanatory variable.

a) Create the linear model by running the code fit = lm(Calories~Sugars, data = df)
(Just run the code in the Q2a code chunk - you don’t need to supply any written response here.)

This creates the model and saves it as fit, which you should see in your Environment tab. fit is an object in R that contains all the information we need about the linear model we just built.

In particular,

  • fit$coefficients gives us the slope and \(y\)-intercept of the least squares line,
  • fit$fitted.values gives us all the predicted \(\hat{y}\) values, and
  • fit$residuals gives us all the residual values.

b) Find the equation of the least squares line by running the code fit$coefficients in the Q2b code chunk. Record the equation of the line in your written response.

c) In the Q2c code chunk adapt the following code to add the least-squares regression line to a scatter plot of the association between Calories (response variable) and Sugars (explanatory variable).

ggplot(df,aes(x=Sugars,y=Calories))+
  geom_point(col="chocolate")+
  geom_smooth(method='lm', formula=y~x,col="seagreen",linewidth=.8,se=FALSE)+
  labs(title = "add title",
       x = "add x-axis label",
       y = "add y-axis label")

Q3 - Assess the model

Before we use the least-squares regression line to make any conclusions about Starbucks drinks, let’s see if we can determine how good a job our model actually does at explaining the variation in Calories.

a) What percentage of the variability in Calories is explained by its linear relationship with Sugars?
Hint: Think \(r^2\)!!

Note: We have two ways to find \(r^2\). The first approach is to square the correlation coefficient:

cor(df$Sugars, df$Calories)^2

The second way makes use of the summary(fit) command, which gives us other useful information about our model that we will want. Note that \(r^2\) is called the in the summary output.

A residual plot can be used to assess the linearity of the fit. Recall that the residual of a data point is the observed \(y\) value of the observation minus the predicted (from the line) \(y\)-value.

b) In the part (b) code chunk run the code below to create the residual plot. Then in your written response to part (b), comment on whether you think constant variability seems to be a reasonable assumption for this model. Explain your answer.

ggplot(fit, aes(x = Sugars, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0,linetype="dashed",col="red",linewidth=.6)+
  ylab("residual")+
  ggtitle("residual plot")

c) Use ggplot to create a histogram of the residuals, choosing appropriate bin widths, and labeling the plot appropriately. (I have provided basic code below that you should “embellish”.) Does the histogram of the residuals look like what we would expect for a normal distribution? Explain.

ggplot(fit) +
  geom_histogram(aes(x=.resid),bins=25,col="white",fill="wheat")

d) If constant variability seems reasonable, the value of \(r^2\) is fairly large (on the scale from 0 to 1), and the histogram of the residuals is reasonably bell-shaped, these are good indications that a linear model is a good model for the relationship between sugar content and calories in Starbucks beverages. Summarize your findings in Q3 - does a linear model seem to be a reasonable model to use here?

Q4 - Use the model

a) Interpret the slope of the regression line in the context of the data by completing this sentence: “For every 10 grams of additional sugar, the expected increase in calories is… .”

b) If a Starbucks drink has 47 grams of sugar, about how many calories would the drink have, as predicted by the linear model? Use the equation of the line you found in Q2 to make this prediction.

c) The data set does show a Starbucks drink with exactly 47 grams of sugar (the Frappuccino® Blended Coffee with Soy Milk, of course). Determine the residual value for this drink. Does this drink lie above or below the least squares line?

Note, we can ask R to search the starbucks data set for all drinks having 47 grams of sugar by running this code:
df[df$Sugars==47,]

Q5 - Grouping by category

a) The first variable in df is Beverage_category. What are the different beverage categories? Use the table() command.

We can add a categorical variable to a scatter plot by using color. In a ggplot, to add color to the dots according to their beverage type, we add aes(col=Beverage_category) inside the geom_point() layer (since it’s the geom_point() layer that creates the dots).

b) Recreate your plot from Q2 c), but this time color each dot according to the drink’s Beverage category. Then explain interesting features this use of color reveals about the association between sugar and calories for Starbucks beverages. For instance, which drink categories does our linear model do the worst job of predicting? Which categories tend to have more calories than what is predicted by the linear fit with sugar? Which categories tend to have fewer calories than what is predicted by the line? Does that seem reasonable?