Lab 2 Objectives

  1. Load data sets into R as data frames and conduct exploratory data analysis by finding summary statistics and with visualizations.

  2. Conduct statistical inference in R with real data sets loaded as data frames, and check whether the conditions for reliable inference are being met.

  3. Interpret the results of the inference.

Lab 2 Instructions

  1. Open R studio in the posit cloud.

  2. Download the Lab 2 template file from Blackboard and upload it into your project.

  3. Open the Lab2 template file into your session.

  4. Open a new script in your Rstudio session to use as “scratch paper” for trying out code to solve lab questions.

  5. Run the following three lines in your script to activate tidyverse commands and load the two data sets.

    1. survey - the class survey data we considered in Lab 1
    2. nc - information about 150 randomly chosen births in North Carolina (a data set provided by our text)
    library(tidyverse)
    survey <- read.csv("https://mphitchman.com/stats/data/student_surveys.csv")
    nc <- read.csv('https://www.openintro.org/data/csv/births.csv')
  6. In your final lab report be sure to:

  • indicate first and last names of all group numbers
  • maintain the formatting of the lab report as you enter your code chunks and written responses. It may be easier to keep the formatting if you toggle to Visual mode.
  • Knit early and knit often. Remember, the knitted .html file will be your final report, and your lab grade will be based on completeness of solutions, correctness of code, and the clarity of the presentation in the output file.
  1. Remember the Golden Rule of Write-ups: Place statistics in context using complete sentences. Numbers without context are not statistics, they are just numbers. Use complete sentences. Use punctuation. Use capital letters as appropriate.

  2. Each group will submit one lab report (the final knitted .html file) as an attachment in an email to me.

  3. Each of you will submit to blackboard your own two paragraph summary of the lab, to include a brief description of the problems you tackled and what your group concluded, as well as how you in particular helped with the completion of the lab report. Also include a short paragraph on how well your group worked together on this project.

Q1 Inference on a proportion.

According to Google, 45% of people in the US have brown eyes. Does our class survey data provide statistically significant evidence at the \(\alpha=.05\) level that \(p\), the proportion of all Linfield students with brown eyes is different than .45? Let’s investigate this question by calculating a 95% confidence interval for \(p\) by treating the eyecolor column of our class survey data as a simple random sample of all Linfield students.

  1. Use your Q1 a) code chunk to find the sample proportion of brown-eyed students in the survey sample. The length() and table() commands may be helpful. Also provide the sample proportion in the written response to this question. [Last write-up reminder to write complete sentences and provide context to your numbers.]
  1. Are the conditions for inference being met here? That is, check that the success-failue condition is being met, and that the sample size is less than 10% of the entire population of Linfield Students.
  1. Use a code chunk to calculate the confidence interval for \(p\). Give your interval in the (low to high) format. In your written response to this part state the interval and explain what it tells us in terms of brown-eyed Linfield students.
  1. Do we have statistically significant evidence at the \(\alpha = .05\) level that the proportion of all Linfield students having brown eyes is different than .45? Explain. Hint: Is .45 within your confidence interval?

Q2 Inference on a mean

Question: Do Linfield Students get less than 8 hours of sleep a night on average?

Treat the sleep column of the class survey data as a simple random sample of all Linfield students. Does this sample provide statistically significant evidence at the \(\alpha=.05\) level that the population of all Linfield students sleep less than 8 hours on average?

  1. State hypotheses for the hypothesis test to address the question of whether the population of all Linfield students sleep less than 8 hours a night on average.
  1. If we were asked to crank out this test by hand, we would need the following summary statistics from the data: the sample size \(n\), the sample mean \(\bar{x}\), and the sample standard deviation \(s\). In a code chunk find each of these values for the sleep column, and summarize these results in your written response.
  1. Rather than do the significance test by hand, we let R crunch the numbers via thet.test() command. Run this test in the Q2 c) code chunk per the instructions below.

In the code below, you will want to input:

  • x = the data frame column of interest that is our sample
  • mu = the value of the population mean assumed in the null hypothesis
  • alternative = choose one (and enter with quotes) consistent with your hypotheses from Q2 a):
    • “two.sided”,
    • “less”, or
    • “greater”
t.test(x= ,mu= , alternative= )
  1. Record the P-value of this test. Are the results statistically significant at the \(\alpha = .05\) level? State your conclusion in the context of the original research question about sleep for Linfield students.
  1. Check: Is the test reliable? Reliability red flags include poorly gathered data that is not representative of teh population, small sample size, and strong skewness or outliers in the sample data. Here \(n\) is quite large, so our method is reliable if these data don’t show crazy outliers. Make a histogram of the sample data to check the shape of the data set, and state your conclusion about the reliability of the test.

Q3 Birth Weight and smoking

Our text book comes with many data sets; one of them provides information about a random sample of 150 births in North Carolina.

In this data set we have observations on 9 different variables, some categorical and some numerical. The meaning of each variable is as follows.

Variable Description
f_age father’s age in years
m_age mother’s age in years
weeks length of pregnancy in weeks
premature whether the birth was classified as premature (premie) or full term
visits number of hospital visits during pregnancy
gained weight gained by mother during pregnancy
weight weight of the baby at birth in pounds
sex_baby sex of the baby, “female” or “male”
smoke status of mother as a “smoker” or “nonsmoker”

If you haven’t already, load the data into your session by running

nc <- read.csv('https://www.openintro.org/data/csv/births.csv')

Q3 Part I. Exploratory Analysis

Explore the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

  1. Use ggplot to make side-by-side boxplots of smoke and weight. Use colors and text to make your plot clear and informative to the reader. What does the plot highlight about the relationship between these two variables?
  1. The boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following data verbs (thanks, tidyverse!) to group the data by the smoke variable, then summarize the groups using the mean function to calculate the mean weight within each group. Is there an abserved difference in average birth weight in these two groups?
nc |> 
  group_by(smoke) |> 
  summarize(mean(weight))

Q3 Part II: Inference

You saw in (b) that there is an observed difference in birth weight for our two groups. But is this easily explained by randomness in sampling, or does it indicate a genuine assocation (i.e., is there genuinely a difference in average birth weight by these two groups of moms)?

  1. Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same code above but replacing mean with length. You may also want to plot the distribution of weights in each group to check for extreme outliers.
  1. Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
  1. Use the t.test() function to run the signficance test. We are testing the difference of two means (birtheights) found in the weight column, and the two groups (mom’s who smoke vs mom’s who don’t) are determined by the data frame columnsmoke` column. The code in R for splitting weights by smoke habit in a t test is below (fill in the appropariate alternative hypothesis)
t.test(nc$weight ~ nc$smoke, alternative = )

Then write a short paragraph sharing your findings. Be precise, and express your findings in the context of birth weights and smoking habits.

Q4 Length of pregnancies

Use the nc data to determine a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that you are doing inference on a single population parameter here.

Q5 Birth weight by sex

Use the nc data to conduct a hypothesis test on the following question: Is there a difference in birth weight between male and female babies?