Instructions

In this activity we use R to investigate the Central Limit Theorem for proportions. Your code in R will practice the following tasks:

  • Create a population in which a given proportion of the elements have a particular feature.
  • Simulate the drawing of random samples from this population, and determining sample proportions, \(\hat{p}\).
  • Approximate the sampling distribution for \(\hat{p}\)
  • Compare your approximate sampling distribution with the results of the Central Limit Theorem
  • Use the CLT to estimate some probabilities.

Based on your code you will record answers to various questions on this worksheet. The completed worksheet is due in class on Tuesday, March 31.

Q1 Create a Population

  1. Think of a scenario (real or fantastic) in which you want to investigate the proportion of a population having a certain feature. Maybe it’s the proportion of all skittles that are green, or the proportion of all dragons having 6-toed feet, or the proportion of all little league baseball players that wear their glove on their right hand. Record your population on your worksheet, and the feature you are interested in studying.
  1. Decide on a (reasonable or fantastic) value for \(p\), the proportion of your population having your feature of interest. Be sure your value is between 0 and 1, preferably between .2 and .8. Record this value of \(p\) on your workhseet, and define this value in your script by running the line below
p = your chosen value (between 0 and 1)
  1. Now create your model population by running the following code in R
pop_size = 100000
population <- rep( c('s', 'f'), pop_size*c(p, 1-p) )

This code creates a vector called population in which every entry is either s or f (short for ‘success’ and ‘failure’).

  • The ‘s’ entries correspond to those elements in your population that have your feature of interest
  • The ‘f’ entries correspond to those elements that don’t have the feature.

The code above ensures that the proportion of your population having your feature of interest is your chosen value p.

Q2: Drawing a sample

In your script draw a sample of size n = 100, and store the results in the vector called draw by running this code

n = 100
draw <- sample(population, n)

We are interested in p_hat, the proportion of elements in the sample having your feature of interest. We can ask R to find this sample proportion (of ‘s’) by running this code:

sum(draw == 's') / n

Record your sample proportion p_hat in your worksheet. Is it close to the population proportion \(p\)? Exactly the same?

Q3 The sampling distribution for p_hat

We can approximate the sampling distribution for p_hat by visualizing the distribution of p_hats that we would obtain from many, many different random samples of size \(n\) from the population.

The following code will draw a random sample of size \(n = 100\) a whopping 100,000 different times, recording the sample proportion of successes in each draw in the vector we call results_100. Run this code in your script.

results_100 <- replicate(100000, sum(sample(population, n) == 's')/n)

Make a bar plot of this vector of results (barplot(table(results_100))). Describe the shape of the distribution. Is it bell-shaped? Based on this distribution, would you describe the sample proportion you obtained in Q2 as typical, fairly common but on the high side of typical, fairly common but on the low side of typical, or a very rare occurrence?

This bar plot is a nice approximation to the theoretical sampling distribution for p_hat, a look at the different sample proportions that are possible along with how likely each is to occur.

Q4 The Central Limit Theorem

According to the central limit theorem, the sampling distribution for p_hat is nearly normal.

  1. In the case of your population, and your sample size of \(n = 100\), according to the CLT what is the mean and what is the standard deviation (also called standard error) of this sampling distribution? Record your answers on the worksheet.
  1. In Q3, we constructed a results vector that is meant to approximate the sampling distribution for p_hat. Determine the mean and standard deviation of the results_100 vector. How do they compare to your answers to part (a) of this question?

Q5 Change the sample size

Now we essentially repeat questions 3 and 4, with a different sample size. Instead of drawing a sample of size 100, let’s draw a sample of size 400. Run the following code to generate a new results vector called results_400 (of 100000 different sample proportions from samples of this new size)

n = 400
results_400 <- replicate(100000, sum(sample(population, n) == 's')/n)
  1. Make a barplot of the results_400 vector. Does this visual resemble a bell curve? How does this visual compare to the earlier one in one in terms of center and spread?
  1. In the case of your population, and your sample size of \(n = 400\), according to the CLT what is the mean and what is the standard error of this sampling distribution? Record your answers on the worksheet.
  1. Determine the mean and standard deviation of the results_400 vector. How do they compare to your answers to part (b) of this question?

Q6 Estimating Likelihoods

  1. Use the CLT to estimate the probability that more than half the voters in a sample of size 100 likely voters are in favor of Candidate A, assuming that 45% of the population of likely voters actually favors Candidate A.

Feel free to use R as your calculator in this problem. Recall the code we considered in the last class. Your job here is to assign values for p, p_hat, and n, and then decide on the right use of pnorm().

p <- 
p_hat <- 
n <- 
SE <- sqrt(p*(1-p)/n)
z <- (p_hat - p) / SE

pnorm(z) # this is the probability of getting a sample proportion less than p_hat
1 - pnorm(z) # this is the probability of getting a sample proportion greater than p_hat
  1. Use the CLT to estimate the probability that more than half the voters in a sample of size 500 likely voters are in favor of Candidate A, assuming that 45% of the population of likely voters actually favors Candidate A.