In this activity we use R to investigate the Central Limit Theorem for proportions. Your code in R will practice the following tasks:
Based on your code you will record answers to various questions on this worksheet. The completed worksheet is due in class on Tuesday, March 31.
- Think of a scenario (real or fantastic) in which you want to investigate the proportion of a population having a certain feature. Maybe it’s the proportion of all skittles that are green, or the proportion of all dragons having 6-toed feet, or the proportion of all little league baseball players that wear their glove on their right hand. Record your population on your worksheet, and the feature you are interested in studying.
- Decide on a (reasonable or fantastic) value for \(p\), the proportion of your population having your feature of interest. Be sure your value is between 0 and 1, preferably between .2 and .8. Record this value of \(p\) on your workhseet, and define this value in your script by running the line below
p = your chosen value (between 0 and 1)
- Now create your model population by running the following code in R
pop_size = 100000
population <- rep( c('s', 'f'), pop_size*c(p, 1-p) )
This code creates a vector called population in which
every entry is either s or f (short for
‘success’ and ‘failure’).
The code above ensures that the proportion of your population having
your feature of interest is your chosen value p.
In your script draw a sample of size n = 100, and store the results in the vector called
drawby running this code
n = 100
draw <- sample(population, n)
We are interested in p_hat, the proportion of elements in the sample having your feature of interest. We can ask R to find this sample proportion (of ‘s’) by running this code:
sum(draw == 's') / n
Record your sample proportion p_hat in your worksheet. Is it close to the population proportion \(p\)? Exactly the same?
We can approximate the sampling distribution for p_hat by visualizing the distribution of p_hats that we would obtain from many, many different random samples of size \(n\) from the population.
The following code will draw a random sample of size \(n = 100\) a whopping 100,000 different
times, recording the sample proportion of successes in each draw in the
vector we call results_100. Run this code in your
script.
results_100 <- replicate(100000, sum(sample(population, n) == 's')/n)
Make a bar plot of this vector of results (
barplot(table(results_100))). Describe the shape of the distribution. Is it bell-shaped? Based on this distribution, would you describe the sample proportion you obtained in Q2 as typical, fairly common but on the high side of typical, fairly common but on the low side of typical, or a very rare occurrence?
This bar plot is a nice approximation to the theoretical sampling distribution for p_hat, a look at the different sample proportions that are possible along with how likely each is to occur.
According to the central limit theorem, the sampling distribution for p_hat is nearly normal.
- In the case of your population, and your sample size of \(n = 100\), according to the CLT what is the mean and what is the standard deviation (also called standard error) of this sampling distribution? Record your answers on the worksheet.
- In Q3, we constructed a
resultsvector that is meant to approximate the sampling distribution for p_hat. Determine the mean and standard deviation of theresults_100vector. How do they compare to your answers to part (a) of this question?
Now we essentially repeat questions 3 and 4, with a different sample
size. Instead of drawing a sample of size 100, let’s draw a sample of
size 400. Run the following code to generate a new results vector called
results_400 (of 100000 different sample proportions from
samples of this new size)
n = 400
results_400 <- replicate(100000, sum(sample(population, n) == 's')/n)
- Make a barplot of the
results_400vector. Does this visual resemble a bell curve? How does this visual compare to the earlier one in one in terms of center and spread?
- In the case of your population, and your sample size of \(n = 400\), according to the CLT what is the mean and what is the standard error of this sampling distribution? Record your answers on the worksheet.
- Determine the mean and standard deviation of the
results_400vector. How do they compare to your answers to part (b) of this question?
- Use the CLT to estimate the probability that more than half the voters in a sample of size 100 likely voters are in favor of Candidate A, assuming that 45% of the population of likely voters actually favors Candidate A.
Feel free to use R as your calculator in this problem. Recall the
code we considered in the last class. Your job here is to assign values
for p, p_hat, and n, and then
decide on the right use of pnorm().
p <-
p_hat <-
n <-
SE <- sqrt(p*(1-p)/n)
z <- (p_hat - p) / SE
pnorm(z) # this is the probability of getting a sample proportion less than p_hat
1 - pnorm(z) # this is the probability of getting a sample proportion greater than p_hat
- Use the CLT to estimate the probability that more than half the voters in a sample of size 500 likely voters are in favor of Candidate A, assuming that 45% of the population of likely voters actually favors Candidate A.