Sampling Distribution for \(\hat{p}\)

Suppose we draw a simple random sample of size \(n\) from a population with the intention of estimating the proportion of the entire population having a certain feature.

In this setting,

  • \(p\) represents the population proportion of interest, and \(p\) is a parameter (a value describing the entire population)
  • \(n\) represents the size of a random sample
  • \(\hat{p}\) represents the sample proportion. It is a statistic (since it is calculated from data), and \(\hat{p}\) serves as an estimator for \(p\).

For instance, we may want to estimate the proportion \(p\) of all Oregon registered voters in favor of a certain ballot measure. This proportion is a parameter - it is unknown and something that describes the entire population. To estimate \(p\), we might ask a simple random sample of \(n = 500\) registered voters in Oregon whether they support the ballot measure. Let’s say \(x = 273\) voters in this sample do support the measure, which gives us a sample proportion \[\hat{p} = \frac{273}{500} = .546\] in favor of the bill. This value for \(\hat{p}\) serves as an estimate for \(p\), and we call it a point estimate because it’s a single number.

If we gather a second sample of size \(n = 500\), we are likely to see a different sample proportion of voters in favor of the measure, though it may be fairly close to the first one.

The sampling distribution for \(\hat{p}\) is a theoretical distribution that describes the different possible values for \(\hat{p}\) from the different possible random samples, along with how likely it is for these values to occur.

We can get a sense of a sampling distribution by simulating a large number of random samples,as in the R activity Sampling Distributions.

Code for Colored Rectangles Activity

Here’s the code to gather a random sample of 100 of Hitchman’s rectangles.

sample<-read.csv("https://mphitchman.com/stats/data/normal_act_data.csv")[sample(1:2000,100),]

Code for Sampling Distributions Activity

To generate your own sample in Q4 of the Sampling Distributions activity, I recommend copying the following code into a new script, and then running the code one line at a time. An explanation of this code is given below.

lin=rep(c("f","nf"),times=c(612,1299))
s = sample(lin,50)
table(s)

This code does three things:

  • Line 1 creates a Linfield student body (called lin) with 612 first-gen students (labelled “f”), and 1299 non-first-gen students (labelled “nf”). Note: 612 is 32% of 1911, and 1299 would be the rest of the student body.
  • Line 2 creates a simple random sample of 50 students from lin and calls this sample s.
  • Line 3 gives a table of the results, from which you can read how many students in your sample are first-gen students.

The Central Limit Theorem for Proportions

The Sampling Distributions Activity suggests that the sampling distribution for \(\hat{p}\) looks rather bell-shaped, like a normal distribution. The Central Limit Theorem states that this is often the case, and the theorem also tells us the values of the mean and standard deviation of the sampling distribution for \(\hat{p}\). In the context of sampling, this standard deviation is usually called the standard error, denoted \(SE\) for short.

CLT for Proportions
When we collect a sufficiently large sample of \(n\) independent observations from a population with population proportion of \(p\), the sampling distribution of \(\hat{p}\) will be nearly normal with
\[\begin{align*} &\text{mean}=p &&\text{standard error }(SE) = \sqrt{\frac{p(1-p)}{n}} \end{align*}\]

The sample size is typically considered sufficiently large when
- the success-failure condition has been met: \(np \geq 10\) and \(n(1-p) \geq 10\), and - \(n\) does not exceed 10% of the population.

Example: Green-eyed Frogs

Suppose in a certain population of frogs, 40% of them have green eyes. Now, suppose we intend to gather a random sample of \(n = 25\) frogs from this population. What is the sampling distribution for \(\hat{p}\), the proportion of frogs in our (to be collected) sample that have green eyes?

Checking conditions of the CLT

First let’s check whether our sample size is large enough for the conditions of the CLT to be met.

The success-failure condition: Should we expect at least 10 “successes” and 10 “failures” in our sample? In the context of this example, “success” means the frog has green eyes.

  • Expected successes: 40% of 25, which is 0.4*25 = 10 (phew!)
  • Expected failures: 60% of 25, which is 0.6*25 = 15.

Since each number is at least 10, the success-failure contiion is met.

The 10% Rule: We also want to check that our sample size \(n = 25\) is less than 10% of the population. “Experts” estimate around 1000 frogs in the population we’re studying, and \(n = 25\) is only 2.5% of the population, this condition is also met.

Shape of the sampling distribution

The central limit theorem says that this sampling distribution is likely normally distributed! More than that, the center of the distribution is \(p\), the population proportion, which in this case is \(p = 0.4\); and the standard error (aka standard deviation) is given by the formula \[SE = \sqrt{p(1-p)/n} = \sqrt{(.4)(.6)/25}\approx .098.\]

Ok, so here’s a plot of the likely shape of the sampling distribution for \(\hat{p}\). Note that this bell-curve is centered at 0.4, and the standard deviation tick mark widths are 0.098. So, thinking of the 68-95-99.7 rule about 68% of the time we should expect a sample proportion between .3 and .5; about 95% of the time we should expect a sample proportion between .2 and .6, and about 99.7% of the time we should expect a sample proportion between .1 and .7.

Let’s simulate the process of gathering a sample of \(n = 25\) frogs from such a population and recording the proportion in the sample having green eyes.

trials = 100000
n=25
p=.4
population=rep(c("g","ng"),times=c(400,600))
phats=c()
for (i in 1:trials){
  phats[i]=sum(sample(population,n)=="g")/n
}

The code above generates a vector called phats that stores the 100,000 sample proportions generated from 100,000 different random samples of size \(n = 25\). We can display a histogram of these 100,000 \(\hat{p}\)’s:

Finally, we superimpose the theoretical bell-curve approximation to the sampling distribution, with the simulated approximation to the sampling distribution. Nice fit!

Using CLT to estimate probabilities.

Question: What is the probability that in a random sample of \(n = 25\) frogs from this population, at most 5 of them have green eyes?

First, we phrase this question with the notation of this section. Obtaining a sample of \(n = 25\) frogs with 5 green-eyed frogs corresponds to a sample proportion \[\hat{p} = \frac{5}{25} = 0.2.\]

This question is asking, then, for the probability that \(\hat{p} \leq 0.2\), or, in our notation, the question is asking for \(P(\hat{p} \leq .2)\). We can convert to \(Z\)-scores and use pnorm() to estimate this probability, thanks to the CLT. The \(Z\)-score for 0.2 is:

\[Z = \frac{0.2 - 0.4}{.098} = -2.04,\] so \[P(\hat{p}\leq 0.2) = P(Z \leq -2.04) = \texttt{pnorm}(-2.04)=.0207.\]

Conclusion: If 40% of the frog populaiton has green eyes, and if we gather a random sample of \(n = 25\) frogs from this population, there is about a 2% chance that fewer than 5 of the frogs in the sample have green eyes.

We can check our theoretical estimate against the simulation results too. We can ask RStudio to check (very quickly!) through the 100,000 trials stored in the phats vector to find how many of them had sample proportion less than or equal to .2, and then divide that count by 100,000:

sum(phats <= .2)/trials
## [1] 0.02792

Since the distribution for \(\hat{p}\) is discrete (if \(n = 25\), the only possibilities for \(\hat{p}\) are 0, 1/25, 2/25, 3/25, … , 24/25, 1), and a normal distribution is continuous, our estimate will only be approximate. In practice, one can employ a “continuity correction” to improve this estimate but we save that for another day.