C Discrete Random Variables in R
Here we investigate in R the common, named discrete random variables we encounter in MATH 340:
- binomial |
binom
- geometric |
geom
- negative binomial |
nbinom
- hypergeometric |
hyper
- Poisson |
pois
We use four commands to work with the named distributions. For a distribution named ___
:
d___(x,...)
| probability function, \(p(x)\)p___(q,...)
| Cumulative probability, \(P(X \leq q)\)q___(p,...)
| Quantiles, finds \(x\) such that \(P(X \leq x) = p\)r___(n,...)
| Random sample of size \(n\) from the distribution
We also discuss below how to build and analyze homemade discrete random variables in R.
C.1 Binomial binom
The Scene
Recall, \(X \sim \texttt{binom}(n,p)\) means \(X\) counts the number of successes in \(n\) independent, identical Bernoulli trials, when probability of success on any given trial is \(p\).
The space of \(X\) is \(x = 0, 1, \ldots, n\).
Probability function
For \(x = 0, 1, \ldots, n,\) \[p(x)=\binom{n}{x}p^x(1-p)^{n-x}.\]
The binomial distribution in R
dbinom()
- probability function
dbinom(x,n,p)
returns the probability \(P(X = x)\) for \(X \sim \texttt{binom}(n,p)\).
For instance, dbinom(2,5,.3)
returns \[\binom{5}{2}(.3)^2(.7)^3.\]
## [1] 0.3087
As a check:
## [1] 0.3087
pbinom()
- cumulative probability
pbinom(q,n,p)
returns the cumulative probability \(P(X \leq q)\) for \(X \sim \text{binom}(n,p)\): \[\sum_{x=0}^q p(x)=\sum_{x=0}^q\binom{n}{x}p^x(1-p)^{n-x}.\]
So pbinom(2,5,.3)
returns \(P(X \leq 2)\) when \(X\) is \(\texttt{binom}(5,.3)\):
## [1] 0.83692
As a check:
## [1] 0.83692
qbinom()
- quantiles
Recall the definition of quantile (9.4): If \(0 < p < 1,\) the \(p\)th quantile of \(X,\) denoted \(\phi_p,\) is the smallest value such that \(P(X \leq \phi_p) \geq p\). In other words, the value \(\phi_p\) marks the smallest value below which one finds 100*p percent of the distribution of \(X\).
qbinom(q,n,p)
returns the quantile \(\phi_q\) for \(X \sim \texttt{binom}(n,p)\)
For instance, what value marks the 95th percentile of the \(\texttt{binom}(100,.5)\) distribution?
## [1] 58
So, if you flip a fair coin 100 times and count how many heads you get, about 95% of the time you would flip less than or equal to 58 heads.
We can check this:
## [1] 0.955687
rbinom()
- sampling
rbinom(10,20,.4)
will generate a vector that stores a random sample of size 10 drawn from a \(\texttt{binom}(20,.4)\) distribution.
## [1] 7 8 6 7 6 9 11 6 3 8
We can use r___
to run simulations, and to visualize the shape of a distribution.
Two useful commands for summarizing data: table()
presents the frequency table for the sample, and barplot(table())
is a quick way to visualize this frequency table.
## sim_data
## 2 3 4 5 6 7 8 9 10 11 12 13 14
## 2 20 35 72 112 180 180 183 106 57 35 14 4
C.2 Geometric geom
The Scene
Let the random variable \(X\) denote the number of identical, independent Bernoulli trials (with probability of success \(p,\) probability of failure \(q = 1-p\)) up to and including the first success.
Then \(X\) is called a geometric random variable with parameter \(p\).
Notation
\(X\) is \((p)\).
The space of \(X\) is \(x = 1, 2, \ldots\)
Probability function
For \(x = 1, 2, 3, \ldots,\) \[p(x)= q^{x-1}p.\]
NOTE: The geometric distribution in R counts failures, not total trials.
In R geom
counts the number of failures until the first success, not the total number of trials up to and including the first success.
As with the binom
distribution, we can use the d___
, p___
, q___
, and r___
commands to determine probability for particular values of \(x,\) cumulative probabilities, quantiles, and random samples, respectively.
dgeom(4,.3)
gives the probability of seeing 4 failures before the first success in a Bernoulli trial in which \(p = .3\)
## [1] 0.07203
pgeom(4,.3)
gives the probability of seeing 4 or fewer failures before the first success in a sequence of Bernoulli trials in which \(p = .3\)
## [1] 0.83193
and the following line gives the probability of seeing more than 4 failures prior to the first success:
## [1] 0.16807
Example C.1
Roll a fair 6-sided die until a four comes up, and let \(X\) denote the rolls up needed to see that first four. Repeat this game 10,000 times, and plot the frequency distribution for \(X\).
Strategy:
- Note that this game is a Bernoulli trial, where “success” means rolling a 4 and “failure” means not rolling a four. So \(p = 1/6,\) and \(q = 5/6\).
- Take a random sample of size 10000 from the
geom
distribution in R with thergeom()
method (which records the number of failures, not the number of trials). - Add one to each value in the sample to get the number of trials.
- barplot the table!
OMG notice from the bar plot that one depressing game required 50 rolls to see my first 4.
C.3 Negative Binomial nbinom
The Scene
Again, we consider a sequence of Bernoulli trials (probability of success is \(p,\) probability of failure is \(q = 1-p\)).
We let \(X\) denote the number of trials in the sequence up to and including the \(r\)th success, where \(r \geq 1\) is a positive integer. Then \(X\) is called a negative binomial random variable with parameters \(p\) and \(r\).
Notation: \(X\) is \(\texttt{nbinom}(r,p)\)
The space of \(X\) is \(x = r, r+1, r+2, \ldots\)
Probability function
For \(x = r, r+1, r+2, \ldots ,\) \[p(x)= \binom{x-1}{r-1}p^{r}q^{x-r}.\]
Example C.2 A study indicates that an exploratory oil well drilled in a particular region should strike oil with probability 0.2. Find the probability that the third oil strike comes on the 10th well drilled.
Here, if \(X\) equals the number of wells drilled until the company gets its third strike, then \(X\) is Nb(3,.2), and the answer to this question is \(P(X=10)\) which is \[P(X=10)=\binom{9}{2}0.2^{3}.8^{7}.\]
## [1] 0.0604
In R this distribution is accessed using nbinom
, but this distribution, like geom
, focuses on the number of failures, not total trials. If we want to know the probability that our third success occurs on the 10th trial, this is equivalent to the probability of having 10-3 = 7 failures before getting our third success, which can be computed in R as
## [1] 0.06039798
Visualizing \(X \sim \texttt{nbinom}(3,.2)\)
C.4 Hypergeometric hyper
The Scene
A finite population has \(N\) elements, each of which possesses one of two possible characteristics. Say we have a jar of \(N\) marbles, each is either red or black. Let’s say \(m\) of them are red and \(n\) of them are black (so \(m + n = N\)). We draw a sample of size \(k,\) and let \(X\) denote the number of red marbles in the jar.
Then \(X\) is called a hypergeometric random variable with parameters \(m,\) \(n,\) and \(k\).
Notation: \(X\) is \(\texttt{hyper}(m,n,k)\)
The space of \(X\) is \(x = 0,1,2,\ldots,k\) subject to the restriction that \(x \leq m\) and \(k - x \leq n\).
Probability function
The probability function is \[p(x)= \frac{\binom{m}{x}\binom{n}{k-x}}{\binom{m+n}{k}}.\]
In R Use hyper
.
Example C.3
A group of 6 seals and 4 pelicans hang at the beach, and they select a random subset of size 5 to play beach volleyball. Let \(X\) = the number of pelicans chosen.
Here, \(X\) is hypergeometric with parameters \(m = 4\) (4 pelicans), \(n = 6\) (6 seals) and \(k = 5\) (sample size).
The probability that \(X = 2\) is
## [1] 0.4761905
We can also use the built in command dhyper(x,m,n,k)
## [1] 0.4761905
Example C.4 (Good Potatoes Bad Potatoes in R)
A truck has 500 potatoes, 50 of which are bad, the rest are good. We sample 10. What is the probability that more than 3 are bad?
If \(X\) equals the number of bad potatoes in the sample, then \(X\) is hypergeometric with parameters \(m = 50,\) \(n=450,\) and \(k = 10\).
So \[P(X > 3) = 1 - P(X \leq 3)\]
which can be calculated with the cumulative probability command phyper
:
## [1] 0.01186118
C.5 Poisson pois
The Scene
The Poisson probability distribution can provide a good model for the number of occurrences \(X\) of a rare event in time, space, or some other unit of measure. A Poisson random variable \(X\) has one parameter, \(\lambda,\) which is the average number of occurrences of the rare event in the indicated time (or space, etc.)
Notation: \(X\) is \(\texttt{Poisson}(\lambda)\).
The space of \(X\) is \(x = 0,1,2,\ldots,\) (countably infinite!)
Probability function
The probability function is \[p(x)=\frac{\lambda^x}{x!}e^{-\lambda}\]
In R use pois
.
Example C.5
Suppose \(X\) is Poisson(5). Determine \(P(X \geq 10)\).
Note: \(P(X \geq 10) = 1-P(X < 10)= 1-P(X \leq 9)\). So, using ppois()
we have
## [1] 0.03182806
Example C.6
The number \(X\) of typos on a page in a textbook follows a Poisson distribution with an average number of 2 typos per page. (a) If you pick a page at random, what is the probability it contains 0 typos? (b) According to this model, 99% of the pages have no more than how many typos?
## [1] 0.1353353
## [1] 6
Example C.7 (Rutherford/Geiger Data)
In a paper published in 1910 entitled “The Probability Variations in the Distribution of \(\alpha\)-particles”, Rutherford and Geiger reported data that counted the number of “scintillations” in 72 second intervals caused by radioactive decay of a quantity of the element polonium.
Here are the data:
results=rep(0:14,c(57,203,383,525,532,408,273,139,45,27,10,4,0,1,1))
trials=length(results)
table(results)
## results
## 0 1 2 3 4 5 6 7 8 9 10 11 13 14
## 57 203 383 525 532 408 273 139 45 27 10 4 1 1
barplot(table(results)/trials,
ylim=c(0,.25),
ylab="rel. freq",
xlab="scintillations",
main="Rutherford/Geiger Data")
Here’s the mean of the data (which gives average # of scintillations in 72 seconds):
## [1] 3.871549
Let’s compare the observed relative frequencies to the theoretical probabilities associated with a \(\texttt{Poisson}(3.87)\) distribution:
x | rel_freq | pois_prob |
---|---|---|
0 | 0.0219 | 0.0209 |
1 | 0.0778 | 0.0807 |
2 | 0.1469 | 0.1562 |
3 | 0.2013 | 0.2015 |
4 | 0.2040 | 0.1949 |
5 | 0.1564 | 0.1509 |
6 | 0.1047 | 0.0973 |
7 | 0.0533 | 0.0538 |
8 | 0.0173 | 0.0260 |
9 | 0.0104 | 0.0112 |
10 | 0.0038 | 0.0043 |
11 | 0.0015 | 0.0015 |
12 | 0.0000 | 0.0005 |
13 | 0.0004 | 0.0001 |
14 | 0.0004 | 0.0000 |
C.6 Homemade Discrete Random Variables
Let’s say a discrete random variable \(X\) has finite sample space and known probability function \(p(x)\). We often display this type of probability model via a table:
\[ \begin{array}{c|c|c|c|c|c} x & 5 & 6 & 7 & 8 & 9 \\ \hline p(x) & 0.1 & 0.1 & 0.3 & 0.4 & 0.1 \end{array} \] We can input this model into an R session by defining two vectors:
We can check in R that the two conditions for a valid probability have been met by this assignment:
- Each probability is non-negative:
Px >= 0
= TRUE, TRUE, TRUE, TRUE, TRUE - The probabilities add to 1:
sum(Px)
= 1
Expected Value of \(X\)
Recall if \(X\) is a discrete random variable with probability function \(p(x),\) then the expected value of \(X\) is \[E(X)=\sum_{\text{all }x}x\cdot p(x)\]
Having defined vectors \(X\) and \(Px\) in R, we calculate \(E(X)\) by running
## [1] 7.3
Note: For those who have taken vector calculus sum(v*w)
returns the dot product of v
and w
, aka the inner product. R has an alternative command for this dot product, which is v %*% w
. So, sum(v*w)
and v %*% w
do the same thing, but I prefer the first option to remind me that the expected value is obtained as a sum over all \(x\) of some things.
Variance of \(X\)
Recall the variance of \(X\) is \[V(X) = E[(X-\mu)^2],\] where \(\mu = E(X)\). Alternatively, the variance can be computed via \[V(X) = E(X^2)-\mu^2.\]
So we can compute the variance of \(X\) in R as follows:
## [1] 1.21
Or, alternatively, as follows:
## [1] 1.21
Distribution Plots
R can offer some quick visualizations of probability distributions.
The following code will give the shape of the probability distribution (with a splash of color and plot title:)
Sampling
The following code draws a sample of size 10 from our distribution using the weighted probabilities assigned by the probability function:
## [1] 6 7 9 8 8 9 7 8 6 8
If we take a large sample, and make a relative frequency table of the results, it should be close to the probability table:
##
## 5 6 7 8 9
## 0.382 0.383 1.171 1.521 0.377
C.6.1 Discrete Uniform Distribution
Definition C.1 If \(X\) is a finite set with size \(|X| = n\). The probability distribution defined by \[p(x) = \frac{1}{n}\] for all \(x \in X\) is called uniform.
In a uniform distribution, we will find over a large number of trials that each name comes up with about the same frequency.
Example C.8
Pick a random seal from the famous Eddington family: Otto, Ruth, Pluotika, Slarftel, Edgar and Bob.
To simulate the process of picking one seal at random from the family, a large number of times, we sample 1 element with replacement, a large number of times.
family=c("Bob", "Edgar", "Pluotika", "Otto", "Ruth", "Slarftel")
results=sample(family,10000,replace=TRUE)
The resulting frequency plot should look uniform:
ggplot(data.frame(results))+
geom_bar(aes(x=results,fill=results))+xlab("Name")+ggtitle("Pick a seal, any seal!")+theme(legend.position = "none")
Way to go Otto, you over achiever!