# A Sampling in R

## A.1 Data vectors

Use the `c()`

command to enter an ordered list of elements. Separate entries with commas. The resulting object in R is called a **data vector**, or **vector**.

### vector types

We see vectors of three types: **numeric**, **character**, and **logical**.

A character vector consists of a list of strings. Strings are entered with quotes.

The vector `x`

below is numeric. No quotes, just numbers.

A **logical** vector consists of a list of `TRUE`

or `FALSE`

elements (all caps!):

We can check the vector type with the `typeof()`

command:

`## [1] "character"`

If you mix numbers and strings in a vector, R treats it as a character vector:

`## [1] "character"`

We may wish to place data vectors into a two-dimensional structure such as a matrix or a data frame.

### matrices

Create a matrix from a vector with the `matrix()`

command, specifying how many rows, and whether we enter the data in the matrix by row, or by column.

```
## [,1] [,2] [,3]
## [1,] "a" "a" "a"
## [2,] "b" "b" "b"
## [3,] "c" "c" "c"
## [4,] "d" "d" "d"
```

```
## [,1] [,2] [,3] [,4]
## [1,] "a" "b" "c" "d"
## [2,] "a" "b" "c" "d"
## [3,] "a" "b" "c" "d"
```

### data frames

A data frame links related vectors as columns in an array via the `data.frame()`

command.

```
a = c("McMinnville","Denver","Minneapolis","Charleston")
x = c(45.21,39.74,44.98,32.78)
y = c(123.19,104.99,93.26,79.93)
df = data.frame(city = a, lat = x, long = y)
df
```

```
## city lat long
## 1 McMinnville 45.21 123.19
## 2 Denver 39.74 104.99
## 3 Minneapolis 44.98 93.26
## 4 Charleston 32.78 79.93
```

Data frames are the most common way to manage related data vectors in R.

### common vector commands

Here’s a vector of Hank Aaron’s home run totals in each of his MLB seasons:

With `hr`

loaded into your session, you can refer to it by name when you want to extract features of it. Here are some commonly used commands on numeric vectors:

`length(hr)`

, number of elements in the vector (number of seasons Hank played)`sum(hr)`

, the sum of the vector (total career home runs)`mean(hr)`

, the mean of the vector (average HR per season)`max(hr)`

, the max (best HR total in a season)`sd(hr)`

, standard deviation`diff(hr)`

returns a vector whose elements are the differences between consecutive elements in the vector`hr`

`cumsum(hr)`

returns a vector whose elements are the cumulative sum of the vector`hr`

`rev(hr)`

returns the vector in the reverse order

Behold:

```
## [1] 14 -1 18 -14 9 1 -6 11 -1 -20 8 12 -5 -10 15 -6 9 -13 6
## [20] -20 -8 -2
```

```
## [1] 13 40 66 110 140 179 219 253 298 342 366 398 442 481 510 554 592 639 673
## [20] 713 733 745 755
```

### comparison operators

We compare things in R with various comparison operators, each one returning TRUE or FALSE:

- equal to
`==`

- not equal to
`!=`

- less than
`<`

- less than or equal to
`<=`

- greater than
`>`

- greater than or equal to
`>=`

A few examples:

`## [1] TRUE`

Use **double equal signs** `==`

to see whether two things are equal:

`## [1] TRUE`

`## [1] FALSE`

```
x = 3 # this defines the variable
x^2+3*x == 12 #this asks whether x^2 + 3*x equals 12 for the currently stored value of x (x=3 in this case)
```

`## [1] FALSE`

Logical vectors arise when we give R a proposition involving a vector:

`## [1] FALSE TRUE FALSE TRUE`

### checking membership

The `%in%`

command can be used to ask whether a particular element is contained in a vector.

`## [1] FALSE`

`## [1] TRUE`

`sum()`

and `which()`

The `sum()`

command on a numeric vector adds the elements of the vector, as we saw above with `sum(hr)`

.

The `sum()`

command on a logical vector returns the number of TRUE elements in the vector.

`## [1] 4`

We can thus easily count the number of elements in a vector meeting some condition:

`## [1] 8`

8 seasons with at least 40 HR?!! Of course! 8! Ok, which seasons?

`## [1] 4 7 9 10 13 16 18 20`

The `which()`

command returns the indices of the vector at which the condition being tested has been met. So Hank hit 40 or more HR in seasons 4, 7, 9, 10, 13, 16, 18, and 20.

### extracting elements

Recall Hank Aaron’s home runs by season:

`## [1] 13 27 26 44 30 39 40 34 45 44 24 32 44 39 29 44 38 47 34 40 20 12 10`

We can **extract an element** of a vector by indicating its [position]:

`## [1] 26`

Or we can specify several elements:

`## [1] 13 26 30`

### comparing vectors

We can count the number of positions in which two vectors of the same length agree

`## [1] 1`

We can find the position(s) at which they agree

`## [1] 4`

and list the matching value(s):

`## [1] 8`

### vector arithmetic

We can do element-wise arithmetic on two vectors of equal length, such as addition, subtraction, multiplication, divsiion, and exponentiation

Operation | Result | |
---|---|---|

`v + w` |
0, 5, 8 | |

`v - w` |
-2, -3, -2 | |

`v * w` |
-1, 4, 15 | |

`v / w` |
-1, 0.25, 0.6 | |

`v^w` |
-1, 1, 243 |

We also have scalar multiplication,

`## [1] -8 8 24`

and, don’t tell your vector calc prof, but you can add a scalar to each element in a vector:

`## [1] 7 9 11`

### concatenate vectors

The `c()`

command allows you to concatenate vectors:

`## [1] 1 2 3 4 5 6`

We can add an element to a vector `A`

via concatenation:

`## [1] "Will" "Lucas" "Mike" "Dustin" "Eleven"`

Notice, the vector `A`

currently has 5 elements. We can add a 6th element can also to `A`

directly:

`## [1] "Will" "Lucas" "Mike" "Dustin" "Eleven" "Max"`

## A.2 Special vectors

### consecutive integers

**The integers 1 to n** can be entered by typing `1:n`

. For instance, we could define a 20 sided die by entering

More generally, entering `a:b`

creates a vector of consecutive integers starting with `a`

and ending with `b`

(even if `a`

is greater than or equal to `b`

).

`## [1] 8 7 6 5 4 3 2`

### letters

```
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
```

`LETTERS`

is the capitalized version of the `letters`

vector.

`rep()`

The `rep()`

command lets us build a vector with lots of repeated elements.

**Example A.1 **Let’s say we want to create a **bag of skittles** with this color distribution: 40 red, 30 orange, 25 yellow, 60 green, and 20 purple.

The `rep()`

command let’s us do this quickly:
- first enter the distinct items (as a vector with the `c()`

command!),
- then enter how many times each occurs (as a vector!):

## A.3 Sampling

We use the `sample(x,...)`

command to sample from vector `x`

.

For instance, we can draw a random sample of size 2 from `hr`

:

`## [1] 44 13`

Here’s another example. Let’s grab 20 skittles at random from the bag `skittles`

we created in example A.1 and count how many orange ones we get:

```
## grab
## green orange red yellow
## 8 4 5 3
```

The `table()`

command counts how many of each color :).

We could have found the orange count directly with

`## [1] 4`

`sample()`

options

Typically, we provide the `sample()`

command with 3 or 4 arguments, in this order:

`x`

, the vector we sample from`size`

, the size of the sample`replace`

, whether you sample with or without replacement (default = FALSE)`prob`

, custom probabilities for the sampling of elements (default = equal probability for all elements in`x`

)

If you enter your arguments in the order `x=, size=, replace=, prob=`

then you do not need to specify the variable names.

If you do not specify their value, the `sample()`

command assumes the following **default values**:

`size`

= the length of the vector`replace`

= FALSE`prob`

is set so all elements in the vector have equal probability of being chosen.

Here are handy special cases, illustrated with this vector:

#### permutations

Use `sample(x)`

to generate a *random permutation* of `x`

:

`## [1] "hedgehog" "rabbit" "cat" "dog"`

#### repeated sampling of 1 element

Use `sample()`

to simulate picking one elemnt of `animals`

\(n\) times by setting`size = n`

and `replace = TRUE`

.

Example: Draw one animal from the set 1000 different times and summarize the picks with a table.

```
## picks
## cat dog hedgehog rabbit
## 238 232 262 268
```

And the winner is… rabbit!

Or, since we fear rabbits and love dogs, we can do repeated sampling of a single element with **custom probabilities**:

```
## picks2
## cat dog hedgehog rabbit
## 228 378 276 118
```

Nice!

Remember, the default option for `sample()`

is to sample without replacement, and with equal probabilities.

### sample without replacement

TaskPick 4 students at random from a class of 9 to race around Taylor Hall. (Assumes we have numbered the students 1-9).

`## [1] 9 1 4 3`

### sample with replacement

Task

On twelve consecutive days, ask one student at random, in a class of size 9, to write a solution on the board.

`## [1] 8 5 1 5 2 8 5 4 9 8 4 5`

### sample with custom probabilities

Task

Roll a weighted 6 sided die with the following probability distribution 100 times and summarize the results.

\(x\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

\(p(x)\) | .2 | .1 | .05 | .4 | .1 | .15 |

```
## rolls
## 1 2 3 4 5 6
## 18 8 7 45 7 15
```

### example: Lefties

Task

8% of a population is left-handed. Draw a random sample of 45 people from the population and record the number of lefties.

One approach: build a large population with these features and then draw 45 people from it without replacement.

```
pop=rep(c("L","R"),c(800,9200)) # a population of 10,000 people, 800 of them lefties
table(sample(pop,size=45))
```

```
##
## L R
## 3 42
```

A second approach: sample with replacement 45 times from a “two-sided” die with customized probabilities

```
##
## L R
## 2 43
```

A third approach: Use a binomial distribution (later)

## A.4 Repeated sampling

Let’s say we have a huge urn full of orange and blue marbles, and 42% of them are orange. We can use repeated sampling to approximate the **sampling distribution** for the number of orange marbles we would draw in a sample of, say, 50 marbles.
The sampling distribution provides information about what sorts of orange marble counts should we expect, and how often should we expect these counts?

*Repeated sampling* can estimate this sampling distribution. Here are two methods for achieving repeated sampling in R.

### using a `for`

loop

The code below creates a vector called *orange_counts* that, eventually, after the for loop has completed, has 10000 entries. Each entry in this vector gives the number of orange marbles drawn from the urn from a random sample of 50 marbles.

```
colors=c("orange","blue")
orange_counts=c() #a vector for storing the results of each trial
for (i in 1:10000){
orange_counts[i]=sum(sample(colors,50,replace=TRUE,prob=c(.42,.58))=="orange")
}
```

We know that `table(orange_counts)`

would display the counts of each of the unique values occuring in `orange_counts`

. We can visualize these counts with a `barplot()`

:

### using `replicate()`

The `replicate()`

command essentially does the above for loop for us :) The command `replicate(n,expr)`

will evaluate `expr`

`n`

times, and store the results.

```
colors=c("orange","blue")
orange_counts =
replicate(10000,
sum(sample(colors,50,replace=TRUE,prob=c(.42,.58)) == "orange")
)
```

Again, we can summarize the *frequency* with which each value of orange_counts occurs with `table()`

, and visualize this frequency table with a `barplot`

:

In addition, we can calculate summary statistics to put numbers to qualitative descriptions of the distribution of values in `orange_counts`

such as center and spread. These statistics help us answer the question of what sorts of orange counts to expect.

statistic | command | result |
---|---|---|

mean | `mean(orange_counts)` |
21.0736 |

standard deviation | `sd(orange_counts)` |
3.49096 |

five number summary | `fivenum(orange_counts)` |
8, 19, 21, 23, 35 |

## A.5 Summary of R commands

### defining vectors

Command | Description | Example |
---|---|---|

`c()` |
List the elements | `x = c("a","c","c","z","z","z")` |

`a:b` |
Consecutive integers from `a` to `b` |
`8:4` returns the vector 8, 7, 6, 5, 4 |

`rep()` |
Build a vector from a frequency table | `rep(c("y","n"),c(3,2))` returns y, y, y, n, n |

`seq()` |
Arithmetic progression (first,last,step) | `seq(0,1,.2)` returns 0, 0.2, 0.4, 0.6, 0.8, 1 |

### summarizing vectors

Command | Description |
---|---|

`typeof(x)` |
the vector type of `x` (usually character, numeric, or logical) |

`length(x)` |
the length of `x` (how many elements it has) |

`table(x)` |
the frequency table (which values occur in `x` along with how often each value occurs) |

### sampling from vectors

Sampling Options | Example with `x = 1:6` |
---|---|

permutation of `x` |
`sample(x) =` 4, 1, 6, 5, 3, 2 |

sample \(n\) elements without replacement | `sample(x,3) =` 3, 5, 4 |

sample \(n\) elements with replacement | `sample(x,5,replace=T) =` 4, 6, 6, 3, 4 |

sample with custom probabilities | `sample(x,10,replace=T,prob= c(0,.2,0,.5,.1,.2)) =` 4, 4, 5, 2, 4, 2, 4, 4, 5, 4 |