I encourage you to reproduce all of the results on this page in your own RStudio session.
In MATH 140 we use R to manage, explore, and analyze data. Yep, we use R to conduct
We will either enter data manually in R (as vectors or data frames), or import data in R via spreadsheet files.
A vector is an ordered list of values or characters.
Use the
c()
command to input a vector.
Run this code in your session (copy, paste, and run):
score = c(219, 223, 186, 256, 295, 211, 175, 318)
You have created a vector called score
that contains
eight numbers. In fact, score
is a record of my last eight
Yahtzee game scores. After you run the line above in your own
session, you will see the value score
in your Environment
tab. Check that it’s there!
Note: The letter ‘c’ stands for combine, or
concatenate, and the ‘c’ also reminds me to separate the entries in the
c()
command with commas.
Having named the vector, we can refer to it by name when we want to do anything with it. For instance:
command | result | description |
---|---|---|
length(score) |
8 | the number of entries in the vector |
sum(score) |
1883 | the sum of the scores |
mean(score) |
235.375 | the mean of the scores |
max(score) |
318 | the maximum score |
sd(score) |
50.5764979 | standard deviation of the scores |
More on finding descriptive statistics below.
A data frame is a collection of related vectors bound together as columns in a spreadsheet:
We can either create a data frame in R by assembling defined vectors, or we import a spreadsheet file as a data frame. We look at both options below.
Pro Tip: Do not enter data by hand if you don’t have to. It’s much, much easier (and less prone to error) to load saved spreadsheets into R than it is to retype the entries by hand in R.
Using the
data.frame()
command to create a data frame from vectors.
As the spreadsheet pictured above shows, I have two other bits of
information about those eight Yahtzee games I played: whether I scored a
Yahtzee (vector yahtzee
) during the game, and my mood at
the start of each game (vector mood
):
score = c(219, 223, 186, 256, 295, 211, 175, 318)
yahtzee=c("no","no","no","no","yes","no","no","yes")
mood=c("content","pensive","anxious","pensive","content","content","pensive","anxious")
Note: If I’m not entering numerical values, I need to use quotes around each entry in the vector.
Create a data frame:
games=data.frame(mood,score,yahtzee)
## mood score yahtzee
## 1 content 219 no
## 2 pensive 223 no
## 3 anxious 186 no
## 4 pensive 256 no
## 5 content 295 yes
## 6 content 211 no
## 7 pensive 175 no
## 8 anxious 318 yes
Note: Each row gives three pieces of information about a single game. For instance, at the start of game 3 I felt anxious, my game 3 score was 186, and I did not roll a Yahtzee during the game. Sad.
Note: This data frame has two categorical
variables: mood
and yahtzee
and one
numerical variable: score
. When entering
categorical data in a vector, we must use quotes around each entry. When
entering numerical data we do not use quotes.
The data frame games
should appear in your Environment
tab after running the lines above. My Environment tab looks like
this:
Notice that the environment tab indicates whether a vector is
categorical (chr
) or numerical (num
or
int
).
color=sample(c("red","blue","purple"),size=100,replace=TRUE,prob=c(.2,.2,.6))
roll=sample(1:4,size=100,replace=TRUE)
df=data.frame(color,roll)
dim(df)
## [1] 100 2
Unpacking the above code.
df
that has
two columns, called color
and roll
, and 100
rows.sample()
command in
R.color
is created by drawing 100 times from
the set of colors c("red","blue","purple")
, where, for each
draw, the probability of selecting “red” is 0.2, the probability of
selecting “blue” is 0.2, and the probability of selecting “purple” is
0.6.roll
is created by drawing 100 times from
the set of numbers 1 through 4 (all numbers equally likely to be
drawn).dim(df)
, returns the number of rows
(100) and columns (2) in df
.head(df,8)
## color roll
## 1 purple 4
## 2 blue 2
## 3 purple 3
## 4 purple 4
## 5 purple 1
## 6 red 3
## 7 blue 2
## 8 purple 4
I hear you asking, “Why would you want to
build this df
?”
Imagine wanting to simulate the following game: I have 10 4-sided dice in a box, 6 of them are purple, 2 are red, and 2 are blue. I want to pick a die at random from the box, roll it once, and record the roll; and I want to play this game 100 times and record the results.
And now I hear you asking “Again, why would I want to do this?”
OK, suppose a campus community is, politically, 20% democrat (blue),
20% republican (red), and 60% independent (purple). We pick 100 people
at random and ask them their opinion on some question that has 4
possible answers. The above df
simulates what sort of
results we should expect if each person’s answer is random, and
independent of their political affiliation. Seeing what sorts of results
to expect if political affiliation is not a factor helps us to
investigate whether real data we gather suggests political affiliation
is a factor in one’s views on this question.
OK, this ends the brief introduction to building data frames in R from scratch. Again, generally data frames are more easily entered in R by importing them from spreadsheet files.
A common file type is the .csv file, which stands for comma
separated values. With RStudio, such data files are at our
fingertips! We can import a .csv file from the web using the
read.csv()
command. (We can also import other types of
spreadsheets with other, similar commands.)
The following code imports the file called softball.csv from our
course resource page, and it names the data matrix wildcats
in our RStudio session.
wildcats <- read.csv("https://www.mphitchman.com/stats/data/softball.csv")
What does the wildcats
data set look like? Here’s a look
at the first 5 rows:
head(wildcats,5)
## Year Coach W L T District NCAA
## 1 1999 Laura Kenow 20 17 0 3rd
## 2 2000 Laura Kenow 24 10 0 4th
## 3 2001 Laura Kenow 24 17 0 2nd
## 4 2002 Jackson Vaughan 21 17 0 3rd
## 5 2003 Jackson Vaughan 27 11 0 2nd
In wildcats
each row gives info about a season, and the
variables include the year, the coach, wins, losses, and how well the
team did in districts and nationals.
Enter this line in your session and click on wildcats
in
your Environment tab. The data matrix will appear in the upper left pane
of your session in spreadsheet form.
The dollar sign
$
is the key to extracting columns from a data frame. We get to a column by entering:dfname$colname
.
For instance, wildcats$W
will extract the W
column of the wildcats
data frame, and
games$score
will extract the score
column of
the games
data frame (my yahtzee games defined above).
To find the average number of wins per season from 1999 to 2024, we
can use the mean()
command on the W
column of
the wildcats
data frame:
mean(wildcats$W)
## [1] 35.26923
The average number of losses per season?
mean(wildcats$L)
## [1] 9.346154
Generally we will work with a data frame that we load from a
spreadsheet, such as wildcats
, or a data frame that we
input via vectors, such as games
, but R also comes with
several built-in data frames. Try entering the following word in the
console prompt.
faithful
What you see displayed is a data frame! What on earth do these numbers mean?
This data frame gaves information about eruptions of the Old Faithful Geyser. Search on ‘faithful’ in the Help Tab to read all about it!
You can type
data()
at the console prompt in RStudio to see a list of all the data sets that come with the program.
Here are some common commands for describing data.
function | description |
---|---|
length() |
returns the number of entries in the vector entered in the parentheses |
mean() |
returns the mean of the data |
median() |
returns the median |
max() |
returns the maximum of the data |
min() |
returns the minimum |
sum() |
returns the sum of the data |
var() |
returns the variance of the data |
sd() |
returns the standard deviation of the data |
summary() |
returns 6 statistics: min, Q1, median, mean, Q3, max |
fivenum() |
returns the 5 number summary: min Q1 median Q3 max |
table() |
returns the frequency of each unique value of the data |
Don’t you want to know the distribution of colored dice randomly chosen in Example 2 above!?
table(df$color)
##
## blue purple red
## 21 60 19
The average of a bunch of numbers (need c()
!!):
mean(c(4,5,8,5,4,3,6,7,3,8,9,12,11))
## [1] 6.538462
The 5 number summary for the W
column of the
wildcats
data frame.
fivenum(wildcats$W)
## [1] 12.0 31.0 37.5 39.0 51.0
How many games did Linfield softball win between 1999 and 2024?
sum(wildcats$W)
## [1] 917
Who has coached Linfield softball during this period, and how many seasons has each coach coached?
table(wildcats$Coach)
##
## Jackson Vaughan Laura Kenow
## 23 3
Extracting a particular entry in a vector:
m = c(10,12,14,16,9,13)
m[3] # returns the third element in the vector
## [1] 14
Extracting a particular entry in a data frame:
games[2,3] # returns the element in row 2, column 3 of the games data frame
## [1] "no"
Extracting specific rows (by row numbers), say rows 2 through 5:
games[2:5,]
## mood score yahtzee
## 2 pensive 223 no
## 3 anxious 186 no
## 4 pensive 256 no
## 5 content 295 yes
Note: The comma is necessary. In general, the values before the comma (2:5) specify the row numbers, and the values after the comma specify the column numbers. By leaving the column entry blank, R takes all the columns.
We filter the Yahtzee games
data frame built above to
just those rows meeting various conditions.
List all games in which my mood was
anxious
subset(games,mood=="anxious")
## mood score yahtzee
## 3 anxious 186 no
## 8 anxious 318 yes
List all games with score less than 200
subset(games, score < 200)
## mood score yahtzee
## 3 anxious 186 no
## 7 pensive 175 no
Let’s not talk about those games :).
List all games meeting two conditions (say score at least 200 and mood content)
subset(games,score>=200&mood=="content")
## mood score yahtzee
## 1 content 219 no
## 5 content 295 yes
## 6 content 211 no
Note: We may want to save a filtered data frame as a new one, which we do by giving it a name:
fun.games=subset(games,score>=200&mood=="content")
Running that line would add the data frame fun.games to my RStudio session, and it will appear in my Environment tab. Check!
In the wildcats
data frame, suppose we want to focus on
wins and losses by season. We can create a new data frame from the old
one that has just these columns:
catWL <- wildcats[c("Year","W","L")]
head(catWL)
## Year W L
## 1 1999 20 17
## 2 2000 24 10
## 3 2001 24 17
## 4 2002 21 17
## 5 2003 27 11
## 6 2004 37 9
Suppose we want to add a column to the wildcats
data
frame that specifies the total number of games played, something we can
calculate from existing columns (games played = wins+losses+ties). The
following code creates the GP
column and tells R how to
build it:
wildcats$GP = wildcats$W+wildcats$L+wildcats$T
R has some built-in graphing functions for histograms, boxplots, scatter plots, etc.
A basic histogram - of Linfield wins by season.
hist(wildcats$W)
When presenting data, we have a responsibility to make it readable, so we can do better than the above. We can supply titles, axis labels, annotations, colors, etc to help us visualize the data in a more useful way.
hist(wildcats$W,
main="Wins by Season, Linfield Softball 1999-2024",
xlab="Wins",
col="purple",
breaks=10
)
We specify the bins in a histogram with the breaks
option. If we have too few bins for the data, we lose all sorts of
information, and if we have too many bins we might have a harder time
seeing the shape of the distribution through the noise.
A boxplot is a visual representation of the 5 number summary, and R loves to make them!
Here’s a boxplot in its simplest form.
boxplot(wildcats$W)
By default a boxplot is presented vertically, but we can ask for a horizontal boxplot as well, and we can add a splash of color.
boxplot(wildcats$W,
main="Linfield softball wins by season (1999-2024)",
col="red",
horizontal = TRUE)
AND, we can add text to a graph. For instance, we can add the values in the 5 number summary to the boxplot, which gives more precision to the plot!
boxplot(wildcats$W,
main="Linfield softball wins by season (1999-2024)",
col="red",
horizontal = TRUE)
v<-fivenum(wildcats$W)
text(x=c(v[1:5]),y=c(rep(1.3,5)),c(v[1:5]),cex=.7,col="purple")
For categorical variables it can be useful to make barplots of
counts. If we’re considering a vector of values we use the
barplot()
command in conjuntion with the
table()
command
barplot(table(wildcats$District),main="Finish at Districts",ylab="count",xlab="finish",col="#F5D7FB")
If you type colors()
at the console prompt you can get a
list of names of built in colors.
You can find color codes on-line too, such as the one I used in the barplot code. Here’s one site for this: https://htmlcolorcodes.com/.
A scatter plot is a useful way to visualize the relationship between two numeric variables. Here’s a scatter plot in its simplest form.
plot(x=wildcats$Year,y=wildcats$W)
We can change the shape of points in a plot with the pch
option, and change the scale of the y-axis.
plot(x=wildcats$Year,y=wildcats$W,
main="Linfield softball wins by season",
ylab="wins",
xlab="season",
col="brown3",
ylim=c(0,60),
pch=16)
A quick guide to pch and point shapes (google “R pch options” for more details):
As an aside, we can create a line plot as well, by specifying
type="l"
(that’s an ‘l’ for ‘line’, not the number 1).
plot(x=wildcats$Year,y=wildcats$W,
main="Linfield softball wins by season",
ylab="wins",
xlab="season",
col="purple",
ylim=c(0,60),
type="l")
Let’s investigate the faithful
data that comes in R.
dim(faithful) # This gives the number of rows (observations) and columns (variables)
## [1] 272 2
This data frame has 272 rows and 2 columns. What are the variables?
colnames(faithful)
## [1] "eruptions" "waiting"
The help tab (in the lower right pane) gives information about this
data frame (type faithful in the search box.) The variable
eruptions
gives the duration of an eruption (in minutes),
and the second variable, waiting
, gives the waiting period
(in minutes) until the next eruption.
We can ask for the first 7 rows of the data frame:
head(faithful,7)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
## 7 4.700 88
We can extract the entry in row 6, column 2:
faithful[6,2]
## [1] 55
We can extract the subset of all the data for which the eruption value is at least 5 minutes:
subset(faithful, eruptions >= 5)
## eruptions waiting
## 76 5.067 76
## 149 5.100 96
## 151 5.033 77
## 168 5.000 88
Only 4 of the 272 eruptions exceeded 5 minutes. Seems like a rare event!
A histogram of eruption duration:
hist(faithful$eruptions,
main="Old Faithful Eruptions",
xlab="Duration (minutes)",
col="wheat"
)
A scatter plot:
plot(x=faithful$eruptions, y=faithful$waiting,
main="Old Faithful Eruptions",
ylab="Wait time (minutes)",
xlab="Duration (minutes)",
pch=16,
col="wheat"
)
Below are some bits of code. Can you see what information it will provide?
max(wildcats$W)
## [1] 51
ANSWER: The maximum number of wins in a season over this time period.
Which season was this? Here’s a way to extract the entire season (row) from the data matrix that had the most number of wins:
subset(wildcats, W == max(wildcats$W))
## Year Coach W L T District NCAA GP
## 13 2011 Jackson Vaughan 51 3 0 1st 1st 54
Wow! What a season! 51 wins against just 3 losses, first in districts, and National Champions!
table(wildcats$NCAA)
##
## 1st 2nd 3rd 4th
## 19 2 2 2 1
ANSWER: It specifies how well Linfield did at Nationals over the 1999-2024 time period. Linfield has finished 1st twice, 2nd twice, 3rd once, and 4th once. The other seasons they didn’t finish as high as 4th.
sum(wildcats$District=="1st")
## [1] 15
ANSWER: It gives the number of seasons from 1999-2024 in which Linfield was 1st in the District
wildcats$winpct <- wildcats$W/wildcats$GP
ANSWER: It creates a new column in the data frame wildcats called
winpct
. This column records the winning percentage for the
team each season (wins divided by games played).
max(wildcats$winpct)
## [1] 0.9444444
ANSWER: It returns the best winning percentage for a season between 1999 and 2024.
Which season was this?
subset(wildcats, winpct == max(wildcats$winpct))
## Year Coach W L T District NCAA GP winpct
## 13 2011 Jackson Vaughan 51 3 0 1st 1st 54 0.9444444
Not surprisingly, the best winning percentage occurred in the magical 2011 season in which the team went 51-3.
plot(x=wildcats$L, y=wildcats$W,cex=.7,
main="Linfield Softball (1999-2024)",
xlab="Losses",
ylab="Wins")
ANSWER: Scatterplot of wins vs losses. This plot makes it clear that 2020 was a strange season. This visual deviation from the trend indicates that something “different” was going on. Hmm… global pandemic?
fivenum(wildcats$W[wildcats$Coach=="Jackson Vaughan"])
## [1] 12.0 33.0 38.0 40.5 51.0
ANSWER: This code gives the 5 number summary for the Linfield
softball wins for just the seasons in the data set in which the coach
was Jackson Vaughan. More generally, consider this code, which makes use
of the aggregate()
method:
aggregate(wildcats$W, by=list(wildcats$Coach), FUN = median)
## Group.1 x
## 1 Jackson Vaughan 38
## 2 Laura Kenow 24
The aggregate()
method executes a function on a variable
that has been grouped by a categorical variable. In the code above we
have chosen to determine the median number of wins for each coach over
this time period.
x
is a vector of three values,
xbar
is the mean of the values in x
. What is
v
?x = c(10,12,17)
xbar = mean(x)
v = sum((x - mean(x))^2)/(length(x)-1)
v
## [1] 13
ANSWER: This code reproduces the formula for the variance of a set of
numbers:\[s^2 =
\frac{1}{n-1}\sum_{i=1}^n(x_i-\overline{x})^2.\] Convinced? let’s
check that this really gives the variance of the data set
x
:
var(x)
## [1] 13
Yup.