An Introduction to R
Background
The purpose of these course notes is to introduce participants to basic programming, graphics and statistics in R.
The notes, which are released under a CC BY 4.0 license, were written by Joe Thorley.
R, S and S-Plus
R and S-PLUS are both implementations for the S language which is a language for ‘programming with data’. R is free and open source.
Installation
Download the the most recent version of the base distribution binary for your platform from http://cran.r-project.org/ and install using the default options. Then start up R.
R Basics
Calculations
R can be used just like a calculator. An expression is entered at the >
prompt and the result printed (the print
command is redundant).
3 + 4
## [1] 7
3 * 4
## [1] 12
print(1/4)
## [1] 0.25
To faciliate cut and paste the R code in this document is not preceded by the >
prompt and any output is preceded by two comment characters ##
(comments are introduced below).
The [1]
at the start of each result indicates that the result is the first element in a vector (vectors are introduced below).
Exercise 1 What is nine raised to the power of three?
Exercise 2 And nine raised to the power of 0.5?
Exercise 3 97 out of 284 eggs survive. What is the mortality expressed as a percentage?
Exercise 4 What is the result of 3 * 4 + 5
? And 5 + 3 * 4
? What does this indicate?
Precedence
In R, if one operator has precedence over another then it is evaluated first. The order in which operators are evaluated can be controlled using brackets.
2 * 3 + 4
## [1] 10
2 * (3 + 4)
## [1] 14
Exercise 5 Does the *
operator have precedence over the ^
operator?
Demonstrate with an example.
Functions
Most of R’s functionality comes from its functions. A function takes zero, one or multiple arguments, depending on the function, and returns a value. To call a function enter it’s name followed by a pair of brackets - include any arguments in the brackets.
log(10)
## [1] 2.302585
To find out more about a function called function_name
type ?function_name
. To search for the functions associated with a topic type ??topic
or ??"multiple topics"
.
Exercise 6 Which function calculates cumulative sums? And what arguments does it take?
Arguments
The R Documentation for log
indicates that the function requires an argument x
that is a vector of
numeric (real) or complex numbers and an argument base
which is the base of the logarithm.
If undefined the value of base
is set to be exp(1)
, i.e., log
calculates natural logarithms by default.
When calling a function its arguments can be specified using positional and/or named matching.
log(x = 10, base = 2)
## [1] 3.321928
log(base = 2, x = 10)
## [1] 3.321928
log(10, 2)
## [1] 3.321928
log(2, 10)
## [1] 0.30103
Exercise 7 What arguments does the function sqrt
take?
Assignment
The result of an expression can be assigned to an object using the <-
operator. The object can then be used in subsequent expressions. To save finger strokes type alt-.
x <- 3 + 4
x
## [1] 7
x/2
## [1] 3.5
y <- x * x
y
## [1] 49
Exercise 8 Set x
to be 7
. What is the value of x^x
. Save the value in a object called i
. If you assign the value 20
to the object x
does the value of i
change? What does this indicate about how R assigns values to objects?
Workspace
When a value is assigned to an object, the object can be used in subsequent calculations because it is stored in the workspace. The workspace is an area of memory associated with the current session.
The ls()
function lists the objects in the workspace. Calling rm(x)
deletes object x
from the workspace. Typing rm(list = ls())
removes all objects.
Exercise 9 Using the rm
function and the assignment operator arrange your workspace so that it contains three objects x
, y
and z
with values of 3
, 5
and 7
, respectively.
Vectors
A vector is a string of values.
Generation
Vectors can be created using the c
(concatenate) and seq
(sequence) functions.
c(1, 3, 5, 7, 13)
## [1] 1 3 5 7 13
seq(from = 1, to = 11, by = 2)
## [1] 1 3 5 7 9 11
seq(1, 6)
## [1] 1 2 3 4 5 6
The length
function returns the number of values in a vector.
Exercise 10 Use the seq
function to generate a vector of 30
evenly spaced numbers from 0
to 1
. Confirm its length.
Vectorized calculations
In R, there are no scalars. Instead single values are considered vectors of length 1
. Even simple calculations are designed to be performed on vectors.
x <- seq(3, 5)
y <- c(1, 2, 2)
x - y
## [1] 2 2 3
x/y
## [1] 3.0 2.0 2.5
If the vectors in an expression are of different lengths the shorter vector is recycled.
x <- seq(1, 4)
y <- seq(1, 2)
x/y
## [1] 1 1 3 2
Exercise 11 Create the vector that is the square root of all the whole numbers from 1
to 100
.
Exercise 12 Create a vector that gives the deviation of the values in the vector seq(1, 10)
from their mean.
Classes
The vectors you have dealt with so far have been of class numeric
which have real numbers as their permissible values.
x <- c(1.5, 2.7)
class(x)
## [1] "numeric"
Another important vector class is character
which has text values.
x <- c("txt", "one", "1", "1.9")
class(x)
## [1] "character"
Exercise 13 Try combining character and numeric values together in the same vector. What happens?
Missing values
Missing values are represented by NA
. Functions such as min
, max
and mean
that require knowledge of
all the input values return an NA
if one or more values are missing.
This behaviour can be altered by setting the na.rm
argument to be TRUE
.
x <- c(1, 2, 3, NA)
mean(x)
## [1] NA
mean(x, na.rm = TRUE)
## [1] 2
Factors
Factors represent categorical variables.
Vectors can be coerced to class factor
using the as.factor
function.
x <- c("blue", "green", "blue", "red", "green")
x <- as.factor(x)
class(x)
## [1] "factor"
Exercise 14 Coerce the factor x
to a vector of class numeric using as.numeric
. What happens?
Data Frames
Data frames are the fundamental data structure in R. A data frame is a two-dimensional data set where the columns are variables (vectors) of equal length.
data
R packages often include data sets. To see the data sets in the current search path (introducted later) type data()
. To print a summary of the trees
data frame type summary(trees)
. Type trees
to print the data frame itself. To get more information on trees
type ?trees
.
Exercise 15 What are the data in the ToothGrowth
data set?
Referencing columns
When referencing a column (vector) in a data frame both the data frame and column name must be specified.
The $
operator is used to separate the data frame and column name.
trees$Girth
## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6
The $
operator can also be used to reference a new column in a data frame. In the following example the girth (circumference) is being used to estimate the cross-sectional area.
trees$Area <- trees$Girth^2/(4 * pi)
summary(trees)
## Girth Height Volume Area
## Min. : 8.30 Min. :63 Min. :10.20 Min. : 5.482
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40 1st Qu.: 9.717
## Median :12.90 Median :76 Median :24.20 Median :13.242
## Mean :13.25 Mean :76 Mean :30.17 Mean :14.726
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30 3rd Qu.:18.552
## Max. :20.60 Max. :87 Max. :77.00 Max. :33.770
Exercise 16 Add a column Diameter
to trees
bearing in mind that pi
is the ratio of the circumference (girth) to the diameter.
RStudio
Thus far you have been using R through its Graphical User Interface (GUI). However a number of third-party programs provide more powerful Integrated Development Environments (IDEs) for interfacing with R. I recommend RStudio - the desktop version of which can be downloaded from http://rstudio.org/download/desktop. Like R, RStudio is free and open source. The remainder of this course is taught assuming you are using RStudio.
Console
By default the pane in the bottom left of RStudio is the R console. You can enter expressions in the console in the same way you have been entering expressions in the R console in the R GUI.
Workspace
The first window in the top right pane provides an interface for viewing and manipulating the objects in the workspace.
History
The second window in the top right panel allows you to view and copy the expressions you have executed at the console.
Packages
Everything in the S language is an object - even the functions and operators. With the exception of the workspace, all of the available objects are contained within packages.
To see all the packages that are currently loaded on the search path type search()
.
search()
## [1] ".GlobalEnv" "package:knitr" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
.GlobalEnv
is the workspace. To see all the objects in a package type ls("package:package_name")
. For example
ls("package:stats")
For a detailed description of how R searches and finds objects see http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/.
To see all the packages that are currently installed on your computer go to the Packages window in the bottom right pane of RStudio. Packages that are currently loaded are selected.
You can load an installed package by selecting it or by typing library(package_name)
at the console.
Exercise 17 Load the package foreign
. What arguments does the
function read.dbf
in the
foreign
package take?
To view the packages on the Comprehensive R Archive Network (CRAN) website that are currently available to install on your computer go to http://cran.r-project.org/web/packages/available_packages_by_name.html.
You can install a package from the CRAN site onto your computer by clicking Install on the Packages window and entering its name or alternatively by typing install.packages("package_name")
?
Exercise 18 Install the package ggplot2
onto your computer and then load it. What arguments does the function ggplot
take.
Other packages are available from GitHub.
To install a package from a GitHub repository you need to first load the devtools
package and then use the install_github
function.
The following code demonstrates how to install (and load) the developmental version of the
yesno
package from https://github.com/poissonconsulting/yesno.
install.packages("devtools")
library(devtools)
install_github("poissonconsulting/yesno")
library(yesno)
Exercise 19 Install the latest version of the newdata
package from GitHub at https://github.com/poissonconsulting/newdata.
Working Directory
Each R session is associated with a folder on your computer that is referred to as the working directory. To view the contents of the working directory go to the Files window in the bottom right pane. To change the working directory use the Choose Directory… suboption of the Set Working Directory option from the Session menu in the global toolbar. Alternatively use the functions getwd()
and setwd()
.
Exercise 20 Create a new folder on your desktop called RCourse and make this folder the working directory. To ensure you have been successful confirm that the output of getwd()
refers to your RCourse folder.
Projects
Instead of setting the working directory each time you restart RStudio you can associate the contents of a folder with a project - then all you have to do is to select the project.
To create a new project use the New Project command (available on the File menu of the global toolbar).
Exercise 21 Create a new project called RCourse in the RCourse folder on your desktop. As the folder already exists choose the Create project from: Existing Directory option. To confirm you were successful quit RStudio and double-click the RCourse.Rproj file in the RCourse folder - RStudio should restart with RCourse as the working directory.
Transferring Data
Comma Separated Files
The simplest way to input and output data is in the form of comma separated files. Comma separated files, which have the suffix .csv, are recognised by almost all statistical and spreadsheet programs including .
The following code creates a data.frame
stocks, prints the first 6 rows using head()
and saves it to the working directory as a csv file called StocksMissing
.
stocks <- data.frame(time = 1:10, X = seq(-2, 3, length.out = 10), Y = 3:12,
Z = -10:-1)
head(stocks)
write.csv(stocks, "StocksMissing.csv", row.names = FALSE)
Exercise 22 Run the line of code above and then open the StocksMissing csv file in a spreadsheet program and replace the first five values in the X
column with NA
. Now use the read.csv
function to import the StocksMissing csv file into the workspace. Use the na.omit
function to delete the
missing values. What happens to the data frame when the missing values are deleted?
Manipulating Data
The dplyr package provides functions for manipulating data frames. Four key functions are:
- filter: chose a subset of rows based on conditions
- select: chose a subset of columns based on names
- arrange: sort rows
- mutate: add new columns
Exercise 23 What does each of the following lines of R code do to the ToothGrowth data frame?
install.packages("dplyr")
library(dplyr)
ToothGrowth <- filter(datasets::ToothGrowth, len < 30)
ToothGrowth <- mutate(datasets::ToothGrowth, len2 = len * 2)
ToothGrowth <- select(datasets::ToothGrowth, supp, dose)
ToothGrowth <- arrange(datasets::ToothGrowth, dose, supp, len)
The code fragment datasets::
indicates that the following object must be taken from the datasets package. This can be useful for ensuring R uses the correct object if multiple objects with the same name exist in the search path. Later I use it to remind the reader that a particular package must be loaded to run the code.
If the functions were nested the code would look like the following which is almost unreadable.
ToothGrowth <- arrange(select(mutate(filter(datasets::ToothGrowth, len <
30), len2 = len * 2), len2, supp, dose), dose, supp, len2)
To improve readability, the same operations can be joined in left to right order using the forward-pipe operator %>%
(pronounced ‘then’) in the magrittr
package
to
install.packages("magrittr")
library(magrittr)
ToothGrowth <- datasets::ToothGrowth %>% filter(len < 30) %>% mutate(len2 = len *
2) %>% select(len2, supp, dose) %>% arrange(dose, supp, len2)
The dplyr
functions together with those in the tidyr
package can be used to
to clean and tidy data where
data cleaning focusses on observations and data tidying focusses on variables
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
- Multiple tables include a column(s) that allow them to be linked.
The following code uses the tidyr::gather
function to convert the a data frame from wide to long format so that it satisfies the first condition.
install.packages("tidyr")
library(tidyr)
dayCount <- data.frame(day = 1:3, year1 = seq(-3, -4, length.out = 3),
year2 = 4:6, year3 = -1:1)
dayCount
## day year1 year2 year3
## 1 1 -3.0 4 -1
## 2 2 -3.5 5 0
## 3 3 -4.0 6 1
dayCount <- gather(dayCount, year, count, -day)
dayCount
## day year count
## 1 1 year1 -3.0
## 2 2 year1 -3.5
## 3 3 year1 -4.0
## 4 1 year2 4.0
## 5 2 year2 5.0
## 6 3 year2 6.0
## 7 1 year3 -1.0
## 8 2 year3 0.0
## 9 3 year3 1.0
The complement to tidyr::gather()
is the tidyr::spread
function.
spread(dayCount, year, count)
## day year1 year2 year3
## 1 1 -3.0 4 -1
## 2 2 -3.5 5 0
## 3 3 -4.0 6 1
Exercise 24 What happens if you filter negative prices from the gathered dayCount
data frame before spreading?
Programming
Scripts
Rather than entering expressions into R one by one at the command line,
a more efficient method is to type the expressions into a text file
called a script and then send all of them to R at the same time.
R scripts are indicated by the suffix .R
or less commonly .r
.
Open a new script using the R Script suboption of the New option in the File menu. Next paste the following expressions into the script.
# this is a comment!
graphics.off()
rm(list = ls())
summary(datasets::trees)
Now save the script in your working directory as trees.R
. Finally type Command-A to select the entire script then Command-Enter to send it to the console.
Congratulations you’ve written and executed your first script. The first line is a comment - R ignores it. The second line closes any open graphics windows while the second line removes all objects from the workspace so that the script is starting with a blank slate.
Coding
For some reason some people still work in farenheit. The following equation converts a temperature in farenheit (\(F\)) to celsius (\(C\)).
\[ C = (F - 32) * 5 / 9\]
Exercise 25 Append code to the trees.R
script to convert 45 farenheit to celcius.
Note The correct answer is 7.22.
Exercise 26 Now use the code to convert the following temperatures in farenheit to celcius: 0 , 32, 64 and 100.
For further information on R coding style see http://stat405.had.co.nz/r-style.html.
Functions
R’s power stems from its extensibility - users can easily write their own functions.
The following code defines a function farenheit_to_kelvin
which converts a temperature in farenheit to kelvins. The inspiration for this example came from Software Carpentry’s novice R bootcamp material.
farenheit_to_kelvin <- function(farenheit) {
kelvin <- ((farenheit - 32) * 5/9) + 273.15
return(kelvin)
}
The first line tells R that farenheit_to_kelvin
is a function that takes a single argument named farenheit
. The squiggly brackets tell R where the function
definition starts and ends. The first line of code in the function definition
converts farenheit
to kelvin
while the second and last line tells the
function to return the value of kelvin
.
Exercise 27 Paste the code into a script and then execute it. What happens?
Now type farenheit_to_kelvin
. What happens? Finally type
farenheit_to_kelvin(c(0, 32))
. What happens now?
Graphics
The ggplot2
library provides a family of functions for producing high-quality graphics within
a unified conceptual framework.
Geoms
The following code produces a scatterplot of Volume
on the
y-axis against Girth
on the x-axis for the trees
data frame.
The first line of code loads the ggplot2
library while
the second line specifies the data and relationships of interest. The third line indicates the method (geom) for representing the relationships among the data while the last line plots the ggplot object (in this case gp
). Note: the same plot can be produced using the ggplot2::qplot
wrapper function thus
qplot(Girth, Volume, data = trees)
.
library(ggplot2)
gp <- ggplot(data = trees, aes(x = Girth, y = Volume))
gp <- gp + geom_point()
print(gp)
Exercise 28 What happens if you replace geom_point
with geom_line
?
Exercise 29 And what happens if you add both geoms to the gp
object?
The available geoms are documented at http://docs.ggplot2.org/current/.
Windows
By default plots are printed to the Plots window in the bottom right pane of RStudio. To create a new graphics window use the
windows()
function, where by default the width and height are both 7 inches. If working with the Mac or Linux Operating systems use the quartz()
or X11()
function, respectively.
The most recent ggplot object to have been created or modified or plotted can be saved in the working
directory using the ggsave
function.
For example to save it as a 4 by 4 inch 600 dots per inch (dpi) Portable Network Graphics (png) file
called trees type ggsave("trees.png",width=4,height=4,dpi=600)
. 600 dpi png files are recommended for inclusion in word documents.
If the number of open windows becomes unwieldy the command graphics.off()
can be used to close all the graphics windows.
Scales
Whereas geoms define the method for representing relationships between data, scales control the mapping from data to visual properties such as colour, shape, position and size. These visual properties are referred to as aesthetics.
For example consider the msleep
dataset in the ggplot2 library.
gp <- ggplot(data = msleep, aes(x = bodywt, y = brainwt, size = sleep_total,
colour = vore))
gp <- gp + geom_point()
gp <- gp + scale_x_continuous(name = "Body Weight (kg)", trans = "log10")
gp <- gp + scale_y_continuous(name = "Brain Weight (kg)", trans = "log10")
gp <- gp + scale_size_continuous(name = "Sleep (hr)")
gp <- gp + scale_color_discrete(name = "Diet")
gp
## Warning: Removed 27 rows containing missing values (geom_point).
The above code overides the default positional scales for the x and y data with log10
scales and provides meaningful names. The following code replaces the default discrete color scale with manual values where the color is matched to the diet type.
gp <- gp + scale_colour_manual(name = "Diet", values = c(carni = "red",
herbi = "green3", omni = "brown", insecti = "black"))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
Exercise 30 Create a new script called ToothGrowth.R
and produce a plot of the ToothGrowth
data for inclusion in a word report.
The following ggplot2 functions might be helpful.
geom_jitter(position = position_jitter(width = 0.1))
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_jitter
expand_limits(x = c(0, 2.5))
## mapping: x = ~x
## geom_blank: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
Facets
One of the most powerful features of the ggplot2 package is its ability to facet plots. As an example consider the following plot. What does it suggest to you about the relationship between diet and sleep?
gp <- ggplot(data = msleep, aes(x = bodywt, y = brainwt, size = sleep_total))
gp <- gp + facet_grid(. ~ vore)
gp <- gp + geom_point()
gp <- gp + scale_x_continuous(name = "Body Weight (kg)", trans = "log10")
gp <- gp + scale_y_continuous(name = "Brain Weight (kg)", trans = "log10")
gp <- gp + scale_size_continuous(name = "Sleep (hr)")
print(gp)
## Warning: Removed 27 rows containing missing values (geom_point).
Exercise 31 Modify your ToothGrowth.R
script so that it produces a second plot facetted by supp
.
Themes
The appearance of non-data elements in a ggplot plot is controlled by the theme system. To see the settings for the current theme type . To globally change to the black and white theme use the following code:
theme_set(theme_bw())
Alternatively theme parameters can be set for specific ggplot objects using the function. The following code modifies the theme for the current object so that grid lines are not plotted
gp <- gp + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Exercise 32 Alter your ToothGrowth.R
script so that it produces a third plot using the black and white theme without grid lines.
Linear Models
Linear regression
The basic linear regression model can be expressed as \[ y_{i} = \alpha +\beta x_{i} + \epsilon_{i} \] where \(\alpha\) is the intercept and \(\beta\) is the slope and the error terms (\(\epsilon_{i}\)) are independent and normally distributed with equal variance.
The following code fits the basic linear regression model where Volume
is the response and Girth
is the predictor for the trees
data set and prints a summary of the model.
mod <- lm(Volume ~ Girth, data = trees)
summary(mod)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
The model sumary indicates (among other things) that the intercept is -36.9 and the slope is 5.1.
Fitted values
A general approach for plotting the fitted values for any model is to
generate the predicted values for a new data set that covers the range of values of interest and then add them to the plot. New data sets can be easily generated using the new_data
function of the newdata
package.
library(newdata)
newdata <- new_data(trees, "Girth")
newdata$Volume <- predict(mod, newdata = newdata)
gp <- ggplot(data = trees, aes(x = Girth, y = Volume))
gp <- gp + geom_point()
gp <- gp + geom_line(data = newdata)
print(gp)
Exercise 33 Step through the above code line by line and document what each line is doing.
Confidence intervals
95% confidence intervals can added to the plot using the same approach.
pred <- data.frame(predict(mod, newdata, interval = "conf"))
newdata <- cbind(newdata, pred)
gp <- gp + geom_line(data = newdata, aes(y = lwr), linetype = "dashed")
gp <- gp + geom_line(data = newdata, aes(y = upr), linetype = "dashed")
print(gp)
Prediction intervals
Prediction intervals can be calculated by setting
the interval
argument in the predict
function to be "pred"
.
Roughly speaking, confidence intervals represent the uncertainty surrounding the underlying relationship whereas prediction intervals represent the uncertainty surrounding new individual observations.
Exercise 34 Modify your trees.R
script so that it fits a linear model to Volume
against Girth
and plots the observed data with
confidence and prediction intervals.
Model Fit
A linear regression model is only valid if its residuals are consistent with the assumed error structure. A diagnostic plot can be produced using the following code.
qplot(.fitted, .stdresid, data = mod) + geom_hline(yintercept = 0) + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Exercise 35 What does the previous diagnostic plot suggest to you?
The previous code works because the fortify
function has been defined for lm
objects. For further examples of diagnostic plots see http://docs.ggplot2.org/current/fortify.lm.html.
The relationship between Volume
and Girth
is expected to be allometric because the cross-sectional area at an given point scales to the square of the girth (circumference).
Exercise 36 Try fitting an allometric relationship by log
transforming
Volume
and Girth
and then using the same basic linear regression model.
Does the diagnostic plot suggest an improvement in
the model?
ANCOVA
The basic ANCOVA (analysis of covariance) model includes one categorical and one continous predictor variable without interactions. It can be expressed algebraically as follows \[ y_{ij} = \alpha_{i} + \beta x_{j} + \epsilon_{ij}\] where \(\alpha_{i}\) is the intercept for the \(i_{th}\) group mean and \(\beta\) is the slope and the error terms (\(\epsilon_{ij}\)) are independent and normally distributed with equal variance.
The following code fits the basic ANCOVA model to the ToothGrowth
data and prints a summary table.
mod <- lm(len ~ supp + dose, data = ToothGrowth)
summary(mod)
##
## Call:
## lm(formula = len ~ supp + dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2725 1.2824 7.231 1.31e-09 ***
## suppVC -3.7000 1.0936 -3.383 0.0013 **
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
The model’s coefficients indicate that the slope of the line is 9.3 and that VC
is 3.7 less than OJ
.
Exercise 37 Are the model’s residuals consistent with the assumed error structure?
The fitted values for both levels of supp
can be added to the scatterplot of
len
against dose
using the same general approach as for the tree
data set.
newdata <- new_data(ToothGrowth, c("dose", "supp"))
newdata$len <- predict(mod, newdata = newdata)
gp <- ggplot(data = ToothGrowth, aes(x = dose, y = len, colour = supp))
gp <- gp + geom_point(position = position_jitter(width = 0.1))
gp <- gp + geom_line(data = newdata, aes(group = supp))
print(gp)
Exercise 38 Modify your script so that it adds the predicted line for both levels of supp
to the scatter plot of the ToothGrowth
data using the previous code.
The following code fits three additional models corresponding to the ANOVA model with just supp
, the linear regression model with just dose
and the ANCOVA model with an interaction between dose
and supp
.
mod2 <- lm(len ~ supp, data = ToothGrowth)
mod3 <- lm(len ~ dose, data = ToothGrowth)
mod4 <- lm(len ~ dose * supp, data = ToothGrowth)
Exercise 39 Modify your script so that it fits all four competing models and then each model’s predictions (along with the raw data) in separate windows.
Exercise 40 Which model provides the best fit to the data?
Functions you might find useful are anova
and AIC
.
Model Formulae and Functions
This section lists the various functions and formulae required to fit a range of linear, generalized linear, generalized additive, and generalized additive mixed models.
lm(y~1) # one sample t-test
lm(y~x) # basic linear regression
lm(y~fac) # one-way ANOVA
lm(y~x-1) # basic linear regression through the origin (with no intercept term)
lm(y~x+I(x^2)) # polynomial (quadratic) regression
lm(y~x1+x2) # multiple regression
lm(y~x1+x2+x1:x2) # multiple regression with interaction
lm(y~fac1+fac2) # two-way ANOVA
lm(y~x1+fac2) # ANCOVA
nlme::lme(y~x1,random=~1|fac) # linear mixed model
glm(y~x1,family=binomial) # logistic regression
mgcv::gam(y~s(x),family=poisson) # generalized additive model
mgcv::gamm(y~s(x),random=~1|fac,family=Gamma) # generalized additive mixed model
Additional Analyses
Exercise 41 Analyse the chickwts
data set. To what extent is diet
a predictor of body weight? Represent your model fit graphically.
Exercise 42 Analyse the msleep
data set in the ggplot2
package. To what extent are body weight and diet predictors of brain weight? Are carnivores brainier than herbivores?
R Course Cheat Sheet
Operator | Description |
---|---|
- |
Subtraction. Usage: x - y |
; |
Separate lines. Usage: x <- 1; y <- 2 |
:: |
Specify package of object. Usage: datasets::trees |
?? |
Search help files. Usage: ??topic |
? |
Help file. Usage: ?function_name |
( |
Bracket. Usage: x*(y+z) |
* |
Multiplication. Usage: x * y |
/ |
Division. Usage: x / y |
^ |
Power. Usage: x^y |
+ |
Addition. Usage: x + y |
<- |
Assignment. Shortcut: alt-. Usage: x <- y |
$ |
Reference column in data.frame. Useage: x$y |
magrittr::%>% |
Forward pipe operations. Useage: x %>% sum() |
Math Function | Description |
---|---|
cumsum |
Cumulative sum. Usage: cumsum(x) |
exp |
Exponential. Usage: exp(x) |
log |
Logarithm. Usage: log(x, base = 10) |
mean |
Average. Usage: mean(x, na.rm = TRUE) |
sqrt |
Square-root. Usage: sqrt(x) |
Vector Function | Description |
---|---|
as.factor |
Coerce to factor. Useage: as.factor(x) |
as.numeric |
Coerce to numeric. Useage: as.factor(x) |
c |
Concatenate. Usage: c(1, -3, 5) |
length |
Length. Usage: length(seq(2, 5)) |
seq |
Sequence. Usage: seq(1, 4) |
data.frame Function | Description |
---|---|
cbind |
Combine two data.frames by columns. |
data.frame |
Create data.frame. |
data |
List available data sets. |
newdata::new_data |
Dummy data set. |
dplyr::arrange |
Sort rows. |
dplyr::filter |
Chose a subset of rows based on conditions. |
dplyr::mutate |
Add new columns. |
dplyr::select |
Chose a subset of columns based on names. |
head |
Chose first six rows. |
na.omit |
Delete missing values. |
read.csv |
Input from csv file. |
tidyr::gather |
Convert from wide to long format |
tidyr::spread |
Convert from long to wide format |
write.csv |
Output as csv file. |
Graphics Function | Description |
---|---|
ggplot2::element_blank |
Blank value for a theme element. |
ggplot2::expand_limits |
Expand scale limits to include specified values. |
ggplot2::facet_grid |
Break plot into facets by variables. |
ggplot2::geom_hline |
Represent values with horizontal lines. |
ggplot2::geom_jitter |
Represent values with points plus random spread. |
ggplot2::geom_line |
Represent values with lines. |
ggplot2::geom_point |
Represent values with points. |
ggplot2::geom_smooth |
Represent values with smoothers. |
ggplot2::ggplot |
Set data.frame and aesthetic mappings for ggplot graphic object. |
ggplot2::ggsave |
Save current plot. |
ggplot2::qplot |
Quick wrapper for ggplot and geom_point. |
ggplot2::scale_color_discrete |
Set discrete color scale attributes. |
ggplot2::scale_color_manual |
Set manual color scale attributes. |
ggplot2::scale_size_continuous |
Set continuous size scale attributes. |
ggplot2::scale_x_continuous |
Set continuous x position scale attributes. |
ggplot2::scale_y_continuous |
Set continuous y position scale attributes. |
ggplot2::theme_get |
Get current theme for graphics. |
ggplot2::theme_set |
Set current theme for graphics. |
ggplot2::theme |
Set theme element for current graphic object. |
graphics.off |
Close all graphics windows. |
windows |
New graphics window. Use quartz() or X11() for OSX or linux |
Model Function | Description |
---|---|
AIC |
Akaike’s Information Criterion for one or more models. |
anova |
Analysis of variance tables for one or more models. |
lm |
Fit linear model. Useage: lm(y ~ x, data = data) |
predict |
Predict response values for model. |
General Function | Description |
---|---|
class |
Class of object. Usage: class(x) |
devtools::install_github |
Install package from GitHub. |
getwd |
Get working directory. Useage: getwd() . |
install.packages |
Install package(s) from CRAN. Useage: install.packages('pkg_name') |
library |
Load installed package. Useage: library(package_name) |
ls |
Objects in workspace or package. Usage: ls(); ls(2) |
print |
Print to console or draw graphics object in window |
rm |
Remove object(s) from workspace. Usage: rm(x) ; rm(list = ls()) |
search |
Loaded packages. Useage: search() |
setwd |
Set working directory. Useage: setwd('path_to_dir') |
summary |
Summary of object. Useage: summary(x) |