Ggplot2 Tutorial Slides

hello ggplot2!
Dr. Jennifer (Jenny) Bryan

Department of Statistics and Michael Smith Laboratories
University of British Columbia
jenny@stat.ubc.ca
@JennyBryan
https://github.com/jennybc
http://www.stat.ubc.ca/~jenny/
thanks to ...
Vancouver R Users Group
Tavis Rudd and Tilman Holschuh -- admin
Rob Balshaw and the
BC Centres for Disease Control -- host
Casey Shannon, Nick Fishbane -- helpers + content
please see this GitHub repository for all references,

examples worked with live coding, etc.
https://github.com/jennybc/ggplot2-tutorial
these slides just remind me to discuss some Big Ideas
by putting them in a Big Font
stackoverflow is your friend

use tags!
stackoverflow is your friend

use tags!
A picture is worth
a thousand words
1986 Challenger space shuttle disaster

Favorite example of Edward Tufte
http://msnbcmedia1.msn.com/j/msnbc/Components/Photos/050709/050609_columbia_hmed_6p.hmedium.jpg
A picture is worth a thousand words
Siddhartha R. Dalal; Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space
Shuttle: Pre-Challenger Prediction of Failure. JASA,Vol. 84, No. 408 (Dec., 1989),
pp. 945-957. Access via JSTOR.
Edward Tufte
http://www.edwardtufte.com
BOOK:
Visual Explanations: Images and Quantities, Evidence and
Narrative
Ch. 5 deals with the Challenger disaster
That chapter is available for $7 as a downloadable booklet:
http://www.edwardtufte.com/tufte/books_textb

Always, always, always plot the data.
Replace (or complement) typical tables of
data or statistical results with figures that
are more compelling and accessible.
Whenever possible, generate figures that
overlay / juxtapose observed data and
analytical results, e.g. the fit.
base or traditional graphics

vs
lattice package
ships with R, but must load
library(lattice)
vs
ggplot2 package
must be installed and loaded
install.packages(ggplot2, dependencies = TRUE)
library(ggplot2)
Two main goals for statistical graphics

To facilitate comparisons.
To identify trends.
lattice and ggplot2 achieve these
goals with less fuss
10^2.5
10^3.5
1962
10^4.5
10^2.5
1977
10^3.5
1992
2007
1977
Assignment 1: Best Set of Graphs
2007
10^3.5
10^4.5
10^2.5
10^3.5
10^4.5
Income per person (GDP/capita, inflationadjusted $)
base
2000
6000
10000
14000
Income per Person
65
0
5000
10000
15000
Income per Person
Life Expectancy at Birth (yrs)
10^2.5
Year of 1960
5000
10000
15000
Income per Person
Year of 1970
5000 10000
20000
Income per Person
Year of 1980
5000
15000
Income per Person
25000
Year of 1965
65
Year of 1955
55
Year of 1950
5000
10000
15000
20000
Income per Person
Year of 1975
64 70
Africa
Africa
Africa
Africa
80
70
60
50
40
30
5000 10000
20000
Income per Person
Year of 1985
76
1992
50
70
1977
40 55 70
30 50 70
64 70
72
66
2007
1962
lifeExp ~ gdpPercap | continent * year
Americas
multi-panel conditioning
1992
Americas
1977
Americas
Americas
80
70
60
50
40
30
2007
1962
80
70
60
50
40
30
1992
Asia
Asia
1962
80
70
60
50
40
30
Europe
Asia
Europe
Asia
Europe
Europe
lattice
10^4.5
10000
15000
20000
25000
Income per Person
30000
ggplot2
facetting
ggplot(...) + ... +
facet_wrap(~ continent)
1000
10000
1992
2007
Life expectancy at birth (years)
80
70
60
50
40
30
1962
1977
80

70
60
1000
30
50
40
Africa
Americas
Asia
Europe
Oceania
10000
Income per person (GDP/capita, inflationadjusted $)
lattice
groups and superposition

lifeExp ~ gdpPercap | year, group = country
ggplot2
aesthetic mapping
ggplot(...) + ... +
aes(fill = country)
TO DO:
add similar eye candy for overlaying, e.g. a smooth fit
week one ....

ggplot2 / lattice
quality of
output
base
time invested
* figure is totally fabricated but, I claim, still true
after youve climbed the steepest part of the

learning curve ...
ggplot2 / lattice
quality of
output
base
time invested
* figure is totally fabricated but, I claim, still true
use data.frames
use factors
be the boss of your factors
keep your data tidy
reshape your data
if you are struggling with a plot,

ask yourself:
am I breaking one or more of these rules?
often that is the real, hidden reason for struggle
use data.frames
use factors
be the boss of your factors
keep your data tidy
reshape your data
master read.table()
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
master reorder()
Tidy way
dataofis mapping
a standard
of mapping
the meaning
of a dataset
its structure.
A dataset is
tandard
theway
meaning
of a dataset
to its structure.
A to
dataset
is
from
Wickhams Tidy Data
messy on
or tidy
on how
and tables
areobservations,
matched up with
observations,
epending
how depending
rows, columns
androws,
tablescolumns
are matched
up with
Journal In
of Statistical
Software
3
variables
types.
tidy data:
ypes.
In tidyand
data:
ata structure
1. Each
variable forms a column.
able
forms
a column.
atistical datasets are rectangular tables made up of rows and columns. The columns
2. forms
Each observation
a row.
ost
always
labelled
the rows areforms
sometimes
labelled. Table 1 provides some data
rvation
aandrow.
n imaginary experiment in a format commonly seen in the wild. The table has two
ofofrows
Statistical
Software
3
4
s and three
rows, Journal
and
both
and columns
are
labelled.
3. Each
type
observational
unit
forms a table.
e of observational unit forms a table.

treatmenta
tructure
This
Tidy Data
treatmentb
is Codds
3rd normal form
(Coddthe
1990),
but with
the constraints
framed in statistical
3rd normal
form
but
framed
in
statistical
John (Codd
Smith 1990),
withIn
2 constraints
dropped.
this
experiment,
the
missing
value
represents an observation
al datasets
are
rectangular
tables
made
up
of
rows
and
columns.
The
columns
and
focusdataset
put 16
onrather
a single
ratherconnected
than the datasets
many connected datasets
Jane
Doe
11 dataset
the language,
focus and
putthe
onrows
athe
single
than
thesome
many
been
made,
but
wasnt,
so
its
important
to keep it. Structural missing value
ways
labelled
are
sometimes
labelled.
Table
1
provides
data
Mary
Johnson databases.
3 Messy 1data is any other other arrangement of the data.
common
in
relational
tionalexperiment
databases.
Messy
data
is any
other
arrangement
the data.
ginary
in a format
commonly
seen
in the
wild. other
The
has two
measurements
thattable
cant
be madeof(e.g.
the count of pregnant males) can b
messy
tidy
three rows, and both rows

columns
are labelled.
Tableand
1: Typical
presentation
dataset.
treatmentb
name
re many ways to structure treatmenta
the same underlying
data. Table 2 shows the same data
e 1, but theJohn
rowsSmith
and columns have
been transposed.
2 The data is the same, but the
John Smith
s dierent. Jane
Our vocabulary
of rows and
Doe
16 columns is11simply not rich enough to describe
e two tablesMary
represent
the same data.3 In addition1to appearance, we need a way Jane
to
Doe
Johnson
e the underlying semantics, or meaning, of the values displayed in table.
Mary Johnson
Table 1: Typical presentation dataset.
John Smith Jane Doe Mary Johnson

ny ways to structure the same underlying data. Table 2 shows the same data
treatmenta
3
ut the rows and
columns have been
transposed.16The data is the same,
but the
treatmentb
2
11
1
ent. Our vocabulary of rows and columns is simply not rich enough to describe
tables represent the same data. In addition to appearance, we need a way to
Table 2: The same data as in Table 1 but structured dierently.
nderlying semantics, or meaning, of the values
displayed
in table.
Table
3: The
same data as in
ata semantics
John Smith
Jane Doe
John Smith
Jane Doe
Mary Johnson
trt
a
a
a
b
b
b
result
16
3
2
11
1
Table 1 but with variables in columns and obser
Mary Johnson
For a given dataset, its usually easy to figure out what are observations and w
define variables and observation
example, if the columns in the Table 1 were height and weight we would
treatmenta
16
3
set istreatmentb
a collection of values,2 usually either
(if1 quantitative)
or strings
but numbers
it is surprisingly
difficult
to(ifprecisely
11
ive). Values are organised in two ways. Every value belongs to a variable and an
from White et als Nine simple ways ...
xamples of how to restructure two common issues with tabular data. (a) Each cell should only contain a
reshape your data
data has a tendency to get shorter and wider, but

tall and thin often better for analysis + visualization
Journal of Statistical Software from Wickhams Tidy Data
see also reshape2
Journal of Statistical Software
melt
row
column
a
b
row a b c
row a c b
a
1 4 7
a
1 b4
b
2 5 a8
c
3 6 b9
2 c5
(a) Raw datac
3 a6
b
(a) Rawcdata
a
a
ac
b
7b
8b
9c
c
c
value
row
a
b
c
a
b
c
a
b
(b) Molten data
c
1
2
3
4
5
6
7
8
9
column
a
a
a
b
b
b
c
c
c
value
1
2
3
4
5
6
7
8
9
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
(b) Molten
data
(b). The information in each table is exactly the same, just stored in a dierent
way.
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molte
(b). The information in each table is exactly the same, just stored in a dierent way.
Journal of Statistical Software from Wickhams Tidy Data
see also reshape2
cast
row
column
a
b
row a b c
row a c b
a
1 4 7
a
1 b4
b
2 5 a8
c
3 6 b9
2 c5
(a) Raw datac
3 a6
b
(a) Rawcdata
a
a
ac
b
7b
8b
9c
c
c
value
row
a
b
c
a
b
c
a
b
(b) Molten data
c
1
2
3
4
5
6
7
8
9
column
a
a
a
b
b
b
c
c
c
value
1
2
3
4
5
6
7
8
9
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
(b) Molten
data
(b). The information in each table is exactly the same, just stored in a dierent
way.
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molte
(b). The information in each table is exactly the same, just stored in a dierent way.

typical usage pattern:
melt
row
column
a
a
b
a
row a b c row a b
c
a
a
1 4 7
a
b
a
1
4
b
2 5 8
b
b
b
2
5
c
3 6 9
c
b
a3
6c
(a) Raw data c
b
c
c
c
(a) Raw
data
value
1
2
3
4
5
6
7
8
9
c
7
8
9
(b) Molten data
row
column
a
b
c
a
b
c
a
b
c
a
a
a
b
b
b
c
c
c
value
example of melting. (a) is melted with one colvar, row, yielding the molten dataset
on in each table is exactly the same, just stored in a dierent way.
(b) Molten data
cast
1
2
3
4
5
6
7
8
9
melt to facilitate analysis and

visualization
cast to make compact tables
that are nicer for eyeballs
A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
e information in each table is exactly the same, just stored in a dierent way.
religion
income
Agnostic
Agnostic
Agnostic
Agnostic
<$10k
$10-20k
$20-30k
$30-40k
freq
27
34
60
81
in addition to:
reshape2
see also:
plyr
dplyr
ggplot2
we will not use qplot() function

no training wheels
youre here ...
I assume you want to ride this bike
data, in data.frame form

aesthetic: map variables into properties people can
perceive visually ... position, color, line type?
geom: specifics of what people see ... points? lines?
scale: map data values into computer values
stat: summarization/transformation of data
facet: juxtapose related mini-plots of data subsets
plex by adding a smooth line and faceting. While working through but it might be polar coordinates, or a spherical projectio
mples you will be introduced to all six components of the grammar,
The process for mapping the colour is a little more com
then defined more precisely in Section 3.5.
The chapter concludes
40
on 3.6, which describes how the various35 components map to data a non-numeric result: colours. However, colours can be th
three components, corresponding
to the three types of colo
in R.
factor(cyl)
30
3 Mastering the grammar
the human eye. These three4 cell types give rise to a three
5
space.
Scaling
then
involves
mapping
the data values to p
25
6
This new dataset is a result of applying the aesthetic mappings
to the original
8 this, but here since cyl is a cat
ways this
to dodata.
economy data
data. We20 can create many differentThere
typesare
of many
plots using
The scattermaptovalues
evenly
spacedgethues
on plot.
the colour
wheel, as
plot uses15 points, but were we instead
draw to
lines
we would
a line
If
he fuel economy dataset, mpg, a sample of which is illustrated in A different mapping is used when the variable is continuo
we used bars, wed get a bar plot. Neither of those examples makes sense for
It records make, model, class, engine size, transmission and fuel
The
result6 of these
conversions
is
Table
3.4, which c
2
3 still draw
4
5 as in Figure
7
this
data,
but
we
could
them,
3.2.
In
ggplot2
we
can
r a selection of US cars in 1999 and 2008. It contains the 38 modelsdisplhave meaning to the computer. As well as aesthetics that
produce
plots
that dont
make sense, yet are grammatically valid. This
updated every year, an indicator
that themany
car was
a popular
model.
to we
variable,
we
also
includebut
aesthetics
that are constant. W
Fig.A4,
3.1: Honda
A scatterplot
of engine
displacement
in
litres
(displ) senseless
vs. average highway
is no
different
thanCivic,
English,
where
can
create
grammatical
dels include popular cars like the
Audi
Hyundai
the according
aesthetics
for each
point This
are completely specified and R
miles per gallon (hwy). Points are coloured
to number
of cylinders.
like theJetta.
angryThis
rockdata
barked
like
a comma.
issan Maxima, Toyota Camrysentences
and Volkswagen
hwy
30
plot summarises the most important factor governing fuel economy: engine size.
m the EPA fuel economy website, http://fueleconomy.gov.

manufacturer model
audi
audi
audi
audi
audi
audi
audi
audi
audi
audi
disp year cylMapping

cty hwy aesthetics
class to data
a4
1.8 1999
a4
1.8 1999
a4
2.0 2008
a4
2.0 2008
a4
2.8 1999
a4
2.8 1999
a4
3.1 2008
a4 quattro 1.8 1999
a4 quattro 1.8 1999
a4 quattro 2.0 2008
x y colour
0.037 0.531
1.8 29
4
4What
18 precisely
29 compact
0.037
0.531
is a scatterplot? You1.8
have
probably
29seen many
4 before and have
4even
21 drawn
29 compact
0.074 as
0.594
some by hand. A scatterplot
represents
each observation
a
2.0
31
4
4point
20 (),
31 positioned
compact according to the value of two variables. As0.074
well as
a
0.562
2.0
30
4
4horizontal
21 30 compact
and vertical position, each point also has a size, a colour
and
a
0.222
0.438
2.8 26
6shape.
16 These
26 compact
attributes are called aesthetics,
and6are the properties that can
0.222 0.438
26 can be
6 mapped to a variable,
on the graphic. Each 2.8
aesthetic
or
6be18perceived
26 compact
0.278
0.469
constant
value. In Figure 3.1
3.1 displ
6set18to a 27
compact
27 is mapped
6 to horizontal position,
position and cyl to 1.8
colour.
mapped0.438
to
4hwy
18 to vertical
26 compact
26 Size and
4 shape are not 0.037
remain at their (constant) default values.
0.037 0.406
4variables,
16 25but
compact
1.8
25
4
have these mappings we can create a new dataset that records
this
0.074 0.500
4 20Once28wecompact
2.0
28
4
information. Table 3.2 shows the first 10 rows of the data behind Figure 3.1.
colour size shape

#FF6C91
#FF6C91
#FF6C91
#FF6C91
#00C1A9
#00C1A9
#00C1A9
#FF6C91
#FF6C91
#FF6C91
1
1
1
1
1
1
1
1
1
1
19
19
19
19
19
19
19
19
19
19
scaling:
mapping data
dataareunits
to aesthetics for other aesthetics
filled in:
the points will be filled circles
a 1-mm diameter.
computer units
taset suggests many interesting questions. How are engine size and
The first 10 cars in the mpg dataset, included in the ggplot2 package. cty Table 3.4: Simple dataset with variables mapped into aesthetic s
Table
3.2:
First 10
rows from
mpg rearranged into the format required for a scatterplot.
cord miles per gallon (mpg) for city
and
highway
driving,
respectively,
colours
is intimidating,
but this is the form that R uses inte
This data frame contains all the dataofto
be displayed
on the plot.
s the engine displacement in litres.
base graphics cause a figure to exist as a side effect

ggplot2 (and lattice) construct the figure as an R object
obviously youll need to print it to see it
this tutorial consisted largely of live

coding ... see the repo for indicative content
https://github.com/jennybc/ggplot2-tutorial
saving figures to file
do not save figures mouse-y style

not self-documenting
not reproducible
http://cache.desktopnexus.com/thumbnails/180681-bigthumbnail.jpg
most correct method:
pdf("awesome_figure.pdf")
plot(1:10)
dev.off()
postscript(), svg(), png(), tiff(), ....
fine for everyday use:

plot(1:10)
dev.print(pdf,"awesome_figure.pdf")
postscript(), svg(), png(), tiff(), ....
next slide from here:

Data Visualization with R & ggplot2
Karthik Ram
September 2, 2013
Karthik Ram
If the plot is on your screen

ggsave("/path/to/figure/filename.png")
If your plot is assigned to an object

ggsave(plot1, file = "/path/to/figure/filename.png")
Specify a size
ggsave(file = "/path/to/figure/filename.png", width = 6,
height =4)
or any format (pdf, png, eps, svg, jpg)

ggsave(file = "/path/to/figure/filename.eps")
ggsave(file = "/path/to/figure/filename.jpg")
ggsave(file = "/path/to/figure/filename.pdf")
Karthik Ram
p <- ggplot(...) + ...

p #delete or comment this out if non-interactive
ggsave(p, file = path/to/figure/filename.png)
Use this workflow if the script might be run noninteractively.

Why? If you do not specify the plot explicitly, the
default is to draw the last interactively drawn plot.
That wont exist in a non-interactive session and
your plot files will be blank.
This can be frustrating. Ask me how I know.
p <- ggplot(...) + ...

ggsave(p, "filename.png", scale = 0.8)
Adjust the "scale" parameter to get multiple

versions of a plot destined for different targets, e.g.,
for use in a presentation vs. a poster. vs a manuscript.
scale < 1 makes the various plot elements bigger
relative to the plotting area
scale > 1 makes them smaller
YMMV but try scale = 0.8 for posters/slides

Ggplot2 Tutorial Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ggplot2 Tutorial Slides

Uploaded by

Copyright:

Available Formats

hello ggplot2!

Dr. Jennifer (Jenny) Bryan

please see this GitHub repository for all references,

stackoverflow is your friend

stackoverflow is your friend

1986 Challenger space shuttle disaster

A picture is worth a thousand words

A picture is worth a thousand words

A picture is worth a thousand words

base or traditional graphics

Two main goals for statistical graphics

Assignment 1: Best Set of Graphs

Income per person (GDP/capita, inflationadjusted $)

Income per Person

Income per Person

Life Expectancy at Birth (yrs)

Income per Person

Income per Person

Income per Person

Income per Person

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

lifeExp ~ gdpPercap | continent * year

Income per Person

Life expectancy at birth (years)

Income per person (GDP/capita, inflationadjusted $)

groups and superposition

week one ....

after youve climbed the steepest part of the

if you are struggling with a plot,

e of observational unit forms a table.

three rows, and both rows

John Smith Jane Doe Mary Johnson

Table 1 but with variables in columns and obser

from White et als Nine simple ways ...

reshape your data

data has a tendency to get shorter and wider, but

Journal of Statistical Software from Wickhams Tidy Data

see also reshape2

Journal of Statistical Software

Journal of Statistical Software from Wickhams Tidy Data

see also reshape2

Journal of Statistical Software

Journal of Statistical Software

typical usage pattern:

(b) Molten data

melt to facilitate analysis and

we will not use qplot() function

data, in data.frame form

m the EPA fuel economy website, http://fueleconomy.gov.

disp year cylMapping

colour size shape

base graphics cause a figure to exist as a side effect

this tutorial consisted largely of live

saving figures to file

do not save figures mouse-y style

most correct method:

postscript(), svg(), png(), tiff(), ....