You are on page 1of 46

hello ggplot2!

Dr. Jennifer (Jenny) Bryan


Department of Statistics and Michael Smith Laboratories
University of British Columbia
jenny@stat.ubc.ca
@JennyBryan
https://github.com/jennybc
http://www.stat.ubc.ca/~jenny/

thanks to ...
Vancouver R Users Group
Tavis Rudd and Tilman Holschuh -- admin
Rob Balshaw and the
BC Centres for Disease Control -- host
Casey Shannon, Nick Fishbane -- helpers + content

please see this GitHub repository for all references,


examples worked with live coding, etc.
https://github.com/jennybc/ggplot2-tutorial
these slides just remind me to discuss some Big Ideas
by putting them in a Big Font

stackoverflow is your friend


use tags!

stackoverflow is your friend


use tags!

A picture is worth
a thousand words

1986 Challenger space shuttle disaster


Favorite example of Edward Tufte

http://msnbcmedia1.msn.com/j/msnbc/Components/Photos/050709/050609_columbia_hmed_6p.hmedium.jpg

A picture is worth a thousand words

A picture is worth a thousand words

Siddhartha R. Dalal; Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space
Shuttle: Pre-Challenger Prediction of Failure. JASA,Vol. 84, No. 408 (Dec., 1989),
pp. 945-957. Access via JSTOR.

Edward Tufte
http://www.edwardtufte.com
BOOK:
Visual Explanations: Images and Quantities, Evidence and
Narrative
Ch. 5 deals with the Challenger disaster
That chapter is available for $7 as a downloadable booklet:
http://www.edwardtufte.com/tufte/books_textb

A picture is worth a thousand words


Always, always, always plot the data.
Replace (or complement) typical tables of
data or statistical results with figures that
are more compelling and accessible.
Whenever possible, generate figures that
overlay / juxtapose observed data and
analytical results, e.g. the fit.

base or traditional graphics


vs
lattice package
ships with R, but must load
library(lattice)

vs
ggplot2 package
must be installed and loaded
install.packages(ggplot2, dependencies = TRUE)
library(ggplot2)

Two main goals for statistical graphics


To facilitate comparisons.
To identify trends.
lattice and ggplot2 achieve these
goals with less fuss

10^2.5

10^3.5

1962

10^4.5

10^2.5

1977

10^3.5

1992

2007

1977

Assignment 1: Best Set of Graphs

2007

10^3.5

10^4.5

10^2.5

10^3.5

10^4.5

Income per person (GDP/capita, inflationadjusted $)

base

2000

6000

10000

14000

Income per Person

65
0

5000

10000

15000

Income per Person

Life Expectancy at Birth (yrs)

10^2.5

Year of 1960

5000

10000

15000

Income per Person

Year of 1970

5000 10000

20000

Income per Person

Year of 1980

5000

15000
Income per Person

25000

Year of 1965
65

Year of 1955

55

Year of 1950

5000

10000

15000

20000

Income per Person

Year of 1975
64 70

Africa

Africa

Africa

Africa

80
70
60
50
40
30

5000 10000

20000

Income per Person

Year of 1985
76

1992

50

70

1977

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

40 55 70

30 50 70

64 70

72

66

2007

Life Expectancy at Birth (yrs)

1962

Life Expectancy at Birth (yrs)

Life Expectancy at Birth (yrs)

lifeExp ~ gdpPercap | continent * year

Americas

multi-panel conditioning

1992

Americas

1977

Americas

Americas

80
70
60
50
40
30

2007

1962

80
70
60
50
40
30

1992

Asia

Asia

1962
80
70
60
50
40
30

Europe

Asia

Europe

Asia

Europe

Europe

lattice

10^4.5

10000

15000

20000

25000

Income per Person

30000

ggplot2
facetting
ggplot(...) + ... +
facet_wrap(~ continent)

1000

10000

1992

2007

Life expectancy at birth (years)

80

70

60

50

40

30

1962

1977

80

70

60

1000

30

50
40

Africa
Americas
Asia
Europe
Oceania

10000

Income per person (GDP/capita, inflationadjusted $)

lattice

groups and superposition


lifeExp ~ gdpPercap | year, group = country

ggplot2

aesthetic mapping
ggplot(...) + ... +
aes(fill = country)

TO DO:
add similar eye candy for overlaying, e.g. a smooth fit

week one ....


ggplot2 / lattice

quality of
output
base
time invested
* figure is totally fabricated but, I claim, still true

after youve climbed the steepest part of the


learning curve ...
ggplot2 / lattice

quality of
output
base
time invested
* figure is totally fabricated but, I claim, still true

use data.frames
use factors
be the boss of your factors
keep your data tidy
reshape your data

if you are struggling with a plot,


ask yourself:
am I breaking one or more of these rules?
often that is the real, hidden reason for struggle
use data.frames
use factors
be the boss of your factors
keep your data tidy
reshape your data

master read.table()
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

master reorder()

Tidy way
dataofis mapping
a standard
of mapping
the meaning
of a dataset
its structure.
A dataset is
tandard
theway
meaning
of a dataset
to its structure.
A to
dataset
is
from
Wickhams Tidy Data
messy on
or tidy
on how
and tables
areobservations,
matched up with
observations,
epending
how depending
rows, columns
androws,
tablescolumns
are matched
up with
Journal In
of Statistical
Software
3
variables
types.
tidy data:
ypes.
In tidyand
data:

ata structure
1. Each
variable forms a column.
able
forms
a column.

atistical datasets are rectangular tables made up of rows and columns. The columns
2. forms
Each observation
a row.
ost
always
labelled
the rows areforms
sometimes
labelled. Table 1 provides some data
rvation
aandrow.
n imaginary experiment in a format commonly seen in the wild. The table has two
ofofrows
Statistical
Software
3
4
s and three
rows, Journal
and
both
and columns
are
labelled.
3. Each
type
observational
unit
forms a table.

e of observational unit forms a table.


treatmenta

tructure
This

Tidy Data

treatmentb

is Codds
3rd normal form
(Coddthe
1990),
but with
the constraints
framed in statistical
3rd normal
form
but
framed
in
statistical
John (Codd
Smith 1990),
withIn
2 constraints
dropped.
this
experiment,
the
missing
value
represents an observation
al datasets
are
rectangular
tables
made
up
of
rows
and
columns.
The
columns
and
focusdataset
put 16
onrather
a single
ratherconnected
than the datasets
many connected datasets
Jane
Doe
11 dataset
the language,
focus and
putthe
onrows
athe
single
than
thesome
many
been
made,
but
wasnt,
so
its
important
to keep it. Structural missing value
ways
labelled
are
sometimes
labelled.
Table
1
provides
data
Mary
Johnson databases.
3 Messy 1data is any other other arrangement of the data.
common
in
relational
tionalexperiment
databases.
Messy
data
is any
other
arrangement
the data.
ginary
in a format
commonly
seen
in the
wild. other
The
has two
measurements
thattable
cant
be madeof(e.g.
the count of pregnant males) can b

messy

tidy

three rows, and both rows


columns
are labelled.
Tableand
1: Typical
presentation
dataset.

treatmentb
name
re many ways to structure treatmenta
the same underlying
data. Table 2 shows the same data
e 1, but theJohn
rowsSmith
and columns have
been transposed.
2 The data is the same, but the
John Smith
s dierent. Jane
Our vocabulary
of rows and
Doe
16 columns is11simply not rich enough to describe
e two tablesMary
represent
the same data.3 In addition1to appearance, we need a way Jane
to
Doe
Johnson
e the underlying semantics, or meaning, of the values displayed in table.
Mary Johnson
Table 1: Typical presentation dataset.

John Smith Jane Doe Mary Johnson


ny ways to structure the same underlying data. Table 2 shows the same data
treatmenta
3
ut the rows and
columns have been
transposed.16The data is the same,
but the
treatmentb
2
11
1
ent. Our vocabulary of rows and columns is simply not rich enough to describe
tables represent the same data. In addition to appearance, we need a way to
Table 2: The same data as in Table 1 but structured dierently.
nderlying semantics, or meaning, of the values
displayed
in table.
Table
3: The
same data as in

ata semantics

John Smith

Jane Doe

John Smith
Jane Doe
Mary Johnson

trt
a
a
a
b
b
b

result

16
3
2
11
1

Table 1 but with variables in columns and obser

Mary Johnson

For a given dataset, its usually easy to figure out what are observations and w
define variables and observation
example, if the columns in the Table 1 were height and weight we would

treatmenta

16
3
set istreatmentb
a collection of values,2 usually either
(if1 quantitative)
or strings
but numbers
it is surprisingly
difficult
to(ifprecisely
11
ive). Values are organised in two ways. Every value belongs to a variable and an

from White et als Nine simple ways ...

xamples of how to restructure two common issues with tabular data. (a) Each cell should only contain a

reshape your data

data has a tendency to get shorter and wider, but


tall and thin often better for analysis + visualization

Journal of Statistical Software from Wickhams Tidy Data

see also reshape2

Journal of Statistical Software

melt
row

column

a
b
row a b c
row a c b
a
1 4 7
a
1 b4
b
2 5 a8
c
3 6 b9
2 c5
(a) Raw datac
3 a6
b
(a) Rawcdata

a
a
ac
b
7b
8b
9c
c
c

value
row

a
b
c
a
b
c
a
b
(b) Molten data
c

1
2
3
4
5
6
7
8
9

column
a
a
a
b
b
b
c
c
c

value
1
2
3
4
5
6
7
8
9

Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
(b) Molten
data
(b). The information in each table is exactly the same, just stored in a dierent
way.

Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molte
(b). The information in each table is exactly the same, just stored in a dierent way.

Journal of Statistical Software from Wickhams Tidy Data

see also reshape2

Journal of Statistical Software

cast
row

column

a
b
row a b c
row a c b
a
1 4 7
a
1 b4
b
2 5 a8
c
3 6 b9
2 c5
(a) Raw datac
3 a6
b
(a) Rawcdata

a
a
ac
b
7b
8b
9c
c
c

value
row

a
b
c
a
b
c
a
b
(b) Molten data
c

1
2
3
4
5
6
7
8
9

column
a
a
a
b
b
b
c
c
c

value
1
2
3
4
5
6
7
8
9

Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
(b) Molten
data
(b). The information in each table is exactly the same, just stored in a dierent
way.

Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molte
(b). The information in each table is exactly the same, just stored in a dierent way.

Journal of Statistical Software


Journal of Statistical Software

typical usage pattern:

melt
row

column

a
a
b
a
row a b c row a b
c
a
a
1 4 7
a
b
a
1
4
b
2 5 8
b
b
b
2
5
c
3 6 9
c
b
a3
6c
(a) Raw data c
b
c
c
c
(a) Raw
data

value
1
2
3
4
5
6
7
8
9

c
7
8
9

(b) Molten data

row

column

a
b
c
a
b
c
a
b
c

a
a
a
b
b
b
c
c
c

value

example of melting. (a) is melted with one colvar, row, yielding the molten dataset
on in each table is exactly the same, just stored in a dierent way.
(b) Molten data

cast

1
2
3
4
5
6
7
8
9

melt to facilitate analysis and


visualization
cast to make compact tables
that are nicer for eyeballs

A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
e information in each table is exactly the same, just stored in a dierent way.

religion

income

Agnostic
Agnostic
Agnostic
Agnostic

<$10k
$10-20k
$20-30k
$30-40k

freq
27
34
60
81

in addition to:
reshape2
see also:
plyr
dplyr

ggplot2

we will not use qplot() function


no training wheels
youre here ...
I assume you want to ride this bike

data, in data.frame form


aesthetic: map variables into properties people can
perceive visually ... position, color, line type?
geom: specifics of what people see ... points? lines?
scale: map data values into computer values
stat: summarization/transformation of data
facet: juxtapose related mini-plots of data subsets

plex by adding a smooth line and faceting. While working through but it might be polar coordinates, or a spherical projectio
mples you will be introduced to all six components of the grammar,
The process for mapping the colour is a little more com
then defined more precisely in Section 3.5.
The chapter concludes
40
on 3.6, which describes how the various35 components map to data a non-numeric result: colours. However, colours can be th
three components, corresponding
to the three types of colo
in R.
factor(cyl)
30
3 Mastering the grammar

the human eye. These three4 cell types give rise to a three
5
space.
Scaling
then
involves
mapping
the data values to p
25
6
This new dataset is a result of applying the aesthetic mappings
to the original
8 this, but here since cyl is a cat
ways this
to dodata.
economy data
data. We20 can create many differentThere
typesare
of many
plots using
The scattermaptovalues
evenly
spacedgethues
on plot.
the colour
wheel, as
plot uses15 points, but were we instead
draw to
lines
we would
a line
If
he fuel economy dataset, mpg, a sample of which is illustrated in A different mapping is used when the variable is continuo
we used bars, wed get a bar plot. Neither of those examples makes sense for
It records make, model, class, engine size, transmission and fuel
The
result6 of these
conversions
is
Table
3.4, which c
2
3 still draw
4
5 as in Figure
7
this
data,
but
we
could
them,
3.2.
In
ggplot2
we
can
r a selection of US cars in 1999 and 2008. It contains the 38 modelsdisplhave meaning to the computer. As well as aesthetics that
produce
plots
that dont
make sense, yet are grammatically valid. This
updated every year, an indicator
that themany
car was
a popular
model.
to we
variable,
we
also
includebut
aesthetics
that are constant. W
Fig.A4,
3.1: Honda
A scatterplot
of engine
displacement
in
litres
(displ) senseless
vs. average highway
is no
different
thanCivic,
English,
where
can
create
grammatical
dels include popular cars like the
Audi
Hyundai
the according
aesthetics
for each
point This
are completely specified and R
miles per gallon (hwy). Points are coloured
to number
of cylinders.
like theJetta.
angryThis
rockdata
barked
like
a comma.
issan Maxima, Toyota Camrysentences
and Volkswagen
hwy

30

plot summarises the most important factor governing fuel economy: engine size.

m the EPA fuel economy website, http://fueleconomy.gov.


manufacturer model
audi
audi
audi
audi
audi
audi
audi
audi
audi
audi

disp year cylMapping


cty hwy aesthetics
class to data

a4
1.8 1999
a4
1.8 1999
a4
2.0 2008
a4
2.0 2008
a4
2.8 1999
a4
2.8 1999
a4
3.1 2008
a4 quattro 1.8 1999
a4 quattro 1.8 1999
a4 quattro 2.0 2008

x y colour

0.037 0.531
1.8 29
4
4What
18 precisely
29 compact
0.037
0.531
is a scatterplot? You1.8
have
probably
29seen many
4 before and have
4even
21 drawn
29 compact
0.074 as
0.594
some by hand. A scatterplot
represents
each observation
a
2.0
31
4
4point
20 (),
31 positioned
compact according to the value of two variables. As0.074
well as
a
0.562
2.0
30
4
4horizontal
21 30 compact
and vertical position, each point also has a size, a colour
and
a
0.222
0.438
2.8 26
6shape.
16 These
26 compact
attributes are called aesthetics,
and6are the properties that can
0.222 0.438
26 can be
6 mapped to a variable,
on the graphic. Each 2.8
aesthetic
or
6be18perceived
26 compact
0.278
0.469
constant
value. In Figure 3.1
3.1 displ
6set18to a 27
compact
27 is mapped
6 to horizontal position,
position and cyl to 1.8
colour.
mapped0.438
to
4hwy
18 to vertical
26 compact
26 Size and
4 shape are not 0.037
remain at their (constant) default values.
0.037 0.406
4variables,
16 25but
compact
1.8
25
4
have these mappings we can create a new dataset that records
this
0.074 0.500
4 20Once28wecompact
2.0
28
4
information. Table 3.2 shows the first 10 rows of the data behind Figure 3.1.

colour size shape


#FF6C91
#FF6C91
#FF6C91
#FF6C91
#00C1A9
#00C1A9
#00C1A9
#FF6C91
#FF6C91
#FF6C91

1
1
1
1
1
1
1
1
1
1

19
19
19
19
19
19
19
19
19
19

scaling:
mapping data
dataareunits
to aesthetics for other aesthetics
filled in:
the points will be filled circles
a 1-mm diameter.
computer units
taset suggests many interesting questions. How are engine size and

The first 10 cars in the mpg dataset, included in the ggplot2 package. cty Table 3.4: Simple dataset with variables mapped into aesthetic s
Table
3.2:
First 10
rows from
mpg rearranged into the format required for a scatterplot.
cord miles per gallon (mpg) for city
and
highway
driving,
respectively,
colours
is intimidating,
but this is the form that R uses inte
This data frame contains all the dataofto
be displayed
on the plot.
s the engine displacement in litres.

base graphics cause a figure to exist as a side effect


ggplot2 (and lattice) construct the figure as an R object
obviously youll need to print it to see it

this tutorial consisted largely of live


coding ... see the repo for indicative content
https://github.com/jennybc/ggplot2-tutorial

saving figures to file

do not save figures mouse-y style


not self-documenting
not reproducible
http://cache.desktopnexus.com/thumbnails/180681-bigthumbnail.jpg

most correct method:

pdf("awesome_figure.pdf")
plot(1:10)
dev.off()

postscript(), svg(), png(), tiff(), ....

fine for everyday use:


plot(1:10)
dev.print(pdf,"awesome_figure.pdf")

postscript(), svg(), png(), tiff(), ....

next slide from here:


Data Visualization with R & ggplot2
Karthik Ram

September 2, 2013

Data Visualization with R & ggplot2

Karthik Ram

If the plot is on your screen


ggsave("/path/to/figure/filename.png")

If your plot is assigned to an object


ggsave(plot1, file = "/path/to/figure/filename.png")

Specify a size
ggsave(file = "/path/to/figure/filename.png", width = 6,
height =4)

or any format (pdf, png, eps, svg, jpg)


ggsave(file = "/path/to/figure/filename.eps")
ggsave(file = "/path/to/figure/filename.jpg")
ggsave(file = "/path/to/figure/filename.pdf")

Data Visualization with R & ggplot2

Karthik Ram

p <- ggplot(...) + ...


p #delete or comment this out if non-interactive
ggsave(p, file = path/to/figure/filename.png)

Use this workflow if the script might be run noninteractively.


Why? If you do not specify the plot explicitly, the
default is to draw the last interactively drawn plot.
That wont exist in a non-interactive session and
your plot files will be blank.
This can be frustrating. Ask me how I know.

p <- ggplot(...) + ...


ggsave(p, "filename.png", scale = 0.8)

Adjust the "scale" parameter to get multiple


versions of a plot destined for different targets, e.g.,
for use in a presentation vs. a poster. vs a manuscript.
scale < 1 makes the various plot elements bigger
relative to the plotting area
scale > 1 makes them smaller
YMMV but try scale = 0.8 for posters/slides

You might also like