Professional Documents
Culture Documents
thanks to ...
Vancouver R Users Group
Tavis Rudd and Tilman Holschuh -- admin
Rob Balshaw and the
BC Centres for Disease Control -- host
Casey Shannon, Nick Fishbane -- helpers + content
A picture is worth
a thousand words
http://msnbcmedia1.msn.com/j/msnbc/Components/Photos/050709/050609_columbia_hmed_6p.hmedium.jpg
Siddhartha R. Dalal; Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space
Shuttle: Pre-Challenger Prediction of Failure. JASA,Vol. 84, No. 408 (Dec., 1989),
pp. 945-957. Access via JSTOR.
Edward Tufte
http://www.edwardtufte.com
BOOK:
Visual Explanations: Images and Quantities, Evidence and
Narrative
Ch. 5 deals with the Challenger disaster
That chapter is available for $7 as a downloadable booklet:
http://www.edwardtufte.com/tufte/books_textb
vs
ggplot2 package
must be installed and loaded
install.packages(ggplot2, dependencies = TRUE)
library(ggplot2)
10^2.5
10^3.5
1962
10^4.5
10^2.5
1977
10^3.5
1992
2007
1977
2007
10^3.5
10^4.5
10^2.5
10^3.5
10^4.5
base
2000
6000
10000
14000
65
0
5000
10000
15000
10^2.5
Year of 1960
5000
10000
15000
Year of 1970
5000 10000
20000
Year of 1980
5000
15000
Income per Person
25000
Year of 1965
65
Year of 1955
55
Year of 1950
5000
10000
15000
20000
Year of 1975
64 70
Africa
Africa
Africa
Africa
80
70
60
50
40
30
5000 10000
20000
Year of 1985
76
1992
50
70
1977
40 55 70
30 50 70
64 70
72
66
2007
1962
Americas
multi-panel conditioning
1992
Americas
1977
Americas
Americas
80
70
60
50
40
30
2007
1962
80
70
60
50
40
30
1992
Asia
Asia
1962
80
70
60
50
40
30
Europe
Asia
Europe
Asia
Europe
Europe
lattice
10^4.5
10000
15000
20000
25000
30000
ggplot2
facetting
ggplot(...) + ... +
facet_wrap(~ continent)
1000
10000
1992
2007
80
70
60
50
40
30
1962
1977
80
70
60
1000
30
50
40
Africa
Americas
Asia
Europe
Oceania
10000
lattice
ggplot2
aesthetic mapping
ggplot(...) + ... +
aes(fill = country)
TO DO:
add similar eye candy for overlaying, e.g. a smooth fit
quality of
output
base
time invested
* figure is totally fabricated but, I claim, still true
quality of
output
base
time invested
* figure is totally fabricated but, I claim, still true
use data.frames
use factors
be the boss of your factors
keep your data tidy
reshape your data
master read.table()
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
master reorder()
Tidy way
dataofis mapping
a standard
of mapping
the meaning
of a dataset
its structure.
A dataset is
tandard
theway
meaning
of a dataset
to its structure.
A to
dataset
is
from
Wickhams Tidy Data
messy on
or tidy
on how
and tables
areobservations,
matched up with
observations,
epending
how depending
rows, columns
androws,
tablescolumns
are matched
up with
Journal In
of Statistical
Software
3
variables
types.
tidy data:
ypes.
In tidyand
data:
ata structure
1. Each
variable forms a column.
able
forms
a column.
atistical datasets are rectangular tables made up of rows and columns. The columns
2. forms
Each observation
a row.
ost
always
labelled
the rows areforms
sometimes
labelled. Table 1 provides some data
rvation
aandrow.
n imaginary experiment in a format commonly seen in the wild. The table has two
ofofrows
Statistical
Software
3
4
s and three
rows, Journal
and
both
and columns
are
labelled.
3. Each
type
observational
unit
forms a table.
tructure
This
Tidy Data
treatmentb
is Codds
3rd normal form
(Coddthe
1990),
but with
the constraints
framed in statistical
3rd normal
form
but
framed
in
statistical
John (Codd
Smith 1990),
withIn
2 constraints
dropped.
this
experiment,
the
missing
value
represents an observation
al datasets
are
rectangular
tables
made
up
of
rows
and
columns.
The
columns
and
focusdataset
put 16
onrather
a single
ratherconnected
than the datasets
many connected datasets
Jane
Doe
11 dataset
the language,
focus and
putthe
onrows
athe
single
than
thesome
many
been
made,
but
wasnt,
so
its
important
to keep it. Structural missing value
ways
labelled
are
sometimes
labelled.
Table
1
provides
data
Mary
Johnson databases.
3 Messy 1data is any other other arrangement of the data.
common
in
relational
tionalexperiment
databases.
Messy
data
is any
other
arrangement
the data.
ginary
in a format
commonly
seen
in the
wild. other
The
has two
measurements
thattable
cant
be madeof(e.g.
the count of pregnant males) can b
messy
tidy
treatmentb
name
re many ways to structure treatmenta
the same underlying
data. Table 2 shows the same data
e 1, but theJohn
rowsSmith
and columns have
been transposed.
2 The data is the same, but the
John Smith
s dierent. Jane
Our vocabulary
of rows and
Doe
16 columns is11simply not rich enough to describe
e two tablesMary
represent
the same data.3 In addition1to appearance, we need a way Jane
to
Doe
Johnson
e the underlying semantics, or meaning, of the values displayed in table.
Mary Johnson
Table 1: Typical presentation dataset.
ata semantics
John Smith
Jane Doe
John Smith
Jane Doe
Mary Johnson
trt
a
a
a
b
b
b
result
16
3
2
11
1
Mary Johnson
For a given dataset, its usually easy to figure out what are observations and w
define variables and observation
example, if the columns in the Table 1 were height and weight we would
treatmenta
16
3
set istreatmentb
a collection of values,2 usually either
(if1 quantitative)
or strings
but numbers
it is surprisingly
difficult
to(ifprecisely
11
ive). Values are organised in two ways. Every value belongs to a variable and an
xamples of how to restructure two common issues with tabular data. (a) Each cell should only contain a
melt
row
column
a
b
row a b c
row a c b
a
1 4 7
a
1 b4
b
2 5 a8
c
3 6 b9
2 c5
(a) Raw datac
3 a6
b
(a) Rawcdata
a
a
ac
b
7b
8b
9c
c
c
value
row
a
b
c
a
b
c
a
b
(b) Molten data
c
1
2
3
4
5
6
7
8
9
column
a
a
a
b
b
b
c
c
c
value
1
2
3
4
5
6
7
8
9
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
(b) Molten
data
(b). The information in each table is exactly the same, just stored in a dierent
way.
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molte
(b). The information in each table is exactly the same, just stored in a dierent way.
cast
row
column
a
b
row a b c
row a c b
a
1 4 7
a
1 b4
b
2 5 a8
c
3 6 b9
2 c5
(a) Raw datac
3 a6
b
(a) Rawcdata
a
a
ac
b
7b
8b
9c
c
c
value
row
a
b
c
a
b
c
a
b
(b) Molten data
c
1
2
3
4
5
6
7
8
9
column
a
a
a
b
b
b
c
c
c
value
1
2
3
4
5
6
7
8
9
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
(b) Molten
data
(b). The information in each table is exactly the same, just stored in a dierent
way.
Table 5: A simple example of melting. (a) is melted with one colvar, row, yielding the molte
(b). The information in each table is exactly the same, just stored in a dierent way.
melt
row
column
a
a
b
a
row a b c row a b
c
a
a
1 4 7
a
b
a
1
4
b
2 5 8
b
b
b
2
5
c
3 6 9
c
b
a3
6c
(a) Raw data c
b
c
c
c
(a) Raw
data
value
1
2
3
4
5
6
7
8
9
c
7
8
9
row
column
a
b
c
a
b
c
a
b
c
a
a
a
b
b
b
c
c
c
value
example of melting. (a) is melted with one colvar, row, yielding the molten dataset
on in each table is exactly the same, just stored in a dierent way.
(b) Molten data
cast
1
2
3
4
5
6
7
8
9
A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset
e information in each table is exactly the same, just stored in a dierent way.
religion
income
Agnostic
Agnostic
Agnostic
Agnostic
<$10k
$10-20k
$20-30k
$30-40k
freq
27
34
60
81
in addition to:
reshape2
see also:
plyr
dplyr
ggplot2
plex by adding a smooth line and faceting. While working through but it might be polar coordinates, or a spherical projectio
mples you will be introduced to all six components of the grammar,
The process for mapping the colour is a little more com
then defined more precisely in Section 3.5.
The chapter concludes
40
on 3.6, which describes how the various35 components map to data a non-numeric result: colours. However, colours can be th
three components, corresponding
to the three types of colo
in R.
factor(cyl)
30
3 Mastering the grammar
the human eye. These three4 cell types give rise to a three
5
space.
Scaling
then
involves
mapping
the data values to p
25
6
This new dataset is a result of applying the aesthetic mappings
to the original
8 this, but here since cyl is a cat
ways this
to dodata.
economy data
data. We20 can create many differentThere
typesare
of many
plots using
The scattermaptovalues
evenly
spacedgethues
on plot.
the colour
wheel, as
plot uses15 points, but were we instead
draw to
lines
we would
a line
If
he fuel economy dataset, mpg, a sample of which is illustrated in A different mapping is used when the variable is continuo
we used bars, wed get a bar plot. Neither of those examples makes sense for
It records make, model, class, engine size, transmission and fuel
The
result6 of these
conversions
is
Table
3.4, which c
2
3 still draw
4
5 as in Figure
7
this
data,
but
we
could
them,
3.2.
In
ggplot2
we
can
r a selection of US cars in 1999 and 2008. It contains the 38 modelsdisplhave meaning to the computer. As well as aesthetics that
produce
plots
that dont
make sense, yet are grammatically valid. This
updated every year, an indicator
that themany
car was
a popular
model.
to we
variable,
we
also
includebut
aesthetics
that are constant. W
Fig.A4,
3.1: Honda
A scatterplot
of engine
displacement
in
litres
(displ) senseless
vs. average highway
is no
different
thanCivic,
English,
where
can
create
grammatical
dels include popular cars like the
Audi
Hyundai
the according
aesthetics
for each
point This
are completely specified and R
miles per gallon (hwy). Points are coloured
to number
of cylinders.
like theJetta.
angryThis
rockdata
barked
like
a comma.
issan Maxima, Toyota Camrysentences
and Volkswagen
hwy
30
plot summarises the most important factor governing fuel economy: engine size.
a4
1.8 1999
a4
1.8 1999
a4
2.0 2008
a4
2.0 2008
a4
2.8 1999
a4
2.8 1999
a4
3.1 2008
a4 quattro 1.8 1999
a4 quattro 1.8 1999
a4 quattro 2.0 2008
x y colour
0.037 0.531
1.8 29
4
4What
18 precisely
29 compact
0.037
0.531
is a scatterplot? You1.8
have
probably
29seen many
4 before and have
4even
21 drawn
29 compact
0.074 as
0.594
some by hand. A scatterplot
represents
each observation
a
2.0
31
4
4point
20 (),
31 positioned
compact according to the value of two variables. As0.074
well as
a
0.562
2.0
30
4
4horizontal
21 30 compact
and vertical position, each point also has a size, a colour
and
a
0.222
0.438
2.8 26
6shape.
16 These
26 compact
attributes are called aesthetics,
and6are the properties that can
0.222 0.438
26 can be
6 mapped to a variable,
on the graphic. Each 2.8
aesthetic
or
6be18perceived
26 compact
0.278
0.469
constant
value. In Figure 3.1
3.1 displ
6set18to a 27
compact
27 is mapped
6 to horizontal position,
position and cyl to 1.8
colour.
mapped0.438
to
4hwy
18 to vertical
26 compact
26 Size and
4 shape are not 0.037
remain at their (constant) default values.
0.037 0.406
4variables,
16 25but
compact
1.8
25
4
have these mappings we can create a new dataset that records
this
0.074 0.500
4 20Once28wecompact
2.0
28
4
information. Table 3.2 shows the first 10 rows of the data behind Figure 3.1.
1
1
1
1
1
1
1
1
1
1
19
19
19
19
19
19
19
19
19
19
scaling:
mapping data
dataareunits
to aesthetics for other aesthetics
filled in:
the points will be filled circles
a 1-mm diameter.
computer units
taset suggests many interesting questions. How are engine size and
The first 10 cars in the mpg dataset, included in the ggplot2 package. cty Table 3.4: Simple dataset with variables mapped into aesthetic s
Table
3.2:
First 10
rows from
mpg rearranged into the format required for a scatterplot.
cord miles per gallon (mpg) for city
and
highway
driving,
respectively,
colours
is intimidating,
but this is the form that R uses inte
This data frame contains all the dataofto
be displayed
on the plot.
s the engine displacement in litres.
pdf("awesome_figure.pdf")
plot(1:10)
dev.off()
September 2, 2013
Karthik Ram
Specify a size
ggsave(file = "/path/to/figure/filename.png", width = 6,
height =4)
Karthik Ram