You are on page 1of 92

Biostatistics

Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Biological statistics

Li, Xinhai

Phone: 64807898
Email: lixh@ioz.ac.cn
Homepage: http://people.gucas.ac.cn/~LiXinhai
Blog: http://blog.sciencenet.cn/u/lixinhai
Miniblog: http://weibo.com/lixinhaiblog

1
Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

How to learn statistics in this


class
l
No preview needed before the class
Focus on listening
g and thinking
g ((3 hours / week)) at
class
Dont take notes ((wasting
g yyour time))
Intensive review (1-2 hours / week) after the class
Do the homework (1 hour / week) after the class

2
Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Text books

Sokal, R. R. and F. J. Rohlf. 1995. Biometry: the principles and


practice of statistics in biological research. Third Edition. W. H.
Freeman and Co.: New York. 887 pp.
Zar,, JJ. H. 1999. Biostatistical Analysis.
y Fourth Edition.
From 1976 (the earliest year indexed) to mid 1997 (the date the search was
Prentice Hall: New Jersey, 663 pp.
performed) the following counts were obtained: Darwin (all publications,
e.g. The origin of the species) = 7,111. Sokal and Rohlf Biometry = 31,757. 3
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Biostatistics or biometry

"biostatistics" and "biometry" are


sometimes
i usedd iinterchangeably
h bl

"biometry" is more often used of biological


or agricultural
i lt l applications
li ti

"biostatistics" is more often used of


medical applications
applications.
4
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

What is statistics?
http://teeky.org/search-engine-optimization/
determine-success-via-website-statistics/

Statistics is the science of collection


collection,
analysis, interpretation, and presentation
of data.
data
Descriptive statistics are numerical
estimates that organize, sum up or present
the data
data.
Inferential statistics is the process of
inferring from a sample to the population. 5
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Statistical errors in publications


Underwood (1981) found statistical errors in 78% of the papers he
surveyed
y in marine ecology. gy Hurlbert ((1984)) reported
p that in two
separate surveys 26% and 48% of the ecological papers surveyed
showed the statistical error of pseudoreplication (Krebs 1999).

Charles J. Krebs. 1999. Ecological Methodology, 2nd ed. Addison-Wesley Educational Publishers, Inc.

50% of medical literature have statistical flaws (Altman et al. 1991).


Serious statistical errors were found in 40% of 164 articles published
i a psychiatry
in hi t jjournall (M
(McGuigan
G i 1995) (Ercan
(E ett al.
l 2007)
2007).

Ilker Ercan, Berna Yazc, Yaning Yang, Guven zkaya, Sengul Cangur, Bulent Ediz, Ismet Kan.
Misusage Of Statistics In Medical Research.
Research Eur J Gen Med 2007; 4(3):128-134

6
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Contents
Brief history, basic concepts, Analysis of covariance (ANCOVA)
and descriptive
p statistics
Nonparametric statistics
Probability distribution
Multivariate analysis
Hypothesis testing
Generalized linear model
Analysis of variance (ANOVA)
Common mistakes
Simple linear regression and
correlation
l ti

y x1 x2
7
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

St ti ti l software
Statistical ft R http://cran.r-project.org

R is a free software environment for statistical computing and graphics


graphics. It
compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

In 1995,
1995 R was initially written by Ross Ihaka and Robert Gentleman at the
Department of Statistics of the University of Auckland in Auckland, New Zealand.

Since mid-1997 there has been a core g


group
p ((the R Core Team)) who can
modify the R source code archive.

It is free software distributed under a GNU-style copyleft, and an official part of


the GNU project (GNU S).

It has over 2100 packages in 2010.

Citation
R Development Core Team
Team. 2011
2011. R: A Language and Environment
for Statistical Computing. R Foundation for Statistical Computing.
Vienna, Austria. ISBN: 3-900051-07-0. http://www.R-project.org. 8
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Todays contents
Introduction to biological
statistics

History
Data in biology
Descriptive statistics

9
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Hi t
History
Jo G
John Graunt
au t ((1620-1674,
6 0 6 , British)
t s ) and
a d William
a Petty
etty ((1623-1687,
6 3 68 , British):
ts )
developed early human statistical and census methods that later provided a
framework for modern demography based on life table, mean value, census,
longevity, and mortality.

Blaise Pascal (1623-1662, French) and Pierre de Fermat (1601-1665,


French), Jacques Bernoulli (1654-1705, Swiss): probability theory (binomial
coefficients)

Abraham de Moivre ()(1667-1754, French): combine the statistics with


probability theory; approximate the normal distribution though the expansion
of the binomial distribution

Carl Friedrich Gauss (1777-1855, Germany): least square, normal distribution

Adolphe Quetelet () (1796-1874, Belgium): significance of constancy


of large numbers (rate of criminal events)

Florence Nightingale (1820-1910, British): graphic presentation of statistics


10
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Emergence of statistics in 1800s


Laplace wrote a book describing how to compute the
future positions of planets and comets on the basis of a
few observations from earth.
Napoleon: "I find no mention of God in your treatise, Mr.
Laplace.
Laplace."
Laplace replied: "I had no need for that hypothesis.
The observations
Th b ti off planets
l t and
d cometst from
f this
thi earthly
thl platform
l tf did
not fit the predicted positions exactly. Laplace and his fellow scientists
attributed this to errors in the observations, sometimes due to
perturbations in the earth's atmosphere, other times due to human
error.
By the end of the nineteenth century
century, the errors had mounted instead
of diminishing. As measurements became more and more precise, 11
more and more error cropped up.
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Gaps between Darwinism and genetics in early 1900s


Mendels law of
Core Evolution Concepts
segregation
ti
Population: Organisms that share a By carrying out the monohybrid crosses,
common g gene p pool ((Species
p = actually
y or Mendel determined that the 2 alleles for
potentially interbreeding organisms) each character segregate during gamete
production.
Variation: Modifications of forms are
produced by
p y chance via mutations,, g
genetic
coding errors of individual organisms
Natural Selection: Reproduction &
survival of organisms
g whose heritable traits
are better suited to existing environmental
conditions
Retention: Persistence within a
population of the selected variation(s) over
successive generations

12
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Neo-Darwinian Modern evolutionary synthesis


in 1930s
Sir Ronald A. Fisher (1890-1962, British) developed
several basic statistical methods in support
pp of his work
The Genetical Theory of Natural Selection

Sewall G.
G Wright (1889-1988,
(1889 1988 American) used statistics
in the development of modern population genetics

John B. S. Haldane (1892-1964, British)


reestablished natural selection as the premier
mechanism of evolution by explaining it in terms of the
mathematical consequences of Mendelian genetics in
his book The Causes of Evolution.

13
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Francis Galton
http://www.sil.si.edu/digitalcollections/hst/scientific-identity/fullsize/SIL14-G001-05a.jpg

Francis Galton (1822-1911, British) (father of biometry


and
d eugenics):
i ) regression,
i correlation
l ti
African Explorer and elected Fellow in the Royal Geographic Society
C
Creator
t off the
th first
fi t weather
th maps and
d establisher
t bli h off th
the meteorological
t l i l
theory of anticyclones
Coined term "eugenics" and phrase "nature versus nurture"
Developed statistical concepts of correlation and regression to the mean
Discovered that fingerprints were an index of personal identity and
persuaded Scotland Yard to adopt a fingerprinting system
First to utilize the survey as a method for data collection
Produced over 340 papers and books throughout his lifetime
Knighted
K i ht d iin 1909
14
Galton, F. (1869/1892/1962). Hereditary Genius: An Inquiry into its Laws and Consequences. Macmillan/Fontana, London.
Galton, F. (1883/1907/1973). Inquiries into Human Faculty and its Development. AMS Press, New York.
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Karl Pearson
http://www.economics.soton.ac.uk/staff/aldrich/New%20Folder/kpreader1.htm

Karl Pearson ((1857-1936,, British):


) continued in the tradition of Galton
and laid the foundation for much of descriptive statistics.

In 1884, Pearson became Professor of Applied Mathematics and Mechanics


at University College
C London.
In 1901 Pearson, Weldon and Galton founded Biometrika, a Journal for the
Statistical Study of Biological Problems.
In 1907, Pearson took over a research unit founded by Galton and
reconstituted it as the Francis Galton Laboratory of National Eugenics.
In 1911,
1911 Pearson founded the world's
world s first university statistics department at
University College London.
method of moments
chi-square
correlation 15
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Ronald A. Fisher
http://en.wikipedia.org/wiki/Image:RonaldFisher.jpg

Sir Ronald Aylmer Fisher


Fisher, (1890 1962),
1962) an English statistician
statistician, evolutionary
biologist, and geneticist.

He was described byy Anders Hald as "a g genius who almost single-handedly
g y
created the foundations for modern statistical science"[1] and Richard Dawkins
described him as "the greatest biologist since Darwin".[2] (from Wikipedia)

In 1933 he became a Professor of Eugenics at University College London


In 1943 he was offered the Balfour Chair of Genetics at Cambridge
Universityy

Analysis of variance Fisher, R.A. 1925. Statistical Methods for Research Workers
Maximum likelihood Fisher, R.A. 1935. The design of experiments
Fisher information
[1] Hald, Anders (1998). A History of Mathematical Statistics. New York: Wiley. 16
[2] Dawkins, Richard (1995). River out of Eden.
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Society and publications in early years


In 1901
1901, Pearson
Pearson, Weldon and Galton founded
Biometrika, a Journal for the Statistical Study of
Biological Problems.

Until the 1940s, the application of statistics to biological


questions began to have a profound impact on the
scientific community.

Th
The bi
biometrics
i section
i off the
h AAmerican
i S
Statistical
i i l
Association to publish the Biometrics Bulletin, in 1945.

In 1947, International Biometric Society (IBS) was


established. Shortly thereafter, the IBS began publishing
Biometrics
Biometrics.
17
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

A story of statistics in
industry
http://www.census.gov/history/www/census_then_now/notable_alumni/w_edwards_deming.html

In 1980, the NBC television network aired a


documentary entitled "If Japan Can, Why Can't We?"
The documentary was really a description of the influence one
man had on Japanese industry, W. Edwards Deming.

Deming's major point about quality control is that the


output of a production line is variable
variable, because that is the
nature of all human activity. What the customer wants is
not a perfect product but a reliable product.
18
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

A story of statistics and industry


Demings quality control
Deming proposed that the production line be seen as a stream of
activities that start with raw material and end with finished product.
Each activity can be measured, so each activity has its own
variability due to environmental causes.
Instead of waiting for the final product to exceed arbitrary limits of
variability, the managers should be looking at the variability of each
of these activities.
The most variable of the activities is the one that should be
addressed. Once that variability is reduced, there will be another
activity that is "most
most variable,
variable " and it should then be addressed
addressed.
Thus, quality control becomes a continuous process, where the
most variable aspect
p of the p
production line is constantly
y being
g
worked on.
19
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Data
Datum is one observation about the variable
being
g measured.

Data are a collection of observations.

A population consists of all subjects about


whom the study is being conducted.

A sample is a sub-group of population being


examined
examined.
20
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Parameters vs. Statistics

A parameter is a numerical quantity measuring


some aspect
p of a p
population
p of scores.
For example, the mean is a measure of central tendency
Usually use Greek letters

A statistic computed in samples is used to estimate


parameters
Quantity Parameter Statistic
Mean M
Standard deviation s
Proportion p
21
Correlation r
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Variables Quantitative Qualitative


Nominal variable
classification data
data, e
e.g.,
g male/female
male/female, 0/1
0/1, etc
Ordinal variable
Ordinal Interval or
ordered but differences between values are not ratio
important
e.g., Likert scales, rank on a scale of 1..5 (degree of
satisfaction); restaurant ratings
Interval scale variable
ordered, constant scale, but no natural zero
differences make sense, but ratios do not (e.g.,
30-20 = 20-10, but 20/10 is not twice as hot!
e.g., temperature (C,F), dates
Ratio scale variable
ordered,
ordered constant scale
scale, natural zero
e.g., height, weight, age, length 22
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Derived variables

Ratio
Sex ratio

Index
S&P 500 index (stock
market)

Rate
Growth rate 23
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Acc rac and precision of data


Accuracy

Accuracy Precision Inaccuracy

24
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Accuracy of data

Mean square
q error
for estimating population mean ()
using sample mean (m)

M
Bias
MSE (M )
E[( M ) ] 2
Accuracy

Var ( M ) [ E ( M ) ] 2

precision bias
25
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Summarizing Data

Frequency Distribution
Cumulative Distributions
Relative Frequency Distribution
Percent Frequency Distribution
B G
Bar Graphh
Histogram
Pie Chart
Dot Plot
26
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Frequency Distribution for


Q lit ti data
Qualitative d t
A ffrequency di
distribution
ib i isi a tabular
b l summary off
data showing the frequency (or number) of items
i eachh off severall nonoverlapping
in l i classes.
l

The
h objective
bj i is i to provide
id insights
i i h about
b the
h data
d
that cannot be quickly obtained by looking only at
the
h original
i i l data.
d

27
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Frequency Distribution
Guests staying at Holiday Inn were asked to
rate the quality of their accommodations as
being:

Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Totall 20
28
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

An example for quantitative data:


Hudson Auto Repair
Sample of Parts Cost for 50 Tune-
Tune-ups

91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73

29
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Frequency Distribution
Guidelines for selecting number of classes

Use between 5 and 20 classes


Data sets with a larger
g number of elements usually
y require
q
a larger number of classes
Smaller data sets usually require fewer classes

Use classes of equal width


Approximate class width =

Largest Data Value Smallest Data Value


Number of Classes
30
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Frequency Distribution
For Hudson Auto Repair, if we choose six classes:

Approximate Class Width = (109 - 52)/6 = 9.5 10

Parts Cost ($) Frequency


50-59 2
60-69 13
70-79 16
80 89
80-89 7
90-99 7
100 109
100-109 5
31
Total 50
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Relati e Freq
Relative Frequency
enc Distrib
Distribution
tion

The relative frequency of a class is the fraction or


proportion of the total number of data items
belonging to the class.
A relative frequency distribution is a tabular
summary of a set of data showing the relative
frequency for each class.

32
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Percent Frequency Distribution

The percent frequency of a class is the relative


frequency multiplied by 100.
100

A percent frequency distribution is a tabular


summary of a set of data showing the percent
frequency for each class.

33
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Relative
R l ti Frequency
F and
d
Percent Frequency
q y Distributions
Holiday Inn Quality Ratings

Relative Percent
Rating Frequency Frequency
Poor .10 10
Below Average .15 15
Average .25 25
Above Average .45 45
Excellent .05 5
Total 1.00 100

1/20 = .05 34
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Relative
R l ti Frequency
F and
d
Percent Frequency
q y Distributions
Hudson Auto Repair

Parts Relative Percent


Cost ($) Frequency Frequency
50-59 .04 4
60 69
60-69 .26
26 2/50 26
70-79 .32 32
80 89
80-89 .14
14 14
90-99 .14 14
100 109
100-109 .10
10 10
Total 1.00 100 35
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Relative
R l i Frequency
F and
d
Percent Frequency Distributions
Insights gained from the percent frequency distribution

Only 4% of the parts costs are in the $50-59 class.


class

30% of the parts costs are under $70.

The greatest percentage (32% or almost one-third)


of the p
parts costs are in the $$70-79 class.
10% of the parts costs are $100 or more.

36
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Our class
students <- read.csv('D:/ioz/statistics/2012/students.csv', header=T)
students$ID <- as.character(students$ID)
head(students)
order ID name visits email
33 201028007610020 2093 163.com nrow(students) #115
3 201028016215017 222 163.com f il
family.name <
<- substr(students$name,1,1)
b t ( t d t $ 1 1)
111 201028016215018 99 163.com length(unique(family.name)) #62
4 201028016215019 130 163.com f.name <- table(family.name)[table(family.name)>1]
25 201128000206033 130 163.com f.name <- as.table(f.name)
56 201128000206061 282 mails.gucas.ac.cn
g barplot(f.name, ylab='Number')
0 12
10
8
mber
6
Num
4
2
0

37
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Our class
email <- table(students$email)[table(students$email) > 2]
class(email) # array
email <- as.table(email)
barplot(email, ylab='Number')
40
30
Number
20
10
1
0

126.com 163.com gmail.com mails.gucas.ac.cn qq.com sina.com

38
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Our class
hist(students$visits, freq=T, nclass=15, xlab='Times')

Histogram of students$visits
25
20
0
uency
15
Frequ
10
5
0

0 500 1000 1500 2000


Times 39
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Barplot()

10 12
Bar Graph

8
Number
6
4
2
0

A bar graph is a graphical device for depicting qualitative data.


Specify the labels that are used for each of the classes on one axis
(usually the horizontal axis)
axis).
A frequency, relative frequency, or percent frequency scale can be used
for the other axis (usually the vertical axis).
Use a bar of fixed width drawn above each class label.
The bars are separated to emphasize the fact that each class is a
separate category
category.
40
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Histogram
Another common graphical presentation of quantitative data is a
histogram.
The variable of interest is placed on the horizontal axis.
A rectangle is drawn above each class interval with its height
corresponding to the inter
intervals
als freq
frequency,
enc relati
relative
e freq
frequency,
enc or percent
frequency.
Unlike a bar graph, a histogram has no natural separation between
rectangles of adjacent classes.

R code
( ( ), )
hist(rnorm(100),nclass=6)

41
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Holiday Inn Quality Ratings

Pie Chart

R code
x=sample(1:100,6,replace=TRUE)
names(x)=c('A' 'B' 'C' 'D' 'E' 'F')
names(x)=c('A','B','C','D','E','F')
pie(x)

The pie chart is a commonly used graphical device for presenting relative
frequency distributions for qualitative data.
First draw a circle; then use the relative frequencies to subdivide the
circle into sectors that correspond to the relative frequency for each class.
Since there are 360 degrees in a circle, a class with a relative frequency
of .25 would consume .25(360) = 90 degrees of the circle.
42
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

D t Pl
Dot Plott
One of the simplest graphical summaries of data is a
dot plot.
A horizontal axis shows the range of data values
values.
Then each data value is represented by a dot placed
above the axis.
axis

Tune-up
p Parts Cost
.
. .. . . .
. .. .. .. .. . .
. . . ..... .......... .. . .. . . ... . .. .
50 60 70 80 90 100 110
43
Cost ($)
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

C
Cumulative
l ti Distributions
Di t ib ti
Cumulative frequency distribution - shows the
number of items with values less than or equal to
the upper limit of each class..
Cumulative relative/ percent frequency distribution

R code

x=seq(-5,5,by=0.1)
plot(pnorm(x,mean=0,sd=1),type='l')

44
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

C
Cumulative
l ti Distributions
Di t ib ti
Hudson Auto Repair
Cumulative Cumulative
Cumulative Relative Percent
Cost ($) Frequency Frequency Frequency
< 59 2 .04 4
< 69 15 .30 30
< 79 31 2 + 13 .62 15/50 62
< 89 38 .76 76
< 99 45 .90 90
< 109 50 1.00 100
45
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Leaf Unit = 0.1


8 6 8
9 1 4
Stem-and-Leaf Display
p y 10
11
2
0 7

A stem-and-leaf display
p y shows both the rank order and shape
p of the
distribution of the data.
It is similar to a histogram on its side, but it has the advantage of
showing
h i th the actual
t ld data
t values.
l
The first digits of each data item are arranged to the left of a vertical
line.
To the right of the vertical line we record the last digit for each item in
rank order.
Each line in the display is referred to as a stem.
Each digit on a stem is a leaf.

46
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

E
Example:
l Leaf
L f Unit
U it = 0.1
01
If we have data with values such as
8.6 11.7 9.4 9.1 10.2 11.0 8.8

a stem-
stem-and
and--leaf display of these data will be

Leaf Unit = 0.1


8 6 8
9 1 4
10 2
11 0 7
47
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Example: Leaf Unit = 10


If we have data with values such as

1806 1717 1974 1791 1682 1910 1838

a stem
stem--and
and--leaf display of these data will be

Leaf Unit = 10
16 8
The 82 in 1682
17 1 9 is rounded down
18 0 3 to 80 and is
represented as an 8.
19 1 7
48
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Probability density function (PDF)


A probability density function (pdf) is a function that represents a
probability
p y distribution in terms of integrals.
g
Formally, a probability distribution has density f(x), such that the
probability of the interval [a, b] is given by
b

f ( x )dx
a

IIntuitively,
t iti l if a probability
b bilit di
distribution
t ib ti h has ddensity
it f(x),
f( ) then
th ththe infinitesimal
i fi it i l
interval [x, x + dx] has probability f(x) dx.

x=seq(-5,5,by=0.1)
plot(dnorm(x,mean=0,sd=1),type='l')


The total area under the graph is 1 f ( x )dx 1
-
49
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Descriptive statistics

Are the scores generally high or generally low?


Where the center of the distribution tends to be
located
Three
Th measures off central
t l tendency
t d
Mode
Median
Mean
50
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Mode

The most frequently occurring score


Report mode when using nominal scale, the
most frequently occurring category
Based on the simple frequency of each score
If you have a rectangular distribution, do not
report the mode
Unimodal, bimodal, multimodal, antimode

51
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

E
Example
l off Mode
M d
Measurements

x
3
5 In this case the data have
5 tow modes:
1
7
5 and 7
2 Both measurements are
6
repeated twice
7
0
4

52
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

E
Example
l off Mode
M d
Measurements
M t
x
3
Mode: 3
5
1
1
4 Notice that it is possible for a
7 data not to have anyy mode.
3
8
3

53
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Median
Score
S att the
th 50th percentile
til
For normal distribution the median is the same
as the mode
Arrange scores from lowest to highest
highest, if odd
number of scores the Median is the one in the
middle, if even number of scores then average
the two scores in the middle
Used when have ordinal scale and when the
distribution is skewed

54
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Example of Median
Measurements Measurements Median: (4+5)/2 =
Ranked 4.5
x x
3 0
5 1 Notice that only the two
5 2 centrall values
l are used
d
1 3 in the computation.
7 4
2 5
6 5 The median is not
7 6 sensible to extreme
0 7
values
4 7
40 40

55
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Mean
Score at the exact mathematical center of
distribution (average)
U
Used
d with
ith iinterval
t l and
d ratio
ti scales,
l and
d when
h
have a symmetrical and unimodal distribution
Not accurate when distribution is skewed
because it is pulled towards the tail
n

x i
X i 1

n 56
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Uses of the Mean


Describes scores
Deviation of mean gives us the error of our
estimate of the score, with total error equal to
zero
Predict scores
Describe a scores location
Describe the population mean () which is a
parameter
57
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Deviations around the Mean

The score minus the mean


Include plus or minus sign
Sum of deviations of the mean always
equals zero (X-M)=0

58
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Range

Report the maximum difference between the


lowest and highest
Semi-interquartile range used with the median:
one half the distance between the scores at the
25th and 75th percentile

59
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Measures of Variability

Extent to which the scores differ from each other


or how spread out the scores are
Tells us how accurately the measure of central
tendency describes the distribution
Shape of the distribution

60
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Wh do
Why d we care about
b t variability?
i bilit ?
Where would you rather vacation,
vacation LA Bungalows,
Bungalows
where the mean temperature is 24 degrees, or
Sahara Condos where the mean temperature is
also 24 degrees?
LA temperature range:
day
y = 26
night = 22

Sahara
S h ttemperature
t range:
day = 40
night = 8
61
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Variance
Uses the deviation from the mean
Remember
Remember, the sum of the deviations always
equals zero, so you have to square each of the
deviations
S2X= sum of squared deviations divided by the
number of scores
Provides information about the relative variability

62
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

S
Some Li
Limits
it

It isnt the average deviation


Interpretation doesnt make sense because:
Number
N b iis ttoo llarge
And it is a squared value

63
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

The standard deviation (SD)

Take the square root of the variance


SX
Uses the same units of measurement as the raw
scores
How much scores deviate below and above the
mean

64
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

The standard deviation (SD)


( )

Standard deviation ~ the mean of


deviations from the mean ((sort of))

(lowercase sigma) is the population standard deviation.

S th sample
the l standard
t d d deviation
d i ti

ss (s hat) is the sample estimate of


(s-hat)
65
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

The deviation (definitional) formula for


the p
population
p standard deviation
n

i
( x ) 2

i 1
n
The
e larger
a ge the e sstandard
a da d de
deviation
a o the
e more
oe
variability there is in the scores

The standard deviation is somewhat less


sensitive to extreme outliers than the range
g
(as N increases) 66
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Th d
The deviation
i ti (definitional)
(d fi iti l) formula
f l for
f
the sample standard deviation

X 2
X
S i

Whats the difference between this formula and


the population standard deviation?
In the first case,
case all the Xs represent the entire
population. In the second case, the Xs
represent
p a sample.
p
67
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

St d d Deviation:
Standard D i ti Example
E l
X X X X X 2

21 -5.8
5.8 33.64
25 -1.8 3.24
24 -2 8
-2.8 7 84
7.84
30 3.2 10.24
34 72
7.2 51 84
51.84
Mean 26.8 0 21.36

106.8
S 21.36 4.62
5 68
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Calculating S using the


raw-score
raw score formula

X
2

X 2

N
S
N

To calculate X2 you square all the scores first and


then sum them

To calculate (X)2 you sum all the scores first and


then square them
69
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Population and sample


variance and standard deviation

When we have data from the entire population


we use (not x bar) to compute X using the
same formula
Variance and standard deviations of the sample
are biased estimates of the population

70
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Estimating the population


standard deviation from a sample

S
S, the sample standard deviation,
deviation is usually a little smaller
than the population standard deviation. Why?

The sample mean minimizes the sum of squared deviations


(SS). Therefore, if the sample mean differs at all from the
population
l ti mean, th then th
the SS ffrom th
the sample
l will
ill b
be an
understimate of the SS from the population

Therefore, statisticians alter the formula of the sample


standard deviation by subtracting 1 from N
71
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Formulas for s-hat (estimated)

X X
2
Definitional
formula: s
N 1

X
2
Raw-score
R X 2

N
formula: s
N 1

72
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

The estimated variance

The standard deviation squares

2

X 2

s 2

X X
2

N n 1

The variance is not a very useful descriptive statistic,


but it is very important value you will use in other
t h i
techniques (e.g.,
( the
th analysis
l i off variance)
i )
73
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

For a standard normal


distribution
Sample mean is a good estimate of population
mean
The estimate of the population variance and
standard deviation tells us how spread out the
scores are
68% of the scores are within +1 and 1 SX

74
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Standard error

The standard error of a sample of sample size n is the sample's


standard deviation divided by . It therefore estimates the
standard deviation of the sample mean based on the
population mean.

s
SE x
n
75
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Coefficient of variation

In probability theory and statistics, the coefficient of variation (CV) is a


normalized measure of dispersion of a probability distribution
distribution. It is
defined as the ratio of the standard deviation to the mean :


CV 100

76
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Skewness
Symmetrical distribution
Symmetric
Left tail is the mirror image of the right tail
Examples: heights and weights of people
.35
.30
ncy
Frequen

.25
.20
F

.15
Relative

.10
.05
05
R

0 77
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Skewness
Asymmetrical distribution
Moderately Skewed Left
A longer tail to the left
Example: exam scores

.35
.30
Rellative Frrequency
y

.25
.20
.15
.10
.05
78
0
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Skewness
Asymmetrical distribution
Frequency

IIncome
Populations of
countries

Value

79
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Skewness
A Measure of skewness based on the 3rd moment about the Mean

( i ) 3

skewness i 1
(N 1)s 3
n

xi x 3

(n 1) 3 / 2
i 1

n
n2
x x
3/ 2
i
i 1

(mean mod e) / s 3 (mean median) / s 80


Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Sk
Skewness
Frequency
q y

Value

81
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Skewed Right - Positive Skewness

Number of Music CDs of Spring 1998 Stat 250 Students

20
Frequency

10

0 100 200 300 400


Number of Music CDs
82
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Kurtosis

Measures of Kurtosis

Kurtosis is a measure of the flatness or peakedness of a Distribution

Normal Kurtosis - Mesokurtic

Flat Kurtosis - Platokurtic

Peaked Kurtosis - Leptokurtic

A Measure of Kurtosis based on the 4th moment about the Mean

83
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Kurtosis

( i ) 4

kurtosis i 1
(N 1)s 4

If less then 0 = Platokurtic


More than 0 = Leptokurtic
If 0 then = Mesokurtic
84
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Kurtosis

k>3
Frequency
q y

k=3

k<3

85
Value
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Describing data
Statistic (mean Statistic (non-
based) mean based)
Center Mean Mode, median
Spread Variance, SD Range,
(standard Interquartile
deviation), SE, range
CV
Skew Skewness --
Peaked Kurtosis --
86
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

R code
x = rnorm(100)
( )
mean(x)
sd(x)
var(x)
min(x)
max(x)
median(x)
range(x)
quantile(x)
summary(x)
skewness = sum((x-mean(x))
sum((x-mean(x))^3/sqrt(var(x))^3)/length(x);
3/sqrt(var(x)) 3)/length(x); skewness
kurtosis = sum((x-mean(x))^4/var(x)^2) /length(x) -3; kurtosis

87
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
/****************************************************************/
/* SAS SAMPLE LIBRARY */
/* */

SAS Example
/* NAME: UNIVAR */
/* TITLE: Simple Descriptive Statistics using PROC UNIVARIATE */
/* PRODUCT: SAS */
/* SYSTEM: ALL */
/* KEYS: DESCRIPTIVE STATISTICS, */
/* PROCS: UNIVARIATE */
/* DATA: */
/* */
/* REF: */
/* MISC: */
/* DESC: INPUT A SMALL DATA SET USING THE CARDS STATEMENT. */
OPTIONS LS=75 NODATE; /*
/*
/*
RUN UNIVARIATE USING THE FREQ, PLOT AND NORMAL
PROC OPTIONS. ANALYZE THE VARIABLE POP AND
RETAIN THE VARIABLE STATE USING THE ID STATEMENT. */
*/
*/

/* NO OTHER OPTIONS ARE USED. */


/* */
DATA STATEPOP; /****************************************************************/
OPTIONS LS=75 NODATE;
DATA STATEPOP;
INPUT STATE $ POP @@;
LABEL POP ='1970 CENSUS POPULATION IN MILLIONS';
INPUT STATE $ POP @@; CARDS;
ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95
COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59
HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83
KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92
LABEL POP ='1970 CENSUS POPULATION IN MILLIONS'; MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68
MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17
NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65
OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59
SD 0 0.67
67 TENN 3 3.92
92 TEXAS 11.2 11 2 UTAH 1.06 1 06 VT 0.440 44
CARDS; VA 4.65 WASH 3.41 W.VA 1.74 WIS 4.42 WYO 0.33
PROC UNIVARIATE FREQ PLOT NORMAL;
VAR POP; ID STATE;
run;

ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95
COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59
HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83
KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92
MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68
MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17
NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65
OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59
SD 0.67 TENN 3.92 TEXAS 11.2 UTAH 1.06 VT 0.44
VA 4.65 WASH 3.41 W.VA 1.74 WIS 4.42 WYO 0.33
PROC UNIVARIATE FREQ PLOT NORMAL;
VAR POP; ID STATE; 88
run;
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

SAS results
The SAS System

The UNIVARIATE Procedure


Variable: POP (1970 CENSUS POPULATION IN MILLIONS)

Moments

N 50 Sum Weights 50
Mean 4.0472 Sum Observations 202.36
Std Deviation 4.32931867 Variance 18.7430002
Skewness 2.05521839 Kurtosis 4.54561679
Uncorrected SS 1737.3984 Corrected SS 918.407008
Coeff Variation 106.970712 Std Error Mean 0.61225812
89
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Basic statistics

Basic Statistical Measures

Location Variability

M
Mean 4.047200
4 047200 Std Deviation
D i ti 4.32932
4 32932
Median 2.710000 Variance 18.74300
Mode 3.920000 Range 19.65000
Interquartile Range 3.69000

90
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Quantiles

Quantile Estimate

100% Max 19.950


99% 19
19.950
950
95% 11.790
90% 10.830
75% Q3 4.680
50% Median 2.710
25% Q1 0.990

91
Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li

Assignment
Be familiar with the following terms:
Probability density function (PDF)
Deviation
Variance
Standard deviation
Standard error
Range
Mode
Quantile
Coefficient of variation

Download and install R on your laptop


Plot histograms using
hist(rnorm(100), nclass=6) 92

You might also like