You are on page 1of 195

STAT 509

STATISTICS FOR ENGINEERS


Spring, 2014

Lecture Notes

Joshua M. Tebbs
Department of Statistics
University of South Carolina

TABLE OF CONTENTS

STAT/MATH 511, J. TEBBS

Contents
3 Modeling Random Behavior

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.1

Sample spaces and events . . . . . . . . . . . . . . . . . . . . . .

3.2.2

Unions and intersections . . . . . . . . . . . . . . . . . . . . . . .

3.2.3

Axioms and rules . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.4

Conditional probability and independence . . . . . . . . . . . . .

10

3.2.5

Additional probability rules . . . . . . . . . . . . . . . . . . . . .

12

3.3

Random variables and distributions . . . . . . . . . . . . . . . . . . . . .

15

3.4

Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.4.1

Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.4.2

Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . .

28

3.4.3

Negative binomial distribution . . . . . . . . . . . . . . . . . . . .

30

3.4.4

Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . .

32

3.4.5

Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.5

3.6

Continuous random variables

. . . . . . . . . . . . . . . . . . . . . . . .

37

3.5.1

Exponential distribution . . . . . . . . . . . . . . . . . . . . . . .

42

3.5.2

Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.5.3

Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Reliability and lifetime distributions . . . . . . . . . . . . . . . . . . . . .

53

3.6.1

Weibull distribution

. . . . . . . . . . . . . . . . . . . . . . . . .

54

3.6.2

Reliability functions . . . . . . . . . . . . . . . . . . . . . . . . .

57

4 Statistical Inference

63

4.1

Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.2

Parameters and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

TABLE OF CONTENTS

STAT/MATH 511, J. TEBBS

4.3

Point estimators and sampling distributions . . . . . . . . . . . . . . . .

67

4.4

Sampling distributions involving Y

. . . . . . . . . . . . . . . . . . . . .

69

4.4.1

Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . .

71

4.4.2

t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

4.4.3

Normal quantile-quantile (qq) plots . . . . . . . . . . . . . . . . .

77

Condence intervals for a population mean . . . . . . . . . . . . . . . .

78

4.5.1

Known population variance 2 . . . . . . . . . . . . . . . . . . . .

78

4.5.2

Unknown population variance 2 . . . . . . . . . . . . . . . . . .

85

4.5.3

Sample size determination . . . . . . . . . . . . . . . . . . . . . .

88

4.6

Condence interval for a population proportion p . . . . . . . . . . . . .

90

4.7

Condence interval for a population variance 2 . . . . . . . . . . . . . .

95

4.8

Condence intervals for the dierence of two population means 1 2 :


Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.5

4.9

4.8.1

Equal variance case: 12 = 22 . . . . . . . . . . . . . . . . . . . . . 101

4.8.2

Unequal variance case: 12 = 22 . . . . . . . . . . . . . . . . . . . 105

Condence interval for the dierence of two population proportions p1 p2 :


Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.10 Condence interval for the ratio of two population variances 22 /12 : Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.11 Condence intervals for the dierence of two population means 1 2 :
Dependent samples (Matched-pairs) . . . . . . . . . . . . . . . . . . . . . 116
4.12 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.12.1 Overall F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.12.2 Follow up analysis: Tukey pairwise condence intervals . . . . . . 130
6 Linear regression

133

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2

Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134


6.2.1

Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . 136

ii

TABLE OF CONTENTS

6.3

6.4

STAT/MATH 511, J. TEBBS

6.2.2

Model assumptions and properties of least squares estimators

. . 139

6.2.3

Estimating the error variance . . . . . . . . . . . . . . . . . . . . 140

6.2.4

Inference for 0 and 1 . . . . . . . . . . . . . . . . . . . . . . . . 142

6.2.5

Condence and prediction intervals for a given x = x0 . . . . . . . 145

Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


6.3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.3.2

Matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.3.3

Estimating the error variance . . . . . . . . . . . . . . . . . . . . 154

6.3.4

The hat matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.3.5

Analysis of variance for linear regression . . . . . . . . . . . . . . 156

6.3.6

Inference for individual regression parameters . . . . . . . . . . . 161

6.3.7

Condence and prediction intervals for a given x = x0 . . . . . . . 164

Model diagnostics (residual analysis) . . . . . . . . . . . . . . . . . . . . 165

7 Factorial Experiments

174

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.2

Example: A 22 experiment with replication . . . . . . . . . . . . . . . . . 176

7.3

Example: A 24 experiment without replication . . . . . . . . . . . . . . . 186

iii

CHAPTER 3

STAT 509, J. TEBBS

Modeling Random Behavior

Complementary reading: Chapter 3 (VK); Sections 3.1-3.5 and 3.9.

3.1

Introduction

TERMINOLOGY : Statistics is the development and application of theory and methods to the collection, design, analysis, and interpretation of observed information from
planned and unplanned studies.
Statisticians get to play in everyone elses back yard. (John Tukey, Princeton)
Here are some examples where statistics could be used:
1. In a reliability (time to event) study, an engineer is interested in quantifying the
time until failure for a jet engine fan blade.
2. In an agricultural study in Iowa, researchers want to know which of four fertilizers
(which vary in their nitrogen contents) produces the highest corn yield.
3. In a clinical trial, physicians want to determine which of two drugs is more eective
for treating HIV in the early stages of the disease.
4. In a public health study, epidemiologists want to know whether smoking is linked
to a particular demographic class in high school students.
5. A food scientist is interested in determining how dierent feeding schedules (for
pigs) could aect the spread of salmonella during the slaughtering process.
6. A pharmacist posits that administering caeine to premature babies in the ICU at
Richland Hospital will reduce the incidence of necrotizing enterocolitis.
7. A research dietician wants to determine if academic achievement is related to body
mass index (BMI) among African American students in the fourth grade.
PAGE 1

CHAPTER 3

STAT 509, J. TEBBS

8. An economist as part of President Obamas re-election campaign is trying to forecast the monthly unemployment and under-employment rates for 2012.
REMARK : Statisticians use their skills in mathematics and computing to formulate
statistical models and analyze data for a specic problem at hand. These models are
then used to estimate important quantities of interest (to the researcher), to test the
validity of important conjectures, and to predict future behavior. Being able to identify
and model sources of variability is an important part of this process.
TERMINOLOGY : A deterministic model is one that makes no attempt to explain
variability. For example, in chemistry, the ideal gas law states that
P V = nRT,
where P = pressure of a gas, V = volume, n = the amount of substance of gas (number
of moles), R = Boltzmanns constant, and T = temperature. In circuit analysis, Ohms
law states that
V = IR,
where V = voltage, I = current, and R = resistance.
In both of these models, the relationship among the variables is completely determined without any ambiguity.
In real life, this is rarely true for the obvious reason: there is natural variation that
arises in the measurement process.
For example, a common electrical engineering experiment involves setting up a
simple circuit with a known resistance R. For a given current I, dierent students
will then calculate the voltage V .
With a sample of n = 20 students, conducting the experiment in succession,
we might very well get 20 dierent measured voltages!
A deterministic model is too simplistic for real life; it does not acknowledge
the inherent variability that arises in the measurement process.
PAGE 2

CHAPTER 3

STAT 509, J. TEBBS

A probabilistic (or stochastic or statistical) model might look like


V = IR + ,
where is a random term that accounts for measurement error.
PREDICTION : Statistical models can also be used to predict future outcomes. For
example, suppose that I am trying to predict
Y = MATH 141 nal course percentage
for incoming freshmen enrolled in MATH 141. For each freshmen student, I will record
the following variables:
x1 = SAT MATH score
x2 = high school GPA.
A deterministic model would be
Y = f (x1 , x2 ),
for some function f : R2 [0, 100]. This model suggests that for a student with values x1
and x2 , we could compute Y exactly if the function f was known. A statistical model
for Y might look like something like this:
Y = 0 + 1 x1 + 2 x2 + ,
where is a random term that accounts for not only measurement error (e.g., incorrect
student information, grading errors, etc.) but also
(a) all of the other variables not accounted for (e.g., major, leisure habits, natural
ability, etc.) and
(b) the error induced by assuming a linear relationship between Y and {x1 , x2 } when,
in fact, it may not be.
In this example, with certain (probabilistic) assumptions on and a mathematically
sensible way to estimate the unknown 0 , 1 , and 2 (i.e., coecients of the linear
function), we can produce point predictions of Y on a student-by-student basis; we can
also characterize numerical uncertainty with our predictions.
PAGE 3

CHAPTER 3

3.2
3.2.1

STAT 509, J. TEBBS

Probability
Sample spaces and events

TERMINOLOGY : Probability is a measure of ones belief in the occurrence of a future


event. Here are some events to which we may wish to assign a probability:
tomorrows temperature exceeding 80 degrees
manufacturing a defective part
concluding one fertilizer is superior to another when it isnt
the NASDAQ losing 5 percent of its value.
you being diagnosed with prostate/cervical cancer in the next 20 years.
TERMINOLOGY : Many real life phenomena can be envisioned as a random experiment. The set of all possible outcomes for a given random experiment is called the
sample space, denoted by S.
The number of outcomes in S is denoted by nS .
Example 3.1. In each of the following random experiments, we write out a corresponding sample space.
(a) The Michigan state lottery calls for a three-digit integer to be selected:
S = {000, 001, 002, ..., 998, 999}.
The size of the set of all possible outcomes is nS = 1000.
(b) A USC undergraduate student is tested for chlamydia (0 = negative, 1 = positive):
S = {0, 1}.
The size of the set of all possible outcomes is nS = 2.
PAGE 4

CHAPTER 3

STAT 509, J. TEBBS

(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the
positions are identical (so that selection order does not matter), then
S = {ab, ac, ad, bc, bd, cd}.
The size of the set of all possible outcomes is nS = 6. If the positions are dierent (e.g.,
project leader, assistant project leader, etc.), then
S = {ab, ba, ac, ca, ad, da, bc, cb, bd, db, cd, dc}.
In this case, the size of the set of all possible outcomes is nS = 12.
TERMINOLOGY : Suppose that S is a sample space for a random experiment. We say
that A is an event in S if the outcome satises
A S.
GOAL: We would like to develop a mathematical framework so that we can assign probability to an event A. This will quantify how likely the event is. The probability that
the event A occurs is denoted by P (A).
INTUITIVE : Suppose that a sample space S contains nS < outcomes, each of which
is equally likely. If the event A contains nA outcomes, then
P (A) =

nA
.
nS

This is called an equiprobability model. Its main requirement is that all outcomes in
S are equally likely.
Important: If the outcomes in S are not equally likely, then this result is not
applicable.

Example 3.2. In the random experiments from Example 3.1, we use the previous result
to assign probabilities to events (if applicable).
(a) The Michigan state lottery calls for a three-digit integer to be selected:
S = {000, 001, 002, ..., 998, 999}.
PAGE 5

CHAPTER 3

STAT 509, J. TEBBS

The size of the set of all possible outcomes is nS = 1000. Let the event
A = {000, 005, 010, 015, ..., 990, 995}
= {winning number is a multiple of 5}.
There are nA = 200 outcomes in A. It is reasonable to assume that each outcome in S
is equally likely. Therefore,
P (A) =

200
= 0.20.
1000

(b) A USC undergraduate student is tested for chlamydia (0 = negative, 1 = positive):


S = {0, 1}.
The size of the set of all possible outcomes is nS = 2. However, is it reasonable to assume
that each outcome in S (0 = negative, 1 = positive) is equally likely? The prevalence of
chlamydia among college age students is much less than 50 percent (in SC, this prevalence
is probably somewhere between 5-12 percent). Therefore, it would be illogical to assign
probabilities using an equiprobability model.
(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the
positions are identical (so that selection order does not matter), then
S = {ab, ac, ad, bc, bd, cd}.
The size of the set of all possible outcomes is nS = 6. If A is the event that applicant d
is selected for one of the two positions, then
A = {ad, bd, cd}
= {applicant d is chosen}.
There are nA = 3 outcomes in A. If each of the 4 applicants has the same chance of
being selected (an assumption), then each of the nS = 6 outcomes in S is equally likely.
Therefore,
P (A) =

3
= 0.50.
6

PAGE 6

CHAPTER 3

STAT 509, J. TEBBS

INTERPRETATION : In general, what does P (A) really measure? There are two main
interpretations:
P (A) measures the likelihood that A will occur on any given experiment.
If the experiment is performed many times, then P (A) can be interpreted as the
percentage of times that A will occur over the long run. This is called the relative
frequency interpretation.
If we are using the former interpretation, then it is common to use a decimal representation; e.g.,
P (A) = 0.50.
If we are using the latter, it is commonly accepted to say something like the event A will
occur 50 percent of the time. This gives the impression that A will occur, on average,
1 out of every 2 times the experiment is performed. This does not mean that the event
will occur exactly 1 out of every 2 times the experiment is performed.

3.2.2

Unions and intersections

TERMINOLOGY : The null event, denoted by , is an event that contains no outcomes.


The null event has probability P () = 0.
TERMINOLOGY : The union of two events A and B is the set of all outcomes in
either set or both. We denote the union of two events A and B by
A B = { : A or B}.
TERMINOLOGY : The intersection of two events A and B is the set of all outcomes
in both sets. We denote the intersection of two events A and B by
A B = { : A and B}.
TERMINOLOGY : If the events A and B contain no common outcomes, we say the
events are mutually exclusive. In this case,
P (A B) = P () = 0.
PAGE 7

CHAPTER 3

STAT 509, J. TEBBS

Example 3.3. A primary computer system is backed up by two secondary systems.


They operate independently of each other; i.e., the failure of one has no eect on any
of the others. We are interested in the readiness of these three systems, which we can
envision as an experiment. The sample space for this experiment can be described as
S = {yyy, yny, yyn, ynn, nyy, nny, nyn, nnn},
where y means the system is ready and n means the system is not ready. Dene
A = {primary system is operable} = {yyy, yny, yyn, ynn}
B = {rst backup system is operable} = {yyy, yyn, nyy, nyn}.
The union of A and B is
A B = {yyy, yny, yyn, ynn, nyy, nyn}.
In words, A B occurs if the primary system or the rst backup system is operable (or
both). The intersection of A and B is
A B = {yyy, yyn}
and occurs if the primary system and the rst backup system are both operable.
Important: Can we compute P (A), P (B), P (A B), and P (A B) in this example?
For example, if we wrote
P (A B) =

nAB
6
= = 0.75,
nS
8

we would be making the assumption that each outcome in S is equally likely! This would
only be true if each system (primary and both backups) functions with probability equal
to 1/2. This is extremely unlikely! Therefore, we can not compute probabilities in this
example without additional information about the system-specic failure rates.
Example 3.4. Hemophilia is a sex-linked hereditary blood defect of males characterized
by delayed clotting of the blood which makes it dicult to control bleeding. When a
woman is a carrier of classical hemophilia, there is a 50 percent chance that a male child
PAGE 8

CHAPTER 3

STAT 509, J. TEBBS

will inherit this disease. If a carrier gives birth to two males (not twins), what is the
probability that either will have the disease? both will have the disease?
Solution. We can envision the process of having two male children as an experiment
with sample space
S = {++, +, +, },
where + means the male ospring has the disease; means the male does not have
the disease. To compute the probabilities requested in this problem, we will
assume that each outcome in S is equally likely. Dene the events:
A = {rst child has disease} = {++, +}
B = {second child has disease} = {++, +}.
The union and intersection of A and B are, respectively,
A B = {either child has disease} = {++, +, +}
A B = {both children have disease} = {++}.
The probability that either male child will have the disease is
P (A B) =

nAB
3
= = 0.75.
nS
4

The probability that both male children will have the disease is
P (A B) =

3.2.3

nAB
1
= = 0.25.
nS
4

Axioms and rules

KOLMOLGOROV AXIOMS : For any sample space S, a probability P must satisfy


(1) 0 P (A) 1, for any event A
(2) P (S) = 1
(3) If A1 , A2 , ..., An are pairwise mutually exclusive events, then
(n
)
n

P
Ai =
P (Ai ).
i=1

PAGE 9

i=1

CHAPTER 3

STAT 509, J. TEBBS

The term pairwise mutually exclusive means that Ai Aj = , for all i = j.


The event

Ai = A1 A2 An .

i=1

This event, in words, is read at least one Ai occurs.

3.2.4

Conditional probability and independence

IDEA: In some situations, we may be fortunate enough to have prior knowledge about
the likelihood of other events related to the event of interest. We can then incorporate
this information into a probability calculation.
TERMINOLOGY : Let A and B be events in a sample space S with P (B) > 0. The
conditional probability of A, given that B has occurred, is
P (A|B) =

P (A B)
.
P (B)

P (B|A) =

P (A B)
.
P (A)

Similarly,

Example 3.5. In a company, 36 percent of the employees have a degree from a SEC
university, 22 percent of the employees that have a degree from the SEC are also engineers,
and 30 percent of the employees are engineers. An employee is selected at random.
(a) Compute the probability that the employee is an engineer and is from the SEC.
(b) Compute the conditional probability that the employee is from the SEC, given that
s/he is an engineer.
Solution: Dene the events
A = {employee is an engineer}
B = {employee is from the SEC}.

PAGE 10

CHAPTER 3

STAT 509, J. TEBBS

From the information in the problem, we are given: P (A) = 0.30, P (B) = 0.36, and
P (A|B) = 0.22. In part (a), we want P (A B). Note that
0.22 = P (A|B) =

P (A B)
P (A B)
=
.
P (B)
0.36

Therefore,
P (A B) = 0.22(0.36) = 0.0792.
In part (b), we want P (B|A). From the denition of conditional probability:
P (B|A) =

P (A B)
0.0792
=
= 0.264.
P (A)
0.30

IMPORTANT : Note that, in this example, the conditional probability P (B|A) and the
unconditional probability P (B) are not equal.
In other words, knowledge that A has occurred has changed the likelihood that
B occurs.
In other situations, it might be that the occurrence (or non-occurrence) of a companion event has no eect on the probability of the event of interest. This leads us
to the denition of independence.
TERMINOLOGY : When the occurrence or non-occurrence of B has no eect on whether
or not A occurs, and vice-versa, we say that the events A and B are independent.
Mathematically, we dene A and B to be independent if and only if
P (A B) = P (A)P (B).
Note that if A and B are independent,
P (A|B) =

P (A)P (B)
P (A B)
=
= P (A)
P (B)
P (B)

P (B|A) =

P (B A)
P (B)P (A)
=
= P (B).
P (A)
P (A)

and

Note: These results only apply if A and B are independent. In other words, if A and B
are not independent, then these rules do not apply.
PAGE 11

CHAPTER 3

STAT 509, J. TEBBS

Example 3.6. In an engineering system, two components are placed in a series; that is,
the system is functional as long as both components are. Each component is functional
with probability 0.95. Dene the events
A1 = {component 1 is functional}
A2 = {component 2 is functional}
so that P (A1 ) = 0.95 and P (A2 ) = 0.95. The probability that the system is functional
is given by P (A1 A2 ).
If the components operate independently, then A1 and A2 are independent events
so that
P (A1 A2 ) = P (A1 )P (A2 ) = 0.95(0.95) = 0.9025.
If the components do not operate independently; e.g., failure of one component
wears on the other, then we can not compute P (A1 A2 ) without additional
knowledge.

EXTENSION : The notion of independence extends to any nite collection of events


A1 , A2 , ..., An . Mutual independence means that the probability of the intersection of
any sub-collection of A1 , A2 , ..., An equals the product of the probabilities of the events
in the sub-collection. For example, if A1 , A2 , A3 , and A4 are mutually independent, then
P (A1 A2 ) = P (A1 )P (A2 )
P (A1 A2 A3 ) = P (A1 )P (A2 )P (A3 )
P (A1 A2 A3 A4 ) = P (A1 )P (A2 )P (A3 )P (A4 ).

3.2.5

Additional probability rules

TERMINOLOGY : Suppose that S is a sample space and that A is an event. The


complement of A, denoted by A, is the set of all outcomes in S not in A. That is,
A = { S :
/ A}.
PAGE 12

CHAPTER 3

STAT 509, J. TEBBS

1. Complement rule: Suppose that A is an event.


P (A) = 1 P (A).
2. Additive law: Suppose that A and B are two events.
P (A B) = P (A) + P (B) P (A B).
3. Multiplicative law: Suppose that A and B are two events.
P (A B) = P (B|A)P (A)
= P (A|B)P (B).
4. Law of Total Probability (LOTP): Suppose that A and B are two events.
P (A) = P (A|B)P (B) + P (A|B)P (B).
5. Bayes Rule: Suppose that A and B are two events.
P (B|A) =

P (A|B)P (B)
P (A|B)P (B)
=
.
P (A)
P (A|B)P (B) + P (A|B)P (B)

Example 3.7. The probability that train 1 is on time is 0.95. The probability that train
2 is on time is 0.93. The probability that both are on time is 0.90. Dene the events
A1 = {train 1 is on time}
A2 = {train 2 is on time}.
We are given that P (A1 ) = 0.95, P (A2 ) = 0.93, and P (A1 A2 ) = 0.90.
(a) What is the probability that train 1 is not on time?
P (A1 ) = 1 P (A1 )
= 1 0.95 = 0.05.
(b) What is the probability that at least one train is on time?
P (A1 A2 ) = P (A1 ) + P (A2 ) P (A1 A2 )
= 0.95 + 0.93 0.90 = 0.98.
PAGE 13

CHAPTER 3

STAT 509, J. TEBBS

(c) What is the probability that train 1 is on time given that train 2 is on time?
P (A1 A2 )
P (A2 )
0.90
=
0.968.
0.93

P (A1 |A2 ) =

(d) What is the probability that train 2 is on time given that train 1 is not on time?
P (A1 A2 )
P (A1 )
P (A2 ) P (A1 A2 )
=
1 P (A1 )
0.93 0.90
=
= 0.60.
1 0.95

P (A2 |A1 ) =

(e) Are A1 and A2 independent events?


Answer: They are not independent because
P (A1 A2 ) = P (A1 )P (A2 ).
Equivalently, note that P (A1 |A2 ) = P (A1 ). In other words, knowledge that A2 has
occurred changes the likelihood that A1 occurs.
Example 3.8. An insurance company classies people as accident-prone and nonaccident-prone. For a xed year, the probability that an accident-prone person has an
accident is 0.4, and the probability that a non-accident-prone person has an accident is
0.2. The population is estimated to be 30 percent accident-prone. Dene the events
A = {policy holder has an accident}
B = {policy holder is accident-prone}.
We are given that P (B) = 0.3, P (A|B) = 0.4, and P (A|B) = 0.2.
(a) What is the probability that a new policy-holder will have an accident?
Solution: By the Law of Total Probability,
P (A) = P (A|B)P (B) + P (A|B)P (B)
= 0.4(0.3) + 0.2(0.7) = 0.26.
PAGE 14

CHAPTER 3

STAT 509, J. TEBBS

(b) Suppose that the policy-holder does have an accident. What is the probability that
s/he was accident-prone?
Solution: We want P (B|A). By Bayes Rule,
P (A|B)P (B)
P (A|B)P (B) + P (A|B)P (B)
0.4(0.3)
=
0.46.
0.4(0.3) + 0.2(0.7)

P (B|A) =

3.3

Random variables and distributions

TERMINOLOGY : A random variable Y is a variable whose value is determined by


chance. The distribution of a random variable consists of two parts:
1. an elicitation of the set of all possible values of Y (called the support)
2. a function that describes how to assign probabilities to events involving Y .
NOTATION : By convention, we denote random variables by upper case letters towards
the end of the alphabet; e.g., W , X, Y , Z, etc. A possible value of Y (i.e., a value in the
support) is denoted generically by the lower case version y. In words, the symbol
P (Y = y)
is read, the probability that the random variable Y equals the value y. The symbol
FY (y) P (Y y)
is read, the probability that the random variable Y is less than or equal to the value
y. This probability is called the cumulative distribution function of Y and will be
discussed later.
TERMINOLOGY : If a random variable Y can assume only a nite (or countable) number
of values, we call Y a discrete random variable. If it makes more sense to envision Y as
assuming values in an interval of numbers, we call Y a continuous random variable.
PAGE 15

CHAPTER 3

STAT 509, J. TEBBS

Example 3.9. Classify the following random variables as discrete or continuous and
specify the support of each random variable.
V

= number of unbroken eggs in a randomly selected carton (dozen)

W = pH of an aqueous solution
X = length of time between accidents at a factory
Y

= whether or not you pass this class

Z = number of aircraft arriving tomorrow at CAE.


The random variable V is discrete. It can assume values in
{v : v = 0, 1, 2, ..., 12}.
The random variable W is continuous. It most certainly assumes values in
{w : < w < }.
Of course, with most solutions, it is more likely that W is not negative (although
this is possible) and not larger than, say, 15 (a very reasonable upper bound).
However, the choice of {w : < w < } is not mathematically incongruous
with these practical constraints.
The random variable X is continuous. It can assume values in
{x : x > 0}.
The key feature here is that a time cannot be negative. In theory, it is possible
that X can be very large.
The random variable Y is discrete. It can assume values in
{y : y = 0, 1},
where I have arbitrarily labeled 1 for passing and 0 for failing. Random variables that can assume exactly 2 values (e.g., 0, 1) are called binary.
PAGE 16

CHAPTER 3

STAT 509, J. TEBBS

The random variable Z is discrete. It can assume values in


{z : z = 0, 1, 2, ..., }.
I have allowed for the possibility of a very large number of aircraft arriving.

3.4

Discrete random variables

TERMINOLOGY : Suppose that Y is a discrete random variable. The function


pY (y) = P (Y = y)
is called the probability mass function (pmf ) for Y . The pmf pY (y) is a function
that assigns probabilities to each possible value of Y .
PROPERTIES : A pmf pY (y) for a discrete random variable Y satises the following:
1. 0 < pY (y) < 1, for all possible values of y
2. The sum of the probabilities, taken over all possible values of Y , must equal 1; i.e.,

pY (y) = 1.

all y

Example 3.10. A mail-order computer business has six telephone lines. Let Y denote
the number of lines in use at a specic time. Suppose that the probability mass function
(pmf) of Y is given by

y
pY (y)

0.10 0.15

0.20

0.25

0.20

0.06

0.04

Figure 3.1 (left) displays pY (y), the probability mass function (pmf) of Y .
The height of the bar above y is equal to pY (y) = P (Y = y).
If y is not equal to 0, 1, 2, 3, 4, 5, 6, then pY (y) = 0.
PAGE 17

STAT 509, J. TEBBS

0.0

0.00

0.05

0.2

0.10

0.4

F(y)

0.15

p(y)

0.6

0.20

0.8

0.25

1.0

0.30

CHAPTER 3

Figure 3.1: PMF (left) and CDF (right) of Y in Example 3.10.


Figure 3.1 (right) displays the cumulative distribution function (cdf) of Y .
FY (y) = P (Y y).
FY (y) is a nondecreasing function.
0 FY (y) 1; this makes sense since FY (y) = P (Y y) is a probability!
The cdf FY (y) in this example (Y is discrete) takes a step at each possible
value of Y and stays constant otherwise.
The height of the step at a particular y is equal to pY (y) = P (Y = y).
Here is the table I gave you above, but now I have added the values of the cumulative
distribution function:
y

pY (y)

0.10 0.15

0.20

0.25

0.20

0.06

0.04

FY (y)

0.10 0.25

0.45

0.70

0.90

0.96

1.00

PAGE 18

CHAPTER 3

STAT 509, J. TEBBS

(a) What is the probability that exactly two lines are in use?
pY (2) = P (Y = 2) = 0.20.
(b) What is the probability that at most two lines are in use?
P (Y 2) = P (Y = 0) + P (Y = 1) + P (Y = 2)
= pY (0) + pY (1) + pY (2)
= 0.10 + 0.15 + 0.20 = 0.45.
Note: This is also equal to FY (2) = 0.45.
(c) What is the probability that at least ve lines are in use?
P (Y 5) = P (Y = 5) + P (Y = 6)
= pY (5) + pY (6) = 0.06 + 0.04 = 0.10.
It is also important to note that in part (c), we could have computed
P (Y 5) = 1 P (Y 4)
= 1 FY (4) = 1 0.90 = 0.10.

TERMINOLOGY : Let Y be a discrete random variable with pmf pY (y). The expected
value of Y is given by
= E(Y ) =

ypY (y).

all y

The expected value for a discrete random variable Y is simply a weighted average of the
possible values of Y . Each value y is weighted by its probability pY (y). In statistical
applications, = E(Y ) is commonly called the population mean.
Example 3.11. In Example 3.10, we examined the distribution of Y , the number of
lines in use at a specied time. The probability mass function (pmf) of Y is given by

y
pY (y)

0.10 0.15

0.20

0.25

0.20

0.06

0.04

PAGE 19

CHAPTER 3

STAT 509, J. TEBBS

The expected value of Y is

= E(Y ) =
ypY (y)
all y

= 0(0.10) + 1(0.15) + 2(0.20) + 3(0.25) + 4(0.20) + 5(0.06) + 6(0.04)


= 2.64.
Interpretation: On average, we would expect 2.64 calls at the specied time.
Interpretation: Over the long run, if we observed many values of Y at this specied
time (and this pmf was applicable each time), then the average of these Y observations
would be close to 2.64.
Interpretation: Place an x at = 2.64 in Figure 3.1 (left). This represents the
balance point of the probability mass function.
FUNCTIONS : Let Y be a discrete random variable with pmf pY (y). Suppose that g is
a real-valued function. Then, g(Y ) is a random variable and

g(y)pY (y).
E[g(Y )] =
all y

PROPERTIES OF EXPECTATIONS : Let Y be a discrete random variable with pmf


pY (y). Suppose that g, g1 , g2 , ..., gk are real-valued functions, and let c be any real constant. Expectations satisfy the following (linearity) properties:
(a) E(c) = c
(b) E[cg(Y )] = cE[g(Y )]

(c) E[ kj=1 gj (Y )] = kj=1 E[gj (Y )].


Note: These rules are also applicable if Y is continuous (coming up).
Example 3.12. In a one-hour period, the number of gallons of a certain toxic chemical
that is produced at a local plant, say Y , has the following pmf:
y

pY (y)

0.2

0.3

0.3

0.2

PAGE 20

STAT 509, J. TEBBS

F(y)
0.0

0.0

0.2

0.1

0.4

0.2

p(y)

0.6

0.3

0.8

0.4

1.0

CHAPTER 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 3.2: PMF (left) and CDF (right) of Y in Example 3.12.


(a) Compute the expected number of gallons produced during a one-hour period.
Solution: The expected value of Y is
= E(Y ) =

ypY (y) = 0(0.2) + 1(0.3) + 2(0.3) + 3(0.2) = 1.5.

all y

We would expect 1.5 gallons of the toxic chemical to be produced per hour (on average).
(b) The cost (in hundreds of dollars) to produce Y gallons is given by the cost function
g(Y ) = 3 + 12Y + 2Y 2 . What is the expected cost in a one-hour period?
Solution: We want to compute E[g(Y )]. We rst compute E(Y 2 ):
E(Y 2 ) =

y 2 pY (y) = 02 (0.2) + 12 (0.3) + 22 (0.3) + 32 (0.2) = 3.3.

all y

Therefore,
E[g(Y )] = E(3 + 12Y + 2Y 2 ) = 3 + 12E(Y ) + 2E(Y 2 )
= 3 + 12(1.5) + 2(3.3) = 27.6.
The expected hourly cost is $2, 760.00.
PAGE 21

CHAPTER 3

STAT 509, J. TEBBS

TERMINOLOGY : Let Y be a discrete random variable with pmf pY (y) and expected
value E(Y ) = . The population variance of Y is given by
2 var(Y ) E[(Y )2 ]

=
(y )2 pY (y).
all y

The population standard deviation of Y is the positive square root of the variance:
=

2 =

var(Y ).

FACTS : The population variance 2 satises the following:


(a) 2 0. 2 = 0 if and only if the random variable Y has a degenerate distribution; i.e., all the probability mass is located at one support point.
(b) The larger (smaller) 2 is, the more (less) spread in the possible values of Y about
the population mean = E(Y ).
(c) 2 is measured in (units)2 and is measured in the original units.
COMPUTING FORMULA: Let Y be a random variable with population mean E(Y ) = .
An alternative computing formula for the population variance is
var(Y ) = E[(Y )2 ]
= E(Y 2 ) [E(Y )]2 .
This formula is easy to remember and makes subsequent calculations easier.
Example 3.13. In Example 3.12, we examined the pmf for the number of gallons of a
certain toxic chemical that is produced at a local plant, denoted by Y . The pmf of Y is
y

pY (y)

0.2

0.3

0.3

0.2

In Example 3.13, we computed


E(Y ) = 1.5
E(Y 2 ) = 3.3.
PAGE 22

CHAPTER 3

STAT 509, J. TEBBS

The population variance of Y is


2 = var(Y ) = E(Y 2 ) [E(Y )]2
= 3.3 (1.5)2 = 1.05.
The population standard deviation of Y is

= 2 = 1.05 1.025.

3.4.1

Binomial distribution

BERNOULLI TRIALS : Many experiments can be envisioned as consisting of a sequence


of trials, where
1. each trial results in a success or a failure,
2. the trials are independent, and
3. the probability of success, denoted by p, 0 < p < 1, is the same on every trial.
Example 3.14. We give examples of situations that can be thought of as observing
Bernoulli trials.
When circuit boards used in the manufacture of Blue Ray players are tested, the
long-run percentage of defective boards is 5 percent.
circuit board = trial
defective board is observed = success
p = P (success) = P (defective board) = 0.05.
Ninety-eight percent of all air trac radar signals are correctly interpreted the rst
time they are transmitted.
radar signal = trial
signal is correctly interpreted = success
p = P (success) = P (correct interpretation) = 0.98.
PAGE 23

CHAPTER 3

STAT 509, J. TEBBS

Albino rats used to study the hormonal regulation of a metabolic pathway are
injected with a drug that inhibits body synthesis of protein. The probability that
a rat will die from the drug before the study is complete is 0.20.
rat = trial
dies before study is over = success
p = P (success) = P (dies early) = 0.20.

TERMINOLOGY : Suppose that n Bernoulli trials are performed. Dene


Y = the number of successes (out of n trials performed).
We say that Y has a binomial distribution with number of trials n and success probability p. Shorthand notation is Y b(n, p).
PMF: If Y b(n, p), then the probability mass function of Y is given by
( )

n py (1 p)ny , y = 0, 1, 2, ..., n
y
pY (y) =

0,
otherwise.
MEAN/VARIANCE: If Y b(n, p), then
E(Y ) = np
var(Y ) = np(1 p).
Example 3.15. In an agricultural study, it is determined that 40 percent of all plots
respond to a certain treatment. Four plots are observed. In this situation, we interpret
plot of land = trial
plot responds to treatment = success
p = P (success) = P (responds to treatment) = 0.4.

PAGE 24

STAT 509, J. TEBBS

F(y)
0.0

0.0

0.2

0.1

0.4

0.2

p(y)

0.6

0.3

0.8

0.4

1.0

CHAPTER 3

Figure 3.3: PMF (left) and CDF (right) of Y b(n = 4, p = 0.4) in Example 3.15.
If the Bernoulli trial assumptions hold (independent plots, same response probability for
each plot), then
Y = the number of plots which respond b(n = 4, p = 0.4).
(a) What is the probability that exactly two plots respond?
( )
4
P (Y = 2) = pY (2) =
(0.4)2 (1 0.4)42
2
= 6(0.4)2 (0.6)2 = 0.3456.
(b) What is the probability that at least one plot responds?
( )
4
(0.4)0 (1 0.4)40
P (Y 1) = 1 P (Y = 0) = 1
0
= 1 1(1)(0.6)4 = 0.8704.
(c) What are E(Y ) and var(Y )?
E(Y ) = np = 4(0.4) = 1.6
var(Y ) = np(1 p) = 4(0.4)(0.6) = 0.96.
PAGE 25

CHAPTER 3

STAT 509, J. TEBBS

Example 3.16. An electronics manufacturer claims that 10 percent of its power supply
units need servicing during the warranty period. To investigate this claim, technicians
at a testing laboratory purchase 30 units and subject each one to an accelerated testing
protocol to simulate use during the warranty period. In this situation, we interpret
power supply unit = trial
supply unit needs servicing during warranty period = success
p = P (success) = P (supply unit needs servicing) = 0.1.
If the Bernoulli trial assumptions hold (independent units, same probability of needing
service for each unit), then
Y

= the number of units requiring service during warranty period


b(n = 30, p = 0.1).

Note: Instead of computing probabilities by hand, we will use R.


BINOMIAL R CODE: Suppose that Y b(n, p).
pY (y) = P (Y = y)

FY (y) = P (Y y)

dbinom(y,n,p)

pbinom(y,n,p)

(a) What is the probability that exactly ve of the 30 power supply units require
servicing during the warranty period?

( )
30
pY (5) = P (Y = 5) =
(0.1)5 (1 0.1)305
5
dbinom(5,30,0.1) = 0.1023048.

(b) What is the probability that at most ve of the 30 power supply units require
servicing during the warranty period?
5 ( )

30
FY (5) = P (Y 5) =
(0.1)y (1 0.1)30y
y
y=0

pbinom(5,30,0.1) = 0.9268099.
PAGE 26

STAT 509, J. TEBBS

0.0

0.00

0.2

0.05

0.4

0.10

p(y)

F(y)

0.6

0.15

0.8

0.20

1.0

0.25

CHAPTER 3

10

15

20

25

30

10

15

20

25

30

Figure 3.4: PMF (left) and CDF (right) of Y b(n = 30, p = 0.1) in Example 3.16.
(c) What is the probability at least ve of the 30 power supply units require service?
4 ( )

30
(0.1)y (1 0.1)30y
P (Y 5) = 1 P (Y 4) = 1
y
y=0
1-pbinom(4,30,0.1) = 0.1754949.
(d) What is P (2 Y 8)?
8 ( )

30
P (2 Y 8) =
(0.1)y (1 0.1)30y .
y
y=2

One way to get this in R is to use the command:


> sum(dbinom(2:8,30,0.1))
[1] 0.8142852
The dbinom(2:8,30,0.1) command creates a vector containing pY (2), pY (3), ..., pY (8),
and the sum command adds them. Another way to calculate this probability in R is
> pbinom(8,30,0.1)-pbinom(1,30,0.1)
[1] 0.8142852
PAGE 27

CHAPTER 3

3.4.2

STAT 509, J. TEBBS

Geometric distribution

NOTE : The geometric distribution also arises in experiments involving Bernoulli trials:
1. Each trial results in a success or a failure.
2. The trials are independent.
3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.

TERMINOLOGY : Suppose that Bernoulli trials are continually observed. Dene


Y = the number of trials to observe the rst success.
We say that Y has a geometric distribution with success probability p. Shorthand
notation is Y geom(p).
PMF: If Y geom(p), then the probability mass function of Y is given by

(1 p)y1 p, y = 1, 2, 3, ...
pY (y) =

0,
otherwise.
MEAN/VARIANCE: If Y geom(p), then
1
p
1p
var(Y ) =
.
p2
E(Y ) =

Example 3.17. Biology students are checking the eye color of fruit ies. For each y,
the probability of observing white eyes is p = 0.25. In this situation, we interpret
fruit y = trial
y has white eyes = success
p = P (success) = P (white eyes) = 0.25.

PAGE 28

STAT 509, J. TEBBS

0.0

0.00

0.2

0.05

0.4

0.10

p(y)

F(y)

0.6

0.15

0.8

0.20

1.0

0.25

CHAPTER 3

10

15

20

25

10

15

20

25

Figure 3.5: PMF (left) and CDF (right) of Y geom(p = 0.25) in Example 3.17.
If the Bernoulli trial assumptions hold (independent ies, same probability of white eyes
for each y), then
Y

= the number of ies needed to nd the rst white-eyed


geom(p = 0.25).

(a) What is the probability the rst white-eyed y is observed on the fth y checked?
pY (5) = P (Y = 5) = (1 0.25)51 (0.25)
= (0.75)4 (0.25) 0.079.
(b) What is the probability the rst white-eyed y is observed before the fourth y is
examined? Note: For this to occur, we must observe the rst white-eyed y (success)
on either the rst, second, or third y.
FY (3) = P (Y 3) = P (Y = 1) + P (Y = 2) + P (Y = 3)
= (1 0.25)11 (0.25) + (1 0.25)21 (0.25) + (1 0.25)31 (0.25)
= 0.25 + 0.1875 + 0.140625 0.578.
PAGE 29

CHAPTER 3

STAT 509, J. TEBBS

GEOMETRIC R CODE: Suppose that Y geom(p).


pY (y) = P (Y = y)

FY (y) = P (Y y)

dgeom(y-1,p)

pgeom(y-1,p)

> dgeom(5-1,0.25) ## Part (a)


[1] 0.07910156
> pgeom(3-1,0.25) ## Part (b)
[1] 0.578125

3.4.3

Negative binomial distribution

NOTE : The negative binomial distribution also arises in experiments involving Bernoulli
trials:
1. Each trial results in a success or a failure.
2. The trials are independent.
3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.

TERMINOLOGY : Suppose that Bernoulli trials are continually observed. Dene


Y = the number of trials to observe the rth success.
We say that Y has a negative binomial distribution with waiting parameter r and
success probability p. Shorthand notation is Y nib(r, p).
REMARK : Note that the negative binomial distribution is a mere generalization of the
geometric. If r = 1, then the nib(r, p) distribution reduces to the geom(p).
PMF: If Y nib(r, p), then the probability mass function of Y is given by
(
)

y 1 pr (1 p)yr , y = r, r + 1, r + 2, ...
r1
pY (y) =

0,
otherwise.
PAGE 30

CHAPTER 3

STAT 509, J. TEBBS

MEAN/VARIANCE: If Y nib(r, p), then


r
p
r(1 p)
var(Y ) =
.
p2
E(Y ) =

Example 3.18. At an automotive paint plant, 15 percent of all batches sent to the lab
for chemical analysis do not conform to specications. In this situation, we interpret
batch = trial
batch does not conform = success
p = P (success) = P (not conforming) = 0.15.
If the Bernoulli trial assumptions hold (independent batches, same probability of nonconforming for each batch), then
Y

= the number of batches needed to nd the third nonconforming


nib(r = 3, p = 0.15).

(a) What is the probability the third nonconforming batch is observed on the tenth batch
sent to the lab?
(
)
10 1
pY (10) = P (Y = 10) =
(0.15)3 (1 0.15)103
31
( )
9
=
(0.15)3 (0.85)7 0.039.
2
(b) What is the probability that no more than two nonconforming batches will be
observed among the rst 30 batches sent to the lab? Note: This means the third
nonconforming batch must be observed on the 31st batch observed, the 32nd, the 33rd,
etc.
P (Y 31) = 1 P (Y 30)
)
30 (

y1
= 1
(0.15)3 (0.85)y3 0.151.
3

1
y=3

PAGE 31

STAT 509, J. TEBBS

0.0

0.00

0.2

0.01

0.4

0.02

p(y)

F(y)

0.6

0.03

0.8

0.04

1.0

0.05

CHAPTER 3

10

20

30

40

50

60

70

20

40

60

Figure 3.6: PMF (left) and CDF (right) of Y nib(r = 3, p = 0.15) in Example 3.18.
NEGATIVE BINOMIAL R CODE: Suppose that Y nib(r, p).

pY (y) = P (Y = y)

FY (y) = P (Y y)

dnbinom(y-r,r,p)

pnbinom(y-r,r,p)

> dnbinom(10-3,3,0.15) ## Part (a)


[1] 0.03895012
> 1-pnbinom(30-3,3,0.15) ## Part (b)
[1] 0.1514006

3.4.4

Hypergeometric distribution

SETTING: Consider a population of N objects and suppose that each object belongs to
one of two dichotomous classes: Class 1 and Class 2. For example, the objects (classes)
might be people (infected/not), parts (conforming/not), plots of land (respond to treat-

PAGE 32

CHAPTER 3

STAT 509, J. TEBBS

ment/not), etc. In the population of interest, we have


N = total number of objects
r = number of objects in Class 1
N r = number of objects in Class 2.
Envision taking a sample n objects from the population (objects are selected at random
and without replacement). Dene
Y = the number of objects in Class 1 (out of the n selected).
We say that Y has a hypergeometric distribution and write Y hyper(N, n, r).
PMF: If Y hyper(N, n, r), then the probability mass function of Y is given by
(r )(N r)

y ( ny
) , y r and n y N r
N
pY (y) =
n

0,
otherwise.
MEAN/VARIANCE: If Y hyper(N, n, r), then
(r)
N
( r ) (N r) (N n)
var(Y ) = n
.
N
N
N 1
E(Y ) = n

Example 3.19. A supplier ships parts to a company in lots of 100 parts. The company
has an acceptance sampling plan which adopts the following acceptance rule:
....sample 5 parts at random and without replacement. If there are no defectives in the sample, accept the entire lot; otherwise, reject the entire lot.
In this example, the population size is N = 100. The sample size is n = 5. Dene the
random variable
Y

the number of defectives in the sample

hyper(N = 100, n = 5, r).


PAGE 33

STAT 509, J. TEBBS

F(y)
0.0

0.0

0.1

0.2

0.2

0.4

0.3

p(y)

0.6

0.4

0.8

0.5

0.6

1.0

CHAPTER 3

Figure 3.7: PMF (left) and CDF (right) of Y hyper(N = 100, n = 5, r = 10) in
Example 3.19.
(a) If r = 10, what is the probability that the lot will be accepted? Note: The lot will
be accepted only if Y = 0.

(10)(90)
(100)5
0

pY (0) = P (Y = 0) =

1(43949268)
0.584.
75287520

(b) If r = 10, what is the probability that at least 3 of the 5 parts sampled are defective?
P (Y 3) = 1 P (Y 2)
[ ( )( )
10 90
= 1

(10)(90)

(10)(90) ]

(100)5 + 1(100)4 + (2 100)3

1 (0.584 + 0.339 + 0.070) = 0.007.


HYPERGEOMETRIC R CODE: Suppose that Y hyper(N, n, r).
pY (y) = P (Y = y)

FY (y) = P (Y y)

dhyper(y,r,N-r,n) phyper(y,r,N-r,n)
PAGE 34

CHAPTER 3

STAT 509, J. TEBBS

> dhyper(0,10,100-10,5) ## Part (a)


[1] 0.5837524
> 1-phyper(2,10,100-10,5) ## Part (b)
[1] 0.006637913

3.4.5

Poisson distribution

NOTE : The Poisson distribution is commonly used to model counts, such as


1. the number of customers entering a post oce in a given hour
2. the number of -particles discharged from a radioactive substance in one second
3. the number of machine breakdowns per month
4. the number of insurance claims received per day
5. the number of defects on a piece of raw material.
TERMINOLOGY : In general, we dene
Y = the number of occurrences over in a unit interval of time (or space).
A Poisson distribution for Y emerges if these occurrences obey the following rules:
(i) the number of occurrences in non-overlapping intervals (of time or space) are independent random variables.
(ii) The probability of an occurrence in a suciently short interval is proportional to
the length of the interval.
(iii) The probability of 2 or more occurrences in a suciently short interval is zero.
We say that Y has a Poisson distribution and write Y Poisson(). A process that
produces occurrences according to these rules is called a Poisson process.

PAGE 35

STAT 509, J. TEBBS

0.0

0.00

0.05

0.2

0.10

0.4

F(y)

0.15

p(y)

0.6

0.20

0.8

0.25

1.0

0.30

CHAPTER 3

10

10

Figure 3.8: PMF (left) and CDF (right) of Y Poisson( = 2.5) in Example 3.20.
PMF: If Y Poisson(), then the probability mass function of Y is given by
y

e , y = 0, 1, 2, ...
y!
pY (y) =

0,
otherwise.
MEAN/VARIANCE: If Y Poisson(), then
E(Y ) =
var(Y ) = .

Example 3.20. Let Y denote the number of times per month that a detectable amount
of radioactive gas is recorded at a nuclear power plant. Suppose that Y follows a Poisson
distribution with mean = 2.5 times per month.
(a) What is the probability that there are exactly three times a detectable amount of
gas is recorded in a given month?
(2.5)3 e2.5
3!
15.625e2.5
0.214.
=
6

P (Y = 3) = pY (3) =

PAGE 36

CHAPTER 3

STAT 509, J. TEBBS

(b) What is the probability that there are no more than four times a detectable amount
of gas is recorded in a given month?
P (Y 4) =

(2.5)y e2.5

y!

y=0

(2.5)0 e2.5 (2.5)1 e2.5 (2.5)2 e2.5 (2.5)3 e2.5 (2.5)4 e2.5
+
+
+
+
0!
1!
2!
3!
4!
0.891.
=

POISSON R CODE: Suppose that Y Poisson().


pY (y) = P (Y = y)

FY (y) = P (Y y)

dpois(y,)

ppois(y,)

> dpois(3,2.5) ## Part (a)


[1] 0.213763
> ppois(4,2.5) ## Part (b)
[1] 0.891178

3.5

Continuous random variables

RECALL: A random variable Y is called continuous if it can assume any value in an


interval of real numbers.
Contrast this with a discrete random variable whose values can be counted.
For example, if Y = time (measured in seconds), then the set of all possible values
of Y is
{y : y > 0}.
If Y = temperature (measured in deg C), the set of all possible values of Y (ignoring
absolute zero and physical upper bounds) might be described as
{y : < y < }.
Neither of these sets of values can be counted.
PAGE 37

CHAPTER 3

STAT 509, J. TEBBS

IMPORTANT : Assigning probabilities to events involving continuous random variables


is dierent than in discrete models. We do not assign positive probability to specic
values (e.g., Y = 3, etc.) like we did with discrete random variables. Instead, we assign
positive probability to events which are intervals (e.g., 2 < Y < 4, etc.).
TERMINOLOGY : Every continuous random variable we will discuss in this course has a
probability density function (pdf ), denoted by fY (y). This function has the following
characteristics:
1. fY (y) 0, that is, fY (y) is nonnegative.
2. The area under any pdf is equal to 1, that is,

fY (y)dy = 1.

3. If y0 is a specic value of interest, then the cumulative distribution function


(cdf ) of Y is given by

FY (y0 ) = P (Y y0 ) =

y0

fY (y)dy.

4. If y1 and y2 are specic values of interest (y1 < y2 ), then


y2
P (y1 Y y2 ) =
fY (y)dy
y1

= FY (y2 ) FY (y1 ).
5. If y0 is a specic value, then P (Y = y0 ) = 0. In other words, in continuous
probability models, specic points are assigned zero probability (see #4 above and
this will make perfect mathematical sense). An immediate consequence of this is
that if Y is continuous,
P (y1 Y y2 ) = P (y1 Y < y2 ) = P (y1 < Y y2 ) = P (y1 < Y < y2 )
and each is equal to

y2

fY (y)dy.
y1

This is not true if Y has a discrete distribution because positive probability is


assigned to specic values of Y .
PAGE 38

STAT 509, J. TEBBS

F(y)
0.0

0.0

0.5

0.2

1.0

0.4

1.5

f(y)

0.6

2.0

0.8

2.5

3.0

1.0

CHAPTER 3

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure 3.9: PDF (left) and CDF (right) of Y in Example 3.21.


IMPORTANT: Evaluating a pdf at a specic value y0 , that is, computing fY (y0 ), does
not give you a probability! This simply gives you the height of the pdf fY (y) at y = y0 .
Example 3.21. Suppose that Y has the pdf

3y 2 , 0 < y < 1
fY (y) =
0, otherwise.
Find the cumulative distribution function (cdf) of Y .
Solution. For 0 < y < 1,

FY (y) =

3t2 dt

fY (t)dt =

y


= t3 = y 3 .

0

Therefore, the cdf of Y is

0,

FY (y) =

y<0

y3, 0 y < 1

1, y 1.
PAGE 39

CHAPTER 3

STAT 509, J. TEBBS

(a) Calculate P (Y < 0.3).


Method 1: PDF:

0.3

P (Y < 0.3) =
0

0.3


3y 2 dy = y 3 = (0.3)3 03 = 0.027.

0

Method 2: CDF:
P (Y < 0.3) = FY (0.3) = (0.3)3 = 0.027.
(b) Calculate P (Y > 0.8).
Method 1: PDF:
1


P (Y > 0.8) =
3y 2 dy = y 3 = 13 (0.8)3 = 0.488.

0.8

0.8

Method 2: CDF:
P (Y > 0.8) = 1 P (Y 0.8) = 1 FY (0.8) = 1 (0.8)3 = 0.488.

(c) Calculate P (0.3 < Y < 0.8).


Method 1: PDF:

0.8

P (0.3 < Y < 0.8) =


0.3

0.8

2
3
3y dy = y = (0.8)3 (0.3)3 = 0.485.

0.3

Method 2: CDF:
P (0.3 < Y < 0.8) = FY (0.8) FY (0.3) = (0.8)3 (0.3)3 = 0.485.

TERMINOLOGY : Let Y be a continuous random variable with pdf fY (y). The expected value (or mean) of Y is given by

= E(Y ) =

yfY (y)dy.

NOTE: The limits of the integral in this denition, while technically correct, will always
be the lower and upper limits corresponding to the nonzero part of the pdf.

PAGE 40

CHAPTER 3

STAT 509, J. TEBBS

FUNCTIONS : Let Y be a continuous random variable with pdf fY (y). Suppose that g
is a real-valued function. Then, g(Y ) is a random variable and

E[g(Y )] =
g(y)fY (y)dy.

TERMINOLOGY : Let Y be a continuous random variable with pdf fY (y) and expected
value E(Y ) = . The population variance of Y is given by

2
2
var(Y ) E[(Y ) ] =
(y )2 fY (y)dy.

The computing formula is still


var(Y ) = E(Y 2 ) [E(Y )]2 .
The population standard deviation of Y is the positive square root of the variance:
=

2 =

var(Y ).

Example 3.22. Find the mean and variance of Y in Example 3.21.


Solution. The mean of Y is

= E(Y ) =

yfY (y)dy

y 3y 2 dy =

=
0

1
4
3
3y

3y 3 dy =
= .
4
4
0

To nd var(Y ), we will use the computing formula var(Y ) = E(Y 2 ) [E(Y )]2 . We
already have E(Y ) = 3/4.

y 2 fY (y)dy

E(Y ) =
0

y 2 3y 2 dy =

=
0

1
5
3
3y

3y 4 dy =
= .
5
5

Therefore,
3
= var(Y ) = E(Y ) [E(Y )] =
5
2

The population standard deviation is


=

0.0375 0.194.
PAGE 41

( )2
3
= 0.0375.
4

CHAPTER 3

STAT 509, J. TEBBS

QUANTILES : Suppose that Y is a continuous random variable with cdf FY (y) and let
0 < p < 1. The pth quantile of the distribution of Y , denoted by p , solves
p
FY (p ) = P (Y p ) =
fY (y)dy = p.

The median of the distribution of Y is the p = 0.5 quantile. That is, the median 0.5
solves

FY (0.5 ) = P (Y 0.5 ) =

0.5

fY (y)dy = 0.5.

NOTE: Another name for the pth quantile is the 100pth percentile.
REMARK : When Y is discrete, there are some potential problems with the denition
that p solves FY (p ) = P (Y p ) = p. The reason is that there may be many values of
p that satisfy this equation. By convention, in discrete distributions, the pth quantile
p is taken to be the smallest value satisfying FY (p ) = P (Y p ) p.

3.5.1

Exponential distribution

TERMINOLOGY : A random variable Y is said to have an exponential distribution


with parameter > 0 if its pdf is given by

ey , y > 0
fY (y) =
0,
otherwise.
Shorthand notation is Y exponential(). Important: The exponential distribution is
used to model the distribution of positive quantities (e.g., lifetimes, etc.).
MEAN/VARIANCE: If Y exponential(), then
1

1
var(Y ) =
.
2
E(Y ) =

CDF: Suppose that Y exponential(). Then, the cdf of Y exists in closed form and
is given by

FY (y) =

0,

y0

1 ey , y > 0.
PAGE 42

STAT 509, J. TEBBS

f(y)

0.6

0.8

1.0

CHAPTER 3

0.0

0.2

0.4

lambda = 1
lambda = 1/2
lambda = 1/5

10

15

20

Figure 3.10: Exponential pdfs with dierent values of .

Example 3.23. Extensive experience with fans of a certain type used in diesel engines
has suggested that the exponential distribution provides a good model for time until
failure (i.e., lifetime). Suppose that the lifetime of a fan, denoted by Y (measured in
10000s of hours), follows an exponential distribution with = 0.4.
(a) What is the probability that a fan lasts longer than 30,000 hours?
Method 1: PDF:

P (Y > 3) =
3

)

1

0.4e0.4y dy = 0.4 e0.4y

0.4
3



= e0.4y

(

= e

PAGE 43

1.2

0.301.

STAT 509, J. TEBBS

F(y)
0.0

0.0

0.2

0.1

0.4

0.2

f(y)

0.6

0.3

0.8

0.4

1.0

CHAPTER 3

10

12

10

12

Figure 3.11: PDF (left) and CDF (right) of Y exponential( = 0.4) in Example 3.23.
Method 2: CDF:
P (Y > 3) = 1 P (Y 3) = 1 FY (3)
= 1 [1 e0.4(3) ]
= e1.2 0.301.

(b) What is the probability that a fan will last between 20,000 and 50,000 hours?
Method 1: PDF:

P (2 < Y < 5) =
2

5

1

0.4e0.4y dy = 0.4 e0.4y

0.4
2
5


= e0.4y

2
0.4(5)

= [e

e0.4(2) ]

= e0.8 e2 0.314.

PAGE 44

CHAPTER 3

STAT 509, J. TEBBS

Method 2: CDF:
P (2 < Y < 5) = FY (5) FY (2)
= [1 e0.4(5) ] [1 e0.4(2) ]
= e0.8 e2 0.314.
MEMORYLESS PROPERTY: Suppose that Y exponential(), and let r and s be
positive constants. Then
P (Y > r + s|Y > r) = P (Y > s).
If Y measures time (e.g., time to failure, etc.), then the memoryless property says that
the distribution of additional lifetime (s time units beyond time r) is the same as the
original distribution of the lifetime. In other words, the fact that Y has made it to
time r has been forgotten. For example, in Example 3.23,
P (Y > 5|Y > 2) = P (Y > 3) 0.301.

POISSON RELATIONSHIP: Suppose that we are observing occurrences over time


according to a Poisson distribution with rate . Dene the random variable
W = the time until the rst occurrence.
Then, W exponential(). It is also true that the time between any two occurrences in a Poisson process follows this same exponential distribution (these are called
interarrival times).
Example 3.24. Suppose that customers arrive at a check-out according to a Poisson
process with mean = 12 per hour. What is the probability that we will have to wait
longer than 10 minutes to see the rst customer? Note: 10 minutes is 1/6th of an hour.
Solution. The time until the rst arrival, say W , follows an exponential distribution
with = 12, so the cdf of W , for w > 0, is
FW (w) = 1 e12w .
PAGE 45

CHAPTER 3

STAT 509, J. TEBBS

The desired probability is


P (W > 1/6) = 1 P (W 1/6) = 1 FW (1/6)
= 1 [1 e12(1/6) ]
= e2 0.135.
EXPONENTIAL R CODE: Suppose that Y exponential().
FY (y) = P (Y y)

pexp(y,)

qexp(p,)

> 1-pexp(1/6,12) ## Example 3.24


[1] 0.1353353
> qexp(0.9,12) ## 0.9 quantile
[1] 0.1918821

NOTE: The command qexp(0.9,12) gives the 0.90 quantile (90th percentile) of the
exponential( = 12) distribution. In Example 3.24, this means that 90 percent of the
waiting times will be less than approximately 0.192 hours (only 10 percent will exceed).

3.5.2

Gamma distribution

TERMINOLOGY : The gamma function is a real function of t, dened by



(t) =
y t1 ey dy,
0

for all t > 0. The gamma function satises the recursive relationship
() = ( 1)( 1),
for > 1. Therefore, if is an integer, then
() = ( 1)!.
PAGE 46

STAT 509, J. TEBBS

0.15

0.20

0.25

CHAPTER 3

0.00

0.05

0.10

f(y)

alpha = 1.5, lambda = 1/2


alpha = 2, lambda = 1/3
alpha = 2.5, lambda = 1/5

10

15

20

25

Figure 3.12: Gamma pdfs with dierent values of and .

TERMINOLOGY : A random variable Y is said to have a gamma distribution with


parameters > 0 and > 0 if its pdf is given by

y 1 ey , y > 0
()
fY (y) =

0,
otherwise.
Shorthand notation is Y gamma(, ).
By changing the values of and , the gamma pdf can assume many shapes. This
makes the gamma distribution popular for modeling positive random variables (it
is more exible than the exponential).
Note that when = 1, the gamma pdf reduces to the exponential() pdf.

PAGE 47

CHAPTER 3

STAT 509, J. TEBBS

MEAN/VARIANCE: If Y gamma(, ), then

var(Y ) =
.
2
E(Y ) =

CDF: The cdf of a gamma random variable does not exist in closed form. Therefore,
probabilities involving gamma random variables and gamma quantiles must be computed
numerically (e.g., using R, etc.).
GAMMA R CODE: Suppose that Y gamma(, ).
FY (y) = P (Y y)

pgamma(y,,)

qgamma(p,,)

Example 3.25. When a certain transistor is subjected to an accelerated life test, the
lifetime Y (in weeks) is well modeled by a gamma distribution with = 4 and = 1/6.
(a) Find the probability that a transistor will last at least 50 weeks.
P (Y 50) = 1 P (Y < 50) = 1 FY (50)
= 1-pgamma(50,4,1/6)
= 0.03377340.
(b) Find the probability that a transistor will last between 12 and 24 weeks.
P (12 Y 24) = FY (24) FY (12)
= pgamma(24,4,1/6)-pgamma(12,4,1/6)
= 0.4236533.
(c) Twenty percent of the transistor lifetimes will be below which time? Note: I am
asking for the 0.20 quantile (20th percentile) of the lifetime distribution.

> qgamma(0.2,4,1/6)
[1] 13.78072

PAGE 48

STAT 509, J. TEBBS

F(y)
0.0

0.00

0.2

0.01

0.4

f(y)

0.02

0.6

0.03

0.8

1.0

0.04

CHAPTER 3

10

20

30

40

50

60

70

10

20

30

40

50

60

70

Figure 3.13: PDF (left) and CDF (right) of Y gamma( = 4, = 1/6) in Example
3.25.
3.5.3

Normal distribution

TERMINOLOGY : A random variable Y is said to have a normal distribution if its


pdf is given by

( )2
y

1 e 12 , < y <
2
fY (y) =

0,
otherwise.

Shorthand notation is Y N (, 2 ). Another name for the normal distribution is the


Guassian distribution.
MEAN/VARIANCE: If Y N (, 2 ), then
E(Y ) =
var(Y ) = 2 .
REMARK : The normal distribution serves as a very good model for a wide range of measurements; e.g., reaction times, ll amounts, part dimensions, weights/heights, measures
of intelligence/test scores, economic indicators, etc.
PAGE 49

STAT 509, J. TEBBS

0.4

CHAPTER 3

0.2
0.0

0.1

f(y)

0.3

mu = 0, sigma = 1
mu = 2, sigma = 2
mu = 1, sigma = 3

10

10

Figure 3.14: Normal pdfs with dierent values of and 2 .

CDF: The cdf of a normal random variable does not exist in closed form. Probabilities
involving normal random variables and normal quantiles can be computed numerically
(e.g., using R, etc.).
NORMAL R CODE: Suppose that Y N (, 2 ).
FY (y) = P (Y y)

pnorm(y,,)

qnorm(p,,)

Example 3.26. The time it takes for a driver to react to the brake lights on a decelerating
vehicle is critical in helping to avoid rear-end collisions. A recently published study
suggests that this time during in-trac driving, denoted by Y (measured in seconds),
follows a normal distribution with mean = 1.5 and variance 2 = 0.16.
PAGE 50

STAT 509, J. TEBBS

0.4

F(y)

0.6

0.0

0.0

0.2

0.2

0.4

f(y)

0.6

0.8

0.8

1.0

1.0

CHAPTER 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 3.15: PDF (left) and CDF (right) of Y N ( = 1.5, 2 = 0.16) in Example 3.26.
(a) What is the probability that reaction time is less than 1 second?
P (Y < 1) = FY (1)
= pnorm(1,1.5,sqrt(0.16))
= 0.1056498.
(b) What is the probability that reaction time is between 1.1 and 2.5 seconds?
P (1.1 Y 2.5) = FY (2.5) FY (1.1)
= pnorm(2.5,1.5,sqrt(0.16))-pnorm(1.1,1.5,sqrt(0.16))
= 0.835135.
(c) Five percent of all reaction times will exceed which time? Note: I am asking for
the 0.95 quantile (95th percentile) of the reaction time distribution.
> qnorm(0.95,1.5,sqrt(0.16))
[1] 2.157941

PAGE 51

CHAPTER 3

STAT 509, J. TEBBS

EMPIRICAL RULE : For any N (, 2 ) distribution,


about 68% of the distribution is between and +
about 95% of the distribution is between 2 and + 2
about 99.7% of the distribution is between 3 and + 3.
This is also called the 68-95-99.7% rule. This rule allows for us to make statements
like this (referring to Example 3.26, where = 1.5 and = 0.4):

About 68 percent of all reaction times will be between 1.1 and 1.9 seconds.
About 95 percent of all reaction times will be between 0.7 and 2.3 seconds.
About 99.7 percent of all reaction times will be between 0.3 and 2.7 seconds.

TERMINOLOGY : A random variable Z is said to have a standard normal distribution if its pdf is given by

1 ez2 /2 , < z <


2
fZ (z) =

0,
otherwise.
Shorthand notation is Z N (0, 1). A standard normal distribution is a special normal
distribution, that is, a normal distribution with mean = 0 and variance 2 = 1. The
variable Z is called a standard normal random variable.
RESULT : If Y N (, 2 ), then
Z=

Y
N (0, 1).

The result says that Z follows a standard normal distribution; i.e., Z N (0, 1). In this
context, Z is called the standardized value of Y .
Important: Therefore, any normal random variable Y N (, 2 ) can be converted
to a standard normal random variable Z by applying this transformation.

PAGE 52

CHAPTER 3

STAT 509, J. TEBBS

IMPLICATION : Any probability calculation involving a normal random variable Y


N (, 2 ) can be transformed into a calculation involving Z N (0, 1). More specically, if Y N (, 2 ), then
(

)
y1
y2
P (y1 < Y < y2 ) = P
<Z<

(
)
(
)
y2
y1
= FZ
FZ
.

Q: Why is this important?


Because it is common for textbooks (like yours) to table the cumulative distribution function FZ (z) for various values of z. See Table 1 (pp 593-594, VK).
Therefore, the preceding standardization result makes it possible to nd probabilities involving normal random variables by hand without using calculus or
software.
Because R will compute normal probabilities directly, I view hand calculation to
be unnecessary and outdated.
See the examples in the text (Section 3.5) for illustrations of this approach.

3.6

Reliability and lifetime distributions

TERMINOLOGY : Reliability analysis deals with failure time (i.e., lifetime, time-toevent) data. For example,
T = time from start of product service until failure
T = time of sale of a product until a warranty claim
T = number of hours in use/cycles until failure.
We call T a lifetime random variable if it measures the time to an event; e.g.,
failure, death, eradication of some infection/condition, etc. Engineers are often involved
with reliability studies in practice, because reliability is related to product quality.
PAGE 53

CHAPTER 3

STAT 509, J. TEBBS

NOTE : There are many well known lifetime distributions, including


exponential
Weibull
lognormal
Others: gamma, inverse Gaussian, Gompertz-Makeham, Birnbaum-Sanders, extreme value, log-logistic, etc.
The normal (Gaussian) distribution is rarely used to model lifetime variables.

3.6.1

Weibull distribution

TERMINOLOGY : A random variable T is said to have a Weibull distribution with


parameters > 0 and > 0 if its pdf is given by
( )1

t
e(t/) , t > 0

fT (t) =

0,
otherwise.
Shorthand notation is T Weibull(, ).
We call
= shape parameter
= scale parameter.
By changing the values of and , the Weibull pdf can assume many shapes. The
Weibull distribution is very popular among engineers in reliability applications.
Note that when = 1, the Weibull pdf reduces to the exponential( = 1/) pdf.
MEAN/VARIANCE: If T Weibull(, ), then
(
)
1
E(T ) = 1 +

{ (
) [ (
)]2 }
2
1
2
var(T ) = 1 +
1+
.

PAGE 54

STAT 509, J. TEBBS

0.10

beta = 2, eta = 5
beta = 2, eta = 10
beta = 3, eta = 10

0.00

0.05

f(t)

0.15

0.20

CHAPTER 3

10

15

20

25

Figure 3.16: Weibull pdfs with dierent values of and .

CDF: Suppose that T Weibull(, ). Then, the cdf of T exists in closed form and is
given by

FT (t) =

0,

t0

1 e(t/) , t > 0.

Example 3.27. Suppose that the lifetime of a rechargeable battery, denoted by T


(measured in hours), follows a Weibull distribution with parameters = 2 and = 10.
(a) What is the mean time to failure?

( )
3
E(T ) = 10
8.862 hours.
2

(b) What is the probability that a battery is still functional at time t = 20?
P (T 20) = 1 P (T < 20) = 1 FT (20)
= 1 [1 e(20/10) ] 0.018.
2

PAGE 55

STAT 509, J. TEBBS

0.0

0.00

0.2

0.02

0.4

0.04

f(y)

F(y)

0.6

0.06

0.8

0.08

1.0

0.10

CHAPTER 3

10

15

20

25

10

15

20

25

Figure 3.17: PDF (left) and CDF (right) of T Weibull( = 2, = 10) in Example 3.27.
(c) What is the probability that a battery is still functional at time t = 20 given that
the battery is functional at time t = 10?
P (T 20|T 10) =

P (T 20 and T 10)
P (T 20)
=
P (T 10)
P (T 10)
1 FT (20)
e(20/10)
= (10/10)2 0.050.
1 FT (10)
e
2

(d) What is the 99th percentile of this lifetime distribution? We set


FT (0.99 ) = 1 e(0.99 /10) = 0.99.
2

set

Solving for 0.99 gives 0.99 21.460 hours. Only one percent of the battery lifetimes
will exceed this value.
WEIBULL R CODE: Suppose that T Weibull(, ).
FT (t) = P (T t)

pweibull(t,,)

qweibull(p,,)

PAGE 56

CHAPTER 3

STAT 509, J. TEBBS

> 10*gamma(3/2) ## Part (a)


[1] 8.86227
> 1-pweibull(20,2,10) ## Part (b)
[1] 0.01831564
> (1-pweibull(20,2,10))/(1-pweibull(10,2,10)) ## Part (c)
[1] 0.04978707
> qweibull(0.99,2,10) ## Part (d)
[1] 21.45966

3.6.2

Reliability functions

DESCRIPTION : We now describe some dierent, but equivalent, ways of dening the
distribution of a (continuous) lifetime random variable T .
The cumulative distribution function (cdf )
FT (t) = P (T t).
This can be interpreted as the proportion of units that have failed by time t.
The survivor function
ST (t) = P (T > t) = 1 FT (t).
This can be interpreted as the proportion of units that have not failed by time t;
e.g., the unit is still functioning, a warranty claim has not been made, etc.
The probability density function (pdf )
fT (t) =

d
d
FT (t) = ST (t).
dt
dt

Also, recall that

FT (t) =

fT (u)du
0

and

ST (t) =

fT (u)du.
t

PAGE 57

CHAPTER 3

STAT 509, J. TEBBS

TERMINOLOGY : The hazard function is dened as


P (t T < t + |T t)
.
0

hT (t) = lim

The hazard function is not a probability; rather, it is a probability rate. Therefore, it


is possible that a hazard function may exceed one.
REMARK : The hazard function (or hazard rate) is a very important characteristic of
a lifetime distribution. It indicates the way the risk of failure varies with time.
Distributions with increasing hazard functions are seen in units for whom some kind of
aging or wear out takes place. Certain types of units (e.g., electronic devices, etc.)
may display a decreasing hazard function, at least in the early stages of their lifetimes.
NOTE : It is insightful to note that
P (t T < t + |T t)
0

P (t T < t + )
= lim
0
P (T t)
FT (t + ) FT (t)
fT (t)
1
lim
=
.
=
P (T t) 0

ST (t)

hT (t) = lim

We can therefore describe the distribution of the continuous lifetime random variable T
by using either fT (t), FT (t), ST (t), or hT (t).
Example 3.28. In this example, we nd the hazard function for T Weibull(, ).
Recall that the pdf of T is
( )1

t
e(t/) , t > 0

fT (t) =

0,
otherwise.

The cdf of T is
FT (t) =

t0

0,

1 e(t/) , t > 0.

The survivor function of T is


ST (t) = 1 FT (t) =

1,

t0

e(t/) , t > 0.

PAGE 58

STAT 509, J. TEBBS

h(t)
1.0

20
0

0.0

10

h(t)

30

2.0

40

3.0

CHAPTER 3

h(t)
0

0.6

0.8

1.0

h(t)

1.2

1.4

2
t

Figure 3.18: Weibull hazard functions with = 1. Upper left: = 3. Upper right:
= 1.5. Lower left: = 1. Lower right: = 0.5.

Therefore, the hazard function, for t > 0, is


fT (t)
hT (t) =
=
ST (t)

( )1
t

e(t/)

e(t/)

( )1
t
.

Plots of Weibull hazard functions are given in Figure 3.18. It is easy to show
hT (t) is increasing if > 1 (wear out; population of units get weaker with aging)
hT (t) is constant if = 1 (constant hazard; exponential distribution)
hT (t) is decreasing if < 1 (infant mortality; population of units gets stronger with
aging).

PAGE 59

CHAPTER 3

STAT 509, J. TEBBS

Example 3.29. We consider the data in Example 3.23 of Vining and Kowalski (pp 162).
The data are times, denoted by T (measured in months), to the rst failure for 20 electric
carts used for internal delivery and transportation in a large manufacturing facility.

0.9

1.5

2.3

3.2

3.9

5.0

6.2

7.5

8.3

10.4

11.1 12.6 15.0

16.3

19.3

22.6

24.8

31.5

38.1

53.0

From these data, maximum likelihood estimates of and are computed to be


b 1.110
b 15.271.
Note: Maximum likelihood estimation is a mathematical technique used to nd
values of and that most closely agree with the observed data. R computes these
estimates automatically. In Figure 3.19, we display the (estimated) PDF fT (t), CDF
FT (t), survivor function ST (t), and hazard function hT (t) based on these estimates.
REMARK : Note that the estimate b 1.110 is larger than 1. This suggests that there
is wear out taking place among the carts; that is, the population of carts gets weaker
as time passes.
(a) Using the estimated Weibull(b 1.110, b 15.271) distribution as a model for future
cart lifetimes, nd the probability that a cart will still be functioning after t = 36 months.
P (T 36) = 1 P (T < 36) = 1 FT (36)
= 1 [1 e(36/15.271)

1.110

] 0.075.

(b) Use the estimated distribution to nd the 90th percentile of the cart lifetimes.
FT (0.90 ) = 1 e(0.90 /15.271)

1.110

set

= 0.90.

Solving for 0.90 gives 0.90 32.373 months. Only ten percent of the cart lifetimes will
exceed this value.

PAGE 60

F(t)

0.04
0.00

0.02

f(t)

20

40

0.0 0.2 0.4 0.6 0.8 1.0

STAT 509, J. TEBBS

0.06

CHAPTER 3

60

20

60

0.04
0.00

h(t)

0.08

0.0 0.2 0.4 0.6 0.8 1.0

S(t)

40

20

40

60

20

40

60

Figure 3.19: Weibull functions with b 1.110 and b 15.271. Upper left: PDF. Upper
right: CDF. Lower left: Survivor function. Lower right: Hazard function.

> 1-pweibull(36,1.110,15.271) ## Part (a)


[1] 0.07497392
> qweibull(0.90,1.110,15.271) ## Part (b)
[1] 32.37337

TERMINOLOGY : A quantile-quantile plot (qq plot) is a graphical display that can


help assess the appropriateness of a distribution. Here is how the plot is constructed:
On the vertical axis, we plot the observed data, ordered from low to high.

PAGE 61

STAT 509, J. TEBBS

30
20
0

10

Observed values

40

50

CHAPTER 4

10

20

30

40

50

Weibull percentiles

Figure 3.20: Weibull qq plot for the electric cart data in Example 3.29. The observed data
are plotted versus the theoretical quantiles from a Weibull distribution with b 1.110
and b 15.271.
On the horizontal axis, we plot the (ordered) theoretical quantiles from the distribution (model) assumed for the observed data.
Our intuition should suggest the following:
If the observed data agrees with the distributions theoretical quantiles, then the
qq plot should look like a straight line (the distribution is a good choice).
If the observed data does not agree with the theoretical quantiles, then the qq
plot should have curvature in it (the distribution is not a good choice).
Interpretation: The Weibull qq plot in Figure 3.20 looks like a straight line. This
suggests that a Weibull distribution is a good t for the electric cart lifetime data.
PAGE 62

CHAPTER 4

STAT 509, J. TEBBS

Statistical Inference

Complementary reading: Chapter 4 (VK); Sections 4.1-4.8. Read also Sections 3.6-3.7.

4.1

Populations and samples

OVERVIEW : This chapter is about statistical inference. This deals with making
(probabilistic) statements about a population of individuals based on information that
is contained in a sample taken from the population.
Example 4.1. Suppose that we wish to study the performance of lithium batteries used
in a certain calculator. The purpose of our study is to determine the mean lifetime of
these batteries so that we can place a limited warranty on them in the future. Since this
type of battery has not been used in this calculator before, no one (except the Oracle) can
tell us the distribution of Y , the batterys lifetime. In fact, not only is the distribution
not known, but all parameters which index this distribution arent known either.
TERMINOLOGY : A population refers to the entire group of individuals (e.g., parts,
people, batteries, etc.) about which we would like to make a statement (e.g., proportion
defective, median weight, mean lifetime, etc.).
It is generally accepted that the entire population can not be measured. It is too
large and/or it would be too time consuming to do so.
To draw inferences (make probabilistic statements) about a population, we therefore
observe a sample of individuals from the population.
We will assume that the sample of individuals constitutes a random sample.
Mathematically, this means that all observations are independent and follow the
same probability distribution. Informally, this means that each sample (of the same
size) has the same chance of being selected. Our hope is that a random sample of
individuals is representative of the entire population of individuals.

PAGE 63

CHAPTER 4

STAT 509, J. TEBBS

NOTATION : We will denote a random sample of observations by


Y1 , Y2 , ..., Yn .
That is, Y1 is the value of Y for the rst individual in the sample, Y2 is the value of
Y for the second individual in the sample, and so on. The sample size tells us how
many individuals are in the sample and is denoted by n. Statisticians refer to the set of
observations Y1 , Y2 , ..., Yn generically as data. Lower case notation y1 , y2 , ..., yn is used
when citing numerical values (or when referring to realizations of the upper case versions).
BATTERY DATA: Consider the following random sample of n = 50 battery lifetimes
y1 , y2 , ..., y50 (measured in hours):
4285

2066 2584

1009

318

1429

981

1402

1137

414

564

604

14

4152

737

852

1560

1786

520

396

1278

209

349

478

3032

1461

701

1406

261

83

205

602

3770

726

3894

2662

497

35

2778

1379

3920

1379

99

510

582

308

3367

99

373

454

In Figure 4.1, we display a histogram of the battery lifetime data. We see that the
(empirical) distribution of the battery lifetimes is skewed towards the high side.
Which continuous probability distribution seems to display the same type of pattern
that we see in histogram?
An exponential() model seems reasonable here (based on the histogram shape).
What is ?
In this example, is called a (population) parameter. It describes the theoretical
distribution which is used to model the entire population of battery lifetimes.
In general, (population) parameters which index probability distributions (like the
exponential) are unknown.
All of the probability distributions that we discussed in Chapter 3 are meant to
describe (model) population behavior.
PAGE 64

STAT 509, J. TEBBS

Count

10

15

CHAPTER 4

1000

2000

3000

4000

Lifetime (in hours)

Figure 4.1: Histogram of battery lifetime data (measured in hours).

4.2

Parameters and statistics

TERMINOLOGY : A parameter is a numerical quantity that describes a population.


In general, population parameters are unknown. Some very common examples are:
= population mean
2 = population variance
p = population proportion.
CONNECTION : All of the probability distributions that we talked about in Chapter 3
were indexed by population (model) parameters. For example,
the N (, 2 ) distribution is indexed by two parameters, the population mean and
the population variance 2 .
PAGE 65

CHAPTER 4

STAT 509, J. TEBBS

the Poisson() distribution is indexed by one parameter, the population mean .


the Weibull(, ) distribution is indexed by two parameters, the shape parameter
and the scale parameter .
the b(n, p) distribution is indexed by one parameter, the population proportion of
successes p.

TERMINOLOGY : Suppose that Y1 , Y2 , ..., Yn is a random sample from a population.


The sample mean is

1
Yi .
n i=1
n

Y =
The sample variance is

1
(Yi Y )2 .
S =
n 1 i=1
n

The sample standard deviation is the positive square root of the sample variance;
i.e.,

v
u

u
2
S= S =t

1
(Yi Y )2 .
n 1 i=1
n

Important: Unlike their population analogues, these quantities can be computed from
a sample of data Y1 , Y2 , ..., Yn .
TERMINOLOGY : A statistic is a numerical quantity that can be calculated from a
sample of data. Some very common examples are:
Y

= sample mean

S 2 = sample variance
pb = sample proportion.
For example, with the battery lifetime data (a random sample of n = 50 lifetimes),
y = 1274.14 hours
s2 = 1505156 (hours)2
s 1226.85 hours.
PAGE 66

CHAPTER 4

STAT 509, J. TEBBS

> mean(battery) ## sample mean


[1] 1274.14
> var(battery) ## sample variance
[1] 1505156
> sd(battery) ## sample standard deviation
[1] 1226.848
SUMMARY : The table below succinctly summarizes the salient dierences between a
population and a sample (a parameter and a statistic):

Group of individuals

Numerical quantity

Status

Population (Not observed)

Parameter

Unknown

Sample (Observed)

Statistic

Calculated from sample data

Statistical inference deals with making (probabilistic) statements about a population


of individuals based on information that is contained in a sample taken from the population. We do this by
(a) estimating unknown population parameters with sample statistics
(b) quantifying the uncertainty (variability) that arises in the estimation process.
These are both necessary to construct condence intervals and to perform hypothesis
tests, two important exercises discussed in this chapter.

4.3

Point estimators and sampling distributions

NOTATION : To keep our discussion as general as possible (as the material in this subsection can be applied to many situations), we will let denote a population parameter.
For example, could denote a population mean, a population variance, a population
proportion, a Weibull or gamma model parameter, etc. It could also denote a
parameter in a regression context (Chapter 6-7).
PAGE 67

CHAPTER 4

STAT 509, J. TEBBS

TERMINOLOGY : A point estimator b is a statistic that is used to estimate a population parameter . Common examples of point estimators are:
Y

S 2
S

a point estimator for (population mean)


a point estimator for 2 (population variance)
a point estimator for (population standard deviation).

CRUCIAL POINT : It is important to note that, in general, an estimator b is a statistic,


so it depends on the sample of data Y1 , Y2 , ..., Yn .
The data Y1 , Y2 , ..., Yn come from the sampling process; e.g., dierent random samples will yield dierent data sets Y1 , Y2 , ..., Yn .
In this light, because the sample values Y1 , Y2 , ..., Yn will vary from sample to sample, the value of b will too! It therefore makes sense to think about all possible
b that is, the distribution of .
b
values of ;
TERMINOLOGY : The distribution of an estimator b (a statistic) is called its sampling
distribution. A sampling distribution describes mathematically how b would vary in
repeated sampling. We will study many sampling distributions in this chapter.
TERMINOLOGY : We say that b is an unbiased estimator of if and only if
b = .
E()
In other words, the mean of the sampling distribution of b is equal to . Note that
unbiasedness is a characteristic describing the center of a sampling distribution. This
deals with accuracy.
RESULT : Mathematics shows that when Y1 , Y2 , ..., Yn is a random sample,
E(Y ) =
E(S 2 ) = 2 .
That is, Y and S 2 are unbiased estimators of their population analogues.
PAGE 68

CHAPTER 4

STAT 509, J. TEBBS

GOAL: Not only do we desire to use point estimators b which are unbiased, but we would
also like for them to have small variability. In other words, when b misses , we would
like for it to not miss by much. This deals with precision.
MAIN POINT : Accuracy and precision are the two main mathematical characteristics
b We desire point estimators
that arise when evaluating the quality of a point estimator .
b which are unbiased (perfectly accurate) and have small variance (highly precise).
TERMINOLOGY : The standard error of a point estimator b is equal to

b
b
se() = var().
In other words, the standard error is equal to the standard deviation of the sampling
b An estimators standard error measures the amount of variability in
distribution of .
b Therefore,
the point estimator .
b b more precise.
smaller se()

4.4

Sampling distributions involving Y

NOTE : This subsection summarizes Sections 3.6-3.7 (VK).


Result 1: Suppose that Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution.
The sample mean Y has the following sampling distribution:
(
)
2
Y N ,
.
n
This result reminds us that
E(Y ) = .
That is, the sample mean Y is an unbiased estimator of the population mean .
This result also shows that the standard error of Y (as a point estimator) is

2
= .
se(Y ) = var(Y ) =
n
n
PAGE 69

STAT 509, J. TEBBS

CHAPTER 4

f(y)

Population distribution
Sample mean, n=5
Sample mean, n=25

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 4.2: Braking time example. Population distribution: Y N ( = 1.5, 2 = 0.16).


Also depicted are the sampling distributions of Y when n = 5 and n = 25.

Example 4.2. In Example 3.26 (notes), we examined the distribution of


Y = time (in seconds) to react to brake lights during in-trac driving.
We assumed that
Y N ( = 1.5, 2 = 0.16).
We call this the population distribution, because it describes the distribution of values
of Y for all individuals in the population (here, in-trac drivers).
Question.

Suppose that we take a random sample of n = 5 drivers with times

Y1 , Y2 , ..., Y5 . What is the distribution of the sample mean Y ?


Solution. If the sample size is n = 5, then with = 1.5 and 2 = 0.16, we have
(
)
2
Y N ,
= Y N (1.5, 0.032).
n
PAGE 70

CHAPTER 4

STAT 509, J. TEBBS

This distribution describes the values of Y we would expect to see in repeated sampling,
that is, if we repeatedly sampled n = 5 individuals from this population of in-trac
drivers and calculated the sample mean Y each time.
Question. Suppose that we take a random sample of n = 25 drivers with times
Y1 , Y2 , ..., Y25 . What is the distribution of the sample mean Y ?
Solution. If the sample size is n = 25, then with = 1.5 and 2 = 0.16, we have
(
)
2
Y N ,
= Y N (1.5, 0.0064).
n
The sampling distribution of Y when n = 5 and when n = 25 is shown in Figure 4.2.

4.4.1

Central Limit Theorem

Result 2: Suppose that Y1 , Y2 , ..., Yn is a random sample from a population distribution


with mean and variance 2 (not necessarily a normal distribution). When the sample
size n is large, we have

(
)
2
Y AN ,
.
n

The symbol AN is read approximately normal. This result is called the Central Limit
Theorem (CLT).
Result 1 guarantees that when the underlying population distribution is N (, 2 ),
the sample mean

(
)
2
Y N ,
.
n

The Central Limit Theorem (Result 2) says that even if the population distribution
is not normal (Guassian), the sampling distribution of the sample mean Y will be
approximately normal (Gaussian) when the sample size is suciently large.

Example 4.3. The time to death for rats injected with a toxic substance, denoted by
Y (measured in days), follows an exponential distribution with = 1/5. That is,
Y exponential( = 1/5).
PAGE 71

STAT 509, J. TEBBS

0.3

0.4

CHAPTER 4

0.2
0.0

0.1

PDF

Population distribution
Sample mean, n=5
Sample mean, n=25

10

15

20

Time to death (in days)

Figure 4.3: Rat death times. Population distribution: Y exponential( = 1/5). Also
depicted are the sampling distributions of Y when n = 5 and n = 25.

This is the population distribution, that is, this distribution describes the time to
death for all rats in the population.
In Figure 4.3, I have shown the exponential(1/5) population distribution (solid
curve). I have also depicted the theoretical sampling distributions of Y when n = 5
and when n = 25.
Main point: Notice how the sampling distribution of Y begins to (albeit distantly)
resemble a normal distribution when n = 5. When n = 25, the sampling distribution of Y looks very much to be normal (Gaussian). This is precisely what is
conferred by the CLT. The larger the sample size n, the better a normal (Gaussian)
distribution approximates the true sampling distribution of Y .
PAGE 72

CHAPTER 4

STAT 509, J. TEBBS

Example 4.4. When a batch of a certain chemical product is prepared, the amount of
a particular impurity in the batch (measured in grams) is a random variable Y with the
following population parameters:
= 4.0g
2 = (1.5g)2 .
Suppose that n = 50 batches are prepared (independently). What is the probability that
the sample mean impurity amount Y is greater than 4.2 grams?
Solution. With n = 50, = 4, and 2 = (1.5)2 , the CLT says that
)
(
2
Y AN ,
= Y AN (4, 0.045).
n
Therefore,
P (Y > 4.2) = 1 P (Y < 4.2)
1-pnorm(4.2,4,sqrt(0.045)) = 0.1728893.
Important: Note that in making this (approximate) probability calculation, we never
made an assumption about the underlying population distribution shape.

4.4.2

t distribution

Result 3: Suppose that Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution.


Result 1 says the sample mean Y has the following sampling distribution:
(
)
2
Y N ,
.
n
If we standardize Y , we obtain
Z=

Y
N (0, 1).
/ n

Replacing the population standard deviation with the sample standard deviation S,
we get a new sampling distribution:
t=

Y
t(n 1),
S/ n

a t distribution with degrees of freedom = n 1.


PAGE 73

STAT 509, J. TEBBS

0.4

CHAPTER 4

0.2
0.0

0.1

f(y)

0.3

N(0,1)
t(2)
t(10)

Figure 4.4: Probability density functions of N (0, 1), t(2), and t(10).

FACTS : The t distribution has the following characteristics:


It is continuous and symmetric about 0 (just like the standard normal distribution).
It is indexed by a value called the degrees of freedom.
In practice, is often an integer (related to the sample size).
As , t() N (0, 1); thus, when becomes larger, the t() and the N (0, 1)
distributions look more alike.
When compared to the standard normal distribution, the t distribution, in general,
is less peaked and has more probability (area) in the tails.
The t pdf formula is complicated and is unnecessary for our purposes. R will
compute t probabilities and quantiles from the t distribution.
PAGE 74

CHAPTER 4

STAT 509, J. TEBBS

t R CODE: Suppose that T t().


FT (t) = P (T t)

pt(t,)

qt(p,)

Example 4.5. Hollow pipes are to be used in an electrical wiring project. In testing
1-inch pipes, the data below were collected by a design engineer. The data are measurements of Y , the outside diameter of this type of pipe (measured in inches). These
n = 25 pipes were randomly selected and measuredall in the same location.

1.296

1.320

1.311

1.298

1.315

1.305

1.278

1.294

1.311

1.290

1.284

1.287

1.289

1.292

1.301

1.298

1.287

1.302

1.304

1.301

1.313

1.315

1.306

1.289

1.291

From their extensive experience, the manufacturers of this pipe claim that the population
distribution is normal (Gaussian) and that the mean outside diameter is = 1.29 inches.
Under this assumption (which may or may not be true), calculate the value of
t=

y
.
s/ n

Solution. We use R to nd the sample mean y and the sample standard deviation s:

> mean(pipes) ## sample mean


[1] 1.29908
> sd(pipes) ## sample standard deviation
[1] 0.01108272

With n = 25, we have


t=

1.29908 1.29
4.096.
0.01108272/ 25
PAGE 75

STAT 509, J. TEBBS

0.2
0.0

0.1

f(t)

0.3

0.4

CHAPTER 4

Figure 4.5: t(24) probability density function. An at t = 4.096 has been added.

Analysis. If the manufacturers claim is true (that is, if = 1.29 inches), then
t=

s/ n

comes from a t(24) distribution. The t(24) pdf is displayed above in Figure 4.5.
Key question: Does t = 4.096 seem like a value you would expect to see from this
distribution? If not, what might this suggest? Recall that t was computed under the
assumption that = 1.29 inches (the manufacturers claim).
Question. The value t = 4.096 is what percentile of the t(24) distribution?
> pt(4.096,24)
[1] 0.9997934
Answer: t = 4.096 is approximately the 99.98th percentile of the t(24) distribution.
PAGE 76

CHAPTER 4

4.4.3

STAT 509, J. TEBBS

Normal quantile-quantile (qq) plots

IMPORTANT : Result 3 says that if Y1 , Y2 , ..., Yn is a random sample from a N (, 2 )


distribution, then
t=

Y
t(n 1).
S/ n

An obvious question therefore arises: What if Y1 , Y2 , ..., Yn are non-normal (i.e., nonGaussian)?
Answer: The t distribution result still approximately holds, even if the underlying
population distribution is not perfectly normal. The approximation improves when
the sample size is larger
the population distribution is more symmetric (not highly skewed).
Because the normality assumption (for the population distribution) is not absolutely
critical for this t sampling distribution result to hold, we say that Result 3 is robust to
the normality assumption.
REMARK : Robustness is a nice property; it assures us that the underlying assumption
of normality is not an absolute requirement for the t distribution result to hold. Other
sampling distribution results (coming up) are not always robust to normality departures.
TERMINOLOGY : Just as we used Weibull qq plots to assess the Weibull model assumption in the last chapter, we can use a normal quantile-quantile (qq) plot to assess the
normal distribution assumption. The plot is constructed as follows:
On the vertical axis, we plot the observed data, ordered from low to high.
On the horizontal axis, we plot the (ordered) theoretical quantiles from the distribution (model) assumed for the observed data (here, normal).
ILLUSTRATION : Figure 4.6 shows the normal qq plot for the pipe diameter data in
Example 4.5. The ordered data do not match up perfectly with the normal quantiles,
but the plot doesnt set o any serious alarms (insofar as a departure from normality is
concerned).
PAGE 77

STAT 509, J. TEBBS

1.30
1.28

1.29

Sample Quantiles

1.31

1.32

CHAPTER 4

Theoretical Quantiles

Figure 4.6: Normal qq plot for the pipe diameter data in Example 4.5. The observed
data are plotted versus the theoretical quantiles from a normal distribution. The line
added passes through the rst and third theoretical quartiles.

4.5
4.5.1

Condence intervals for a population mean


Known population variance 2

SETTING: To get things started, we will assume that Y1 , Y2 , ..., Yn is a random sample
from a N (, 2 ) population distribution. We will assume that
the population variance 2 is known (largely unrealistic).
the goal is to estimate the population mean .

PAGE 78

CHAPTER 4

STAT 509, J. TEBBS

We already know that Y is an unbiased (point) estimator for , that is,


E(Y ) = .
However, reporting Y alone does not acknowledge that there is variability attached to
this estimator. For example, in Example 4.5, with the n = 50 measured pipes, reporting
y 1.299 in
as an estimate of the population mean does not account for the fact that
the 50 pipes measured were drawn randomly from a population of all pipes, and
dierent samples would give dierent sets of pipes (and dierent values of y).
In other words, using a point estimator only ignores important information; namely,
how variable the population of pipes is.
REMEDY : To avoid this problem (i.e., to account for the uncertainty in the sampling
procedure), we therefore pursue the topic of interval estimation (also known as condence intervals). The main dierence between a point estimate and an interval estimate
is that
a point estimate is a one-shot guess at the value of the parameter; this ignores
the variability in the estimate.
an interval estimate (i.e., condence interval) is an interval of values. It is
formed by taking the point estimate and then adjusting it downwards and upwards to account for the point estimates variability. The end result is an interval
estimate.
DERIVATION : We start our discussion by revisiting Result 1 in the last subsection.
Recall that if Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution, then the
sampling distribution of Y is

(
)
2
Y N ,
n
PAGE 79

STAT 509, J. TEBBS

0.2
0.0

0.1

f(z)

0.3

0.4

CHAPTER 4

Figure 4.7: N (0, 1) pdf. The upper 0.025 and lower 0.025 areas have been shaded. The
associated quantiles are z0.025 1.96 and z0.025 1.96, respectively.

and therefore
Z=

Y
N (0, 1).
/ n

We now introduce new notation that identies quantiles from this distribution.
NOTATION : Let z/2 denote the upper /2 quantile from the N (0, 1) distribution.
Because the N (0, 1) distribution is symmetric about z = 0, we know that z/2 is the
lower /2 quantile. For example, if = 0.05 (see Figure 4.7), we know that
z0.05/2 = z0.025 1.96
z0.05/2 = z0.025 1.96.
To see where these values come from, you can use Table 1 (VK, pp 593-594) or R:
PAGE 80

CHAPTER 4

STAT 509, J. TEBBS

> qnorm(0.975,0,1) ## z_{0.025} ## upper 0.025 quantile


[1] 1.959964
> qnorm(0.025,0,1) ## -z_{0.025} ## lower 0.025 quantile
[1] -1.959964

DERIVATION : In general, for any value of , 0 < < 1, we can write


1 = P (z/2 < Z < z/2 )
(
)
Y
< z/2
= P z/2 <
/ n
(
)

= P z/2 < Y < z/2


n
n
(
)

= P z/2 > Y > z/2


n
n
(
)

= P Y + z/2 > > Y z/2


n
n
(
)

= P Y z/2 < < Y + z/2


.
n
n
We call

(
)

Y z/2 , Y + z/2
n
n

a 100(1) percent condence interval for the population mean . This is sometimes
written (more succinctly) as

Y z/2 .
n
Note the form of the interval:
point estimate quantile standard
error} .
|
{z
| {z }
|
{z
}

z/2

/ n

Many condence intervals we will study follow this same general form.
Here is how we interpret this interval: We say
We are 100(1 ) percent condent that the population mean is in
this interval.
PAGE 81

CHAPTER 4

STAT 509, J. TEBBS

Unfortunately, the word condent does not mean probability. The term condence in condence interval means that if we were able to sample from the population over and over again, each time computing a 100(1 ) percent condence
interval for , then 100(1 ) percent of the intervals we would compute would
contain the population mean .
That is, condence refers to long term behavior of many intervals; not probability for a single interval. Because of this, we call 100(1 ) the condence
level. Typical condence levels are
90 percent ( = 0.10)

= z0.05 1.645

95 percent ( = 0.05)

= z0.025 1.96

99 percent ( = 0.01)

= z0.005 2.33.

The length of the 100(1 ) percent condence interval

Y z/2
n
is equal to

2z/2 .
n
Therefore,
the larger the sample size n, the smaller the interval length.
the larger the population variance 2 , the larger the interval length.
the larger the condence level 100(1 ), the larger the interval length.
Clearly, shorter condence intervals are preferred. They are more informative!

Example 4.6. Civil engineers have found that the ability to see and read a sign at night
depends in part on its surround luminance; i.e., the light intensity near the sign. The
data below are n = 30 measurements of the random variable Y , the surround luminance
(in candela per m2 ). The 30 measurements constitute a random sample from all signs in
a large metropolitan area.
PAGE 82

CHAPTER 4

STAT 509, J. TEBBS

10.9

1.7

9.5

2.9

9.1

3.2

9.1

7.4

13.3

13.1

6.6

13.7

1.5

6.3

7.4

9.9

13.6

17.3

3.6

4.9

13.1

7.8

10.3

10.3

9.6

5.7

2.6

15.1

2.9

16.2

Based on past experience, the engineers assume a normal population distribution (for
the population of all signs) with known population variance 2 = 20.
Question. Find a 90 percent condence interval for , the mean surround luminance.
Solution. We rst use R to calculate the sample mean y:
> mean(intensity) ## sample mean
[1] 8.62
For a 90 percent condence level; i.e., with = 0.10, we use
z0.10/2 = z0.05 1.645.
This can be determined from Table 1 (VK, pp 593-594) or from R:
> qnorm(0.95,0,1) ## z_{0.05} ## upper 0.05 quantile
[1] 1.644854
With n = 30 and 2 = 20, a 90 percent condence interval for the mean surround
luminance is

y z/2
n

( )
20
= 8.62 1.645
30

= (7.28, 9.96) candela/m2 .

Interpretation: We are 90 percent condent that the mean surround luminance for
all signs in the population is between 7.28 and 9.96 candela/m2 .
Further analysis: Recall that the engineers claimed that the population of all light
intensities is described by a normal (Gaussian) distribution. In Figure 4.8, we display a
normal qq plot to check this assumption. The qq plot might cause some concern about
the normality assumption. There is some mild evidence of disagreement with the normal
model (although with a small sample, it is always hard to be sure).
PAGE 83

STAT 509, J. TEBBS

10
5

Sample Quantiles

15

CHAPTER 4

Theoretical Quantiles

Figure 4.8: Normal qq plot for the light intensity data in Example 4.6. The observed
data are plotted versus the theoretical quantiles from a normal distribution. The line
added passes through the rst and third theoretical quartiles.

Is this a serious cause for concern?: Probably not. Recall that even if the population
distribution (here, the distribution of all light intensity measurements in the city) is not
perfectly normal, we still have
(
)
2
Y AN ,
,
n
for n large, by the Central Limit Theorem. Therefore, our 90 percent condence interval
is still approximately valid. A sample of size n = 30 is pretty large. In other words,
at n = 30, the CLT approximation above is usually kicking in rather well unless the
underlying population distribution is grossly skewed (and I mean very grossly). This is
not the case here.
PAGE 84

CHAPTER 4

STAT 509, J. TEBBS

Unknown population variance 2

4.5.2

SETTING: We will continue to assume that Y1 , Y2 , ..., Yn is a random sample from a


N (, 2 ) population distribution.
Our goal is the same; namely, to write a 100(1 ) percent condence interval for
the population mean .
However, we will no longer make the (rather unrealistic) assumption that population variance 2 is known.
RECALL: If you look back in the notes at the known 2 case, you will see that to derive
a 100(1) percent condence interval for , we started with the following distributional
result:
Z=

Y
N (0, 1).
/ n

This led us to the following condence interval formula:

Y z/2 .
n
The obvious problem is that, because 2 is now unknown, we can not calculate the
interval. Not to worry; we just need a dierent starting point. Recall that
t=

Y
t(n 1),
S/ n

where S is the sample standard deviation (a point estimator for the population standard
deviation). This result is all we need; in fact, it is straightforward to reproduce the
known 2 derivation and tailor it to this (now more realistic) case. A 100(1 )
percent condence interval for is given by
S
Y tn1,/2 .
n
The symbol tn1,/2 denotes the upper /2 quantile from a t distribution with = n1
degrees of freedom. This value can be easily obtained from R. For those of you that like
probability tables, VK tables the t distributions in Table 2 (pp 595).
PAGE 85

CHAPTER 4

STAT 509, J. TEBBS

We see that the interval again has the same form:


point estimate quantile standard
error} .
|
{z
|
{z
}
| {z }

tn1,/2

S/ n

We interpret the interval in the same way.


We are 100(1 ) percent condent that the population mean is in
this interval.

Example 4.7. Acute exposure to cadmium produces respiratory distress and kidney and
liver damage (and possibly death). For this reason, the level of airborne cadmium dust
and cadmium oxide fume in the air, denoted by Y (measured in milligrams of cadmium
per m3 of air), is closely monitored. A random sample of n = 35 measurements from a
large factory are given below:

0.044 0.030

0.052

0.044

0.046

0.020

0.066

0.052 0.049

0.030

0.040

0.045

0.039

0.039

0.039 0.057

0.050

0.056

0.061

0.042

0.055

0.037 0.062

0.062

0.070

0.061

0.061

0.058

0.053 0.060

0.047

0.051

0.054

0.042

0.051

Based on past experience, engineers assume a normal population distribution (for the
population of all cadmium measurements).
Question. Find a 99 percent condence interval for , the mean level of airborne
cadmium.
Solution. We rst use R to calculate the sample mean y and the sample standard
deviation s:
> mean(cadmium) ## sample mean
[1] 0.04928571
> sd(cadmium) ## sample standard deviation
[1] 0.0110894
PAGE 86

CHAPTER 4

STAT 509, J. TEBBS

For a 99 percent condence level; i.e., with = 0.01, we use


t34,0.01/2 = t34,0.005 2.728.
Note that Table 2 (VK, pp 595) is not helpful here; the = 34 degrees of freedom
quantiles are not listed. From R, we get:
> qt(0.995,34) ## t_{34,0.005} ## upper 0.005 quantile
[1] 2.728394
With n = 35, y 0.049, and s 0.011, a 99 percent condence interval for the population
mean level of airborne cadmium is
s
y tn1,/2
n

(
= 0.049 2.728

0.011

35

)
= (0.044, 0.054) mg/m3 .

Interpretation: We are 99 percent condent that the population mean level of airborne
cadmium is between 0.044 and 0.054 mg/m3 .
NOTE : It is possible to implement the t interval procedure entirely in R:

> # Calculate t interval directly


> t.test(cadmium,conf.level=0.99)$conf.int
[1] 0.04417147 0.05439996

Further analysis: Recall that the engineers claimed that the population of all airborne
cadmium concentrations is described by a normal (Gaussian) distribution. In Figure
4.9, we display the normal qq plot to check this assumption. The qq plot looks pretty
supportive of the normal assumption, although there is mild evidence of a slight departure
in the upper tail.
ROBUSTNESS : The t condence interval is based on the population distribution being
normal (Gaussian). However, this interval is robust to departures from normality; i.e.,
even if the population distribution is non-normal (non-Gaussian), we can still use the t
condence interval and get approximately valid results.
PAGE 87

STAT 509, J. TEBBS

0.05
0.04
0.02

0.03

Sample Quantiles

0.06

0.07

CHAPTER 4

Theoretical Quantiles

Figure 4.9: Normal qq plot for the cadmium data in Example 4.7. The observed data
are plotted versus the theoretical quantiles from a normal distribution. The line added
passes through the rst and third theoretical quartiles.

4.5.3

Sample size determination

MOTIVATION : In the planning stages of an experiment or investigation, it is often of


interest to determine how many individuals are needed to write a condence interval
with a given level of precision. For example, we might want to construct a 95 percent
condence interval for a population mean so that the interval length is no more than
5 units (e.g., days, inches, dollars, etc.). Of course, collecting data almost always costs
money! Therefore, one must be cognizant not only of the statistical issues associated
with sample size determination, but also of the practical issues like cost, time spent
in data collection, personnel training, etc.
PAGE 88

CHAPTER 4

STAT 509, J. TEBBS

SETTING: Suppose that Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) population,


where 2 is known. In this (known 2 ) situation, recall that a 100(1 ) percent condence interval for is given by

(
Y z/2
|

.
n
{z
}

=B, say

The quantity B is called the margin of error.


FORMULA: In the setting described above, it is possible to determine the sample size n
necessary once we specify these three pieces of information:
the value of 2 (or an educated guess at its value; e.g., from past information, etc.)
the condence level, 100(1 )
the margin of error, B.
This is true because

(
B = z/2

)
n =

(z

/2

)2
.

Example 4.8. In a biomedical experiment, we would like to estimate the population


mean remaining life of healthy rats that are given a certain dose of a toxic substance.
Suppose that we would like to write a 95 percent condence interval for with a margin
of error equal to B = 2 days. From past studies, remaining rat lifetimes have been
approximated by a normal distribution with standard deviation = 8 days. How many
rats should we use for the experiment?
Solution. With z0.05/2 = z0.025 1.96, B = 2, and = 8, the desired sample size to
estimate is
n=

(z

/2

)2
=

1.96 8
2

)2

We would sample n = 62 rats to achieve these goals.

> qnorm(0.975,0,1)
[1] 1.959964
PAGE 89

61.46.

CHAPTER 4

4.6

STAT 509, J. TEBBS

Condence interval for a population proportion p

SITUATION : We now switch gears and focus on a new parameter: the population
proportion p. This parameter emerges when the characteristic we measure on each
individual is binary (i.e., only 2 outcomes possible). Here are some examples:
p = proportion of defective circuit boards
p = proportion of customers who are satised
p = proportion of payments received on time
p = proportion of HIV positives in SC.
To start our discussion, we need to recall the Bernoulli trial assumptions for each
individual in the sample:
1. each individual results in a success or a failure,
2. the individuals are independent, and
3. the probability of success, denoted by p, 0 < p < 1, is the same for every
individual.
In our examples above,
success circuit board defective
success customer satised
success payment received on time
success HIV positive individual.
RECALL: If the individual success/failure statuses in the sample adhere to the Bernoulli
trial assumptions, then
Y = the number of successes out of n sampled individuals
follows a binomial distribution, that is, Y b(n, p). The statistical problem at hand is
to use the information in Y to estimate p.
PAGE 90

CHAPTER 4

STAT 509, J. TEBBS

POINT ESTIMATOR: A natural point estimator for p, the population proportion, is


pb =

Y
,
n

the sample proportion. This statistic is simply the proportion of successes in the
sample (out of n individuals).
PROPERTIES : Fairly simple arguments can be used to show the following results:
E(b
p) = p

se(b
p) =

p(1 p)
.
n

The rst result says that the sample proportion pb is an unbiased estimator of the
population proportion p. The second (standard error) result quanties the precision of pb
as an estimator of p.
SAMPLING DISTRIBUTION : Knowing the sampling distribution of pb is critical if we
are going to formalize statistical inference procedures for p. In this situation, we appeal
to an approximate result (conferred by the CLT) which says that
[
]
p(1 p)
pb AN p,
,
n
when the sample size n is large.
RESULT : An approximate 100(1 ) percent condence interval for p is given by

pb(1 pb)
pb z/2
.
n
This interval should be used only when the sample size n is large. A common
rule of thumb (to use this interval formula) is to require
nb
p 5
n(1 pb) 5.
Under these conditions, the CLT should adequately approximate the true sampling
distribution of pb, thereby making the condence interval formula above approximately valid.
PAGE 91

CHAPTER 4

STAT 509, J. TEBBS

Note again the form of the interval:


point estimate quantile standard
| {z error} .
|
{z
}
| {z }
pb

z/2

p
b(1b
p)
n

We interpret the interval in the same way.


We are 100(1 ) percent condent that the population proportion p
is in this interval.
The value z/2 is the upper /2 quantile from the N (0, 1) distribution.

Example 4.9. One source of water pollution is gasoline leakage from underground
storage tanks. In Pennsylvania, a random sample of n = 74 gasoline stations is selected
and the tanks are inspected; 10 stations are found to have at least one leaking tank.
Question. Calculate a 95 percent condence interval for p, the population proportion
of gasoline stations with at least one leaking tank.
Solution. In this situation, we interpret
gasoline station = individual trial
at least one leaking tank = success
p = population proportion of stations with at least one leaking tank.
For 95 percent condence, we need z0.05/2 = z0.025 1.96. The sample proportion of
stations with at least one leaking tank is
pb =

10
0.135.
74

Therefore, an approximate 95 percent condence interval for p is

0.135(1 0.135)
0.135 1.96
= (0.057, 0.213).
74
Interpretation: We are 95 percent condent that the population proportion of stations
in Pennsylvania with at least one leaking tank is between 0.057 and 0.213.
PAGE 92

CHAPTER 4

STAT 509, J. TEBBS

CLT approximation check: We have


(
nb
p = 74

10
74

= 10
(
)
10
n(1 pb) = 74 1
= 64.
74
Both of these are larger than 5 = we can feel comfortable in using this interval formula.
QUESTION : Suppose that we would like to write a 100(1) percent condence interval
for p, a population proportion. We know that

pb(1 pb)
pb z/2
n
is an approximate 100(1) percent condence interval for p. What sample size n should
we use?
SAMPLE SIZE DETERMINATION : To determine the necessary sample size, we rst
need to specify two pieces of information:
the condence level 100(1 )
the margin of error:

B = z/2

pb(1 pb)
.
n

A small problem arises. Note that B depends on pb. Unfortunately, pb can only be
calculated once we know the sample size n. We overcome this problem by replacing pb
with p0 , an a priori guess at its value. The last expression becomes

p0 (1 p0 )
.
B = z/2
n
Solving this equation for n, we get
n=

(z

/2

)2

p0 (1 p0 ).

This is the desired sample size n to nd a 100(1 ) percent condence interval for p
with a prescribed margin of error (roughly) equal to B. I say roughly, because there
may be additional uncertainty arising from our use of p0 (our best guess).
PAGE 93

CHAPTER 4

STAT 509, J. TEBBS

CONSERVATIVE APPROACH : If there is no sensible guess for p available, use p0 = 0.5.


In this situation, the resulting value for n will be as large as possible. Put another way,
using p0 = 0.5 gives the most conservative solution (i.e., the largest sample size, n).
This is true because
n = n(p0 ) =

(z

/2

)2

p0 (1 p0 ),

when viewed as a function of p0 , is maximized when p0 = 0.5.


Example 4.10. You have been asked to estimate the proportion of raw material (in a
certain manufacturing process) that is being scrapped; e.g., the material is so defective
that it can not be reworked. If this proportion is larger than 10 percent, this will be
deemed (by management) to be an unacceptable continued operating cost and a substantial process overhaul will be performed. Past experience suggests that the scrap rate
is about 5 percent, but recent information suggests that this rate may be increasing.
Question. You would like to write a 95 percent condence interval for p, the population proportion of raw material that is to be scrapped, with a margin of error equal to
B = 0.02. How many pieces of material should you ask to be sampled?
Solution. For 95 percent condence, we need z0.05/2 = z0.025 1.96. In providing an
initial guess, we have options; we could use
p0 = 0.05 (historical scrap rate)
p0 = 0.10 (critical mass value)
p0 = 0.50 (most conservative choice).
For these choices, we have
(
n =
(
n =
(
n =

1.96
0.02
1.96
0.02
1.96
0.02

)2
0.05(1 0.05) 457
)2
0.10(1 0.10) 865
)2
0.50(1 0.50) 2401.

As we can see, the guessed value of p0 has a substantial impact on the nal sample
size calculation.
PAGE 94

CHAPTER 4

4.7

STAT 509, J. TEBBS

Condence interval for a population variance 2

MOTIVATION : In many situations, one is concerned not with the mean of an underlying
(continuous) population distribution, but with the variance 2 instead. If 2 is excessively
large, this could point to a potential problem with a manufacturing process, for example,
where there is too much variation in the measurements produced. In a laboratory setting,
chemical engineers might wish to estimate the variance 2 attached to a measurement
system (e.g., scale, caliper, etc.). In eld trials, agronomists are often interested in
comparing the variability levels for dierent cultivars or genetically-altered varieties. In
clinical trials, physicians are often concerned if there are substantial dierences in the
variation levels of patient responses at dierent clinic sites.
NEW RESULT : Suppose that Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution. The quantity
Q=

(n 1)S 2
2 (n 1),
2

a 2 distribution with = n 1 degrees of freedom.


FACTS : The 2 distribution has the following characteristics:
It is continuous, skewed to the right, and always positive.
It is indexed by a value called the degrees of freedom. In practice, is often
an integer (related to the sample size).
The 2 pdf formula is unnecessary for our purposes. R will compute 2 probabilities
and quantiles from the 2 distribution.
2 R CODE: Suppose that Q 2 ().
FQ (q) = P (Q q)

pchisq(q,)

qchisq(p,)

NOTE : Table 3 (VK, pp 596-597) catalogues some quantiles from some 2 distributions.
PAGE 95

STAT 509, J. TEBBS

0.15

CHAPTER 4

0.00

0.05

f(q)

0.10

Chisquare; df = 5
Chisquare; df = 10
Chisquare; df = 20

10

20

30

40

Figure 4.10: 2 probability density functions with dierent degrees of freedom.

GOAL: Suppose that Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution. We


would like to write a 100(1 ) percent condence interval for 2 .
NOTATION : Let 2n1,/2 denote the lower /2 quantile and let 2n1,1/2 denote the
upper /2 quantile of the 2 (n 1) distribution; i.e., 2n1,/2 and 2n1,1/2 satisfy
P (Q < 2n1,/2 ) = /2
P (Q > 2n1,1/2 ) = /2,
respectively. Note that, unlike the N (0, 1) and t distributions, the 2 distribution is
not symmetric. Therefore, dierent notation is needed to identify the quantiles of 2
distributions (this is nothing to get worried about).

PAGE 96

CHAPTER 4

STAT 509, J. TEBBS

DERIVATION : Because Q 2 (n 1), we write


1 = P (2n1,/2 < Q < 2n1,1/2 )
[
]
(n 1)S 2
2
2
= P n1,/2 <
< n1,1/2
2
[
]
2
1
1
= P
>
> 2
2n1,/2
(n 1)S 2
n1,1/2
]
[
2
(n

1)S
(n 1)S 2
> 2 > 2
= P
2n1,/2
n1,1/2
[
]
2
(n 1)S 2
(n

1)S
= P
.
< 2 < 2
2n1,1/2
n1,/2
This argument shows that

(n 1)S 2 (n 1)S 2
,
2n1,1/2 2n1,/2

is a 100(1 ) percent condence interval for the population variance 2 . We


interpret the interval in the same way.
We are 100(1 ) percent condent that the population variance 2 is in
this interval.
IMPORTANT : A 100(1 ) percent condence interval for the population standard
deviation arises from simply taking the square root of the endpoints of the 2 interval.
That is,

(n
,
2n1,1/2
1)S 2

(n
2n1,/2

1)S 2

is a 100(1 ) percent condence interval for the population standard deviation . In


practice, this interval may be preferred over the 2 interval, because standard deviation
is a measure of variability in terms of the original units (e.g., dollars, inches, days, etc.).
The variance is measured in squared units (e.g., dollars2 , in2 , days2 , etc.).
Example 4.11. Indoor swimming pools are noted for their poor acoustical properties.
Suppose your goal is to design a pool in such a way that

PAGE 97

CHAPTER 4

STAT 509, J. TEBBS

the population mean time that it takes for a low-frequency sound to die out is
= 1.3 seconds
the population standard deviation for the distribution of die-out times is = 0.6
seconds.
Computer simulations of a preliminary design are conducted to see whether these standards are being met; here are data from n = 20 independently-run simulations. The data
are obtained on the time (in seconds) it takes for the low-frequency sound to die out.

1.34 2.56

1.28

2.25

1.84

2.35

0.77

1.84

1.80

2.44

0.86 1.29

0.12

1.87

0.71

2.08

0.71

0.30

0.54

1.48

Question. Find a 95 percent condence interval for the population standard deviation of
times . What does this interval suggest about whether the preliminary design conforms
to specications (with respect to variability)?
Solution. For 95 percent condence (i.e., using = 0.05), we need to use
2n1,/2 = 219,0.025 8.907
2n1,1/2 = 219,0.975 32.852.
I got these quantiles from R:
> qchisq(0.025,19) ## chi^2_{19,0.025}
[1] 8.906516
> qchisq(0.975,19) ## chi^2_{19,0.975}
[1] 32.85233
We also need to calculate the sample variance s2 . From R, we get s2 0.555.
> var(sounds) ## sample variance
[1] 0.554666

PAGE 98

STAT 509, J. TEBBS

1.5
1.0
0.5

Sample Quantiles

2.0

2.5

CHAPTER 4

Theoretical Quantiles

Figure 4.11: Normal qq plot for the swimming pool sound time data in Example 4.11.
The observed data are plotted versus the theoretical quantiles from a normal distribution.
The line added passes through the rst and third theoretical quartiles.

A 95 percent condence interval for 2 is


[
]
[
]
19(0.555) 19(0.555)
(n 1)s2 (n 1)s2
,
=
,
2n1,1/2 2n1,/2
32.852
8.907

= (0.321, 1.184).

Interpretation: We are 95 percent condent that the population variance 2 is between


0.321 and 1.184 sec2 .
A 95 percent condence interval for (which is what we originally wanted) is

( 0.321, 1.184) = (0.567, 1.088).


Interpretation: We are 95 percent condent that the population standard deviation
is between 0.567 and 1.088 sec. From this interval, it appears as though the pool design
PAGE 99

CHAPTER 4

STAT 509, J. TEBBS

specications are being met (with respect to the population standard deviation level).
Note that = 0.6 is contained in the condence interval for .
Major warning: Unlike the z and t condence intervals for a population mean ,
the 2 interval for 2 (and for ) is not robust to departures from normality. If the
underlying population distribution is non-normal (non-Guassian), then the condence
interval formulas for 2 and are not to be used. Therefore, it is very important to
check the normality assumption with these interval procedures (e.g., use a qq-plot).
Analysis: With only n = 20 measurements, it is somewhat hard to tell, but the qq-plot
in Figure 4.11 looks fairly straight. Small sample sizes make interpreting qq-plots more
dicult (e.g., the analyst may look for patterns that are not really there).

4.8

Condence intervals for the dierence of two population


means 1 2 : Independent samples

REMARK : In practice, it is very common to compare the same characteristic (mean,


proportion, variance) from two dierent distributions. For example, we may wish to
compare
the mean starting salaries of male and female engineers (compare 1 and 2 )
the proportion of scrap produced from two manufacturing processes (compare p1
and p2 )
the variance of sound levels from two indoor swimming pool designs (compare 12
and 22 ).
Our previous work is applicable only for a single distribution (i.e., a single mean , a single
proportion p, and a single variance 2 ). We therefore need to extend these procedures to
handle two distributions. We start with comparing two means.

PAGE 100

CHAPTER 4

STAT 509, J. TEBBS

TWO-SAMPLE PROBLEM : Suppose that we have two independent samples:


Sample 1 :

Y11 , Y12 , ..., Y1n1 N (1 , 12 ) random sample

Sample 2 :

Y21 , Y22 , ..., Y2n2 N (2 , 22 ) random sample.

GOAL: Construct a 100(1) percent condence interval for the dierence of population
means 1 2 .
POINT ESTIMATORS : We dene the statistics
Y 1+

n1
1
=
Y1j = sample mean for sample 1
n1 j=1

Y 2+ =

n2
1
Y2j = sample mean for sample 2
n2 j=1

S12 =

1
1
(Y1j Y 1+ )2 = sample variance for sample 1
n1 1 j=1

S22 =

2
1
(Y2j Y 2+ )2 = sample variance for sample 2.
n2 1 j=1

4.8.1

Equal variance case: 12 = 22

GOAL: We want to write a condence interval for 1 2 , but how this interval is
constructed depends on the values of 12 and 22 . In particular, we consider two cases:
12 = 22 ; that is, the two population variances are equal
12 = 22 ; that is, the two population variances are not equal.
We rst consider the equal variance case. Addressing this case requires us to start with
the following (sampling) distribution result:
T =

(Y 1+ Y 2+ ) (1 2 )
(
t(n1 + n2 2),
)
1
1
2
Sp n1 + n2

where
Sp2 =

(n1 1)S12 + (n2 1)S22


.
n1 + n2 2
PAGE 101

STAT 509, J. TEBBS

0.15

CHAPTER 4

0.00

0.05

f(y)

0.10

Distribution 1
Distribution 2

40

50

60

70

Figure 4.12: Two normal distributions with 12 = 22 .

Some comments are in order:


For this sampling distribution to hold (exactly), we need
the two samples to be independent
the two population distributions to be normal (Gaussian)
the two population distributions to have the same variance; i.e., 12 = 22 .
The statistic Sp2 is called the pooled sample variance estimator of the common
population variance, say, 2 . Algebraically, it is simply a weighted average of the
two sample variances S12 and S22 (where the weights are functions of the sample
sizes n1 and n2 ).
The sampling distribution T t(n1 + n2 2) should suggest to you that condence
PAGE 102

CHAPTER 4

STAT 509, J. TEBBS

interval quantiles will come from this t distribution; note that this distribution
depends on the sample sizes from both samples.
In particular, because T t(n1 + n2 2), we can nd the value tn1 +n2 2,/2 that
satises
P (tn1 +n2 2,/2 < T < tn1 +n2 2,/2 ) = 1 .
Substituting T into the last expression and performing algebraic manipulations, we

obtain
(Y 1+ Y 2+ ) tn1 +n2 2,/2

(
Sp2

)
1
1
+
.
n1 n2

This is a 100(1 ) percent condence interval for the mean dierence 1 2 .


We see that the interval again has the same form:
point estimate
|
{z
}

quantile
| {z }
tn1 +n2 2,/2

Y 1+ Y 2+

standard
| {z error} .
Sp2

1
+ n1
n1
2

We interpret the interval in the same way.


We are 100(1) percent condent that the population mean dierence
1 2 is in this interval.
Important: In two-sample situations, it is often of interest to see if the means 1
and 2 are dierent.
If the condence interval for 1 2 includes 0, this does not suggest that the
means 1 and 2 are dierent.
If the condence interval for 1 2 does not include 0, it does.

Example 4.12. In the vicinity of a nuclear power plant, environmental engineers from
the EPA would like to determine if there is a dierence between the mean weight in
sh (of the same species) from two locations. Independent samples are taken from each
location and the following weights (in ounces) are observed:

PAGE 103

STAT 509, J. TEBBS

20
0

10

Weight (in ounces)

30

40

CHAPTER 4

Location 1

Location 2

Figure 4.13: Boxplots of sh weights by location in Example 4.12.

Location 1:

21.9 18.5

12.3

16.7

21.0

15.1

18.2

23.0

Location 2:

22.0 20.6

15.4

17.9

24.4

15.6

11.4

17.5

36.8

26.6

Question. Construct a 90 percent condence interval for the mean dierence 1 2 .


Here, 1 (2 ) denotes the population mean weight of all sh at location 1 (2).
Solution. In order to visually assess the equal variance assumption, we use boxplots
to display the data in each sample; see Figure 4.13. The equal variance assumption,
based on the gure, looks reasonable; note that the spread in each distribution looks
roughly the same (save the outlier in Location 1).
NOTE : Instead of resorting to hand calculation (as we have often done in previous
examples), we will use R to calculate the condence interval directly:

PAGE 104

CHAPTER 4

STAT 509, J. TEBBS

> t.test(loc.1,loc.2,conf.level=0.90,var.equal=TRUE)$conf.int
[1] -1.940438

7.760438

Therefore, a 90 percent condence interval for the mean dierence 1 2 is


(1.940, 7.760) oz.
Interpretation: We are 90 percent condent that the mean dierence 1 2 is between
1.940 and 7.760 oz. Note that this interval includes 0. Therefore, we do not have
sucient evidence that the population mean sh weights 1 and 2 are dierent.
ROBUSTNESS : Some comments are in order about the robustness properties of the
two-sample condence interval

(Y 1+ Y 2+ ) tn1 +n2 2,/2

(
Sp2

1
1
+
n1 n2

for the mean dierence 1 2 .


We should only use this interval if there is strong evidence that the population
variances 12 and 22 are equal (or at least close). Otherwise, we should use a
dierent interval (coming up).
Like the one-sample t interval for , the two sample t interval (and the unequal
variance version coming up) is robust to normality departures. This means that we
can feel comfortable with the interval even if the underlying population distributions
are not perfectly normal (Guassian).

4.8.2

Unequal variance case: 12 = 22

REMARK : When 12 = 22 , the problem of constructing a 100(1 ) percent condence


interval for 1 2 becomes more dicult theoretically. However, we can still write an
approximate condence interval.

PAGE 105

CHAPTER 4

STAT 509, J. TEBBS

FORMULA: An approximate 100(1 ) percent condence interval 1 2 is given by

S12 S22
(Y 1+ Y 2+ ) t,/2
+ ,
n1
n2
where the degrees of freedom is calculated as
)2
( 2
S1
S22
+ n2
n1
= ( S 2 )2 ( S 2 )2 .
1
n1

n1 1

2
n2

n2 1

This interval is always approximately valid, as long as


the two samples are independent
the two population distributions are approximately normal (Gaussian).
No one in their right mind would calculate this interval by hand (particularly
nasty is the formula for ). R will produce the interval on request.

Example 4.13. You are part of a recycling project that is examining how much paper is
being discarded (not recycled) by employees at two large plants. These data are obtained
on the amount of white paper thrown out per year by employees (data are in hundreds
of pounds). Samples of employees at each plant were randomly selected.

Plant 1:

Plant 2:

3.01 2.58

3.04

1.75

2.87

2.57

2.51

2.93

2.85

3.09

1.43 3.36

3.18

2.74

2.25

1.95

3.68

2.29

1.86

2.63

2.83 2.04

2.23

1.92

3.02

3.79 2.08

3.66

1.53

4.07

4.31

2.62

4.52

3.80

5.30

3.41 0.82

3.03

1.95

6.45

1.86

1.87

3.78

2.74

3.81

Question. Are there dierences in the mean amounts of white paper discarded by
employees at the two plants? Answer this question by nding a 95 percent condence
interval for the mean dierence 1 2 . Here, 1 (2 ) denotes the population mean
amount of white paper discarded per employee at Plant 1 (2).
Solution. In order to visually assess the equal variance assumption, we again use
PAGE 106

STAT 509, J. TEBBS

4
0

Weight (in 100s lb)

CHAPTER 4

Plant 1

Plant 2

Figure 4.14: Boxplots of discarded white paper amounts (in 100s lb) in Example 4.13.

boxplots to display the data in each sample; see Figure 4.14. For these data, the equal
variance assumption would be highly suspect; the spread in the distribution of Plant 2
values is much larger than that of Plant 1. We again use R to calculate the (unequal
variance) condence interval for 1 2 :

> t.test(plant.1,plant.2,conf.level=0.95,var.equal=FALSE)$conf.int
[1] -1.35825799 -0.01294201

Therefore, a 95 percent condence interval for the mean dierence 1 2 is


(1.358, 0.013) lb.
Interpretation: We are 95 percent condent that the mean dierence 1 2 is between
PAGE 107

CHAPTER 4

STAT 509, J. TEBBS

135.8 and 1.3 lb. This interval does not include 0. Therefore, we have evidence that
the population mean weights (1 and 2 ) at the two plants are dierent.
REMARK : In this subsection, we have presented two condence intervals for 1 2 . One
assumes 12 = 22 (equal variance assumption) and one that assumes 12 = 22 (unequal
variance assumption). If you are unsure about which interval to use, go with the
unequal variance interval. The penalty for using it when 12 = 22 is much smaller
than the penalty for using the equal variance interval when 12 = 22 .

4.9

Condence interval for the dierence of two population proportions p1 p2 : Independent samples

NOTE : We also can extend our condence interval procedure for a single population
proportion p to two populations. Dene
p1 = population proportion of successes in Population 1
p2 = population proportion of successes in Population 2.
For example, we might want to compare the proportion of
defective circuit boards for two dierent suppliers
satised customers before and after a product design change (e.g., Facebook, etc.)
on-time payments for two classes of customers
HIV positives for individuals in two demographic classes.
POINT ESTIMATORS : We assume that there are two independent random samples of
individuals (one sample from each population to be compared). Dene
Y1 = number of successes in Sample 1 (out of n1 individuals) b(n1 , p1 )
Y2 = number of successes in Sample 2 (out of n2 individuals) b(n2 , p2 ).
PAGE 108

CHAPTER 4

STAT 509, J. TEBBS

The point estimators for p1 and p2 are the sample proportions, dened by
Y1
n1
Y2
=
.
n2

pb1 =
pb2

GOAL: We would like to write a 100(1 ) percent condence interval for p1 p2 , the
dierence of two population proportions.
IMPORTANT : To accomplish this goal, we need the following distributional result.
When the sample sizes n1 and n2 are large,
(b
p1 pb2 ) (p1 p2 )
Z=
AN (0, 1).
p1 (1p1 )
p2 (1p2 )
+
n1
n2
If this sampling distribution holds approximately, then

pb1 (1 pb1 ) pb2 (1 pb2 )


(b
p1 pb2 ) z/2
+
n1
n2
is an approximate 100(1 ) percent condence interval for p1 p2 .
For the Z sampling distribution to hold approximately (and therefore for the interval above to be useful), we need
the two samples to be independent
the sample sizes n1 and n2 to be large; common rules of thumb are to require
ni pbi 5
ni (1 pbi ) 5,
for each sample i = 1, 2. Under these conditions, the CLT should adequately
approximate the true sampling distribution of Z, thereby making the condence interval formula above approximately valid.
Note again the form of the interval:
point estimate quantile
{z
}
| {z }
|
pb1 b
p2

z/2

We interpret the interval in the same way.


PAGE 109

standard
|
{z error} .

p
b1 (1b
p1 )
p
b (1b
p )
+ 2 n 2
n1
2

CHAPTER 4

STAT 509, J. TEBBS

We are 100(1 ) percent condent that the population proportion


dierence p1 p2 is in this interval.
The value z/2 is the upper /2 quantile from the N (0, 1) distribution.
Important: In two-sample situations, it is often of interest to see if the proportions
p1 and p2 are dierent.
If the condence interval for p1 p2 includes 0, this does not suggest that the
proportions p1 and p2 are dierent.
If the condence interval for p1 p2 does not include 0, it does.

Example 4.14. A programmable lighting control system is being designed. The purpose of the system is to reduce electricity consumption costs in buildings. The system
eventually will entail the use of a large number of transceivers (a device comprised of
both a transmitter and a receiver). Two types of transceivers are being considered. In
life testing, 200 transceivers (randomly selected) were tested for each type.
Transceiver 1:

20 failures were observed (out of 200)

Transceiver 2:

14 failures were observed (out of 200).

Question. Dene p1 (p2 ) to be the population proportion of Transceiver 1 (Transceiver


2) failures. Write a 95 percent condence interval for p1 p2 . Is there a signicant
dierence between the failure rates p1 and p2 ?
Solution. For 95 percent condence, we need z0.05/2 = z0.025 1.96. The sample
proportions of defective transceivers are
20
= 0.10
200
14
=
= 0.07.
200

pb1 =
pb2

Therefore, an approximate 95 percent condence interval for p1 p2 is

0.10(1 0.10) 0.07(1 0.07)


(0.10 0.07) 1.96
+
= (0.025, 0.085).
200
200
PAGE 110

CHAPTER 4

STAT 509, J. TEBBS

Interpretation: We are 95 percent condent that the dierence of the population failure
rates for the two transceivers is between 0.025 and 0.085. Because this interval includes
0, we do not have sucient evidence that the two failure rates p1 and p2 are dierent.
CLT approximation check: We have
(
)
20
n1 pb1 = 200
= 20
200
)
(
20
= 180
n1 (1 pb1 ) = 200 1
200

(
n2 pb2 = 200

14
200

)
= 14

(
)
14
n2 (1 pb2 ) = 200 1
= 186.
200

All of these quantities are larger than 5 = we can feel comfortable in using this interval
formula.

4.10

Condence interval for the ratio of two population variances 22 /12 : Independent samples

IMPORTANCE : You will recall that when we wrote a condence interval for 1 2 ,
the dierence of the population means (with independent samples), we proposed two
dierent intervals:
one interval that assumed 12 = 22
one interval that assumed 12 = 22 .
We now propose a condence interval procedure that can be used to determine which
assumption is more appropriate. This condence interval is used to compare the population variances in two independent samples.
TWO-SAMPLE PROBLEM : Suppose that we have two independent samples:
Sample 1 :

Y11 , Y12 , ..., Y1n1 N (1 , 12 ) random sample

Sample 2 :

Y21 , Y22 , ..., Y2n2 N (2 , 22 ) random sample.

GOAL: Construct a 100(1 ) percent condence interval for the ratio of population
variances 22 /12 .
PAGE 111

CHAPTER 4

STAT 509, J. TEBBS

IMPORTANT : To accomplish this, we need the following sampling distribution result:


Q=

S12 /12
F (n1 1, n2 1),
S22 /22

an F distribution with (numerator) 1 = n1 1 and (denominator) 2 = n2 1 degrees


of freedom.
FACTS : The F distribution has the following characteristics:
continuous, skewed right, and always positive
indexed by two degree of freedom parameters 1 and 2 ; these are usually integers
and are often related to sample sizes
The F pdf formula is complicated and is unnecessary for our purposes. R will
compute F probabilities and quantiles from the F distribution.
F R CODE: Suppose that Q F (1 , 2 ).
FQ (q) = P (Q q)

pf(q,1 , 2 )

qf(p,1 , 2 )

NOTATION : Let Fn1 1,n2 1,/2 and Fn1 1,n2 1,1/2 denote the lower and upper quantiles,
respectively, of the F (n1 1, n2 1) distribution; i.e., these values satisfy
P (Q < Fn1 1,n2 1,/2 ) = /2
P (Q > Fn1 1,n2 1,1/2 ) = /2,
respectively. Similar to the 2 distribution, the F distribution is not symmetric. Therefore, dierent notation is needed to identify the quantiles of F distributions.
DERIVATION : Because
Q=

S12 /12
F (n1 1, n2 1),
S22 /22

PAGE 112

STAT 509, J. TEBBS

0.6

0.8

CHAPTER 4

0.4
0.0

0.2

f(q)

F; df = 5,5
F; df = 5,10
F; df = 10,10

Figure 4.15: F probability density functions with dierent degrees of freedom.

we can write
(
)
1 = P Fn1 1,n2 1,/2 < Q < Fn1 1,n2 1,1/2
(
)
S12 /12
= P Fn1 1,n2 1,/2 < 2 2 < Fn1 1,n2 1,1/2
S2 /2
( 2
)
S2
22
S22
= P
Fn1 1,n2 1,/2 < 2 < 2 Fn1 1,n2 1,1/2 .
S12
1
S1
This shows that

S22
S22
Fn1 1,n2 1,/2 , 2 Fn1 1,n2 1,1/2
S12
S1

is a 100(1 ) percent condence interval for the ratio of the population variances
22 /12 . We interpret the interval in the same way.
We are 100(1 ) percent condent that the ratio 22 /12 is in this interval.
PAGE 113

CHAPTER 4

STAT 509, J. TEBBS

If the condence interval for 22 /12 includes 1, this does not suggest that the variances 12 and 22 are dierent.
If the condence interval for 22 /12 does not include 1, it does.
Example 4.15. We consider again the recycling project in Example 4.13 that examined
the amount of white paper discarded per employee at two large plants. The data (presented in Example 4.13) were obtained on the amount of white paper thrown out per
year by employees (data are in hundreds of pounds). Samples of employees at each plant
(n1 = 25 and n2 = 20) were randomly selected. The boxplots in Figure 4.14 did suggest
that the population variances may be dierent.
Question. Find a 95 percent condence interval for 22 /12 , the ratio of the population
variances. Here, 12 (22 ) denotes the population variance of the amount of white paper
by employees at Plant 1 (Plant 2).
Solution. We use R to get the sample variances s21 and s22 :
> var(plant.1) ## sample variance (Plant 1)
[1] 0.3071923
> var(plant.2) ## sample variance (Plant 2)
[1] 1.878411
That is, s21 0.307 and s22 1.878. We also use R to get the necessary F quantiles:
> qf(0.025,24,19) ## F_{24,19,0.025} ## lower 0.025 quantile
[1] 0.4264113
> qf(0.975,24,19) ## F_{24,19,0.975} ## upper 0.025 quantile
[1] 2.452321
The lower bound of the a 95 percent condence interval for 22 /12 is
s22
F24,19,0.025 =
s21

1.878
0.4264113 = 2.607.
0.307

The upper bound of the a 95 percent condence interval for 22 /12 is


s22
F24,19,0.975 =
s21

1.878
2.452321 = 14.995.
0.307
PAGE 114

STAT 509, J. TEBBS

4
3

Sample Quantiles

2.5

1.5

2.0

Sample Quantiles

3.0

3.5

CHAPTER 4

Theoretical Quantiles

Theoretical Quantiles

Figure 4.16: Normal quantile-quantile (qq) plots for employee recycle data for two plants
in Example 4.15.
Interpretation: We are 95 percent condent that the ratio of the population variances
22 /12 is between 2.607 and 14.995. This interval does not include 1. Therefore, we
have sucient evidence that the population variances (12 and 22 ) at the two plants are
dierent.
Discussion: This nding supports our use of the unequal-variance interval for 1 2
in Example 4.13. Some statisticians recommend to use this equal/unequal variance test
before deciding which condence interval to use for 1 2 . Some statisticians (including
your authors) do not.
Major warning: Like the 2 interval for single population variance 2 , the two-sample F
interval for the ratio of two variances is not robust to departures from normality. If the
underlying population distributions are non-normal (non-Guassian), then this interval
should not be used.
Discussion: Figure 4.16 (above) displays the normal qq plots for data from the two
plants. I am somewhat worried about the normal assumption for Plant 2.
PAGE 115

CHAPTER 4

4.11

STAT 509, J. TEBBS

Condence intervals for the dierence of two population


means 1 2 : Dependent samples (Matched-pairs)

Example 4.16. Creatine is an organic acid that helps to supply energy to cells in the
body, primarily muscle. Because of this, it is commonly used by those who are weight
training to gain muscle mass. Does it really work? Suppose that we are designing an
experiment involving USC male undergraduates who exercise/lift weights regularly.
Design 1 (Independent samples): Recruit 30 students who are representative of the
population of USC male undergraduates who exercise/lift weights. For a single weight
training session, we will
assign 15 students to take creatine.
assign 15 students an innocuous substance that looks like creatine (but has no
positive/negative eect on performance).
For each student, we will record
Y = maximum bench press weight (MBPW).
We will then have two samples of data (with n1 = 15 and n2 = 15):
Sample 1 (Creatine):

Y11 , Y12 , ..., Y1n1

Sample 2 (Control):

Y21 , Y22 , ..., Y2n2 .

To compare the population means


1 = population mean MBPW for students taking creatine
2 = population mean MBPW for students not taking creatine,
we could construct a two-sample t condence interval for 1 2 using
(
)
1
1
(Y 1+ Y 2+ ) tn1 +n2 2,/2 Sp2
+
n1 n2

PAGE 116

CHAPTER 4

STAT 509, J. TEBBS

or
(Y 1+ Y 2+ ) t,/2

S12 S22
+ ,
n1
n2

depending on our underlying assumptions about 12 and 22 .


Design 2 (Matched Pairs): Recruit 15 students who are representative of the population of USC male undergraduates who exercise/lift weights.
Each student will be assigned rst to take either creatine or the control substance.
For each student, we will then record his value of Y (MBPW).
After a period of recovery (e.g., 1 week), we will then have each student take the
other treatment (creatine/control) and record his value of Y again (but now on
the other treatment).
In other words, for each individual student, we will measure Y under both conditions.
NOTE : In Design 2, because MBPW measurements are taken on the same student, the
dierence between the measurement (creatine/control) should be less variable than the
dierence between a creatine measurement on one student and a control measurement
on a dierent student.
In other words, the student-to-student variation inherent in the latter dierence
is not present in the dierence between MBPW measurements taken on the same
individual student.
MATCHED PAIRS : In general, by obtaining a pair of measurements on a single individual (e.g., student, raw material, machine, etc.), where one of measurement corresponds to
Treatment 1 and the other measurement corresponds to Treatment 2, you eliminate
variation among the individuals. This allows you to compare the two experimental conditions (e.g., creatine/control, biodegradability treatments, operators, etc.) under more
homogeneous conditions where only variation within individuals is present (that is, the
variation arising from the dierence in the two experimental conditions).
PAGE 117

CHAPTER 4

STAT 509, J. TEBBS

Table 4.1: Creatine example. Sources of variation in the two independent sample and
matched pairs designs.
Design

Sources of Variation

Two Independent Samples

among students, within students

Matched Pairs

within students

ADVANTAGE : When you remove extra variability, this enables you to do a better job at
comparing the two experimental conditions (treatments). By better job, I mean, you
can more precisely estimate the dierence between the treatments (excess variability
that naturally arises among individuals is not getting in the way). This gives you a better
chance of identifying a dierence between the treatments if one really exists.
NOTE : In matched pairs experiments, it is important to randomize the order in which
treatments are assigned. This may eliminate common patterns that may be seen when
always following, say, Treatment 1 with Treatment 2. In practice, the experimenter could
ip a fair coin to determine which treatment is applied rst.
IMPLEMENTATION : Data from matched pairs experiments are analyzed by examining
the dierence in responses of the two treatments. Specically, compute
Dj = Y1j Y2j ,
for each individual j = 1, 2, ..., n. After doing this, we have essentially created a one
sample problem, where our data are:
D1 , D2 , ..., Dn ,
the so-called data dierences. The one sample 100(1 ) percent condence interval
SD
D tn1,/2 ,
n
where D and SD are the sample mean and sample standard deviation of the dierences,
respectively, is an interval estimate for
D = mean dierence between the 2 treatments.
PAGE 118

CHAPTER 4

STAT 509, J. TEBBS

Table 4.2: Creatine data. Maximum bench press weight (in lbs) for creatine and control
treatments with 15 students. Note: These are not real data.
Student j

Creatine MBPW

Control MBPW

Dierence (Dj = Y1j Y2j )

230

200

30

140

155

15

215

205

10

190

190

200

170

30

230

225

220

200

20

255

260

220

240

20

10

200

195

11

90

110

20

12

130

105

25

13

255

230

25

14

80

85

15

265

255

10

INTERPRETATION : The parameter D describes the dierence in means for the two
treatment groups. If there are no dierences between the two treatments, D = 0.
If the condence interval for D includes 0, this does not suggest that two treatment
means are dierent.
If the condence interval for D does not include 0, it does.
To analyze the creatine data, lets compute a 95 percent condence interval for D :
> t.test(diff,conf.level=0.95)$conf.int
[1] -3.227946 15.894612
Interpretation: We are 95 percent condent that the mean dierence MBPW weight
is between 3.2 and 15.9 lb. Because this interval includes 0, this does not suggest
that taking creating leads to a larger mean MBPW.
PAGE 119

CHAPTER 4

4.12

STAT 509, J. TEBBS

One-way analysis of variance

REVIEW : So far in this chapter, we have discussed condence intervals for a single
population mean and for the dierence of two population means 1 2 . When there
are two means, we have recently seen that the design of the experiment/study completely
determines how the data are to be analyzed.
When the two samples are independent, this is called a (two) independentsample design.
When the two samples are obtained on the same individuals (so that the samples
are dependent), this is called a matched pairs design.
Condence interval procedures for 1 2 depend on the design of the study.
TERMINOLOGY : More generally, the purpose of an experiment is to investigate differences between or among two or more treatments. In a statistical framework, we do
this by comparing the population means of the responses to each treatment.
In order to detect treatment mean dierences, we must try to control the eects
of error so that any variation we observe can be attributed to the eects of the
treatments rather than to dierences among the individuals.
BLOCKING: Designs involving meaningful grouping of individuals, that is, blocking,
can help reduce the eects of experimental error by identifying systematic components
of variation among individuals.
The matched pairs design for comparing two treatments is an example of such a
design. In this situation, individuals themselves are treated as blocks.
The analysis of data from experiments involving blocking will not be covered in this
course (see, e.g., STAT 506, STAT 525, and STAT 706). We focus herein on a simpler
setting, that is, a one-way classication model. This is an extension of the two
independent-sample design to more than two treatments.
PAGE 120

CHAPTER 4

STAT 509, J. TEBBS

ONE-WAY CLASSIFICATION : Consider an experiment to compare t 2 treatments


set up as follows:
We obtain a random sample of individuals and randomly assign them to treatments. Samples corresponding to the treatment groups are independent (i.e., the
individuals in each treatment sample are unrelated).
In observational studies (where no treatment is physically applied to individuals), individuals are inherently dierent to begin with. We therefore simply take
random samples from each treatment population.
We do not attempt to group individuals according to some other factor (e.g., location, gender, weight, variety, etc.). This would be an example of blocking.
MAIN POINT : In a one-way classication design, the only way in which individuals are
classied is by the treatment group assignment. Hence, such an arrangement is called
a one-way classication. When individuals are thought to be basically alike (other
than the possible eect of treatment), experimental error consists only of the variation
among the individuals themselves, that is, there are no other systematic sources of
variation.
Example 4.17. Four types of mortars: (1) ordinary cement mortar (OCM), polymer
impregnated mortar (PIM), resin mortar (RM), and (4) polymer cement mortar (PCM),
were subjected to a compression test to measure strength (MPa). Here are the strength
measurements taken on dierent mortar specimens (36 in all).

OCM: 51.45 42.96 41.11

48.06

38.27

38.88

42.74

49.62

PIM:

64.97 64.21 57.39

52.79

64.87

53.27

51.24

55.87

61.76

67.15

RM:

48.95 62.41

52.11

60.45

58.07

52.16

61.71

61.06

57.63

56.80

35.28 38.59 48.64

50.99

51.52

52.85

46.75

48.31

PCM:

Side by side boxplots of the data are in Figure 4.17.

PAGE 121

STAT 509, J. TEBBS

60
40
0

20

Strength (MPa)

80

100

CHAPTER 4

OCM

PIM

RM

PCM

Figure 4.17: Boxplots of strength measurements (MPa) for four mortar types.

In this example,
Treatment = mortar type (OCM, PIM, RM, and PCM). There are t = 4 treatment groups.
Individuals = mortar specimens
This is an example of an observational study; not an experiment. That is, we do
not physically apply a treatment here; instead, the mortar specimens are inherently
dierent to begin with. We simply take random samples of each mortar type.
QUERY : An initial question that we might have is the following:
Are the treatment (mortar type) population means equal? Or, are the treatment population means dierent?
PAGE 122

CHAPTER 4

STAT 509, J. TEBBS

This question can be answered by performing a hypothesis test, that is, by testing
H0 : 1 = 2 = 3 = 4
versus
Ha : the population means i are not all equal.
GOAL: We now develop a statistical procedure that allows us to test this type of hypothesis in a one-way classication model.

4.12.1

Overall F test

NOTATION : Let t denote the number of treatments to be compared. Dene


Yij = response on the jth individual in the ith treatment group
for i = 1, 2, ..., t and j = 1, 2, ..., ni .
ni is the number of replications for treatment i
When n1 = n2 = = nt = n, we say the design is balanced; otherwise, the
design is unbalanced.
Let N = n1 + n2 + + nt denote the total number of individuals measured. If the
design is balanced, then N = nt.
Dene
Y i+

ni
1
=
Yij
ni j=1
i
1
=
(Yij Y i+ )2
ni 1 j=1

Si2
Y ++

ni
t
1
=
Yij .
N i=1 j=1

The statistics Y i+ and Si2 are simply the sample mean and sample variance, respectively, of the ith sample. The statistic Y ++ is the sample mean of all the data
(across all t treatment groups).
PAGE 123

CHAPTER 4

STAT 509, J. TEBBS

STATISTICAL HYPOTHESIS : Our goal is to develop a procedure to test


H0 : 1 = 2 = = t
versus
Ha : the population means i are not all equal.
The null hypothesis H0 says that there is no treatment dierence, that is, all
treatment population means are the same.
The alternative hypothesis Ha says that a dierence among the t population
means exists somewhere (but does not specify how the means are dierent).
When performing a hypothesis test, the goal is to decide which hypothesis is more
supported by the observed data.
ASSUMPTIONS : We have independent random samples from t 2 normal distributions, each of which has the same variance (but possibly dierent means):
Sample 1:

Y11 , Y12 , ..., Y1n1 N (1 , 2 )

Sample 2:
..
.

Y21 , Y22 , ..., Y2n2 N (2 , 2 )


..
.

Sample t:

Yt1 , Yt2 , ..., Ytnt N (t , 2 ).

The procedure we develop is formulated by deriving two estimators for 2 . These two
estimators are formed by (1) looking at the variance of the observations within samples,
and (2) looking at the variance of the sample means across the t samples.
WITHIN ESTIMATOR: To estimate 2 within samples, we take a weighted average
(weighted by the sample sizes) of the t sample variances; that is, we pool all variance
estimates together to form one estimate. Dene
SSres = (n1 1)S12 + (n2 1)S22 + + (nt 1)St2
ni
t

=
(Yij Y i+ )2 .
i=1 j=1

{z

(ni 1)Si2

PAGE 124

CHAPTER 4

STAT 509, J. TEBBS

We call SSres the residual sum of squares. Mathematics shows that


(
)
SSres
E
= N t = E(M Sres ) = 2 ,
2
where
M Sres =

SSres
.
N t

IMPORTANT : M Sres is an unbiased estimator of 2 regardless of whether or not H0 is


true. We call M Sres the residual mean squares.
ACROSS ESTIMATOR: To derive the across-sample estimator, we assume a common sample size n1 = n2 = = nt = n (to simplify notation). Recall that if a sample
arises from a normal population, then the sample mean is also normally distributed, i.e.,
(
)
2
Y i+ N i ,
.
n
NOTE : If all the treatment population means are equal, that is,
H0 : 1 = 2 = = t = , say,
is true, then
Y i+

(
)
2
N ,
.
n

If H0 is true, then the t sample means Y 1+ , Y 2+ , ..., Y t+ are a random sample of size t
from a normal distribution with mean and variance 2 /n. The sample variance of this
random sample is given by
1
(Y i+ Y ++ )2
t 1 i=1
t

and has expectation

]
t
1
2
2
E
(Y i+ Y ++ ) = .
t 1 i=1
n

Therefore,

1
=
n(Y i+ Y ++ )2 ,
t 1 i=1
|
{z
}
t

M Strt

SStrt

is an unbiased estimator of 2 ; i.e., E(M Strt ) = 2 , when H0 is true.


PAGE 125

CHAPTER 4

STAT 509, J. TEBBS

TERMINOLOGY : We call SStrt the treatment sums of squares and M Strt the treatment mean squares. M Strt is our second point estimator for 2 . Recall that M Strt
is an unbiased estimator of 2 only when H0 : 1 = 2 = = t is true (this is
important!). If we have dierent sample sizes, we simply adjust M Strt to
1
ni (Y i+ Y ++ )2 .
t 1 i=1
|
{z
}
t

M Strt =

SStrt

This is still an unbiased estimator for 2 when H0 is true.


MOTIVATION : When H0 is true (i.e., the treatment means are the same), then
E(M Strt ) = 2
E(M Sres ) = 2 .
These two facts suggest that when H0 is true,
F =

M Strt
1.
M Sres

When H0 is not true (i.e., the treatment means are dierent), then
E(M Strt ) > 2
E(M Sres ) = 2 .
These two facts suggest that when H0 is not true,
F =

M Strt
> 1.
M Sres

SAMPLING DISTRIBUTION : When H0 is true, the F statistic


F =

M Strt
F (t 1, N t).
M Sres

DECISION : We reject H0 and conclude the treatment population means are dierent
if the F statistic is far out in the right tail of the F (t 1, N t) distribution. Why?
Because a large value of F is not consistent with H0 being true! Large values of F (far
out in the right tail) are more consistent with Ha .
PAGE 126

STAT 509, J. TEBBS

0.4
0.0

0.2

PDF

0.6

CHAPTER 4

10

15

Figure 4.18: The F (3, 32) probability density function. This is the distribution of F in
Example 4.17 if H0 is true. An at F = 16.848 has been added.

MORTAR DATA: We now use R to calculate the F statistic for the strength/mortar
type data in Example 4.17.
> anova(lm(strength ~ mortar.type))
Analysis of Variance Table
Df
mortar.type
Residuals

Sum Sq Mean Sq F value

3 1520.88
32

962.86

506.96

Pr(>F)

16.848 9.576e-07 ***

30.09

CONCLUSION : F = 16.848 is not an observation we would expect from the F (3, 32)
distribution (the distribution of F when H0 is true); see Figure 4.18. Therefore, we reject
H0 and conclude the population mean strengths for the four mortar types are dierent.
In other words, the evidence from the data suggests that Ha is true; not H0 .
PAGE 127

CHAPTER 4

STAT 509, J. TEBBS

TERMINOLOGY : As we have just seen (from the recent R analysis), it is common to


display one-way classication results in an ANOVA table. The form of the ANOVA
table for the one-way classication is given below:
Source

df

SS

MS

Treatments

t1

SStrt

M Strt =

SStrt
t1

Residuals

N t

SSres

M Sres =

SSres
N t

Total

N 1

SStotal

F =

M Strt
M Sres

It is easy to show that


SStotal = SStrt + SSres .
SStotal measures how observations vary about the overall mean, without regard to
treatments; that is, it measures the total variation in all the data. SStotal can be
partitioned into two components:
SStrt measures how much of the total variation is due to the treatments
SSres measures what is left over, which we attribute to inherent variation
among the individuals.
Degrees of freedom (df) add down.
Mean squares (MS) are formed by dividing sums of squares by the corresponding
degrees of freedom.

TERMINOLOGY : The probability value (p-value) for a hypothesis test measures


how much evidence we have against H0 . It is important to remember the following:
the smaller the p-value

the more evidence against H0 .

MORTAR DATA: For the strength/mortar type data in Example 4.17 (from the R output), we see that
p-value = 0.0000009576.
PAGE 128

CHAPTER 4

STAT 509, J. TEBBS

This is obviously quite small which suggests that we have an enormous amount of
evidence against H0 .
In this example, this p-value is calculated as the area to the right of F = 16.848 on
the F (3, 32) probability density function.
Therefore, this probability is interpreted as follows: If H0 is true, this is the
probability that we would get a test statistic equal to or larger than F = 16.848.
Since this is extremely unlikely (p-value = 0.0000009576), this strongly suggests
that H0 is not true.
P-VALUE RULES : Probability values are used in more general hypothesis test settings
in statistics (not just in one-way classication).
Q: How low does a p-value have to get before we reject H0 ?
A: Unfortunately, there is no right answer to this question. What is commonly done
is the following.
First choose a signicance level that is small. This represents the probability
that we will reject a true H0 , that is,
= P (Reject H0 |H0 true).
Common values of chosen beforehand are = 0.10, = 0.05 (the most common),
and = 0.01.
The smaller the is chosen to be, the more evidence one requires to reject H0 .
This is a true statement because of the following well-known decision rule:
p-value < = reject H0 .
Therefore, the value of chosen by the experimenter determines how low the pvalue must get before H0 is ultimately rejected.
For the strength/mortar type data, there is no ambiguity in our decision. For other
situations (e.g., p-value = 0.052), the decision may not be as clear cut.
PAGE 129

CHAPTER 4

4.12.2

STAT 509, J. TEBBS

Follow up analysis: Tukey pairwise condence intervals

RECALL: In a one-way classication, the overall F test is used to test the hypotheses:
H0 : 1 = 2 = = t
versus
Ha : the population means i are not all equal.
QUESTION : If we do reject H0 , have we really learned anything that is all that
relevant? All we have learned is that at least one of the population treatment means
is dierent. We have no idea which one(s) or how many. In this light, rejecting H0 is
largely an uninformative conclusion.
FOLLOW-UP ANALYSES : If H0 is rejected, that is, we conclude at least one population
treatment mean i is dierent, the obvious game becomes determining which one(s) and
how they are dierent. To do this, we will construct Tukey pairwise condence
intervals for all population treatment mean dierences i i , 1 i < i t. If there
are t treatments, then there are
( )
t
t(t 1)
=
2
2
pairwise condence intervals to construct. For example, with the strength/mortar type
example, there are t = 4 treatments and 6 pairwise intervals:
1 2 ,

1 3 ,

1 4 ,

2 3 ,

2 4 ,

3 4 ,

where
1 = population mean strength for mortar type OCM
2 = population mean strength for mortar type PIM
3 = population mean strength for mortar type RM
4 = population mean strength for mortar type PCM.
PROBLEM : If we construct multiple condence intervals (here, 6 of them), and if we
construct each one at the 100(1 ) percent condence level, then the overall condence
PAGE 130

CHAPTER 4

STAT 509, J. TEBBS

level for all 6 intervals will be less than 100(1 ) percent. In statistics, this is known
as the multiple comparisons problem.
GOAL: Construct condence intervals for all pairwise intervals i i , 1 i < i t,
and have our overall condence level still be at 100(1 ) percent.
SOLUTION : Simply increase the condence level associated with each individual interval! Tukeys method is designed to do this. Even better, R automates the construction
of Tukeys intervals. The intervals are of the form:

(Y i+ Y i + ) qt,N t,

M Sres

)
1
1
+
,
ni ni

where qt,N t, is the Tukey quantile which gives an overall condence level of 100(1)
percent (overall for the set of all possible pairwise intervals).
MORTAR DATA: For the strength/mortar type data in Example 4.17, the R output
below gives all pairwise intervals. Note that the overall condence level is 95 percent.

> TukeyHSD(aov(lm(strength ~ mortar.type)),conf.level=0.95)


Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lm(strength ~ mortar.type))
$mortar.type
diff
PCM-OCM

lwr

2.48000 -4.950955

upr

p adj

9.910955 0.8026758

PIM-OCM 15.21575

8.166127 22.265373 0.0000097

RM-OCM

12.99875

5.949127 20.048373 0.0001138

PIM-PCM 12.73575

5.686127 19.785373 0.0001522

RM-PCM

10.51875

3.469127 17.568373 0.0016850

RM-PIM

-2.21700 -8.863448

4.429448 0.8029266

ANALYSIS : In the R output, the columns labeled lwr and upr give, respectively, the
lower and upper limits of the pairwise condence intervals.
PAGE 131

CHAPTER 6

STAT 509, J. TEBBS

We are (at least) 95 percent condent that the dierence between the population
mean strengths for the PCM and OCM mortars is between 4.95 and 9.91 MPa.
Note that this condence interval includes 0, which suggests that these two
population means are not dierent.
An equivalent nding is that the adjusted p-value, given in the p adj column, is large.
We are (at least) 95 percent condent that the dierence between the population
mean strengths for the PIM and OCM mortars is between 8.17 and 22.27 MPa.
Note that this condence interval does not include 0, which suggests that
these two population means are dierent.
An equivalent nding is that the adjusted p-value, given in the p adj column, is small.
Interpretations for the remaining 4 condence intervals are formed similarly.
The main point is this:
If a pairwise condence interval (for two population means) includes 0, then
these population means are declared not to be dierent.
If a pairwise interval does not include 0, then the population means are
declared to be dierent.
The conclusions we make for all possible pairwise comparisons are at the
100(1 ) percent condence level.
Therefore, for the strength/mortar type data, the following pairs of population
means are declared to be dierent:
PIM-OCM RM-OCM PIM-PCM RM-PCM.
The following pairs of population means are declared to be not dierent:
PCM-OCM RM-PIM.
PAGE 132

CHAPTER 6

STAT 509, J. TEBBS

Linear regression

Complementary reading: Chapter 6 (VK); Sections 6.1-6.4.

6.1

Introduction

IMPORTANCE : A problem that arises in engineering, economics, medicine, and other


areas is that of investigating the relationship between two (or more) variables. In such
settings, the goal is to model a continuous random variable Y as a function of one or more
independent variables, say, x1 , x2 , ..., xk . Mathematically, we can express this model as
Y = g(x1 , x2 , ..., xk ) + ,
where g : Rk R. This is called a regression model.
The presence of the (random) error conveys the fact that the relationship between
the dependent variable Y and the independent variables x1 , x2 , ..., xk through g is
not deterministic. Instead, the term absorbs all variation in Y that is not
explained by g(x1 , x2 , ..., xk ).
LINEAR MODELS : In this course, we will consider models of the form
Y = 0 + 1 x1 + 2 x2 + + k xk + ,
|
{z
}
g(x1 ,x2 ,...,xk )

that is, g is a linear function of 0 , 1 , ..., k . We call this a linear regression model.
The response variable Y is assumed to be random (but we do get to observe its
value).
The regression parameters 0 , 1 , ..., k are assumed to be xed and unknown.
The independent variables x1 , x2 , ..., xk are assumed to be xed (not random).
The error term is assumed to be random (and not observed).
PAGE 133

CHAPTER 6

STAT 509, J. TEBBS

DESCRIPTION : More precisely, we call a regression model a linear regression model


if the regression parameters enter the g function in a linear fashion. For example, each
of the models is a linear regression model:
Y

= 0 + 1 x +
| {z }
g(x)

= 0 + 1 x + 2 x2 +
|
{z
}
g(x)

= 0 + 1 x1 + 2 x2 + 3 x1 x2 +.
{z
}
|
g(x1 ,x2 )

Main point: The term linear does not refer to the shape of the regression function g.
It refers to how the regression parameters 0 , 1 , ..., k enter the g function.

6.2

Simple linear regression

TERMINOLOGY : A simple linear regression model includes only one independent


variable x and is of the form
Y = 0 + 1 x + .
The regression function
g(x) = 0 + 1 x
is a straight line with intercept 0 and slope 1 . If E() = 0, then
E(Y ) = E(0 + 1 x + )
= 0 + 1 x + E()
= 0 + 1 x.
Therefore, we have these interpretations for the regression parameters 0 and 1 :
0 quanties the mean of Y when x = 0.
1 quanties the change in E(Y ) brought about by a one-unit change in x.

PAGE 134

CHAPTER 6

STAT 509, J. TEBBS

Example 6.1. As part of a waste removal project, a new compression machine for
processing sewage sludge is being studied. In particular, engineers are interested in the
following variables:
Y

= moisture control of compressed pellets (measured as a percent)

x = machine ltration rate (kg-DS/m/hr).


Engineers collect n = 20 observations of (x, Y ); the data are given below.
Obs

Obs

125.3

77.9

11

159.5

79.9

98.2

76.8

12

145.8

79.0

201.4

81.5

13

75.1

76.7

147.3

79.8

14

151.4

78.2

145.9

78.2

15

144.2

79.5

124.7

78.3

16

125.0

78.1

112.2

77.5

17

198.8

81.5

120.2

77.0

18

132.5

77.0

161.2

80.1

19

159.6

79.0

10

178.9

80.2

20

110.7

78.6

Table 6.1: Sewage data. Moisture (Y , measured as a percentage) and machine ltration
rate (x, measured in kg-DS/m/hr). There are n = 20 observations.
Figure 6.1 displays the data in a scatterplot. This is the most common graphical display
for bivariate data like those seen above. From the plot, we see that
the variables Y and x are positively related, that is, an increase in x tends to be
associated with an increase in Y .
the variables Y and x are linearly related, although there is a large amount of
variation that is unexplained.
this is an example where a simple linear regression model may be adequate.
PAGE 135

STAT 509, J. TEBBS

80
79
77

78

Moisture (Percentage)

81

CHAPTER 6

80

100

120

140

160

180

200

Filtration rate (kgDS/m/hr)

Figure 6.1: Scatterplot of pellet moisture Y (measured as a percentage) as a function of


machine ltration rate x (measured in kg-DS/m/hr).

6.2.1

Least squares estimation

TERMINOLOGY : When we say, t a regression model, we mean that we would like


to estimate the regression parameters in the model with the observed data. Suppose that
we collect (xi , Yi ), i = 1, 2, ..., n, and postulate the simple linear regression model
Yi = 0 + 1 xi + i ,
for each i = 1, 2, ..., n. Our rst goal is to estimate 0 and 1 . Formal assumptions for
the error terms i will be given later.
LEAST SQUARES : A widely-accepted method of estimating the model parameters 0
and 1 is least squares. The method of least squares says to choose the values of 0
PAGE 136

CHAPTER 6

STAT 509, J. TEBBS

and 1 that minimize


Q(0 , 1 ) =

[Yi (0 + 1 xi )]2 .

i=1

Denote the least squares estimators by b0 and b1 , respectively, that is, the values of 0
and 1 that minimize Q(0 , 1 ). A two-variable minimization argument can be used to
nd closed-form expressions for b0 and b1 . Taking partial derivatives of Q(0 , 1 ), we
obtain

Q(0 , 1 )
set
= 2
(Yi 0 1 xi ) = 0
0
i=1
n

Q(0 , 1 )
set
= 2
(Yi 0 1 xi )xi = 0.
1
i=1
n

Solving for 0 and 1 gives the least squares estimators


b0 = Y b1 x
n
(x x)(Yi Y )
SSxy
i=1
n i
b1 =
=
.
2
SSxx
i=1 (xi x)
In real life, it is rarely necessary to calculate b0 and b1 by hand, although VK (Example
6.4, pp 379) does give an example of hand calculation. R automates the entire model
tting process and subsequent analysis.
Example 6.1 (continued). We now use R to calculate the equation of the least squares
regression line for the sewage sludge data in Example 6.1. Here is the output:
> fit = lm(moisture~filtration.rate)
> fit
lm(formula = moisture ~ filtration.rate)
Coefficients:
(Intercept)

filtration.rate

72.95855

0.04103

From the output, we see the least squares estimates (to 3 dp) for the sewage data are
b0 = 72.959
b1 = 0.041.
PAGE 137

STAT 509, J. TEBBS

80
79
77

78

Moisture (Percentage)

81

CHAPTER 6

80

100

120

140

160

180

200

Filtration rate (kgDS/m/hr)

Figure 6.2: Scatterplot of pellet moisture Y (measured as a percentage) as a function of


ltration rate x (measured in kg-DS/m/hr). The least squares line has been added.

Therefore, the equation of the least squares line that relates moisture percentage Y to
the ltration rate x is
Yb = 72.959 + 0.041x,
or, in other words,
\ = 72.959 + 0.041 Filtration rate.
Moisture
NOTE : Your authors call the least squares line the prediction equation. This is
because we can predict the value of Y (moisture) for any value of x (ltration rate). For
example, when the ltration rate is x = 150 kg-DS/m/hr, we would predict the moisture
percentage to be
Yb (150) = 72.959 + 0.041(150) 79.109.
PAGE 138

CHAPTER 6

6.2.2

STAT 509, J. TEBBS

Model assumptions and properties of least squares estimators

INTEREST : We wish to investigate the properties of b0 and b1 as estimators of the true


regression parameters 0 and 1 in the simple linear regression model
Yi = 0 + 1 xi + i ,
for i = 1, 2, ..., n. To do this, we need assumptions on the error terms i . Specically, we
will assume throughout that
E(i ) = 0, for i = 1, 2, ..., n
var(i ) = 2 , for i = 1, 2, ..., n, that is, the variance is constant
the random variables i are independent
the random variables i are normally distributed.
IMPLICATION : Under these assumptions,
Yi N (0 + 1 xi , 2 ).
Fact 1. The least squares estimators b0 and b1 are unbiased estimators of 0 and 1 ,
respectively, that is,
E(b0 ) = 0
E(b1 ) = 1 .
Fact 2. The least squares estimators b0 and b1 have the following sampling distributions:
b0 N (0 , c00 2 )

and

b1 N (1 , c11 2 ),

where
c00 =

1
x2
+
n SSxx

and

c11 =

1
.
SSxx

Knowing these sampling distributions is critical if we want to write condence intervals


and perform hypothesis tests for 0 and 1 .
PAGE 139

CHAPTER 6

6.2.3

STAT 509, J. TEBBS

Estimating the error variance

GOAL: In the simple linear regression model


Yi = 0 + 1 xi + i ,
where i N (0, 2 ), we now turn our attention to estimating 2 , the error variance.
TERMINOLOGY : In the simple linear regression model, dene the ith tted value by
Ybi = b0 + b1 xi ,
where b0 and b1 are the least squares estimators. Each observation has its own tted
value. Geometrically, an observations tted value is the (perpendicular) projection of
its Y value, upward or downward, onto the least squares line.
TERMINOLOGY : We dene the ith residual by
ei = Yi Ybi .
Each observation has its own residual. Geometrically, an observations residual is the
vertical distance (i.e., length) between its Y value and its tted value.
If an observations Y value is above the least squares regression line, its residual is
positive.
If an observations Y value is below the least squares regression line, its residual is
negative.

INTERESTING: In the simple linear regression model (provided that the model includes
an intercept term 0 ), we have the following algebraic result:
n

i=1

ei =

(Yi Ybi ) = 0,
i=1

that is, the sum of the residuals (from a least squares t) is equal to zero.
PAGE 140

CHAPTER 6

STAT 509, J. TEBBS

Obs

Yb = b0 + b1 x

e = Y Yb

Obs

Yb = b0 + b1 x

e = Y Yb

125.3

77.9

78.100

0.200

11

159.5

79.9

79.503

0.397

98.2

76.8

76.988

0.188

12

145.8

79.0

78.941

0.059

201.4

81.5

81.223

0.277

13

75.1

76.7

76.040

0.660

147.3

79.8

79.003

0.797

14

151.4

78.2

79.171

0.971

145.9

78.2

78.945

0.745

15

144.2

79.5

78.876

0.624

124.7

78.3

78.075

0.225

16

125.0

78.1

78.088

0.012

112.2

77.5

77.563

0.062

17

198.8

81.5

81.116

0.384

120.2

77.0

77.891

0.891

18

132.5

77.0

78.396

1.396

161.2

80.1

79.573

0.527

19

159.6

79.0

79.508

0.508

10

178.9

80.2

80.299

0.099

20

110.7

78.6

77.501

1.099

Table 6.2: Sewage data. Fitted values and residuals from the least squares t.
SEWAGE DATA: In Table 6.2, I have used R to calculate the tted values and residuals
for each of the n = 20 observations in the sewage sludge data set.
TERMINOLOGY : We dene the residual sum of squares by
SSres

e2i

i=1

(Yi Ybi )2 .

i=1

Fact 3. In the simple linear regression model,


M Sres =

SSres
n2

is an unbiased estimator of 2 , that is, E(M Sres ) = 2 . The quantity

SSres

b = M Sres =
n2
estimates and is called the residual standard error.
SEWAGE DATA: For the sewage data in Example 6.1, we use R to calculate M Sres :
> fitted.values = predict(fit)
> residuals = moisture-fitted.values
> # Calculate MS_res
> sum(residuals^2)/18
[1] 0.4426659
PAGE 141

CHAPTER 6

STAT 509, J. TEBBS

For the sewage data, an (unbiased) estimate of the error variance 2 is


M Sres 0.443.
The residual standard error is

b=

M Sres =

0.4426659 0.6653.

This estimate can also be seen in the following R output:

> summary(fit)
lm(formula = moisture ~ filtration.rate)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
filtration.rate

72.958547
0.041034

0.697528 104.596
0.004837

< 2e-16 ***

8.484 1.05e-07 ***

Residual standard error: 0.6653 on 18 degrees of freedom


Multiple R-squared: 0.7999,

Adjusted R-squared: 0.7888

F-statistic: 71.97 on 1 and 18 DF,

6.2.4

p-value: 1.052e-07

Inference for 0 and 1

INTEREST : In the simple linear regression model


Yi = 0 + 1 xi + i ,
the regression parameters 0 and 1 are unknown. It is therefore of interest to construct
condence intervals and perform hypothesis tests for these parameters.
In practice, inference for the slope parameter 1 is of primary interest because of
its connection to the independent variable x in the model.
Inference for 0 is less meaningful, unless one is explicitly interested in the mean
of Y when x = 0. We will not pursue this.
PAGE 142

CHAPTER 6

STAT 509, J. TEBBS

CONFIDENCE INTERVAL FOR 1 : Under our model assumptions, the following sampling distribution arises:
b1 1
t=
t(n 2).
M Sres /SSxx
This result can be used to derive a 100(1 ) percent condence interval for 1 ,
which is given by
b1 tn2,/2

M Sres /SSxx .

The value tn2,/2 is the upper /2 quantile from the t(n 2) distribution.
Note the form of the interval:
point estimate quantile standard
| {z error} .
|
{z
}
| {z }
b1

tn2,/2

M Sres /SSxx

We interpret the interval in the same way.


We are 100(1) percent condent that the population regression slope
1 is in this interval.
When interpreting the interval, of particular interest to us is the value 1 = 0.
If 1 = 0 is in the condence interval, this suggests that Y and x are not
linearly related.
If 1 = 0 is not in the condence interval, this suggests that Y and x are
linearly related.

HYPOTHESIS TEST FOR 1 : If our interest was to test


H0 : 1 = 1,0
versus
Ha : 1 = 1,0 ,
where 1,0 is a xed value (often, 1,0 = 0), we would focus our attention on
b1 1,0
t=
.
M Sres /SSxx
PAGE 143

CHAPTER 6

STAT 509, J. TEBBS

REASONING: If H0 : 1 = 1,0 is true, then t arises from a t(n 2) distribution. We can


therefore judge the amount of evidence against H0 by comparing t to this distribution.
R automatically calculates t and produces a p-value for the test above. Remember that
small p-values are evidence against H0 .
Example 6.1 (continued). We now use R to test
H0 : 1 = 0
versus
Ha : 1 = 0,
for the sewage sludge data in Example 6.1. Note that 1,0 = 0.

> summary(fit)
lm(formula = moisture ~ filtration.rate)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
filtration.rate

72.958547
0.041034

0.697528 104.596
0.004837

< 2e-16 ***

8.484 1.05e-07 ***

ANALYSIS : Figure 6.3 shows the t(18) distribution, that is, the distribution of t when
H0 : 1 = 0 is true for the sewage sludge example. Clearly, t = 8.484 is not an expected
outcome from this distribution (p-value = 0.000000105)!! In other words, there is strong
evidence that the absorption rate is linearly related to machine ltration rate.
ANALYSIS : A 95 percent condence interval for 1 is calculated as follows:
b1 t18,0.025 se(b1 ) = 0.0410 2.1009(0.0048) = (0.0309, 0.0511).
We are 95 percent condent that population regression slope 1 is between 0.0309 and
0.0511. Note that this interval does not include 0.
> qt(0.975,18)
[1] 2.100922
PAGE 144

STAT 509, J. TEBBS

0.2
0.0

0.1

f(t)

0.3

0.4

CHAPTER 6

Figure 6.3: t(18) probability density function. An at t = 8.484 has been added.

6.2.5

Condence and prediction intervals for a given x = x0

INTEREST : Consider the simple linear regression model


Yi = 0 + 1 xi + i ,
where i N (0, 2 ). We are often interested in using the tted model to learn about
the response variable Y at a certain setting for the independent variable x = x0 , say. For
example, in our sewage sludge example, we might be interested in the moisture percentage
Y when the ltration rate is x = 150 kg-DS/m/hr. Two potential goals arise:
We might be interested in estimating the mean response of Y when x = x0 .
This mean response is denoted by E(Y |x0 ). This value is the mean of the following

PAGE 145

CHAPTER 6

STAT 509, J. TEBBS

probability distribution:
Y (x0 ) N (0 + 1 x0 , 2 ).
We might be interested in predicting a new response Y when x = x0 . This
predicted response is denoted by Y (x0 ). This value is a new outcome from
Y (x0 ) N (0 + 1 x0 , 2 ).

In the rst problem, we are interested in estimating the mean of the response variable Y
at a certain value of x. In the second problem, we are interested in predicting the value
of a new random variable Y at a certain value of x. Conceptually, the second problem is
far more dicult than the rst.
GOALS : We would like to create 100(1 ) percent intervals for the mean E(Y |x0 ) and
for the new value Y (x0 ). The former is called a condence interval (since it is for
a mean response) and the latter is called a prediction interval (since it is for a new
random variable).
POINT ESTIMATOR/PREDICTOR: To construct either interval, we start with the
same quantity:
Yb (x0 ) = b0 + b1 x0 ,
where b0 and b1 are the least squares estimates from the t of the model.
In the condence interval for E(Y |x0 ), we call Yb (x0 ) a point estimator.
In the prediction interval for Y (x0 ), we call Yb (x0 ) a point predictor.
The primary dierence in the intervals arises in assessing the variability of Yb (x0 ).
CONFIDENCE INTERVAL: A 100(1 ) percent condence interval for the mean
E(Y |x0 ) is given by

Yb (x0 ) tn2,/2

M Sres
PAGE 146

]
1 (x0 x)2
+
.
n
SSxx

CHAPTER 6

STAT 509, J. TEBBS

PREDICTION INTERVAL: A 100(1 ) percent prediction interval for the new


response Y (x0 ) is given by

Yb (x0 ) tn2,/2

[
M Sres

]
1 (x0 x)2
1+ +
.
n
SSxx

COMPARISON : The two intervals are identical except for the extra 1 in the standard
error part of the prediction interval. This extra 1 arises from the additional uncertainty associated with predicting a new response from the N (0 + 1 x0 , 2 ) distribution.
Therefore, a 100(1 ) percent prediction interval for Y (x0 ) will be wider than the
corresponding 100(1 ) percent condence interval for E(Y |x0 ).
INTERVAL LENGTH : The length of both intervals clearly depends on the value of x0 .
In fact, the standard error of Yb (x0 ) will be smallest when x0 = x and will get larger
the farther x0 is from x in either direction. This implies that the precision with which
we estimate E(Y |x0 ) or predict Y (x0 ) decreases the farther we get away from x. This
makes intuitive sense, namely, we would expect to have the most condence in our
tted model near the center of the observed data.
TERMINOLOGY : It is sometimes desired to estimate E(Y |x0 ) or predict Y (x0 ) based
on the t of the model for values of x0 outside the range of x values used in the experiment/study. This is called extrapolation and can be very dangerous. In order for
our inferences to be valid, we must believe that the straight line relationship holds for x
values outside the range where we have observed data. In some situations, this may be
reasonable. In others, we may have no theoretical basis for making such a claim without
data to support it.
Example 6.1 (continued). In our sewage sludge example, suppose that we are interested
in estimating E(Y |x0 ) and predicting a new Y (x0 ) when the ltration rate is x0 = 150
kg-DS/m/hr.
E(Y |x0 ) denotes the mean moisture percentage for compressed pellets when the
machine ltration rate is x0 = 150 kg-DS/m/hr. In other words, if we were to repeat
PAGE 147

CHAPTER 6

STAT 509, J. TEBBS

the experiment over and over again, each time using a ltration rate of x0 = 150
kg-DS/m/hr, then E(Y |x0 ) denotes the mean value of Y (moisture percentage)
that would be observed.
Y (x0 ) denotes a possible value of Y for a single run of the machine when the
ltration rate is set at x0 = 150 kg-DS/m/hr.
R automates the calculation of condence and prediction intervals, as seen below.

> predict(fit,data.frame(filtration.rate=150),level=0.95,interval="confidence")
fit

lwr

upr

79.11361 78.78765 79.43958


> predict(fit,data.frame(filtration.rate=150),level=0.95,interval="prediction")
fit

lwr

upr

79.11361 77.6783 80.54893

Note that the point estimate (point prediction) is easily calculated:


Yb (x0 = 150) = 72.959 + 0.041(150) 79.11361.
A 95 percent condence interval for E(Y |x0 = 150) is (78.79, 79.44). When the
ltration rate is x0 = 150 kg-DS/m/hr, we are 95 percent condent that the mean
moisture percentage is between 78.79 and 79.44 percent.
A 95 percent prediction interval for Y (x0 = 150) is (77.68, 80.55). When the
ltration rate is x0 = 150 kg-DS/m/hr, we are 95 percent condent that the moisture percentage for a single run of the experiment will be between 77.68 and 80.55
percent.
Figure 6.4 shows 95 percent condence bands for E(Y |x0 ) and 95 percent prediction
bands for Y (x0 ). These are not simultaneous bands (i.e., these are not bands for
the entire population regression function).

PAGE 148

STAT 509, J. TEBBS

79

80

95% confidence
95% prediction

77

78

Moisture (Percentage)

81

CHAPTER 6

80

100

120

140

160

180

200

Filtration rate (kgDS/m/hr)

Figure 6.4: Scatterplot of pellet moisture Y (measured as a percentage) as a function of


machine ltration rate x (measured in kg-DS/m/hr). The least squares regression line
has been added. Ninety-ve percent condence/prediction bands have been added.

6.3
6.3.1

Multiple linear regression


Introduction

PREVIEW : We have already considered the simple linear regression model


Yi = 0 + 1 xi + i ,
for i = 1, 2, ..., n, where i N (0, 2 ). We now extend this basic model to include
multiple independent variables x1 , x2 , ..., xk . This is much more realistic because, in
practice, often Y depends on many dierent factors (i.e., not just one). Specically, we
PAGE 149

CHAPTER 6

STAT 509, J. TEBBS

consider models of the form


Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
for i = 1, 2, ..., n. We call this a multiple linear regression model.
There are now p = k + 1 regression parameters 0 , 1 , ..., k . These are unknown
and are to be estimated with the observed data.
Schematically, we can envision the observed data as follows:
Individual

x1

x2

xk

Y1

x11

x12

x1k

2
..
.

Y2
..
.

x21
..
.

x22
..
.

..
.

x2k
..
.

Yn

xn1

xn2

xnk

Each of the n individuals contributes a response Y and a value of each of the


independent variables x1 , x2 , ..., xk .
We continue to assume that i N (0, 2 ).
We also assume that the independent variables x1 , x2 , ..., xk are xed and measured
without error.

PREVIEW : To t the multiple linear regression model


Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
we again use the method of least squares. Simple computing formulae for the least
squares estimators are no longer available (as they were in simple linear regression).
This is hardly a big deal because we will use computing to automate all analyses. For
instructional purposes, it is advantageous to express multiple linear regression models
in terms of matrices and vectors. This streamlines notation and makes the presentation
easier.
PAGE 150

CHAPTER 6

6.3.2

STAT 509, J. TEBBS

Matrix representation

MATRIX REPRESENTATION : Consider the multiple linear regression model


Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
for i = 1, 2, ..., n. Dene

Y=

Y1
Y2
..
.

, X =

Yn

1 xn1 xn2

x1k

x2k
, =

..
..

.
.

xnk

1 x11 x12
1 x21 x22
.. ..
..
. .
.

0
1
2
..
.

, =

1
2
..
.

With these denitions, the model above can be expressed equivalently as


Y = X + .
In this equivalent representation,
Y is an n 1 (random) vector of responses
X is an n p (xed) matrix of independent variable measurements (p = k + 1)
is a p 1 (xed) vector of unknown population regression parameters
is an n 1 (random) vector of unobserved errors.

LEAST SQUARES : The notion of least squares is the same as it was in the simple linear
regression model. To t a multiple linear regression model, we want to nd the values of
0 , 1 , ..., k that minimize
Q(0 , 1 , ..., k ) =

[Yi (0 + 1 xi1 + 2 xi2 + + k xik )]2 ,

i=1

or, in matrix notation, the value of that minimizes


Q() = (Y X) (Y X).
PAGE 151

CHAPTER 6

STAT 509, J. TEBBS

Because Q() is a scalar function of the p = k + 1 elements of , it is possible to use


calculus to determine the values of the p elements that minimize it. Formally, we can
take p partial derivatives with respect to each of 0 , 1 , ..., k and set these equal to zero.
Using the calculus of matrices, we can write this resulting system of p equations (and p
unknowns) as follows:
X X = X Y.
These are called the normal equations. Provided that X X is full rank, the (unique)

solution is

1
b = (X X) X Y =

b0
b1
b2
..
.

bk
This is the least squares estimator of . The tted regression model is
b = Xb,
Y
or, equivalently,
Ybi = b0 + b1 xi1 + b2 xi2 + + bk xik ,
for i = 1, 2, ..., n.
TECHNICAL NOTE : For the least squares estimator
b = (X X)1 X Y
to be unique, we need X to be of full column rank; i.e., r(X) = p = k + 1. This will
occur when there are no linear dependencies among the columns of X. If r(X) < p, then
X X does not have a unique inverse. In this case, the normal equations can not be solved
uniquely.
Example 6.2. The taste of matured cheese is related to the concentration of several
chemicals in the nal product. In a study from the LaTrobe Valley of Victoria, Australia, samples of cheddar cheese were analyzed for their chemical composition and were
PAGE 152

CHAPTER 6

STAT 509, J. TEBBS

subjected to taste tests. For each specimen, the taste Y was obtained by combining the
scores from several tasters. Data were collected on the following variables:
Y =

taste score (TASTE)

x1 =

concentration of acetic acid (ACETIC)

x2 =

concentration of hydrogen sulde (H2S)

x3 =

concentration of lactic acid (LACTIC).

Variables ACETIC and H2S were both measured on the log scale. The variable LACTIC
has not been transformed. Table 6.3 contains concentrations of the various chemicals in
n = 30 specimens of cheddar cheese and the observed taste score.
Specimen

TASTE

ACETIC

H2S

LACTIC

Specimen

TASTE

ACETIC

H2S

LACTIC

12.3

4.543

3.135

0.86

16

40.9

6.365

9.588

1.74

20.9

5.159

5.043

1.53

17

15.9

4.787

3.912

1.16

39.0

5.366

5.438

1.57

18

6.4

5.412

4.700

1.49

47.9

5.759

7.496

1.81

19

18.0

5.247

6.174

1.63

5.6

4.663

3.807

0.99

20

38.9

5.438

9.064

1.99

25.9

5.697

7.601

1.09

21

14.0

4.564

4.949

1.15

37.3

5.892

8.726

1.29

22

15.2

5.298

5.220

1.33

21.9

6.078

7.966

1.78

23

32.0

5.455

9.242

1.44

18.1

4.898

3.850

1.29

24

56.7

5.855

10.20

2.01

10

21.0

5.242

4.174

1.58

25

16.8

5.366

3.664

1.31

11

34.9

5.740

6.142

1.68

26

11.6

6.043

3.219

1.46

12

57.2

6.446

7.908

1.90

27

26.5

6.458

6.962

1.72

13

0.7

4.477

2.996

1.06

28

0.7

5.328

3.912

1.25

14

25.9

5.236

4.942

1.30

29

13.4

5.802

6.685

1.08

15

54.9

6.151

6.752

1.52

30

5.5

6.176

4.787

1.25

Table 6.3: Cheese data. ACETIC, H2S, and LACTIC are independent variables. The response variable is TASTE.
MODEL: Researchers postulate that each of the three chemical composition variables
x1 , x2 , and x3 is important in describing the taste and consider the multiple linear rePAGE 153

CHAPTER 6

STAT 509, J. TEBBS

gression model
Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + i ,
for i = 1, 2, ..., 30. We now use R to t this model using the method of least squares:
> fit = lm(taste~acetic+h2s+lactic)
> fit
Coefficients:
(Intercept)

acetic

h2s

lactic

-28.877

0.328

3.912

19.670

This output gives the values of the least squares estimates

b
28.877
0

b1
0.328
=
.
b=

b2
3.912

b3
19.670
Therefore, the tted least squares regression model is
Yb = 28.877 + 0.328x1 + 3.912x2 + 19.670x3 ,
or, in other words,
\ = 28.877 + 0.328 ACETIC + 3.912 H2S + 19.670 LACTIC.
TASTE

6.3.3

Estimating the error variance

GOAL: Consider the multiple linear regression model


Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
for i = 1, 2, ..., n, where i N (0, 2 ). We have just seen how to estimate 0 , 1 , ..., k
via least squares and how to automate this procedure in R. Our next task is to estimate
the error variance 2 .

PAGE 154

CHAPTER 6

STAT 509, J. TEBBS

TERMINOLOGY : Dene the residual sum of squares by


SSres =

(Yi Ybi )2 =

i=1

e2i .

i=1

In matrix notation, we can write this as


b (Y Y)
b
SSres = (Y Y)
= (Y Xb) (Y Xb) = e e.
b = Xb contains the least squares tted values.
The n 1 vector Y
b contains the least squares residuals.
The n 1 vector e = Y Y

RESULT : In the multiple linear regression model


Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
where i N (0, 2 ),
M Sres =

SSres
np

is an unbiased estimator of 2 , that is,


E(M Sres ) = 2 .
The quantity

b = M Sres =

SSres
np

estimates and is called the residual standard error.


CHEESE DATA: For the cheese data in Example 6.2, we use R to calculate M Sres :
> fitted.values = predict(fit)
> residuals = taste-fitted.values
> # Calculate MS_res
> sum(residuals^2)/26
[1] 102.6299
PAGE 155

CHAPTER 6

6.3.4

STAT 509, J. TEBBS

The hat matrix

TERMINOLOGY : Consider the linear regression model Y = X + and dene


H = X(X X)1 X .
H is called the hat matrix. Important quantities in linear regression can be written as
functions of the hat matrix.
The vector of tted values can be written as
b = Xb = X(X X)1 X Y = HY.
Y
The vector of residuals can be written as
b = Y HY = (I H)Y,
e=YY
where I is the n n identity matrix.
b and e are orthogonal vectors in Rn .
Interestingly, it turns out that Y

6.3.5

Analysis of variance for linear regression

IDENTITY : Algebraically, it can be shown that


n

(Yi Y )2 =

|i=1 {z
SStotal

(Ybi Y )2 +

|i=1 {z

|i=1

SSreg

(Yi Ybi )2 .
{z

SSres

SStotal is the total sum of squares. SStotal is the numerator of the sample variance
of Y1 , Y2 , ..., Yn . It measures the total variation in the response data.
SSreg is the regression sum of squares. SSreg measures the variation in the
response data explained by the linear regression model.
SSres is the residual sum of squares. SSres measures the variation in the response
data not explained by the linear regression model.
PAGE 156

CHAPTER 6

STAT 509, J. TEBBS

Table 6.4: Analysis of variance table for linear regression.


Source

df

SS

MS

Regression

SSreg

M Sreg =

SSreg
k

Residual

np

SSres

M Sres =

SSres
np

Total

n1

SStotal

F =

M Sreg
M Sres

ANOVA TABLE : We can combine all of this information to produce an analysis of


variance (ANOVA) table. Such tables are standard in regression analysis.
The degrees of freedom (df) add down.
SStotal can be viewed as a statistic that has lost a degree of freedom for
having to estimate the overall mean of Y with the sample mean Y . Recall
that n 1 is our divisor in the sample variance S 2 .
There are k degrees of freedom associated with SSreg because there are k
independent variables.
The degrees of freedom for SSres can be thought of as the divisor needed to
create an unbiased estimator of 2 . Recall that
M Sres =

SSres
SSres
=
np
nk1

is an unbiased estimator of 2
The sum of squares (SS) also add down. This follows from the algebraic identity
noted earlier.
Mean squares (MS) are the sums of squares divided by their degrees of freedom.
The F statistic is formed by taking the ratio of M Sreg and M Sres . More on this in
a moment.
COEFFICIENT OF DETERMINATION : Since
SStotal = SSreg + SSres ,
PAGE 157

CHAPTER 6

STAT 509, J. TEBBS

the proportion of the total variation in the data explained by the linear regression model
is
SSreg
.
SStotal
This statistic is called the coecient of determination. Clearly,
R2 =

0 R2 1.
The larger the R2 , the better the regression model explains the variability in the data.
IMPORTANT : It is critical to understand what R2 does and does not measure. Its value
is computed under the assumption that the multiple linear regression model is correct
and assesses how much of the variation in the data may be attributed to that relationship
rather than to inherent variation.
If R2 is small, it may be that there is a lot of random inherent variation in the data,
so that, although the multiple linear regression model is reasonable, it can explain
only so much of the observed overall variation.
Alternatively, R2 may be close to 1; e.g., in a simple linear regression model t, but
this may not be the best model. In fact, R2 could be very high, but ultimately
not relevant because it assumes the simple linear regression model is correct. In
reality, a better model may exist (e.g., a quadratic model, etc.).
F STATISTIC : The F statistic in the ANOVA table is used to test
H0 : 1 = 2 = = k = 0
versus
Ha : at least one of the j is nonzero.
In other words, F tests whether or not at least one of the independent variables
x1 , x2 , ..., xk is important in describing the response Y . If H0 is rejected, we do not
know which one or how many of the j s are nonzero; only that at least one is.
SAMPLING DISTRIBUTION : When H0 : 1 = 2 = = k = 0 is true,
F =

M Sreg
F (k, n p).
M Sres
PAGE 158

CHAPTER 6

STAT 509, J. TEBBS

Therefore, we can gauge the evidence against H0 by comparing F to this distribution.


Values of F far out in the (right) upper tail are evidence against H0 . R automatically
produces the value of F and produces the corresponding p-value. Recall that small
p-values are evidence against H0 (the smaller the p-value, the more evidence).
Example 6.2 (continued). For the cheese data in Example 6.2, we t the multiple linear
regression model
Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + i ,
for i = 1, 2, ..., 30. The ANOVA table, obtained using SAS, is shown below.
Analysis of Variance
Sum of

Mean

DF

Squares

Square

F Value

Pr > F

4994.50861

1664.83620

16.22

<.0001

Residual

26

2668.37806

102.62993

Corrected Total

29

7662.88667

Source
Regression

The F statistic is used to test


H0 : 1 = 2 = 3 = 0
versus
Ha : at least one of the j is nonzero.
ANALYSIS : Based on the F statistic (F = 16.22), and the corresponding probability
value (p-value < 0.0001), we have strong evidence to reject H0 . See also Figure 6.5.
Interpretation: We conclude that at least one of the independent variables (ACETIC,
H2S, LACTIC) is important in describing taste.
NOTE : The coecient of determination is
R2 =

4994.51
SSreg
=
0.652.
SStotal
7662.89

Interpretation: About 65.2 percent of the variability in the taste data is explained by
the linear regression model that includes ACETIC, H2S, and LACTIC. The remaining 34.8
percent of the variability in the taste data is explained by other sources.
PAGE 159

STAT 509, J. TEBBS

0.0

0.1

0.2

0.3

PDF

0.4

0.5

0.6

0.7

CHAPTER 6

10

15

20

Figure 6.5: F (3, 26) probability density function. An at F = 16.22 has been added.

IMPORTANT : If we t this model in R, we get the following:


fit = lm(taste~acetic+h2s+lactic)
anova(fit)
Response: taste
Df

Sum Sq

Mean Sq

F value

Pr(>F)

acetic

2314.14

2314.14

22.5484

6.528e-05 ***

h2s

2147.11

2147.11

20.9209

0.0001035 ***

lactic

533.26

533.26

5.1959

26

2668.38

102.63

Residuals

0.0310870 *

NOTE : The convention used by R is to split up the regression sum of squares


SSreg = 4994.50861
PAGE 160

CHAPTER 6

STAT 509, J. TEBBS

into sums of squares for each of the three independent variables ACETIC, H2S, and LACTIC,
as they are added sequentially to the model (these are called sequential sums of
squares). The sequential sums of squares for the independent variables add to the
SSreg for the model (up to rounding error), that is,
SSreg = 4994.51 = 2314.14 + 2147.11 + 533.26
= SS(ACETIC) + SS(H2S) + SS(LACTIC).
In words,
SS(ACETIC) is the sum of squares added when compared to a model that includes
only an intercept term.
SS(H2S) is the sum of squares added when compared to a model that includes an
intercept term and ACETIC.
SS(LACTIC) is the sum of squares added when compared to a model that includes
an intercept term, ACETIC, and H2S.
In other words, we can use the sequential sums of squares to assess the impact of adding
independent variables ACETIC, H2S, and LACTIC to the model in sequence.

6.3.6

Inference for individual regression parameters

IMPORTANCE : Consider our multiple linear regression model


Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
for i = 1, 2, ..., n, where i N (0, 2 ). Condence intervals and hypothesis tests for j
can help us assess the importance of using the independent variable xj in a model with
the other independent variables. That is, inference regarding j is always conditional
on the other variables being included in the model.
CONFIDENCE INTERVALS : A 100(1 ) percent condence interval for j , for
j = 0, 1, 2, ..., k, is given by
bj tnp,/2

b2 cjj ,

PAGE 161

CHAPTER 6

STAT 509, J. TEBBS

where

b2 = M Sres =

SSres
np

is our unbiased estimate of 2 and


cjj = (X X)1
jj
is the corresponding diagonal element of the (X X)1 matrix.
HYPOTHESIS TESTS : Hypothesis tests for
H0 : j = 0
versus
Ha : j = 0,
can be performed by examining the p-value output provided in R.
If H0 : j = 0 is not rejected, then xj is not important in describing Y in the
presence of the other independent variables.
If H0 : j = 0 is rejected, this means that xj is important in describing Y even
after including the eects of the other independent variables.

> summary(fit)
Call: lm(formula = taste ~ acetic + h2s + lactic)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)

-28.877

19.735

-1.463

0.15540

acetic

0.328

4.460

0.074

0.94193

h2s

3.912

1.248

3.133

0.00425 **

19.670

8.629

2.279

0.03109 *

lactic

Residual standard error: 10.13 on 26 degrees of freedom


Multiple R-squared: 0.6518,

Adjusted R-squared: 0.6116

F-statistic: 16.22 on 3 and 26 DF,

p-value: 3.810e-06

PAGE 162

CHAPTER 6

STAT 509, J. TEBBS

OUTPUT : The Estimate output gives the values of the least squares estimates:
b0 28.877

b1 0.328

b2 3.912

b3 19.670.

Therefore, the tted least squares regression model is


Yb = 28.877 + 0.328x1 + 3.912x2 + 19.670x3 ,
or, in other words,
\ = 28.877 + 0.328 ACETIC + 3.912 H2S + 19.670 LACTIC.
TASTE
The Std.Error output gives

se(b
b 0 ) = c00
b2

b2
se(b
b 1 ) = c11

b2
se(b
b 2 ) = c22

se(b
b 3 ) = c33
b2

19.735 =
4.460 =
1.248 =
8.629 =

=
b2 (X X)1
00

=
b2 (X X)1
11

=
b2 (X X)1
22

=
b2 (X X)1
33 ,

where

b2 = M Sres =

SSres
= (10.13)2 102.63
30 4

is the square of the Residual standard error. The t value output gives the t statistics
b0 0
t = 1.463 =
c00
b2
b1 0
t = 0.074 =
c11
b2
b2 0
t = 3.133 =
c22
b2
b3 0
t = 2.279 =
.
c33
b2
These t statistics can be used to test H0 : i = 0 versus H0 : i = 0, for i = 0, 1, 2, 3.
Two-sided probability values are in Pr(>|t|). At the = 0.05 level,
we do not reject H0 : 0 = 0 (p-value = 0.155). Interpretation: In the model
which includes all three independent variables, the intercept term 0 is not statistically dierent from zero.
PAGE 163

CHAPTER 6

STAT 509, J. TEBBS

we do not reject H0 : 1 = 0 (p-value = 0.942). Interpretation: ACETIC does not


signicantly add to a model that includes H2S and LACTIC.
we reject H0 : 2 = 0 (p-value = 0.004). Interpretation: H2S does signicantly
add to a model that includes ACETIC and LACTIC.
we reject H0 : 3 = 0 (p-value = 0.031). Interpretation: LACTIC does signicantly
add to a model that includes ACETIC and H2S.
CONFIDENCE INTERVALS : Ninety-ve percent condence intervals for the regression
parameters 0 , 1 , 2 , and 3 , respectively, are
b0 t26,0.025 se(b
b 0 ) = 28.877 2.056(19.735) = (69.45, 11.70)
b1 t26,0.025 se(b
b 1 ) = 0.328 2.056(4.460) = (8.84, 9.50)
b2 t26,0.025 se(b
b 2 ) = 3.912 2.056(1.248) = (1.35, 6.48)
b3 t26,0.025 se(b
b 3 ) = 19.670 2.056(8.629) = (1.93, 37.41).
The conclusions reached from interpreting these intervals are the same as those reached
using the hypothesis test p-values. Note that the 2 and 3 intervals do not include zero.
Those for 0 and 1 do.

6.3.7

Condence and prediction intervals for a given x = x0

GOALS : We would like to create 100(1 ) percent intervals for the mean E(Y |x0 )
and for the new value Y (x0 ). As in the simple linear regression case, the former is
called a condence interval (since it is for a mean response) and the latter is called a
prediction interval (since it is for a new random variable).
CHEESE DATA: Suppose that we are interested estimating E(Y |x0 ) and predicting a
new Y (x0 ) when ACETIC = 5.5, H2S = 6.0, and LACTIC = 1.4, so that

5.5

x0 = 6.0 .

1.4
PAGE 164

CHAPTER 6

STAT 509, J. TEBBS

We use R to compute the following:


> predict(fit,data.frame(acetic=5.5,h2s=6.0,lactic=1.4),level=0.95,interval="confidence")
fit

lwr

upr

23.93552 20.04506 27.82597


> predict(fit,data.frame(acetic=5.5,h2s=6.0,lactic=1.4),level=0.95,interval="prediction")
fit

lwr

upr

23.93552 2.751379 45.11966

Note that the point estimate/prediction is


Yb (x0 ) = b0 + b1 x10 + b2 x20 + b3 x30
= 28.877 + 0.328(5.5) + 3.912(6.0) + 19.670(1.4) 23.936.
A 95 percent condence interval for E(Y |x0 ) is (20.05, 27.83). When ACETIC =
5.5, H2S = 6.0, and LACTIC = 1.4, we are 95 percent condent that the mean taste
rating is between 20.05 and 27.83.
A 95 percent prediction interval for Y (x0 ), when x = x0 , is (2.75, 45.12). When
ACETIC = 5.5, H2S = 6.0, and LACTIC = 1.4, we are 95 percent condent that the
taste rating for a new cheese specimen will be between 2.75 and 45.12.

6.4

Model diagnostics (residual analysis)

IMPORTANCE : We now discuss certain diagnostic techniques for linear regression. The
term diagnostics refers to the process of checking the model assumptions. This is an
important exercise because if the model assumptions are violated, then our analysis (and
all subsequent interpretations) could be compromised.
MODEL ASSUMPTIONS : We rst recall the model assumptions on the error terms in
the linear regression model
Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
for i = 1, 2, ..., n. Specically, we have made the following assumptions:
PAGE 165

CHAPTER 6

STAT 509, J. TEBBS

E(i ) = 0, for i = 1, 2, ..., n


var(i ) = 2 , for i = 1, 2, ..., n, that is, the variance is constant
the random variables i are independent
the random variables i are normally distributed.
RESIDUALS : In checking our model assumptions, we rst have to deal with the obvious
problem; namely, the error terms i in the model are never observed. However, from the
t of the model, we can calculate the residuals
ei = Yi Ybi ,
where the ith tted value
Ybi = b0 + b1 xi1 + b2 xi2 + + bk xik .
We can think of the residuals e1 , e2 , ..., en as proxies for the error terms 1 , 2 , ..., n ,
and, therefore, we can use the residuals to check our model assumptions instead.
QQ PLOT FOR NORMALITY : To check the normality assumption (for the errors) in
linear regression, it is common to display the qq-plot of the residuals.
Recall that if the plotted points follow a straight line (approximately), this supports
the normality assumption.
Substantial deviation from linearity is not consistent with the normality assumption.
The plot in Figure 6.6 supports the normality assumption for the errors in the
multiple linear regression model for the cheese data.

RESIDUAL PLOT : By the phrase residual plot, I mean the plot of the residuals (on
the vertical axis) versus the predicted values (on the horizontal axis). This plot is simply
the scatterplot of the residuals and the predicted values.
PAGE 166

STAT 509, J. TEBBS

10
0
10

Sample Quantiles

20

CHAPTER 6

Theoretical Quantiles

Figure 6.6: Cheese data. Normal qq-plot of the least squares residuals.

Advanced linear model arguments show that if the model does a good job at describing the data, then the residuals and tted values are independent.
This means that a plot of the residuals versus the tted values should reveal no
noticeable patterns; that is, the plot should appear to be random in nature (e.g.,
a random scatter of points).
On the other hand, if there are denite (non-random) patterns in the residual plot,
this suggests that the model is inadequate in some way or it could point to a
violation in the model assumptions.
The plot in Figure 6.7 does not suggest any obvious model inadequacies! It looks
completely random in appearance.

PAGE 167

STAT 509, J. TEBBS

10

Residuals

10

20

CHAPTER 6

10

20

30

40

50

Fitted values

Figure 6.7: Cheese data. Residual plot for the multiple linear regression model t. A
horizontal line at zero has been added.

COMMON VIOLATIONS : Although there are many ways to violate the statistical assumptions associated with linear regression, the most common violations are
non-constant variance (heteroscedasticity)
misspecifying the true regression function
correlated observations over time.
Example 6.3. An electric company is interested in modeling peak hour electricity
demand (Y ) as a function of total monthly energy usage (x). This is important for
planning purposes because the generating system must be large enough to meet the
maximum demand imposed by customers. Data for n = 53 residential customers for a
given month are shown in Figure 6.8.
PAGE 168

STAT 509, J. TEBBS

Residuals

10
0

Peak Demand (kWh)

15

CHAPTER 6

500

1000

1500

2000

2500

3000

3500

Monthly Usage (kWh)

10

12

Fitted values

Figure 6.8: Electricity data. Left: Scatterplot of peak demand (Y , measured in kWh)
versus monthly usage (x, measured in kWh) with least squares simple linear regression
line superimposed. Right: Residual plot for the simple linear regression model t.
Problem: There is a clear problem with non-constant variance here. Note how the
residual plot fans out like the bell of a trumpet. This violation may have been missed
by looking at the scatterplot alone, but the residual plot highlights it.
Remedy: A common course of action to handle non-constant variance is to apply a
transformation to the response variable Y . Common transformations are logarithmic

(ln Y ), square-root ( Y ), and inverse (1/Y ).


ELECTRICITY DATA: A square root transformation is commonly applied to address
non-constant variance. Consider the simple linear regression model
Wi = 0 + 1 xi + i ,
for i = 1, 2, ..., 53, where Wi =

Yi . It is straightforward to t this transformed model

in R as before. We simply regress W on x (instead of regressing Y on x).


> fit.2 = lm(sqrt(peak.demand) ~ monthly.usage)
PAGE 169

STAT 509, J. TEBBS

0.5
Residuals

0.5

0.0

3.0
2.5
2.0

1.0

1.5
1.0
0.5

Peak Demand (kWh): Square root scale

3.5

4.0

CHAPTER 6

500

1000

1500

2000

2500

3000

3500

1.0

Monthly Usage (kWh)

1.5

2.0

2.5

3.0

3.5

4.0

Fitted values

Figure 6.9: Electricity data. Left: Scatterplot of the square root of peak demand ( Y )
versus monthly usage (x, measured in kWh) with the least squares simple linear regression
line superimposed. Right: Residual plot for the simple linear regression model t with
transformed response.
> fit.2
Coefficients:
(Intercept)

monthly.usage

0.580831

0.000953

ANALYSIS : Figure 6.9 above shows the scatterplot (left) and the residual plot (right)
from tting the transformed model. The fanning out shape that we saw previously (in
the untransformed model) is now largely absent. The tted transformed model is
c = 0.580831 + 0.000953x,
W
or, in other words,

Peak demand = 0.580831 + 0.000953 Monthly usage.


Further analyses can be carried out with the transformed model; e.g., testing whether
peak demand (on the square root scale) is linearly related to monthly usage, etc.
PAGE 170

STAT 509, J. TEBBS

Residuals

0.6

0.5

0.4

1.0

0.2

DC Output

1.5

0.0

2.0

0.2

CHAPTER 6

10

1.0

Wind Velocity (mph)

1.5

2.0

2.5

Fitted values

Figure 6.10: Windmill data. Left: Scatterplot of DC output Y versus wind velocity (x,
measured in mph) with least squares simple linear regression line superimposed. Right:
Residual plot for the simple linear regression model t.
Example 6.4. A research engineer is investigating the use of a windmill to generate
electricity. He has collected data on the direct current (DC) output Y from his windmill
and the corresponding wind velocity (x, measured in mph). Data for n = 25 observation
pairs are shown in Figure 6.10.
Problem: There is a clear quadratic relationship between DC output and wind velocity,
so a simple linear regression model t (as shown above) is inappropriate. The residual
plot shows a pronounced quadratic pattern; this pattern is not accounted for in tting a
straight line model.
Remedy: Fit a multiple linear regression model with two independent variables: wind
velocity x and its square x2 , that is, consider the quadratic regression model
Yi = 0 + 1 xi + 2 x2i + i ,
for i = 1, 2, ..., 25. It is straightforward to t a quadratic model in R. We simply regress
Y on x and x2 .
PAGE 171

STAT 509, J. TEBBS

0.0
0.2

0.5

0.1

1.0

Residuals

DC Output

1.5

0.1

2.0

0.2

CHAPTER 6

10

0.5

1.0

Wind Velocity (mph)

1.5

2.0

Fitted values

Figure 6.11: Windmill data. Scatterplot of DC output Y versus wind velocity (x, measured in mph) with least squares quadratic regression curve superimposed. Right: Residual plot for the quadratic regression model t.
> wind.velocity.sq = wind.velocity^2
> fit.2 = lm(DC.output ~ wind.velocity + wind.velocity.sq)
> fit.2
Coefficients:
(Intercept)

wind.velocity

wind.velocity.sq

-1.15590

0.72294

-0.03812

The tted quadratic regression model is


Yb = 1.15590 + 0.72294x 0.03812x2
or, in other words,
DC \
output = 1.15590 + 0.72294 Wind.velocity 0.03812 (Wind.velocity)2 .
Note that the residual plot from the quadratic model t, shown above, now looks quite
good. The quadratic trend has disappeared (because the model now incorporates it).
PAGE 172

STAT 509, J. TEBBS

Residuals

0.2
0.0

0.2

0.4

Global temperature deviations (since 1900)

0.4

CHAPTER 7

1900

1920

1940

1960

1980

2000

20

40

60

80

100

Year

Year

Figure 6.12: Global temperature data. Left: Time series plot of the temperature Y
measured one time per year. The independent variable x is year, measured as 1900,
1901, ..., 1997. A simple linear regression model t has been superimposed. Right:
Residual plot from the simple linear regression model t.
Example 6.5. The data in Figure 6.12 (left) are temperature readings (in deg C) on
land-air average temperature anomalies, collected once per year from 1900-1997. To
emphasize that the data are collected over time, I have used straight lines to connect the
observations; this is called a time series plot.
Unfortunately, it is all too common that people t linear regression models to time
series data and then blindly use them for prediction purposes.
It takes neither a meteorologist nor an engineering degree to know that temperature
observations collected over time are probably correlated. Not surprisingly, residuals
from a simple linear regression display clear correlation over time.
Regression techniques (as we have learned in this chapter) are generally not appropriate when analyzing time series data for this reason. More advanced modeling
techniques are needed.
PAGE 173

CHAPTER 7

STAT 509, J. TEBBS

Factorial Experiments

Complementary reading: Chapter 7 (VK); Sections 7.1-7.2.

7.1

Introduction

REMARK : In engineering experiments, particularly those carried out in industrial settings, there are often several factors of interest and the goal is to assess the eects of
these factors on a continuous response Y (e.g., yield, lifetime, ll weights, etc.). A factorial treatment structure is an ecient way of dening treatments in these types of
experiments.
One example of a factorial treatment structure uses k factors, where each factor
has two levels. This is called a 2k factorial experiment.
Factorial experiments are often used in the early stages of experimental work. For
this reason, factorial experiments are also called factor screening experiments.
Example 7.1. A nickel-titanium alloy is used to make components for jet turbine
aircraft engines. Cracking is a potentially serious problem in the nal part, as it can lead
to nonrecoverable failure. A test is run at the parts producer to determine the eect of
k = 4 factors on cracks: pouring temperature (A), titanium content (B), heat treatment
method (C), and amount of grain rener used (D).
Factor A has 2 levels: low temperature and high temperature
Factor B has 2 levels: low content and high content
Factor C has 2 levels: Method 1 and Method 2
Factor D has 2 levels: low amount and high amount.
The response variable in the experiment is
Y = length of largest crack (in mm) induced in a piece of sample material.
PAGE 174

CHAPTER 7

STAT 509, J. TEBBS

NOTE : In this example, there are 4 factors, each with 2 levels. Thus, there are
2 2 2 2 = 24 = 16
dierent treatment combinations. These are listed here:
a1 b1 c1 d1

a1 b2 c1 d1

a2 b1 c1 d1

a2 b2 c1 d1

a1 b1 c1 d2

a1 b2 c1 d2

a2 b1 c1 d2

a2 b2 c1 d2

a1 b1 c2 d1

a1 b2 c2 d1

a2 b1 c2 d1

a2 b2 c2 d1

a1 b1 c2 d2

a1 b2 c2 d2

a2 b1 c2 d2

a2 b2 c2 d2

For example, the treatment combination a1 b1 c1 d1 holds each factor at its low level, the
treatment combination a1 b1 c2 d2 holds Factors A and B at their low level and Factors
C and D at their high level, and so on.
TERMINOLOGY : In a 2k factorial experiment, one replicate of the experiment uses 2k
runs, one at each of the 2k treatment combinations.
Therefore, in Example 7.1, one replicate of the experiment would require 16 runs
(one at each treatment combination listed above).
Two replicates would require 32 runs, three replicates would require 48 runs, and
so on.
TERMINOLOGY : There are dierent types of eects of interest in factorial experiments:
main eects and interaction eects. For example, in a 24 factorial experiment,
there is 1 eect that does not depend on any of the factors.
there are 4 main eects: A, B, C, and D.
there are 6 two-way interaction eects: AB, AC, AD, BC, BD, and CD.
there are 4 three-way interaction eects: ABC, ABD, ACD, and BCD.
there is 1 four-way interaction eect: ABCD.
PAGE 175

CHAPTER 7

STAT 509, J. TEBBS

OBSERVATION : Note that 1 + 4 + 6 + 4 + 1 = 16. In other words, with 16 observations


(from one 24 replicate), we can estimate the 4 main eects and we can estimate all of the
11 interaction eects. We will have 1 observation left to estimate the overall mean of Y ,
that is, the eect that depends on none of the 4 factors.
GENERALIZATION : In a 2k factorial experiment, there is/are

(k )
0

(k )
1

(k )
2

(k )
3

= 1 overall mean (the mean of Y ignoring all factors)


= k main eects
=

k(k1)
2

two-way interaction eects

three-way interaction eects, and so on.

Note that

( ) ( ) ( )
( )
k ( )
k
k
k
k
k
+
+
+ +
=
= 2k
0
1
2
k
j
j=0
(k ) (k )
(k )
and additionally that 0 , 1 , ..., k are the entries in the (k + 1)th row of Pascals
Triangle. Observe also that 2k grows quickly in size as k increases. For example, if there
are k = 10 factors (A, B, C, D, E, F, G, H, I, and J, say), then performing just one
replicate of the experiment would require 210 = 1024 runs! In real life, rarely would this
type of experiment be possible.

7.2

Example: A 22 experiment with replication

NOTE : We rst consider 2k factorial experiments where k = 2, that is, there are only
two factors, denoted by A and B. This is called a 22 experiment. We illustrate with an
agricultural example.
Example 7.2. Predicting corn yield prior to harvest is useful for making feed supply and
marketing decisions. Corn must have an adequate amount of nitrogen (Factor A) and
phosphorus (Factor B) for protable production and also for environmental concerns.
PAGE 176

CHAPTER 7

STAT 509, J. TEBBS

Table 7.1: Corn yield data (bushels/plot).


Treatment combination

Yield (Y )

Treatment sample mean

a1 b1

35, 26, 25, 33, 31

30

a1 b2

39, 33, 41, 31, 36

36

a2 b1

37, 27, 35, 27, 34

32

a2 b2

49, 39, 39, 47, 46

44

Experimental design: In a 2 2 = 22 factorial experiment, two levels of nitrogen


(a1 = 10 and a2 = 15) and two levels of phosphorus were used (b1 = 2 and b2 = 4).
Applications of nitrogen and phosphorus were measured in pounds per plot. Twenty
small (quarter acre) plots were available for experimentation, and the four treatment
combinations a1 b1 , a1 b2 , a2 b1 , and a2 b2 were randomly assigned to plots. Note that
there are 5 replications.
Response: The response variable is
Y = yield per plot (measured in # bushels).
Side-by-side boxplots of the data in the table above are in Figure 7.1.
Naive analysis: One silly way to analyze these data would be to simply regard each
of the combinations a1 b1 , a1 b2 , a2 b1 , and a2 b2 as a treatment and perform a one-way
ANOVA with t = 4 treatment groups like we did in Chapter 4. This would produce the
following ANOVA table:

> anova(lm(yield ~ treatment))


Analysis of Variance Table
Df Sum Sq Mean Sq F value
treatment

575

191.67

Residuals 16

320

20.00

Pr(>F)

9.5833 0.0007362 ***

PAGE 177

STAT 509, J. TEBBS

40
30
0

10

20

Yield (bushels/plot)

50

60

70

CHAPTER 7

a1b1

a1b2

a2b1

a2b2

Figure 7.1: Boxplots of corn yields (bushels/plot) for four treatment groups.

OMNIBUS CONCLUSION : The value F = 9.5833 is not what we would expect from an
F (3, 16) distribution, the distribution of F when
H0 : 11 = 12 = 21 = 22
is true (p-value 0.0007). Therefore, we would conclude that at least one of the factorial
treatment population means is dierent.
REMARK : As we have discussed before in one-way classication experiments, an overall
F test provides very little information. With a factorial treatment structure, it is possible
to explore the data further; in particular, we can learn about the main eects due to
nitrogen (Factor A) and due to phosphorus (Factor B). We can also learn about the
interaction between nitrogen and phosphorus.

PAGE 178

CHAPTER 7

STAT 509, J. TEBBS

PARTITION : Let us rst recall the treatment sum of squares from the one-way ANOVA:
SStrt = 575.
The way we learn more about specic eects is to partition SStrt into the following
pieces: SSA , SSB , and SSAB . By partition, I mean that we will write
SStrt = SSA + SSB + SSAB .
In words,
SSA is the sum of squares due to the main eect of A (nitrogen)
SSB is the sum of squares due to the main eect of B (phosphorus)
SSAB is the sum of squares due to the interaction eect of A and B (nitrogen and
phosphorus).
We can use R to write this partition in a richer ANOVA table (mathematical details
omitted):
> fit = lm(yield ~ nitrogen*phosphorus)
> anova(fit)
Analysis of Variance Table
Df Sum Sq Mean Sq F value

Pr(>F)

nitrogen

125

125

6.25 0.0236742 *

phosphorus

405

405

20.25 0.0003635 ***

nitrogen:phosphorus

45

45

16

320

20

Residuals

2.25 0.1530877

The F statistics FA , FB , and FAB can be used to test for main eects and an interaction
eect, respectively. Each F statistic above has an F (1, 16) distribution under the assumption that the associated eect is zero. Small p-values (e.g., p-value < 0.05) indicate
that the eect is nonzero. Eects with large p-values can be treated as not signicant.

PAGE 179

STAT 509, J. TEBBS

0.0

0.2

0.4

PDF

0.6

0.8

1.0

CHAPTER 7

Figure 7.2: F (1, 16) probability density function. An at FAB = 2.25 has been added.

ANALYSIS : When analyzing data from an experiment with a 22 factorial treatment


structure, the rst task is to judge whether the two factors interact, here, whether or not
the nitrogen/phosphorus contribution is real. From the ANOVA table, we see that
FAB = 2.25 (p-value 0.153).
This value of FAB is not all that unreasonable when compared to the F (1, 16) distribution,
the distribution of FAB when nitrogen and phosphorus do not interact. In other words,
we do not have substantial evidence that nitrogen and phosphorus interact.
NOTE : An interaction plot is a graphical display that can help us assess (visually)
whether two factors interact. In this plot, the levels of Factor A (say) are marked on the
horizontal axis. The sample means of the treatments are plotted against the levels of A,
and the points corresponding to the same level of Factor B are joined by straight lines.
PAGE 180

STAT 509, J. TEBBS

50

CHAPTER 7

phosphorus

40
35
30
20

25

Mean yield (bushels/plot)

45

b2
b1

a1

a2
nitrogen

Figure 7.3: Interaction plot for nitrogen and phosphorus in Example 7.2.

If Factors A and B do not interact at all, the interaction plot should display parallel
lines. That is, the eect of one factor stays constant across the levels of the other
factor. This is essentially what it means to have no interaction.
If the interaction plot displays a departure from parallelism (including an overwhelming case where the lines intersect), then this is visual evidence of interaction.
That is, the eect of one factor depends on the levels of the other factor.
The F test that uses FAB provides numerical evidence of interaction. The interaction plot provides visual evidence.
CONCLUSION : We do not have strong evidence that nitrogen and phosphorus interact.
The FAB statistic is not signicant and the interaction plot does not show a substantial
departure from parallelism.
PAGE 181

CHAPTER 7

STAT 509, J. TEBBS

GENERAL STRATEGY : The following are guidelines for analyzing data from 22 factorial experiments. Start by looking at whether or not the interaction contribution is
signicant. This can be done by using an interaction plot and an F test that uses FAB .
If the interaction is signicant, then formal analysis of main eects is not all
that meaningful because their interpretations depend on the interaction. In this
situation, the best approach is to just redo the entire analysis as a one-way ANOVA
with 4 treatments. Tukey pairwise condence intervals can help you formulate an
ordering among the 4 treatment population means.
If the interaction is not signicant, I prefer to ret the model without the
interaction term present and then examine the main eects. This can be done
numerically by examining the sizes of FA and FB , respectively.

ANALYSIS : Here is the ANOVA table for the corn yield data, leaving out the nitrogen/phosphorus interaction term:
> fit = lm(yield ~ nitrogen + phosphorus)
> anova(fit)
Analysis of Variance Table
Df Sum Sq Mean Sq F value

Pr(>F)

nitrogen

125

125.00

5.8219 0.027403 *

phosphorus

405

405.00 18.8630 0.000442 ***

Residuals

17

365

21.47

Comparing this to the ANOVA table with interaction, note that the interaction sum of
squares, SSAB = 45, has now been absorbed into the residual sum of squares.
The main eect of nitrogen (Factor A) is signicant in describing yield (FA =
5.8219, p-value 0.0274).
The main eect of phosphorus (Factor B) is strongly signicant in describing yield
(FB = 18.8630, p-value = 0.0004).
PAGE 182

45
40
35

Yield (bushels/plot)

20

25

30

35
30
20

25

Yield (bushels/plot)

40

45

50

STAT 509, J. TEBBS

50

CHAPTER 7

Nitrogen.1

Nitrogen.2

Phosphorus.1

Phosphorus.2

Figure 7.4: Left: Side by side boxplots for nitrogen (Factor A). Right: Side by side
boxplots for phosphorus (Factor B).
CONFIDENCE INTERVALS : A 95 percent condence interval for A1 A2 , the dierence in means for the two levels of nitrogen (Factor A) is given by

)
(
1
1
(Y A1 Y A2 ) t17,0.025 M Sres
+
.
10 10
A 95 percent condence interval for B1 B2 , the dierence in means for the two levels
of phosphorus (Factor B) is given by

(Y B1 Y B2 ) t17,0.025

M Sres

)
1
1
+
.
10 10

The R code online can be used to calculate these intervals:


95% CI for A1 A2 :

(9.37, 0.62) bushels/acre

95% CI for B1 B2 :

(13.37, 4.63) bushels/acre

Note that neither of these intervals includes zero. This is expected because FA and FB
are both signicant at the = 0.05 level.

PAGE 183

CHAPTER 7

STAT 509, J. TEBBS

REGRESSION : In Example 7.2, there were two levels of nitrogen (a1 = 10 and a2 = 15)
and two levels of phosphorus (b1 = 2 and b2 = 4) used in the experiment. These levels,
which we generically called low and high when analyzing the data using ANOVA,
are actually numerical in nature (measured in pounds per plot). In this light, there is
nothing to prevent us from tting the following multiple linear regression model
using the numerical values of nitrogen and phosphorus:
Yi = 0 + 1 xi1 + 2 xi2 + i ,
where the independent variables
x1 = nitrogen amount (10 or 15 pounds)
x2 = phosphorus amount (2 or 4 pounds).
Doing so in R gives the following output:

> fit = lm(yield ~ nitrogen + phosphorus)


> summary(fit)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)

9.5000

6.1297

1.550 0.139599

nitrogen

1.0000

0.4144

2.413 0.027403 *

phosphorus

4.5000

1.0361

4.343 0.000442 ***

Residual standard error: 4.634 on 17 degrees of freedom


Multiple R-squared: 0.5922,

Adjusted R-squared: 0.5442

F-statistic: 12.34 on 2 and 17 DF,

p-value: 0.0004886

This output gives the values of the least squares estimates

9.5
b

b = b1 = 1.0 .

4.5
b2
PAGE 184

CHAPTER 7

STAT 509, J. TEBBS

Therefore, the tted least squares regression model for the corn yield data is
Yb = 9.5 + 1.0x1 + 4.5x2 ,
or, in other words,
\ = 9.5 + 1.0 NITROGEN + 4.5 PHOSPHORUS.
YIELD
This equation can subsequently be used make predictions about future yields based on
given values of nitrogen and phosphorus. In doing so, be careful about extrapolation;
for example, you would not want to make a prediction when x1 = 25 and x2 = 10. These
values are not representative of those used in the actual experiment, so this model may
not be a good description of yield for these values of nitrogen and phosphorus.
INTERESTING: I have below constructed the analysis of variance table for the multiple
linear regression t:

> anova(fit)
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value

Pr(>F)

nitrogen

125

125.00

5.8219 0.027403 *

phosphorus

405

405.00 18.8630 0.000442 ***

Residuals

17

365

21.47

You will note that this table is identical to the two-way ANOVA table (without interaction) on pp 182 (notes). This is no coincidence! In fact, the two-way ANOVA model
(without interaction) and the multiple linear regression model
Yi = 0 + 1 xi1 + 2 xi2 + i
are actually identical models. Therefore, tting each one gives the same analysis and the
same conclusions.

PAGE 185

CHAPTER 7

7.3

STAT 509, J. TEBBS

Example: A 24 experiment without replication

Example 7.3. A chemical product is produced in a pressure vessel. A factorial experiment is carried out to study the factors thought to inuence the ltration rate of this
product. The four factors are temperature (A), pressure (B), concentration of formaldehyde (C) and stirring rate (D). Each factor is present at two levels (e.g., low and
high). A 24 experiment is performed with one replication; the data are shown below.

Factor

Filtration rate

Run

Run label

(Y , gal/hr)

a1 b1 c1 d1

45

a2 b1 c1 d1

71

a1 b2 c1 d1

48

a2 b2 c1 d1

65

a1 b1 c2 d1

68

a2 b1 c2 d1

60

a1 b2 c2 d1

80

a2 b2 c2 d1

65

a1 b1 c1 d2

43

10

a2 b1 c1 d2

100

11

a1 b2 c1 d2

45

12

a2 b2 c1 d2

104

13

a1 b1 c2 d2

75

14

a2 b1 c2 d2

86

15

a1 b2 c2 d2

70

16

a2 b2 c2 d2

96

NOTATION : When discussing factorial experiments, it is common to use the symbol


to denote the low level of a factor and the symbol + to denote the high level. For
example, the rst row of the table above indicates that each factor (A, B, C, and D) is
run at its low level. The response Y for this run is 45 gal/hr.

PAGE 186

CHAPTER 7

STAT 509, J. TEBBS

NOTE : In this experiment, there are k = 4 factors, so there are 15 eects to estimate:
the 4 main eects: A, B, C, and D
the 6 two-way interactions: AB, AC, AD, BC, BD, and CD
the 4 three-way interactions: ABC, ABD, ACD, BCD
the 1 four-way interaction: ABCD.
In this 24 experiment, we have 16 values of Y and 15 eects to estimate. This means that
only one Y observation is left, but this observation is used to estimate the overall mean.
This leaves us with no observations (and therefore no degrees of freedom) to perform
statistical tests. This is an obvious problem! Why? Because we have no way to judge
which main eects are signicant, and we cannot learn about how these factors interact.
TERMINOLOGY : A single replicate of a 2k factorial experiment is called an unreplicated factorial. With only one replicate, as in Example 7.3, there is no internal error
estimate, so we cannot perform statistical tests to judge signicance. What do we do?
One approach to the analysis of an unreplicated factorial is to assume that certain
higher-order interactions are negligible and then combine their mean squares to
estimate the error.
This is an appeal to the sparsity of eects principle; that is, most systems
are dominated by some of the main eects and low-order interactions and most
high-order interactions are negligible.
To learn about which eects may be negligible, we can t the full ANOVA model
and obtain the SS attached to each of these 15 eects (see next page).
Eects with large SS can be retained. Eects with small SS can be discarded.
A smaller model with only the large eects can then be t. This smaller model
will have an error estimate formed by taking all of the eects with small SS and
combining them together.
PAGE 187

CHAPTER 7

STAT 509, J. TEBBS

ANALYSIS : Here is the R output summarizing the t of the full model:


> # Fit full model
> fit = lm(filtration ~ A*B*C*D)
> anova(fit)
Analysis of Variance Table
Df

Sum Sq Mean Sq F value Pr(>F)

1 1870.56 1870.56

39.06

39.06

390.06

390.06

855.56

855.56

A:B

0.06

0.06

A:C

1 1314.06 1314.06

B:C

A:D

1 1105.56 1105.56

B:D

0.56

0.56

C:D

5.06

5.06

A:B:C

14.06

14.06

A:B:D

68.06

68.06

A:C:D

10.56

10.56

B:C:D

27.56

27.56

A:B:C:D

7.56

7.56

Residuals

0.00

22.56

22.56

Warning message: ANOVA F-tests on an essentially perfect fit are unreliable

NOTE : From this table, it is easy to see that the eects


A, C, D, AC, AD
are far more signicant than the others. For example, the smallest SS in this set is 390.06
(Factor C) which is over 5 times larger than the largest remaining SS (68.06). As a next
step, we therefore consider tting a smaller model with these 5 eects only. This will
free up 10 degrees of freedom that can be used to estimate the error variance.
ANALYSIS : Here is the R output (next page) summarizing the t of the smaller model
that includes only A, C, D, AC, and AD:
PAGE 188

d2
d1

80
70
40

40

50

60

70

Filtration rate (gal/hr)

90

Factor.D

c1
c2

60

80

Factor.C

50

Filtration rate (gal/hr)

90

100

STAT 509, J. TEBBS

100

CHAPTER 7

a1

a2

a1

Factor.A

a2
Factor.A

Figure 7.5: Left: Interaction plot for temperature (Factor A) and concentration of
formaldehyde (Factor C). Right: Interaction plot for temperature (Factor A) and stirring
rate (Factor D).
> # Fit smaller model
> fit = lm(filtration ~ A + C + D + A:C + A:D)
> anova(fit)
Analysis of Variance Table
Df

Sum Sq Mean Sq F value

Pr(>F)

1 1870.56 1870.56

95.865 1.928e-06 ***

390.06

390.06

19.990

855.56

855.56

43.847 5.915e-05 ***

A:C

1 1314.06 1314.06

67.345 9.414e-06 ***

A:D

1 1105.56 1105.56

56.659 1.999e-05 ***

Residuals 10

195.13

0.001195 **

19.51

NOTE : It is clear that these ve eects are each signicant (note that the p-values are
all very close to zero). Interaction plots for temperature (Factor A) and concentration of
formaldehyde (Factor C) and temperature (Factor A) and stirring rate (Factor D) are in
Figure 7.5. These plots depict the strong pairwise interaction that exists.

PAGE 189

CHAPTER 7

STAT 509, J. TEBBS

REGRESSION : In Example 7.3, there were no numerical values attached to the levels
of temperature (Factor A), concentration of formaldehyde (Factor C), and stirring rate
(Factor D). Therefore, if we wanted to t a regression model (e.g., for prediction purposes,
etc.), we can use the following variables with arbitrary numerical codings assigned:
x1 = temperature (1 = low; 1 = high)
x2 = concentration of formaldehyde (1 = low; 1 = high)
x3 = stirring rate (1 = low; 1 = high).
With these values, we can t the multiple linear regression model
Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 4 xi1 xi2 + 5 xi1 xi3 + i .
Doing so in R gives the following output:
> fit = lm(filtration ~ temp + conc + stir.rate + temp:conc + temp:stir.rate)
> summary(fit)
Estimate Std. Error t value Pr(>|t|)
(Intercept)

70.062

1.104

63.444 2.30e-14 ***

temp

10.812

1.104

9.791 1.93e-06 ***

conc

4.938

1.104

4.471

stir.rate

7.313

1.104

6.622 5.92e-05 ***

temp:conc

-9.062

1.104

-8.206 9.41e-06 ***

8.312

1.104

7.527 2.00e-05 ***

temp:stir.rate

0.00120 **

Residual standard error: 4.417 on 10 degrees of freedom


Multiple R-squared: 0.966,

Adjusted R-squared: 0.9489

F-statistic: 56.74 on 5 and 10 DF,

p-value: 5.14e-07

This output gives the values of the least squares estimates

70.062
b

b1 10.812

b2 4.938
.

b=

=
b3 7.313

b4 9.062

8.312
b5
PAGE 190

CHAPTER 7

STAT 509, J. TEBBS

Therefore, the tted least squares regression model for the ltration rate data is
Yb = 70.062 + 10.812x1 + 4.938x2 + 7.313x3 9.062x1 x2 + 8.312x1 x3
or, in other words,
F[
ILT = 70.062+10.812 TEMP+4.938 CONC+7.313 STIR9.062 TEMP*CONC+8.312 TEMP*STIR
This tted regression model can be used to write condence intervals or prediction intervals for future values of temperature, concentration, and stirring rate.
ANOVA TABLE : I have constructed the analysis of variance table for the multiple linear
regression t in this example:

> anova(fit)
Analysis of Variance Table
Df

Sum Sq Mean Sq F value

Pr(>F)

temp

1 1870.56 1870.56

95.865 1.928e-06 ***

conc

390.06

390.06

19.990

stir.rate

855.56

855.56

43.847 5.915e-05 ***

temp:conc

1 1314.06 1314.06

67.345 9.414e-06 ***

temp:stir.rate

1 1105.56 1105.56

56.659 1.999e-05 ***

Residuals

10

195.13

0.001195 **

19.51

Note that this table is identical to the ANOVA table on pp 189 (notes).
REMARK : In this chapter, we have only just scratched the surface when it comes
to discussing factorial treatment structures. Specialized courses in experimental design,
such as STAT 506, would delve into more advanced designs and analysis techniques.
For example, a design that often arises in industrial experiments is that of running
a 2k factorial experiment in less than 2k runs; these are called fractional factorial
experiments. In these experiments, the engineer acknowledges a priori that the highest
order interactions are negligible and the goal is to assess main eects and lower order
interactions only. For those who are interested, Section 7.3 (VK) introduces this concept.
An illustrative example is Example 7.7 on pp 499 (VK).
PAGE 191

You might also like