You are on page 1of 139

Class Notes STAT 473-573

shuva@math.niu.edu
Oce: 361 F
815-753-6829
1. The prerequisite for this course is STAT 350, STAT 301 and
MATH 211.
2. The prescribed text for this course is
APPLIED LINEAR STATISTICAL MODELS -5
th
EDITION by
Kutner et al.
3. For this course we plan on covering chapters 1, 2, 3, 4, 6, 7,
9, 10 and 11. However if time permits we can start chapter 8
or 12.
4. The rst midterm exam will be on Sep 28 and the second one
will be on Nov 18.
5. Reading Exercise: Please go through Sections 1.1, 1.2, 1.3,
1.4, 1.5 on your own.
A brief Introduction

Experiment

Sample Space (Set of all possible outcomes)

Events (Subsets of Sample Space)

Probability

Random Variable Univariate and Bivariate/Multivariate

Distribution of a Random Variable

Discerete and Continuous Random Variables

Distribution of a Random Variable is characterized by a set of


Parameters

Parameters and Parametric models

Normal, Binomal, Uniform, Poisson,.....

What if the parameter is unknown?

Inference: We infer the parameters value from the data at


hand.

Estimation and Hypothesis Testing.

Estimation: Two kinds Point Estimation and Interval


Estimation.
1. The title of this course is STATISTICAL METHODS AND
MODELS.
2. We will use STATISTICAL METHODS to construct
STATISTICAL MODEL. (How is it dierent from a
MATHEMATICAL MODEL ?)
3. What is a MODEL: We intend to reconstruct a real life
phenomena using statistical/mathematical tools and
techniques.
4. MODEL is an approximation of the phenomenon we intend
to study.
5. For this course we shall keep ourselves conned only to
LINEAR MODELS. ( How is it dierent from NON LINEAR ?)
How does modeling a phenomenon help
1. Prediction.
2. Mechanism by which the data is produced.
We need DATA to model.
Refrigeration Equipment Data

An object is produced in lots of varying sizes

What is the optimum lot size for producing this part?

This qn can be answered only if we can successfully study the


relationship between lot size and labor hours required to
produce the lot.

So the question is how can we study successfully the


relationship between the lot size and the labor hours required
to produce the lot.

The rst thing we need to build a statistical model is DATA.


Data Description

Number of variable. Usually denoted by p. Here p = 2.

Number of observations. Usually denoted by n. Here n = 25.

The observations are independent (VERY IMPORTANT). The


observations do not inuence each other.

Variable names: Lot Size and Work Hours.

Problem: Study how the Lot Size explains the Work Hours.

That is study the change in Work Hours when the Lot Size
changes.

Dependent Variable: Work Hours

Independent/Explanatory Variable/Covariate: Lot Size.

One dependent and one independent variable. (Can we have


more than one independent varaible? Can we have more than
one dependent variable?)

Most people use statistics the way a drunk uses a lamp post.
More for support than illumination.

Let the DATA speak for itself.

Exploratory Data Analysis.

Is there a quick or huerstic way of determining if a relationship


exists between the two varaible? And if yes what is the nature.

Yes. Scatter plot. Dependent on Y axis and the Independent


on the X axis.

We can roughly expect 3 kinds of trends.

Linear trend

NonLinear trend

No trend
Linear Trend
6 4 2 0 2 4 6

5
0
5
1
0
x
y
1
Linear Trend
5 0 5

5
0
5
x
y
5 0 5

5
0
5
x
y
1
Nonlinear Trend
6 4 2 0 2 4 6
0
1
0
2
0
3
0
4
0
5
0
x
y
1
Nonlinear Trend
5 0 5
0
1
0
2
0
3
0
4
0
5
0
x
y
5 0 5
0
1
0
2
0
3
0
4
0
5
0
x
y
1
No Trend
4 2 0 2 4

1
0
1
2
x
y

Now based on the scatter plot we need model the data.

Since we shall be working only with linear models so we would


be interested in those dtata sets where the scatterplot shows a
linear trend.

If the scatter plot does not show linear plot, does it mean that
we can never we a linear model there?
Scatter Plot: Refrigeration Equipment Data
How do we construct the model

It would be best if our model could pass through each


obervation? (Would it really be best?)

But since we want to t a stright line it wont be possible. So


what do we do.

We could either say

Y =
0
+
1
X which would be a deterministic model, which
would not be able to explain the scattering

Y =
0
+
1
X + where is a random variable
Why the epsilon?

As statiticians we choose to model these deviations as


observations from a random varaible

Our model is
Y
i
=
0
+
1
X
i
+
for i = 1, ..., n

Here Y
i
is the i
th
observation ofresponse variable.


0
and
1
are the model parameters.

X
i
s are known constants, i.e., the i
th
value of the predictor
varaible.


i
are i.i.d N(0,
2
). Is known?
The tted model

So we want draw a stright line on the scatter plot. And we


want to do it in the best possible way.

What is best?

We want to draw a stright line which minimizes the sum of


squares of the distance from the obs and the tted.

The sum of squares of the distance from the obs and the
tted is:
n

i =1
(Y
i

0
+
1
X
i
)
2
The tted model

In the last slide we said .. straight line which minimizes the


sum of squares of the distance from the obs and the tted.

A stright line has two parameters. Slope


1
and intercept
0

How to do we choose the slope


1
and intercept
0

As stated before we choose that (


0
,
1
) which minimizes
n

i =1
(Y
i

0
+
1
X
i
)
2

Calculus maxima minima in two variables Pg 17


Least Squares Estimated Values

b
1
=

n
i =1
(X
i


X)(Y
i


Y)

n
i =1
(X
i


X)
2

b
0
=

Y b
1

X
Now that we the estimated values of the parameters we us the
data to compute it and plot the tted regression line on the scatter
plot. (See reg2.png)
The Gauss Markov Theorem

So these LSE have any special statistical properties? Yes

Unbisaed

Minimum variance among among all unbiased linear estimators

BLUE (Best linear unbiased estimator) not blue : (((((


Properties of the tted model

What is the tted regression equation here?


hr = 62.366 + 3.5702 lot

There are six properties in all (see page 23-24). We shall


check to see if these properties hold for our data set.
Interpretation of
0
and
1
Our model is as follows:
Y
i
=
0
+
1
X
i
+
i
for all i = 1, ..., n.

When X
i
= 0 we nd Y
i
=
0
. So
0
is that population mean
of the observed variable y when x = 0.

A change in the population mean value of the observed


variable when the independent variable undergoes a unit
change.
Recall that (b
0
, b
1
) are the estimates of (
1
,
2
). So b
0
+ b
1
X is
the estimate of the mean response of the variable Y when the
value of X is xed at a certain level.

What are tted values? The tted value of the i


th
observation
is

Y
i
= b
0
+ b
1
X
i

The residual for the i


th
observation is e
i
= Y
i


Y
i

An example. page 23

What is E(Y
i
) =?. What is E(Y
i


Y
i
) =?

Error Sum of squares or Residual sum of squares


SSE =

n
i =1
e
2
i
Estimation of
2

Suppose I have a random sample of size n drawn from a


population whose mean is and variance is
2
.

Suggest an unbiased estimator for


2
.

What were the two main assumptions required to answer the


above question?

Now consider our Y


i
s for all i = 1, ..., n. What is the
distribution of, say, Y
5
?

Does this random sample of size n satisfy the above two


assumptions? Which ones does it violate?

So what would be the estimator of


2
here?


2
= MSE =
SSE
n 2

E(MSE) =
2

MSE =
_
SSE
n 2
Chapter 1: Important Terms and Concepts:
1. explanatory variable/covariates/independent variable
2. predictor variable/observations/dependent variable
3. scatter plot
4. Linear Model
5. Simple linear regression
6. slope
1
its estimate b
0
and its interpretation
7. intercept
0
its estimate and its interpretation
8. method of least squares
9. Fitted model and its properties (7 of them). Fitted value.
10. Gauss Markov Theorem
11. residual
12. error sum of squares or residual sum of squares
13. error mean square or residual mean square
14. estimate of
2
The section in Chapter 1 which will not be included in the syllabus
is 1.8.
Chapter 2
5 0 5
3
4
5
6
7
x
y
1
5 0 5
3
4
5
6
7
x
y

From the scatter plot above we know a linear model is


appropriate

But we are not sure if


1
should be non zero?

What do we do now? That is, which model should we t:


with or without
1
?
Y
i
=
0
+
1
X
i
+
i
OR
Y
i
=
0
+
i

By tting a model we mean estimating its parameters. In our


case we estimate (
0
,
1
) or (
0
,
1
,
2
) accordingly as
2
is
known or not.

Now if we cannot infer much from the scatter plot and we


want to use some analytical tools then, one way to see if
linear relationship exists at all would be to test
H
0
:
1
= 0
H
1
:
1
= 0
Hypothesis Testing

We need a H
0
and a H
1
.

T: Test statistic. (This should not contain any unknown


parameters)

Distribution of T under H
0
.

R: Critical Region or the Rejection Region (Usually in term of


the test statistic T).

Now computed value of T also written as T

Check to see if T

belongs to R.

Or compute the P-value and compare it with .

Our hypotheses are:


H
0
:
1
= 0
H
1
:
1
= 0

Recall the estimate for


1
is b
1
.
b
1
=

n
i =1
(X
i


X)(Y
i


Y)

n
i =1
(X
i


X)
2
=
n

i =1
(X
i


X)

n
i =1
(X
i


X)
2
Y
i

This b
1
shall be our test statistic.

Notice it is a linear combination of Y


i
s.
More about b
1

E(b
1
)=?

Var(b
1
)=?

Sampling Distribution: What shall be the distribution of b


1
?

What would be the distribution of


TS =
b
1

1
_
Var(b
1
)

Can TS be a test statistic? Does it contain any unknown


parameter.

Estimated variance of b
1
, that is, estimate of Var (b
1
) (s(b
1
)
2
).

What is the distribution of


Z\
_

2
n
n
?
Where Z( N(0, 1)) and
2
n
are independent.

If we know (n 2)
2
/
2

2
n2

And we aslo know the above random variable is independent


of b
0
and b
1

What would be the distribution of


b
1

1
s(b
1
)

Example 1 (what is the degrees of freedom of our test


statistic) and Example 2 (page 46,48)
Condence interval for b
1

Is the t distribution symmetric ?

t(/2; n 2) : Denotes the (/2)100 percentile of the t


distribution with n 2 degrees of freedom.

t(/2; n 2) = t(1 /2; n 2)

Now
P
_
t(/2; n 2)
b
1

1
s(b
1
)
t(1 /2; n 2)
_
= 1

Thus the 100(1 )% C.I. for


1
is
_
b
1
t(1 /2; n 2) s(b
1
), b
1
+t(1 /2; n 2) s(b
1
)
_
ANOVA: Analysis of Variance

Observe the data set Y


1
, ..., Y
n
with n observations.

The sample variance of this data set is


s
2
=
1
n 1
n

i =1
(Y
i


Y)

Is s
2
always 0? When will it be 0?

If the answer to the above question is no, then the question is


Why NOT ??

Because there is varaition in the data.

What causes this variation?

Recall, we had assumed that


Y
i
=
0
+
1
X
i
+
i

What do we see? The variation is caused by:

A part explained by X
i
s.

Random error
i
s.

Y
i
=

Y + Y
i


Y

What is

Y?

Y
i


Y =

Y
i


Y + Y
i


Y
Thus
(Y
i


Y)
2
= (

Y
i


Y + Y
i


Y)
2

Which nally gives us (verify)


n

i =1
(Y
i


Y)
2
=
n

i =1
(

Y
i


Y)
2
+
n

i =1
(Y
i


Y)
2

So we have,
n

i =1
(Y
i


Y)
2
=
n

i =1
(

Y
i


Y)
2
+
n

i =1
(Y
i


Y)
2

SSR (Regresion Sum of Squares) =

n
i =1
(

Y
i


Y)
2

MSR = SSR/df(SSR)

SSE (Error Sum of Squares) =

n
i =1
(Y
i


Y)
2

MSE = SSE/df(SSE)

SSTO (Total Sum of Squares) =

n
i =1
(Y
i


Y)
2

SSTO = SSR + SSE


The expected values

E(MSR) =
2
+
2
1
n

i =1
(X
i


X)
2

E(MSE) =
2

E(MSR)
E(MSE)
=

2
+
2
1

n
i =1
(X
i


X)
2

2
= 1 when
1
= 0.

Now that we know the ratio will be greater than 1 when


1
is
not equals 0. Thus we can use
F =
MSR
MSE
as a test statistic for testing
H
0
:
1
= 0
H
1
:
1
= 0

What is the distribution of F under H


0
?
F distribution


n
i =1
Z
2
i
? where Z
i
are i.i.d N(0, 1)

What is the distribution of


Z
_

2
n
n
where Z and
2
n
are independent.

F =

2
m
/m

2
n
/n
where
2
m
and
2
n
are independent.

Our question was F ?

Now
F =
SSR

2
SSE

2
(n2)

If we know, under H
0
:
1
= 0,
SSR

2
and
SSE

2
are independent
and have
2
1
and
2
n2

F F(1, n 2)

Since F is always greater than 0. What shall be the critical


region?

Recall that
E(MSR)
E(MSE)
=

2
+
2
1

n
i =1
(X
i


X)
2

2
= 1 when
1
= 0.

Critical region: { F > c }

Altough two sided alternative hypothesis, but CR is one sided.

See Example Pg 71 of the textbook.

So we have established another method of testing


H
0
:
1
= 0
H
1
:
1
= 0

Are the two tests, using a T statistic and using a F statistic


equivalent? YES

Notice if
{X
2
> 4} = {X > 2} {X < 2}

Since F = T
2
, therefore
{F > c} = {T >

c} {T <

c}

See page 71 F(1, 23) = 4.28 = (2.069)


2
= (t(1, 23))
2
ANOVA Table
Source of variation SS df MS E(MS) F P value
Regression SSR 1 MSR=SSR/1 E(MSR) <0.0001*
Error SSE n-2 MSE=SSE/(n-2) E(MSE) <0.0001*
Total SSTO n-1
Inference on b
0
0 5 10
0
5
1
0
x
y
1
0 5 10
0
5
1
0
x
y

From the scatter plot above we know a linear model is


appropriate

But we are not sure if


0
should be non zero?

What do we do now? That is, which model should we t:


with or without
0
?
Y
i
=
0
+
1
X
i
+
i
OR
Y
i
=
1
X
i
+
i

By tting a model we mean estimating its parameters. In our


case we estimate (
0
,
1
) or (
0
,
1
,
2
) accordingly as
2
is
known or not.

Now if we cannot infer much from the scatter plot and we


want to use some analytical tools then, one way to see if
linear relationship exists at all would be to test
H
0
:
0
= 0
H
1
:
0
= 0
Inference on b
0

Recall
b
0
=

Y b
1

X =
1
n
n

i =1
Y
i

_
n

i =1

Xk
i
Y
i
_
=
n

i =1
(
1
n


Xk
i
)Y
i

E(b
0
) =
0


2
(b
0
) =
2
_
1
n
+

X
2

(X
i

X)
2
_

Sampling Distribution: What shall be the distribution of b


0
?
(Assuming that
2
is known.)

What would be the distribution of


TS =
b
0

0
_
Var(b
0
)

Can TS be a test statistic? Does it contain any unknown


parameter?

Estimated variance of b
0
, that is, estimate of Var (b
0
) (s(b
0
)
2
).

s(b
0
)
2
= MSE
_
1
n
+

X
2

(X
i


X)
2
_

Recall that
Z\
_

2
n
n
t
n2
Where Z( N(0, 1)) and
2
n
are independent.

If we know (n 2)
2
/
2

2
n2

And we aslo know the above random variable is independent


of b
0
and b
1

What would be the distribution of


b
0

0
s(b
0
)
Condence interval for b
0

Is the t distribution symmetric ?

t(/2; n 2) : Denotes the (/2)100 percentile of the t


distribution with n 2 degrees of freedom.

t(/2; n 2) = t(1 /2; n 2)

Now
P
_
t(/2; n 2)
b
0

0
s(b
0
)
t(1 /2; n 2)
_
= 1

Thus the 100(1 )% C.I. for


0
is
_
b
0
t(1 /2; n 2) s(b
0
), b
0
+t(1 /2; n 2) s(b
0
)
_
Interval estimation of E(Y
h
)

For a particular value of X, say X = X


h
, we are interested in
the value of the corresponding Y
h
ie the dependent variable
corresponding to this value of the independent variable.

Now if it wont be possible to get the value of Y


h
, so the next
best thing would be E(Y
h
) and this is in a way better because
this actually gives us the mean value of all the possible
estimates of Y
h
.

Recall Y
h
=
0
+
1
X
h
+
h
. Thus
E(Y
h
) =
0
+
1
X
h

What would be an estimator of E(Y


h
). Well from what we
already have, we can say

Y
h
= b
0
+ b
1
X
h
would be an estimator of E(Y
h
).

E(

Y
h
) =
0
+
1
X
h

Variance of

Y
h

2
(

Y
h
) =
2
_
1
n
+
(X
h


X)
2

(X
h


X)
2
_

Estimate of
2
(

Y
h
) =?
MSE
_
1
n
+
(X
h


X)
2

(X
h


X)
2
_

Sampling distribution of

Y
h

What would be the distribution of


TS =

Y
h
(
0
+
1
X
h
)
_
Var(

Y
h
)

Can TS be a test statistic? Does it contain any unknown


parameter?

Estimated variance of

Y
h
, that is, estimate of Var (

Y
h
)
(s(

Y
h
)
2
).

MSE
_
1
n
+
(X
h


X)
2

(X
h


X)
2
_

Recall that
Z\
_

2
n
n
t
n2
Where Z( N(0, 1)) and
2
n
are independent.

If we know (n 2)
2
/
2

2
n2

And we aslo know the above random variable is independent


of b
0
and b
1

What would be the distribution of

Y
h
(
0
+
1
X
h
)
_
MSE
_
1
n
+
(X
h

X)
2

(X
h

X)
2
_
Condence interval for E(Y
h
)

Is the t distribution symmetric ?

t(/2; n 2) : Denotes the (/2)100 percentile of the t


distribution with n 2 degrees of freedom.

t(/2; n 2) = t(1 /2; n 2)

Now
P
_
t(/2; n2)
Y
h
(
0
+
1
X
h
)
s(

Y
h
)
t(1/2; n2)
_
= 1

Thus the 100(1 )% C.I. for


0
+
1
X
h
is
_
b
0
+b
1
X
h
t(1/2; n2)s(

Y
h
), b
0
+b
1
X
h
+t(1/2; n2)s(

Y
h
)
_
Prediction of new observation
Now suppose we have X = X
new
and want to nd the
corresponding value of Y
new
. Since
Y
new
=
0
+
1
X
new
+
new
and
new
is unobserved so we shall new be able to get the exact
value of Y
new
. So here we propose an interval estimate of Y
new
.
Recall

Y
new
= b
0
+ b
1
X
new
.
Now consider Y
new


Y
new
.

E(Y
new


Y
new
) =?

Var (Y
new


Y
new
) =
2
+
2
_
1
n
+
(X
new


X)
2

(X
i

(X))
_

s
2
(Y
new


Y
new
) = MSE + MSE
_
1
n
+
(X
new


X)
2

(X
i

(X))
_

What is the distribution of


Y
new


Y
new
_

2
+
2
_
1
n
+
(X
new

X)
2

(X
i

(X))
_

If
2
is unknown we shall replace it by the MSE

So now the question is what is the distribution of


Y
new


Y
new
_
MSE + MSE
_
1
n
+
(X
new

X)
2

(X
i

(X))
_

Y
new


Y
new
_
MSE
_
1 +
1
n
+
(X
new

X)
2

(X
i

(X))
_

Observe that MSE is independent of the numerator. Why?

Thus the 100(1 )% C.I. for Y


new
is
_

Y
new
t(1 /2; n 2) s(Y
new


Y
new
),

Y
new
+ t(1 /2; n 2) s(Y
new


Y
new
)
_
R
2
and r

i =1
(Y
i


Y)
2
=
n

i =1
(

Y
i


Y)
2
+
n

i =1
(Y
i


Y)
2

SSR (Regresion Sum of Squares) =

n
i =1
(

Y
i


Y)
2

SSE (Error Sum of Squares) =

n
i =1
(Y
i


Y)
2

SSTO (Total Sum of Squares) =

n
i =1
(Y
i


Y)
2

SSTO = SSR + SSE

Thus,
1 =
SSR
SSTO
+
SSE
SSTO
SSR
SSTO
= 1
SSE
SSTO
R
2
=
SSR
SSTO
= 1
SSE
SSTO

What happens if all the observations are all on a straight line?


SSE = 0 or R
2
= 1

R
2
= 0. What does it imply? See the expln on page 74.
A brief note on R
2
and r .
R
2
is dened as the proportionate reduction of the total variation
associated with the use of the predictor variable X.

R
2
=
SSR
SSTO
= 1
SSE
SSTO
=
SSTO SSE
SSTO

Thus larger the value of SSR (or smaller the value of SSE)
closer is R
2
to 1.

If our model was


Y
i
=
0
+
i
we can show that, based on this model the total variation
(constant times the estimate of
2
) will be
n

i =1
(y
i
y)
2
.

However, if we use this model, which uses the predictor


variable X
Y
i
=
0
+
1
X
i
+
i
we can show that, based on this model the total variation
(constant times the estimate of
2
) will be
n

i =1
(y
i
y
i
)
2
.

Thus the absolute reduction in total variation will be


n

i =1
(y
i
y)
2

i =1
(y
i
y
i
)
2

Now the proportionate reduction (or the percentage decrease)


in the total variation will be

n
i =1
(y
i
y)
2

n
i =1
(y
i
y
i
)
2

n
i =1
(y
i
y
i
)
2
=

n
i =1
( y y
i
)
2

n
i =1
(y
i
y
i
)
2
Limitations of R
2
In the text book page 75 the authors speaks of 3
misunderstandings. Here we go over them.

I: Recall that we had proved that E(

Y
h
) =
0
+
1
X
h
. This
implies that the tted regression equation

Y
h
= b
0
+ b
1
X
h
is
an unbiased estimator of the mean value of Y when the level
of X is xed at X
h
. Now merely being unbiased does not
mean much. We would want to make sure that the variance
of this estimator is not too large. We have
s.e
2
(

Y
h
) = MSE
_
1
n
+
(X
h


X)
2

(X
i


X)
2
_
So if X
h
= 100, then

Y
h
= b
0
+ b
1
100, which would be an
unbiased estimator of
0
+
1
100, but the s.e
2
would be
203.72, which we can see is quite large. (See page 55
Example 2)
Using 100 (1 )% C.I. for testing hypothesis at level

The Critical Region for rejecting H


0
:
1
=
10
against
H
a
:
1
=
10
is
_
b
1
>
10
+ t(1 /2, n 2) se(b
1
)
_
_
_
b
1
<
10
t(1 /2, n 2) se(b
1
)
_

The above is equivalent to


_
b
1
t(1 /2, n 2) se(b
1
) >
10
_
_
_
b
1
+ t(1 /2, n 2) se(b
1
) <
10
_

The null hypothesis will NOT be rejected if


_
b
1
t(1/2, n2)se(b
1
) <
10
< b
1
+t(1/2, n2)se(b
1
)
_

Thus under the null hypothesis, i.e, when H


0
:
1
=
10
is
true we have
P
_
b
1
t(1 /2, n 2) se(b
1
)
<
10
< b
1
+ t(1 /2, n 2) se(b
1
)
_
= 1

Recall the 100 (1 )% C.I. for


1
is
_
b
1
t(1/2, n2) se(b
1
), b
1
+t(1/2, n2) se(b
1
)
_

That is
P
_
b
1
t(1 /2, n 2) se(b
1
)

1
b
1
+ t(1 /2, n 2) se(b
1
)
_
= 1

So if we are given a 100 (1 )% condence interval for say

1
and we want to use it to test the null hypothesis
H
0
:
1
=
10
against two sided alternative, then what do we
do.

We check to see if
10
lies outside the interval, if so then we
reject the null.
Chapter 2

Sections not included for the midterm 2.6, 2.8 and 2.11

Important terms and concepts

Inference (Testing + Interval Estimation) concerning


1

T test

ANOVA table + F test

Inference (Testing + Interval Estimation) concerning


0

Interval Estimation of E(Y


h
).

Prediction interval of Y
new

R
2
and r
Chapter 3

We assumed that our observations were generated as follows:


Y
i
=
0
+
1
X
i
+
i
i = 1, ..., n.
where,
i
were assumed to be i.i.d. Normal(0,
2
).

Essentially we had assumed that

The regression function was linear. (Model Assumption)

The error terms have a constant variance. (Error Assumption)

The error terms are independent.(Error Assumption)

The error terms are normally distributed.(Error Assumption)

The question is how do we know, merely based on the data


set, that the above 5 assumptions are valid.

This question can be answered by studying the residuals.

What are the residuals?

Recall that
e
i
= Y
i


Y
i
and

Y
i
= b
0
+ b
1
X
i

Also recall (from the 6 properties of the regression on pages


23 and 24. )
n

i =1
e
i
= 0
n

i =1
X
i
e
i
= 0

Question: Are the residuals independent? Why?

Now E(e
i
) = 0 for all i = 1, ..., n.

Var (e
i
) =
2
(1 h
ii
)

What is h
ii
? Its the i
th
diagonal element of the Hat matrix
H? What is the Hat matrix? Later.

Moral of the story e


i
s do not have a constant variance !

So it becomes dicult to use them together. If we want to


comapre one with another we would want to make sure they
are standardised.

So consider
e

i
=
e
i
e
_
Var (e
i
)
=
e
i
_
Var (e
i
)
=
e
i
_

2
(1 h
ii
)
This is going to make it mean 0 and variance 1. But what if
variance
2
is unknown?

Then we shall use


e

i
=
e
i
e
_
Var (e
i
)
=
e
i
_
Var (e
i
)
=
e
i
_
MSE(1 h
ii
)
Nonlinearity of Regression Function

Is the regression function linear?

An initial answer to this question is provided by scatter plot.

But this is not always as eective as the residual plot (we


shall soon see an example).

What is a residual plot? Plot residuals VS X OR residuals VS


tted.

Let us consider the following examples.


Scatter Plot
10 5 0 5 10

1
0

5
0
5
1
0
1
5
independent
d
e
p
e
n
d
e
n
t
Semi-Studentised Residual and Independent Variable Plot
10 5 0 5 10

2
.
0

1
.
5

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
predictor
s
t
d

r
e
s
i
d
u
a
l
s
Semi-Studentised Residual and Fitted Equation Plot
10 5 0 5 10 15

2
.
0

1
.
5

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
fitted
s
t
d

r
e
s
i
d
u
a
l
s

Here the data was generated from a linear model. Here


Y = 2.4 + 1.4 X + Normal (0, 2)

From the scatter plot we observe that strict linear trend

Now observe the semistudentized residual and x plot. What


do we see there?

Notice that the plot is scattered around the y = 0 line, and


more importantly we do NOT observe any pattern !!

If the functional form of the regression model is incorrect, the


residual plots constructed by using the model will often
display a pattern.

Later we shall see that this pattern can be used to determine


a more appropriate model.

More examples ......


Scatter Plot
10 5 0 5 10
0
2
0
4
0
6
0
8
0
1
0
0
independent
d
e
p
e
n
d
e
n
t
Semi-Studentised Residual and Independent Variable Plot
10 5 0 5 10

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
predictor
s
t
d

r
e
s
i
d
u
a
l
s
Semi-Studentised Residual and Fitted Equation Plot
36.1 36.2 36.3 36.4 36.5 36.6 36.7 36.8

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
fitted
s
t
d

r
e
s
i
d
u
a
l
s

Here the data was generated from a linear model. Here


Y = X
2
+ Normal (0, 2)

From the scatter plot we observe that its not linear at all. So
we can guess that a linear t wont work.

Now observe the semistudentized residual and x plot. What


do we see there?

Notice that the plot shows a distinct pattern (in this case a
quadratic pattern). Implying that the functional form of the
population regression equation is not linear.

More examples ......


Scatter Plot
0 5 10 15 20 25 30
3
.
0
3
.
5
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0
independent
d
e
p
e
n
d
e
n
t
Semi-Studentised Residual and Independent Variable Plot
0 5 10 15 20 25 30

0
.
6

0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
predictor
r
e
s
i
d
u
a
l
s
Semi-Studentised Residual and Fitted Equation Plot
5.0135 5.0140 5.0145 5.0150 5.0155 5.0160 5.0165

0
.
6

0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
fitted
r
e
s
i
d
u
a
l
s

Here the data was generated from a linear model. Here


Y = 5 + Normal (0, 2)

From the scatter plot we observe that it is linear with no


slope. So we can guess that a linear with slope wont work.

Now observe the semistudentized residual and x plot. What


do we see there?

More examples ......


Estimate t value Pr(>|t|)
b0 5.0134539 170.324 <2e-16 ***
b1 0.0001085 0.065 0.948
---
R-squared: 0.000153

In the past exmples it was pretty obvious from the scatter plot
itself as what to expect.

Let us take a look at the following example.


Scatter Plot
80 100 120 140 160 180 200 220
0
1
2
3
4
5
6
7
independent
d
e
p
e
n
d
e
n
t
Semi-Studentised Residual and Independent Variable Plot
80 100 120 140 160 180 200 220

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
predictor
r
e
s
i
d
u
a
l
s
Semi-Studentised Residual and Fitted Equation Plot
2 3 4 5 6 7

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
fitted
r
e
s
i
d
u
a
l
s
b0 -1.727 0.13500
b1 6.484 0.00064 ***
R-squared: 0.8751
Scatter Plot
0 5 10 15 20
0
1
0
0
2
0
0
3
0
0
4
0
0
independent
d
e
p
e
n
d
e
n
t
Semi-Studentised Residual and Independent Variable Plot
0 5 10 15 20

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
predictor
s
t
d

r
e
s
i
d
u
a
l
s
Scatter Plot and the Regression Equation
0 5 10 15 20
0
1
0
0
2
0
0
3
0
0
4
0
0
independent
d
e
p
e
n
d
e
n
t

Here the data was generated from a linear model. Here


Y = 5 + 3X + X
2
+ Normal(0, 2).

Now observe the semistudentized residual and x plot and


compare it with R
2
. What can we conclude?
Nonconstancy of error variance

Scatter plot of the residuals and the X


i
s or the tted values.

If however the number of observations is small, the scatter


plot of absolute value of the residuals and the X
i
s or the
tted values would be a good thing to do.

If the variances are systematically increasing then funnel


opening outwards.

If the variances are systematically decreasing then funnel


closing outwards.

If the variances vary arbitrarily, we wont be able to make that


out from the residual plot.
Nonindependence of error terms

How do we verify the assumption that the random errors are


independent.

When we say non independent we mean that the errors are


generated from a autoregressive time series model, that is:

i
=
i 1
+ u
i
where u
i
are i.i.d N(0,
2
)

Plot the residuals against time sequence instead of the


predicted value or X
i
s itself.

Observe the plot. If it does not display any pattern then the
errors are independent.

However if there is a pattern then we can assume


nonindependence.

What kind of pattern are we looking into?

We expect the residuals in the sequence plot to uctuate in a


more or less random pattern around 0. Lack of randomness
can take the form of too much or too little alteration of points
around the zero line. Read the Comment on page 110.

Now merely based on a graph we can only drawn conslusions


subjectively.

In order to be more precise we need a hypothesis test. This


brings us to the Durbin Watsons test.

H
0
: The errors are not autocorrelated
H
1
: The errors are autocorrelated

Test Statistic:
d =

n
i =2
(e
i 1
e
i
)
2

n
i =1
e
2
i

We shall talk more about this shortly.


Nonnormality of error terms

Distribution plot: Box plot, histogram etc

Comparison of frequencies (the 95% rule for t distribution )

Q-Q Plot

Step1: Arrange the standardised residuals from the smallest to


the largest e
[i ]
. This shall be on the Y axis

Step2: On the X axis plot the points z


(i )
, where z
(i )
is dened
as the point on the scale of the standard normal curve so that
the area under the curve to the left of this point z
(i )
, is
i 0.375
n+0.25
.

If the plot appears to be a straight line then we can assume


normality.

Why z
(i )
? because E(e
[i ]
) very close to
2
z
(i )
Outliers

What are outliers?

Extreme Observations

How to detect outliers?

Plot the semistudentized residuals against X


i
or

Y
i
. This plot
should be centerd around the line y = 0. Now look for the
points which lie outside the band y = 4 and y = 4. This is a
rough and ready way to detect outliers.

What causes outliers to occur in the residual scatter plot?

Mostly because of error in recording the observations. But it


may also be due to interaction with another predictor variable
which is not present in the model.
Why do we need to detect outliers?

LSE are signicantly modied due to the presence of outliers.


This causes wrong t..

If the number of observations is small. This in turn leads us


to errenous conclusions as it alters the residual plot as well.
Exploratory Residual Analysis

MODEL ASSUMPTIONS

The regression function is linear

Plot of residuals against predictor variables or tted values

The error variances are constant

Plot of residuals against predictor variables or tted values

The error terms are independent

Plot of residuals against time sequence

The error terms are normally distributed

Normal Probability Plot of residuals

OUTLIER DETECTION

Plot of residuals against predictor variables or tted values


Tests
In the last few slides we saw a bunch of plots to verify the four
model assumptions. But plots can be interpreted subjectively so in
order to check if the assumptions hold we need to carry out
hypothesis tests.

To check for the independence of the error terms: Durbin


Watson Test

To check for the constancy of variance of the error terms:


Brown Forsythe or Breusch Pagan

To check for the normality of the error terms: Correlation for


the QQ Plot or Kolmogorov-Smironov Test and a bunch of
others.

To check for the linearity of the regression function: F test for


Lack of Fit.

To detect outliers: Remove the point ret the regression


equation and construct a prediction interval.
Remedy

So far we have been discussing about how to detect if the


model assumptions are violated.

Now we shall talk about remedies.

There are two ways of xing the problem

Change the model (from to linear to something more complex).

Transform the data set.

Why should we prefer one over another? Complex models


may lead to a better understanding as to how the data is
being generated. But estimating the model parameters might
turn out to be very challenging. Whereas, if we can transform
the data successfully then we can get away by building a
relatively simpler model (perhaps with lesser number of
parameters to estimate and using simpler techniques).
Model

Regression function is not linear : Consider a polynomial or a


non linear regression model how do we determine its
functional form? scatter plot or techniques which shall be
discussed later. Instead of the usual linear form we could have
E(Y) =
0
+
1
X +
2
X
2
OR E(Y) =
0

X
1
.

Variances of the error terms are not constant : Use weighted


least squares instead of the usual method

That is divide each observation Y


i
by
2
i
, thus making the
variances of each observation constant.

Non independence of errors : Use a dependent structure


instead of the usual iid assumption, which will in turn alter
the theoretical properties of the estimates of the model
parameters.

Non normality of the error term : Use the proper distribution


instead of the normal distribution.
Data Transformation
As we discussed earlier using a complex model can be very
challenging. Whereas, if we can transform the data successfully
then we can get away by building the simple regression model. So
we could

Transforming X.

Transform Y. Power transformation or Box Cox


transformation

Transform both X and Y.


Remember: Why are we transforming? So that we dont have to
use a complex model i.e., we can use a simple linear model. That
is all the 4 model assumptions are valid.
Transforming X
Suppose the scatter plot shows a non linear trend, here we could
either transform X or Y. Here if we have sucient reason to
believe that the error terms are normally distributed and the
variance is constant then we should transform X rather than Y.
Why so ? Because if we transform Y, say

Y, then the
distribution of the error shall change and the variances shall not
necessarily be constant any more.

Step 1: Draw the scatter plot

Step 2: Does this plot look linear?

Step 3: Yes B-)) No? Make a good guess !

Step 4: Transform the X.

Step 5: Draw scatter plot again.

Step 6: Build the model.

Step 7: Residual analysis. What do the plots say?

Note: As we shall soon see in Step 4 this transformation does


not have to be unique, a couple of dierent ones may do the
job. We select any one which ts.

See the comment on page 132 - Very important.

The transformation only on X is essentially used when the


problem lies in the linearity assumption of the regression
function.

Now if the assumptions of constant variance and normality is


violated. What do we do then?

These are issues related to the error term. So the only way to
x it would be transforming the Y values. Usually both these
problems are addressed together by using only one
transformation.

How do we know that there is a problem? Use the residual


plot or observe the scatter plot carefully. If it seems like it is
spreading out then that shows the variance is increasing. Take
a look at the gure 3.15 in the textbook.
Transforming Y
Once we know that transforming Y is necessary, how do we do it?

We could just guess from the scatter plot like we did for X,
but there is another method which is less heuristic.

Box Cox Transformation or the Power Transformation. How


does it work?

We would be happiest if we could t the model


Y
i
=
0
+
1
X
i
+
i

But that is not be : (( So we need to transform the Y


variable, that is come up with a and use Y

instead of Y,
that is now we have to t
Y

i
=
0
+
1
X
i
+
i
Box Cox Transformation

Step 1: Choose 20 (?) uniformly seperated points from the


interval [-2,2] e.g. [-2.0,-1.8,-1.6,...,1.6,1.8,2.0 ]

Step 2: For each value of make the following


transformation:
If = 0
W
i
=
1
((
n
1
Y
i
)
1/n
)
1
(Y

i
1)
If = 0
W
i
= (
n
1
Y
i
)
1/n
log
e
(Y
i
)

Step 3: So now from the original data set consisting of Y


i
s we
have a new data set consisting of W
i
s instead. Now we build
a regression model with these (X
i
, W
i
). The number of
regression models shall be same as the number of dierent
values of .

Step 4: Now calculate the SSE for each of the regression


models. Plot the SSE vs . Select that where the SSE is
minimised.

Step 5: If a non zero is chosen the our transformation will


be Y

otherwise if = 0 the the transformation will be


log
e
(Y
i
)
Transforming both X and Y

When unequal variances are present but regression relation is


linear, a transformation on Y may not be sucient.

Such transformation on Y may stabilise the error variance but


it might also change the linear relationship to non linear.

A transformation on X shall also be required.

So rst we transform Y and then we transform X.


Post Transformation

Once we are done with the transformation, then we t the


linear regression to the transformed data and check the
residual plot.

If we are happy with the residual plot then job done.


Otherwise we try out a dierent transformation till we arrive
at the desire result.
Condence Intervals

100(1 /2)% C.I. for


0
is
b
0
t(1 /4; n 2)s{b
0
}

100(1 /2)% C.I. for


1
is
b
1
t(1 /4; n 2)s{b
1
}

What will be the 100(1 /2)% region for (


0
,
1
)

Let A denote the complement of the region


(b
0
t(1 /4; n 2)s{b
0
}, b
0
+ t(1 /4; n 2)s{b
0
})
Thus P(A) =

Let B denote the complement of the region


(b
1
t(1 /4; n 2)s{b
1
}, b
1
+ t(1 /4; n 2)s{b
1
})
Thus P(B) =

P(A B) = P(A) +P(B) P(A B)

P(A B) 1 P(A) P(B)

Now our required region is, say R


(b
0
t(1 /4; n 2)s{b
0
}) (b
1
t(1 /4; n 2)s{b
1
})

P(R) = P(A B) 1 P(A) P(B)

P(R) = P(A B) 1
Regression through the origin

If our model was


Y
i
=
1
X
i
+
i

E(Y
i
) =
1
X
i

The LSE of
1
is
b
1
=
n

i =1
X
i
Y
i
/
n

i =1
X
2
i

The unbiased estimator of


2
is
s
2
= MSE =
n

i =1
e
2
i
/(n 1)

Table 4.1 page 162


Multiple Regression

So far we had been studying Simple Linear Regression.

Now we move to Multiple linear regression

Why was it simple: It had only one covariate, that is


Y
i
=
0
+
1
X
i
+
i
for all i = 1 to n

In a multiple regression setting we have


Y
i
=
0
+
1
X
1i
+
2
X
2i
+
3
X
3i
+... +
p1
X
(p1)i
+
i
that is
Y
i
=
0
+
p1

j =1

j
X
ji
+
i
for all i = 1 to n
Y
n1
= X
np

p1
+
n1
where
Y
n1
= vector of n observations.
Y = (Y
1
, ..., Y
n
)

X
np
= the design matrix.
The i
th
row of the above matrix is
X
i
= (1, X
1i
, X
2i
, ..., X
(p1)i
)

n1
= vector of n errors.
that is
= (
1
, ...,
n
)

and

i
are i.i.d N(0,
2
)
Now that we can write
Y = X +
we have
E(Y
n1
) = X
Note we are taking the expectation of a vector.
V(Y) =
2
I
nn
We derived the variance of a vector. The Least Squares
Estimates, here, are

= b = (X

X)
1
X

Y
E(

) =
Fitted value

Y = X

Y = X(X

X)
1
X

Y
or

Y = HY
H = X(X

X)
1
X

It is to be noted that the matrix H is idempotent and symmetric.


Residuals in multiple regression
e = Y

Y = Y HY = (I H)Y
E(e) = 0
Why?
V(e) =
2
(I H)
Why?
s
2
(e) = MSE(I H)
ANOVA in multiple regression setting
Recall that
SSTO = SSE + SSR
where,
SSTO = Y

Y
1
n
Y

JY = Y

[I
1
n
J]Y
as usual it has n 1 degrees of freedom.
SSE = e

e = (Y Xb)

(Y Xb) = Y

(I H)Y
it has n p degrees of freedom.
SSR = b

Y
1
n
Y

JY = Y

[H
1
n
J]Y
it has p 1 degrees of freedom. where J is an n n matrix of 1s.
(For a numerical example see page 243, for ANOVA table see page
225)
Thus we have
MSE = SSE/(n p)
(n p)MSE/
2

2
np
MSR = SSR/(p 1)
As before it can be shown that, if all the
i
s are zero, then,
E(MSR) =
2
otherwise
E(MSR) >
2
Now that we have MSE, MSR we can construct an ANOVA table
as before. Now here,
H
0
:
1
=
2
= ... =
p1
= 0
H
a
: At least one
i
is non zero
The test statistic
F

=
MSR
MSE
Rejection region is
F

F(1 ; p 1, n p)
R
2
: Coecient of multiple determination
As before
R
2
=
SSR
SSTO
= 1
SSE
SSTO

It meausres the proportionate reduction of the total variation


in Y associated with the use of set of X variables X
1
, ..., X
p1
.

0 R
2
1. Now R
2
= 0, when all b
k
= 0 for
k = 0, 1, .., p 1, and R
2
= 1 when all

Y
i
= Y
i
for all
i = 1, ..., n.

As we increase the number of covariates, i.e. X


i
s (say from p
=5 to p=10) the SSE will decrease, thus R
2
will increase.
But this does not mean that the model has a better t.

This is why we have adjusted R


2
R
2
adj
= 1
SSE/(np)
SSTO/(n1)
Inference about the regression parameters in multiple
regression
E(b) =
Why?
V(b) =
2
(X

X)
1
Why?
s
2
(b) = MSE(X

X)
1
We know that
b MVN(, )
So we know that each b
k
where k = 0, 1, ..., (p 1) is normally
distributed. Hence as before
b
k

k
s{b
k
}
t(n p)
for all k = 0, 1, 2, ..., p 1
Hence the interval estimate of
k
with (1 ) condence
coecient is
_
b
k
t(1 /2, n p)s{b
k
}, b
k
+ t(1 /2, n p)s{b
k
}
_
Tests for
k
where k = 0, 1, 2, ..., p 1. In order to test
H
0
:
k
= 0
H
a
:
k
= 0
We use
t

=
b
k
s{b
k
}
as our test statistic and our ctitical region is,
|t

| > t(1 /2; n p)


Joint condence intervals. If we wish to nd the joint condence
interval of g of the p parameters, the condence limits with family
condence coecient 1 are:
b
k
t(1 /2g; n p)
Interval Estimation of E(Y
h
). Recall
E(Y) = X
Thus
E(Y
h
) = X

General Linear Test Approach


Recall that the model under consideration is
Y = X +

= (XX

)
1
Y
In order to test
H
0
:
j
= 0
H
a
:
j
= 0
We use, the t-test, where the test statistic is

i

i

MSE C
ii
t
(np)
Also we had
SSE = SSR + SSE
and as p increases SSE decreases or (SSR increases) and vice versa.
Here
SST =
n

i =1
(y
i
y)
2
and it does not depend on either p, the number of parameters in
the model or the values of the covariates in the model (i.e, the
actual values of X
i
s).
Now lets, for a moment, get back to the SLR model setting, that
is,
y
i
=
0
+
1
x
i
+
i
when i = 1, ..., n
Now in this setting the full model is
y
i
=
0
+
1
x
i
+
i
and the reduced model is
y
i
=
0
+
i
Under the full model the
SSE =
n

i =1
(y
i
y
i
)
2
=
n

i =1
(y
i
b
0
b
1
x
i
)
2
Under the reduced model
SSE =
n

i =1
(y
i
y
i
)
2
=
n

i =1
(y
i
y)
2
Observe that, under the reduced model, the SSE = SST. Since
SSE decreases as p increases, so if adding any new variable had
any eect can be found out by comparing
SSE(F) with SSE(R)
We also know that
SSE(F) SSE(R)
So in order to test
H
0
:
1
= 0
H
a
:
1
= 0
we use the following test statistic
F

=
(SSE(R) SSE(F))/(df
R
df
F
)
SSE(F)/df
F
If H
0
is true we know
F

=
SST SSE
(n 1) (n 2)

SSE
(n 2)
=
MSR
MSE
F(1, n 2)
Here (that is the case when p=2) we nd that the General Linear
Test is identical to the ANOVA test.
When p=2, we have
SST = SSR(X
1
) + SSE(X
1
)
When p=3, we have
SST = SSR(X
1
, X
2
) + SSE(X
1
, X
2
)
When p=4, we have,
SST = SSR(X
1
, X
2
, X
3
) + SSE(X
1
, X
2
, X
3
)
EXTRA SUM OF SQUARES:
SSR(X
2
|X
1
) = SSR(X
1
, X
2
) SSR(X
1
)
= SSE(X
1
) SSE(X
1
, X
2
)
The EXTRA SUM OF SQUARES is the measure of the marginal
eect of adding the new variable to the existing model. Similarly
we can dene:
SSR(X
3
|X
1
, X
2
) = SSE(X
1
, X
2
) SSE(X
1
, X
2
, X
3
)
or
SSR(X
3
, X
2
|X
1
) = SSE(X
1
) SSE(X
1
, X
2
, X
3
)
SST = SSR(X
1
) + SSE(X
1
)
SSR(X
2
|X
1
) = SSE(X
1
) SSE(X
1
, X
2
)
SST = SSR(X
1
) + SSE(X
2
|X
1
) + SSE(X
1
, X
2
)
SST = SSR(X
1
, X
2
) + SSE(X
1
, X
2
)
Comparing the two we get
SSR(X
1
, X
2
) = SSR(X
1
) + SSR(X
2
|X
1
)
Since
SSR(X
1
|X
2
) = SSE(X
1
) SSE(X
1
, X
2
)
SSR(X
3
|X
1
, X
2
) = SSE(X
1
, X
2
) SSE(X
1
, X
2
, X
3
)
We can write
SST = SSR(X
1
) + SSE(X
1
)
= SSR(X
1
) + SSE(X
1
, X
2
) + SSE(X
2
|X
1
)
= SSR(X
1
) + SSE(X
2
|X
1
) + SSR(X
3
|X
1
, X
2
) + SSE(X
1
, X
2
, X
3
)
Comparing this with
SST = SSR(X
1
, X
2
, X
3
) + SSE(X
1
, X
2
, X
3
)
We can write
SSR(X
1
, X
2
, X
3
) = SSR(X
1
) + SSE(X
2
|X
1
) + SSR(X
3
|X
1
, X
2
)
The df of SSR is p 1, so the df of SSR(X
1
|X
2
, X
3
) is 1, and that
of SSR(X
1
, X
2
|X
3
) is 2. Now that we have SSR we can also dene
the MSR
MSR(X
2
, X
3
|X
1
) = SSR(X
2
, X
3
|X
1
)/2
Thus we decompose the total SSR into smaller components. What
is the use of all this ? Well it gives us an idea as to how the
reduction in variation takes place and how each covariate is
responsible in bringing about this change, in other words, the
contribution of each covariate gets more explicit.
Consider the following (full)model
Y
i
=
0
+
1
X
i 1
+
2
X
i 2
+
3
X
i 3
+
i
H
0
:
3
= 0
H
a
:
3
= 0
SSE(F) = SSE(X
1
, X
2
, X
3
)
SSE(R) = SSE(X
1
, X
2
)
F

=
SSE(R) SSE(F)
df
R
df
F

SSE(F)
df
F
=
SSE(X
1
, X
2
) SSE(X
1
, X
2
, X
3
)
(n 3) (n 4)

SSE(X
1
, X
2
, X
3
)
(n 4)
=
SSR(X
3
|X
1
, X
2
)
1

SSE(X
1
, X
2
, X
3
)
n 4
=
MSR(X
3
|X
1
, X
2
)
MSE(X
1
, X
2
, X
3
)
This is known as the partial F-test
H
0
:
2
=
3
= 0
H
a
: H
0
is not true.
Full Model: Same as before, thus SSE(F) = SSE(X
1
, X
2
, X
3
)
Reduced Model: Y
i
=
0
+
1
X
i 1
+
i
, thus SSE(R) = SSE(X
1
)

You might also like