Stat 473-573 Notes

Class Notes STAT 473-573
shuva@math.niu.edu
Oce: 361 F
815-753-6829
1. The prerequisite for this course is STAT 350, STAT 301 and
MATH 211.
2. The prescribed text for this course is
APPLIED LINEAR STATISTICAL MODELS -5
th
EDITION by
Kutner et al.
3. For this course we plan on covering chapters 1, 2, 3, 4, 6, 7,
9, 10 and 11. However if time permits we can start chapter 8
or 12.
4. The rst midterm exam will be on Sep 28 and the second one
will be on Nov 18.
5. Reading Exercise: Please go through Sections 1.1, 1.2, 1.3,
1.4, 1.5 on your own.
A brief Introduction
Experiment
Sample Space (Set of all possible outcomes)
Events (Subsets of Sample Space)
Probability
Random Variable Univariate and Bivariate/Multivariate
Distribution of a Random Variable
Discerete and Continuous Random Variables
Distribution of a Random Variable is characterized by a set of

Parameters
Parameters and Parametric models
Normal, Binomal, Uniform, Poisson,.....
What if the parameter is unknown?
Inference: We infer the parameters value from the data at

hand.
Estimation and Hypothesis Testing.
Estimation: Two kinds Point Estimation and Interval

Estimation.
1. The title of this course is STATISTICAL METHODS AND
MODELS.
2. We will use STATISTICAL METHODS to construct
STATISTICAL MODEL. (How is it dierent from a
MATHEMATICAL MODEL ?)
3. What is a MODEL: We intend to reconstruct a real life
phenomena using statistical/mathematical tools and
techniques.
4. MODEL is an approximation of the phenomenon we intend
to study.
5. For this course we shall keep ourselves conned only to
LINEAR MODELS. ( How is it dierent from NON LINEAR ?)
How does modeling a phenomenon help
1. Prediction.
2. Mechanism by which the data is produced.
We need DATA to model.
Refrigeration Equipment Data
An object is produced in lots of varying sizes
What is the optimum lot size for producing this part?
This qn can be answered only if we can successfully study the

relationship between lot size and labor hours required to
produce the lot.
So the question is how can we study successfully the

relationship between the lot size and the labor hours required
to produce the lot.
The rst thing we need to build a statistical model is DATA.

Data Description
Number of variable. Usually denoted by p. Here p = 2.
Number of observations. Usually denoted by n. Here n = 25.
The observations are independent (VERY IMPORTANT). The

observations do not inuence each other.
Variable names: Lot Size and Work Hours.
Problem: Study how the Lot Size explains the Work Hours.
That is study the change in Work Hours when the Lot Size
changes.
Dependent Variable: Work Hours
Independent/Explanatory Variable/Covariate: Lot Size.
One dependent and one independent variable. (Can we have

more than one independent varaible? Can we have more than
one dependent variable?)
Most people use statistics the way a drunk uses a lamp post.
More for support than illumination.
Let the DATA speak for itself.
Exploratory Data Analysis.
Is there a quick or huerstic way of determining if a relationship

exists between the two varaible? And if yes what is the nature.
Yes. Scatter plot. Dependent on Y axis and the Independent

on the X axis.
We can roughly expect 3 kinds of trends.
Linear trend
NonLinear trend
No trend
Linear Trend
6 4 2 0 2 4 6
5
0
5
1
0
x
y
1
Linear Trend
5 0 5
5
0
5
x
y
5 0 5
5
0
5
x
y
1
Nonlinear Trend
6 4 2 0 2 4 6
0
1
0
2
0
3
0
4
0
5
0
x
y
1
Nonlinear Trend
5 0 5
0
1
0
2
0
3
0
4
0
5
0
x
y
5 0 5
0
1
0
2
0
3
0
4
0
5
0
x
y
1
No Trend
4 2 0 2 4
1
0
1
2
x
y
Now based on the scatter plot we need model the data.
Since we shall be working only with linear models so we would

be interested in those dtata sets where the scatterplot shows a
linear trend.
If the scatter plot does not show linear plot, does it mean that
we can never we a linear model there?
Scatter Plot: Refrigeration Equipment Data
How do we construct the model
It would be best if our model could pass through each

obervation? (Would it really be best?)
But since we want to t a stright line it wont be possible. So

what do we do.
We could either say
Y =
0
+
1
X which would be a deterministic model, which
would not be able to explain the scattering
Y =
0
+
1
X + where is a random variable
Why the epsilon?
As statiticians we choose to model these deviations as

observations from a random varaible
Our model is
Y
i
=
0
+
1
X
i
+
for i = 1, ..., n
Here Y
i
is the i
th
observation ofresponse variable.

0
and
1
are the model parameters.
X
i
s are known constants, i.e., the i
th
value of the predictor
varaible.

i
are i.i.d N(0,
2
). Is known?
The tted model
So we want draw a stright line on the scatter plot. And we

want to do it in the best possible way.
What is best?
We want to draw a stright line which minimizes the sum of

squares of the distance from the obs and the tted.
The sum of squares of the distance from the obs and the
tted is:
n
i =1
(Y
i

0
+
1
X
i
)
2
The tted model
In the last slide we said .. straight line which minimizes the

sum of squares of the distance from the obs and the tted.
A stright line has two parameters. Slope

1
and intercept
0
How to do we choose the slope

1
and intercept
0
As stated before we choose that (

0
,
1
) which minimizes
n
i =1
(Y
i

0
+
1
X
i
)
2
Calculus maxima minima in two variables Pg 17

Least Squares Estimated Values
b
1
=
n
i =1
(X
i

X)(Y
i

Y)
n
i =1
(X
i

X)
2
b
0
=

Y b
1
X
Now that we the estimated values of the parameters we us the
data to compute it and plot the tted regression line on the scatter
plot. (See reg2.png)
The Gauss Markov Theorem
So these LSE have any special statistical properties? Yes
Unbisaed
Minimum variance among among all unbiased linear estimators
BLUE (Best linear unbiased estimator) not blue : (((((

Properties of the tted model
What is the tted regression equation here?

hr = 62.366 + 3.5702 lot
There are six properties in all (see page 23-24). We shall

check to see if these properties hold for our data set.
Interpretation of
0
and
1
Our model is as follows:
Y
i
=
0
+
1
X
i
+
i
for all i = 1, ..., n.
When X
i
= 0 we nd Y
i
=
0
. So
0
is that population mean
of the observed variable y when x = 0.
A change in the population mean value of the observed

variable when the independent variable undergoes a unit
change.
Recall that (b
0
, b
1
) are the estimates of (
1
,
2
). So b
0
+ b
1
X is
the estimate of the mean response of the variable Y when the
value of X is xed at a certain level.
What are tted values? The tted value of the i

th
observation
is
Y
i
= b
0
+ b
1
X
i
The residual for the i

th
observation is e
i
= Y
i

Y
i
An example. page 23
What is E(Y
i
) =?. What is E(Y
i

Y
i
) =?
Error Sum of squares or Residual sum of squares

SSE =
n
i =1
e
2
i
Estimation of
2
Suppose I have a random sample of size n drawn from a

population whose mean is and variance is
2
.
Suggest an unbiased estimator for

2
.
What were the two main assumptions required to answer the

above question?
Now consider our Y

i
s for all i = 1, ..., n. What is the
distribution of, say, Y
5
?
Does this random sample of size n satisfy the above two

assumptions? Which ones does it violate?
So what would be the estimator of

2
here?

2
= MSE =
SSE
n 2
E(MSE) =
2
MSE =
_
SSE
n 2
Chapter 1: Important Terms and Concepts:
1. explanatory variable/covariates/independent variable
2. predictor variable/observations/dependent variable
3. scatter plot
4. Linear Model
5. Simple linear regression
6. slope
1
its estimate b
0
and its interpretation
7. intercept
0
its estimate and its interpretation
8. method of least squares
9. Fitted model and its properties (7 of them). Fitted value.
10. Gauss Markov Theorem
11. residual
12. error sum of squares or residual sum of squares
13. error mean square or residual mean square
14. estimate of
2
The section in Chapter 1 which will not be included in the syllabus
is 1.8.
Chapter 2
5 0 5
3
4
5
6
7
x
y
1
5 0 5
3
4
5
6
7
x
y
From the scatter plot above we know a linear model is

appropriate
But we are not sure if

1
should be non zero?
What do we do now? That is, which model should we t:

with or without
1
?
Y
i
=
0
+
1
X
i
+
i
OR
Y
i
=
0
+
i
By tting a model we mean estimating its parameters. In our

case we estimate (
0
,
1
) or (
0
,
1
,
2
) accordingly as
2
is
known or not.
Now if we cannot infer much from the scatter plot and we

want to use some analytical tools then, one way to see if
linear relationship exists at all would be to test
H
0
:
1
= 0
H
1
:
1
= 0
Hypothesis Testing
We need a H
0
and a H
1
.
T: Test statistic. (This should not contain any unknown

parameters)
Distribution of T under H
0
.
R: Critical Region or the Rejection Region (Usually in term of

the test statistic T).
Now computed value of T also written as T
Check to see if T
belongs to R.
Or compute the P-value and compare it with .
Our hypotheses are:

H
0
:
1
= 0
H
1
:
1
= 0
Recall the estimate for

1
is b
1
.
b
1
=
n
i =1
(X
i

X)(Y
i

Y)
n
i =1
(X
i

X)
2
=
n
i =1
(X
i

X)
n
i =1
(X
i

X)
2
Y
i
This b
1
shall be our test statistic.
Notice it is a linear combination of Y

i
s.
More about b
1
E(b
1
)=?
Var(b
1
)=?
Sampling Distribution: What shall be the distribution of b

1
?
What would be the distribution of

TS =
b
1

1
_
Var(b
1
)
Can TS be a test statistic? Does it contain any unknown

parameter.
Estimated variance of b
1
, that is, estimate of Var (b
1
) (s(b
1
)
2
).
What is the distribution of

Z\
_
2
n
n
?
Where Z( N(0, 1)) and
2
n
are independent.
If we know (n 2)
2
/
2

2
n2
And we aslo know the above random variable is independent

of b
0
and b
1

b
1

1
s(b
1
)
Example 1 (what is the degrees of freedom of our test

statistic) and Example 2 (page 46,48)
Condence interval for b
1
Is the t distribution symmetric ?
t(/2; n 2) : Denotes the (/2)100 percentile of the t

distribution with n 2 degrees of freedom.
t(/2; n 2) = t(1 /2; n 2)
Now
P
_
t(/2; n 2)
b
1

1
s(b
1
)
t(1 /2; n 2)
_
= 1
Thus the 100(1 )% C.I. for

1
is
_
b
1
t(1 /2; n 2) s(b
1
), b
1
+t(1 /2; n 2) s(b
1
)
_
ANOVA: Analysis of Variance
Observe the data set Y

1
, ..., Y
n
with n observations.
The sample variance of this data set is

s
2
=
1
n 1
n
i =1
(Y
i

Y)
Is s
2
always 0? When will it be 0?
If the answer to the above question is no, then the question is

Why NOT ??
Because there is varaition in the data.
What causes this variation?
Recall, we had assumed that

Y
i
=
0
+
1
X
i
+
i
What do we see? The variation is caused by:
A part explained by X
i
s.
Random error
i
s.
Y
i
=

Y + Y
i

Y
What is

Y?
Y
i

Y =

Y
i

Y + Y
i

Y
Thus
(Y
i

Y)
2
= (
Y
i

Y + Y
i

Y)
2
Which nally gives us (verify)

n
i =1
(Y
i

Y)
2
=
n
i =1
(
Y
i

Y)
2
+
n
i =1
(Y
i

Y)
2
So we have,
n
i =1
(Y
i

Y)
2
=
n
i =1
(
Y
i

Y)
2
+
n
i =1
(Y
i

Y)
2
SSR (Regresion Sum of Squares) =
n
i =1
(
Y
i

Y)
2
MSR = SSR/df(SSR)
SSE (Error Sum of Squares) =
n
i =1
(Y
i

Y)
2
MSE = SSE/df(SSE)
SSTO (Total Sum of Squares) =
n
i =1
(Y
i

Y)
2
SSTO = SSR + SSE

The expected values
E(MSR) =
2
+
2
1
n
i =1
(X
i

X)
2
E(MSE) =
2
E(MSR)
E(MSE)
=

2
+
2
1
n
i =1
(X
i

X)
2
2
= 1 when
1
= 0.
Now that we know the ratio will be greater than 1 when

1
is
not equals 0. Thus we can use
F =
MSR
MSE
as a test statistic for testing
H
0
:
1
= 0
H
1
:
1
= 0
What is the distribution of F under H

0
?
F distribution

n
i =1
Z
2
i
? where Z
i
are i.i.d N(0, 1)

Z
_
2
n
n
where Z and
2
n
are independent.
F =

2
m
/m
2
n
/n
where
2
m
and
2
n
are independent.
Our question was F ?
Now
F =
SSR
2
SSE
2
(n2)
If we know, under H
0
:
1
= 0,
SSR
2
and
SSE
2
are independent
and have
2
1
and
2
n2
F F(1, n 2)
Since F is always greater than 0. What shall be the critical

region?
Recall that
E(MSR)
E(MSE)
=

2
+
2
1
n
i =1
(X
i

X)
2
2
= 1 when
1
= 0.
Critical region: { F > c }
Altough two sided alternative hypothesis, but CR is one sided.
See Example Pg 71 of the textbook.
So we have established another method of testing

H
0
:
1
= 0
H
1
:
1
= 0
Are the two tests, using a T statistic and using a F statistic

equivalent? YES
Notice if
{X
2
> 4} = {X > 2} {X < 2}
Since F = T
2
, therefore
{F > c} = {T >
c} {T <
c}
See page 71 F(1, 23) = 4.28 = (2.069)

2
= (t(1, 23))
2
ANOVA Table
Source of variation SS df MS E(MS) F P value
Regression SSR 1 MSR=SSR/1 E(MSR) <0.0001*
Error SSE n-2 MSE=SSE/(n-2) E(MSE) <0.0001*
Total SSTO n-1
Inference on b
0
0 5 10
0
5
1
0
x
y
1
0 5 10
0
5
1
0
x
y
From the scatter plot above we know a linear model is

appropriate
But we are not sure if

0
should be non zero?
What do we do now? That is, which model should we t:

with or without
0
?
Y
i
=
0
+
1
X
i
+
i
OR
Y
i
=
1
X
i
+
i
By tting a model we mean estimating its parameters. In our

case we estimate (
0
,
1
) or (
0
,
1
,
2
) accordingly as
2
is
known or not.
Now if we cannot infer much from the scatter plot and we

want to use some analytical tools then, one way to see if
linear relationship exists at all would be to test
H
0
:
0
= 0
H
1
:
0
= 0
Inference on b
0
Recall
b
0
=

Y b
1
X =
1
n
n
i =1
Y
i

_
n
i =1
Xk
i
Y
i
_
=
n
i =1
(
1
n

Xk
i
)Y
i
E(b
0
) =
0

2
(b
0
) =
2
_
1
n
+

X
2
(X
i
X)
2
_
Sampling Distribution: What shall be the distribution of b

0
?
(Assuming that
2
is known.)

TS =
b
0

0
_
Var(b
0
)

parameter?
Estimated variance of b
0
, that is, estimate of Var (b
0
) (s(b
0
)
2
).
s(b
0
)
2
= MSE
_
1
n
+
X
2
(X
i

X)
2
_
Recall that
Z\
_
2
n
n
t
n2
2
n
are independent.
If we know (n 2)
2
/
2

2
n2

of b
0
and b
1

b
0

0
s(b
0
)
Condence interval for b
0

t(/2; n 2) = t(1 /2; n 2)
Now
P
_
t(/2; n 2)
b
0

0
s(b
0
)
t(1 /2; n 2)
_
= 1

0
is
_
b
0
t(1 /2; n 2) s(b
0
), b
0
+t(1 /2; n 2) s(b
0
)
_
Interval estimation of E(Y
h
)
For a particular value of X, say X = X

h
, we are interested in
the value of the corresponding Y
h
ie the dependent variable
corresponding to this value of the independent variable.
Now if it wont be possible to get the value of Y

h
, so the next
best thing would be E(Y
h
) and this is in a way better because
this actually gives us the mean value of all the possible
estimates of Y
h
.
Recall Y
h
=
0
+
1
X
h
+
h
. Thus
E(Y
h
) =
0
+
1
X
h
What would be an estimator of E(Y

h
). Well from what we
already have, we can say
Y
h
= b
0
+ b
1
X
h
would be an estimator of E(Y
h
).
E(
Y
h
) =
0
+
1
X
h
Variance of

Y
h
2
(
Y
h
) =
2
_
1
n
+
(X
h

X)
2
(X
h

X)
2
_
Estimate of
2
(
Y
h
) =?
MSE
_
1
n
+
(X
h

X)
2
(X
h

X)
2
_
Sampling distribution of

Y
h

TS =
Y
h
(
0
+
1
X
h
)
_
Var(
Y
h
)

parameter?
Estimated variance of

Y
h
, that is, estimate of Var (
Y
h
)
(s(
Y
h
)
2
).
MSE
_
1
n
+
(X
h

X)
2
(X
h

X)
2
_
Recall that
Z\
_
2
n
n
t
n2
2
n
are independent.
If we know (n 2)
2
/
2

2
n2

of b
0
and b
1
Y
h
(
0
+
1
X
h
)
_
MSE
_
1
n
+
(X
h
X)
2
(X
h
X)
2
_
Condence interval for E(Y
h
)

t(/2; n 2) = t(1 /2; n 2)
Now
P
_
t(/2; n2)
Y
h
(
0
+
1
X
h
)
s(
Y
h
)
t(1/2; n2)
_
= 1

0
+
1
X
h
is
_
b
0
+b
1
X
h
t(1/2; n2)s(
Y
h
), b
0
+b
1
X
h
+t(1/2; n2)s(
Y
h
)
_
Prediction of new observation
Now suppose we have X = X
new
and want to nd the
corresponding value of Y
new
. Since
Y
new
=
0
+
1
X
new
+
new
and
new
is unobserved so we shall new be able to get the exact
value of Y
new
. So here we propose an interval estimate of Y
new
.
Recall
Y
new
= b
0
+ b
1
X
new
.
Now consider Y
new

Y
new
.
E(Y
new

Y
new
) =?
Var (Y
new

Y
new
) =
2
+
2
_
1
n
+
(X
new

X)
2
(X
i

(X))
_
s
2
(Y
new

Y
new
) = MSE + MSE
_
1
n
+
(X
new

X)
2
(X
i

(X))
_

Y
new

Y
new
_
2
+
2
_
1
n
+
(X
new
X)
2
(X
i
(X))
_
If
2
is unknown we shall replace it by the MSE
So now the question is what is the distribution of

Y
new

Y
new
_
MSE + MSE
_
1
n
+
(X
new
X)
2
(X
i
(X))
_
Y
new

Y
new
_
MSE
_
1 +
1
n
+
(X
new
X)
2
(X
i
(X))
_
Observe that MSE is independent of the numerator. Why?
Thus the 100(1 )% C.I. for Y

new
is
_

Y
new
t(1 /2; n 2) s(Y
new

Y
new
),
Y
new
+ t(1 /2; n 2) s(Y
new

Y
new
)
_
R
2
and r
i =1
(Y
i

Y)
2
=
n
i =1
(
Y
i

Y)
2
+
n
i =1
(Y
i

Y)
2
SSR (Regresion Sum of Squares) =
n
i =1
(
Y
i

Y)
2
SSE (Error Sum of Squares) =
n
i =1
(Y
i

Y)
2
SSTO (Total Sum of Squares) =
n
i =1
(Y
i

Y)
2
SSTO = SSR + SSE
Thus,
1 =
SSR
SSTO
+
SSE
SSTO
SSR
SSTO
= 1
SSE
SSTO
R
2
=
SSR
SSTO
= 1
SSE
SSTO
What happens if all the observations are all on a straight line?

SSE = 0 or R
2
= 1
R
2
= 0. What does it imply? See the expln on page 74.
A brief note on R
2
and r .
R
2
is dened as the proportionate reduction of the total variation
associated with the use of the predictor variable X.
R
2
=
SSR
SSTO
= 1
SSE
SSTO
=
SSTO SSE
SSTO
Thus larger the value of SSR (or smaller the value of SSE)
closer is R
2
to 1.
If our model was

Y
i
=
0
+
i
we can show that, based on this model the total variation
(constant times the estimate of
2
) will be
n
i =1
(y
i
y)
2
.
However, if we use this model, which uses the predictor

variable X
Y
i
=
0
+
1
X
i
+
i
we can show that, based on this model the total variation
(constant times the estimate of
2
) will be
n
i =1
(y
i
y
i
)
2
.
Thus the absolute reduction in total variation will be

n
i =1
(y
i
y)
2
i =1
(y
i
y
i
)
2
Now the proportionate reduction (or the percentage decrease)

in the total variation will be
n
i =1
(y
i
y)
2
n
i =1
(y
i
y
i
)
2
n
i =1
(y
i
y
i
)
2
=
n
i =1
( y y
i
)
2
n
i =1
(y
i
y
i
)
2
Limitations of R
2
In the text book page 75 the authors speaks of 3
misunderstandings. Here we go over them.
I: Recall that we had proved that E(
Y
h
) =
0
+
1
X
h
. This
implies that the tted regression equation

Y
h
= b
0
+ b
1
X
h
is
an unbiased estimator of the mean value of Y when the level
of X is xed at X
h
. Now merely being unbiased does not
mean much. We would want to make sure that the variance
of this estimator is not too large. We have
s.e
2
(
Y
h
) = MSE
_
1
n
+
(X
h

X)
2
(X
i

X)
2
_
So if X
h
= 100, then

Y
h
= b
0
+ b
1
100, which would be an
unbiased estimator of
0
+
1
100, but the s.e
2
would be
203.72, which we can see is quite large. (See page 55
Example 2)
Using 100 (1 )% C.I. for testing hypothesis at level
The Critical Region for rejecting H

0
:
1
=
10
against
H
a
:
1
=
10
is
_
b
1
>
10
+ t(1 /2, n 2) se(b
1
)
_
_
_
b
1
<
10
t(1 /2, n 2) se(b
1
)
_
The above is equivalent to

_
b
1
t(1 /2, n 2) se(b
1
) >
10
_
_
_
b
1
+ t(1 /2, n 2) se(b
1
) <
10
_
The null hypothesis will NOT be rejected if

_
b
1
t(1/2, n2)se(b
1
) <
10
< b
1
+t(1/2, n2)se(b
1
)
_
Thus under the null hypothesis, i.e, when H

0
:
1
=
10
is
true we have
P
_
b
1
t(1 /2, n 2) se(b
1
)
<
10
< b
1
+ t(1 /2, n 2) se(b
1
)
_
= 1
Recall the 100 (1 )% C.I. for

1
is
_
b
1
t(1/2, n2) se(b
1
), b
1
+t(1/2, n2) se(b
1
)
_
That is
P
_
b
1
t(1 /2, n 2) se(b
1
)

1
b
1
+ t(1 /2, n 2) se(b
1
)
_
= 1
So if we are given a 100 (1 )% condence interval for say
1
and we want to use it to test the null hypothesis
H
0
:
1
=
10
against two sided alternative, then what do we
do.
We check to see if
10
lies outside the interval, if so then we
reject the null.
Chapter 2
Sections not included for the midterm 2.6, 2.8 and 2.11
Important terms and concepts
Inference (Testing + Interval Estimation) concerning

1
T test
ANOVA table + F test
Inference (Testing + Interval Estimation) concerning

0
Interval Estimation of E(Y

h
).
Prediction interval of Y
new
R
2
and r
Chapter 3
We assumed that our observations were generated as follows:

Y
i
=
0
+
1
X
i
+
i
i = 1, ..., n.
where,
i
were assumed to be i.i.d. Normal(0,
2
).
Essentially we had assumed that
The regression function was linear. (Model Assumption)
The error terms have a constant variance. (Error Assumption)
The error terms are independent.(Error Assumption)
The error terms are normally distributed.(Error Assumption)
The question is how do we know, merely based on the data

set, that the above 5 assumptions are valid.
This question can be answered by studying the residuals.
What are the residuals?
Recall that
e
i
= Y
i

Y
i
and
Y
i
= b
0
+ b
1
X
i
Also recall (from the 6 properties of the regression on pages

23 and 24. )
n
i =1
e
i
= 0
n
i =1
X
i
e
i
= 0
Question: Are the residuals independent? Why?
Now E(e
i
) = 0 for all i = 1, ..., n.
Var (e
i
) =
2
(1 h
ii
)
What is h
ii
? Its the i
th
diagonal element of the Hat matrix
H? What is the Hat matrix? Later.
Moral of the story e

i
s do not have a constant variance !
So it becomes dicult to use them together. If we want to

comapre one with another we would want to make sure they
are standardised.
So consider
e
i
=
e
i
e
_
Var (e
i
)
=
e
i
_
Var (e
i
)
=
e
i
_
2
(1 h
ii
)
This is going to make it mean 0 and variance 1. But what if
variance
2
is unknown?
Then we shall use

e
i
=
e
i
e
_
Var (e
i
)
=
e
i
_
Var (e
i
)
=
e
i
_
MSE(1 h
ii
)
Nonlinearity of Regression Function
Is the regression function linear?
An initial answer to this question is provided by scatter plot.
But this is not always as eective as the residual plot (we

shall soon see an example).
What is a residual plot? Plot residuals VS X OR residuals VS

tted.
Let us consider the following examples.

Scatter Plot
10 5 0 5 10
1
0
5
0
5
1
0
1
5
independent
d
e
p
e
n
d
e
n
t
Semi-Studentised Residual and Independent Variable Plot
10 5 0 5 10
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
predictor
s
t
d
r
e
s
i
d
u
a
l
s
Semi-Studentised Residual and Fitted Equation Plot
10 5 0 5 10 15
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
fitted
s
t
d
r
e
s
i
d
u
a
l
s
Here the data was generated from a linear model. Here

Y = 2.4 + 1.4 X + Normal (0, 2)
From the scatter plot we observe that strict linear trend
Now observe the semistudentized residual and x plot. What

do we see there?
Notice that the plot is scattered around the y = 0 line, and

more importantly we do NOT observe any pattern !!
If the functional form of the regression model is incorrect, the

residual plots constructed by using the model will often
display a pattern.
Later we shall see that this pattern can be used to determine

a more appropriate model.
More examples ......

Scatter Plot
10 5 0 5 10
0
2
0
4
0
6
0
8
0
1
0
0
independent
d
e
p
e
n
d
e
n
t
10 5 0 5 10
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
predictor
s
t
d
r
e
s
i
d
u
a
l
s
36.1 36.2 36.3 36.4 36.5 36.6 36.7 36.8
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
fitted
s
t
d
r
e
s
i
d
u
a
l
s

Y = X
2
+ Normal (0, 2)
From the scatter plot we observe that its not linear at all. So
we can guess that a linear t wont work.

do we see there?
Notice that the plot shows a distinct pattern (in this case a
quadratic pattern). Implying that the functional form of the
population regression equation is not linear.

Scatter Plot
0 5 10 15 20 25 30
3
.
0
3
.
5
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0
independent
d
e
p
e
n
d
e
n
t
0 5 10 15 20 25 30
0
.
6
0
.
4
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
predictor
r
e
s
i
d
u
a
l
s
5.0135 5.0140 5.0145 5.0150 5.0155 5.0160 5.0165
0
.
6
0
.
4
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
fitted
r
e
s
i
d
u
a
l
s

Y = 5 + Normal (0, 2)
From the scatter plot we observe that it is linear with no

slope. So we can guess that a linear with slope wont work.

do we see there?

Estimate t value Pr(>|t|)
b0 5.0134539 170.324 <2e-16 ***
b1 0.0001085 0.065 0.948
---
R-squared: 0.000153
In the past exmples it was pretty obvious from the scatter plot
itself as what to expect.
Let us take a look at the following example.

Scatter Plot
80 100 120 140 160 180 200 220
0
1
2
3
4
5
6
7
independent
d
e
p
e
n
d
e
n
t
80 100 120 140 160 180 200 220
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
predictor
r
e
s
i
d
u
a
l
s
2 3 4 5 6 7
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
fitted
r
e
s
i
d
u
a
l
s
b0 -1.727 0.13500
b1 6.484 0.00064 ***
R-squared: 0.8751
Scatter Plot
0 5 10 15 20
0
1
0
0
2
0
0
3
0
0
4
0
0
independent
d
e
p
e
n
d
e
n
t
0 5 10 15 20
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
predictor
s
t
d
r
e
s
i
d
u
a
l
s
Scatter Plot and the Regression Equation
0 5 10 15 20
0
1
0
0
2
0
0
3
0
0
4
0
0
independent
d
e
p
e
n
d
e
n
t

Y = 5 + 3X + X
2
+ Normal(0, 2).
Now observe the semistudentized residual and x plot and

compare it with R
2
. What can we conclude?
Nonconstancy of error variance
Scatter plot of the residuals and the X

i
s or the tted values.
If however the number of observations is small, the scatter

plot of absolute value of the residuals and the X
i
s or the
tted values would be a good thing to do.
If the variances are systematically increasing then funnel

opening outwards.
If the variances are systematically decreasing then funnel

closing outwards.
If the variances vary arbitrarily, we wont be able to make that

out from the residual plot.
Nonindependence of error terms
How do we verify the assumption that the random errors are

independent.
When we say non independent we mean that the errors are

generated from a autoregressive time series model, that is:
i
=
i 1
+ u
i
where u
i
are i.i.d N(0,
2
)
Plot the residuals against time sequence instead of the

predicted value or X
i
s itself.
Observe the plot. If it does not display any pattern then the
errors are independent.
However if there is a pattern then we can assume

nonindependence.
What kind of pattern are we looking into?
We expect the residuals in the sequence plot to uctuate in a

more or less random pattern around 0. Lack of randomness
can take the form of too much or too little alteration of points
around the zero line. Read the Comment on page 110.
Now merely based on a graph we can only drawn conslusions

subjectively.
In order to be more precise we need a hypothesis test. This

brings us to the Durbin Watsons test.
H
0
: The errors are not autocorrelated
H
1
: The errors are autocorrelated
Test Statistic:
d =
n
i =2
(e
i 1
e
i
)
2
n
i =1
e
2
i
We shall talk more about this shortly.

Nonnormality of error terms
Distribution plot: Box plot, histogram etc
Comparison of frequencies (the 95% rule for t distribution )
Q-Q Plot
Step1: Arrange the standardised residuals from the smallest to

the largest e
[i ]
. This shall be on the Y axis
Step2: On the X axis plot the points z

(i )
, where z
(i )
is dened
as the point on the scale of the standard normal curve so that
the area under the curve to the left of this point z
(i )
, is
i 0.375
n+0.25
.
If the plot appears to be a straight line then we can assume

normality.
Why z
(i )
? because E(e
[i ]
) very close to
2
z
(i )
Outliers
What are outliers?
Extreme Observations
How to detect outliers?
Plot the semistudentized residuals against X

i
or

Y
i
. This plot
should be centerd around the line y = 0. Now look for the
points which lie outside the band y = 4 and y = 4. This is a
rough and ready way to detect outliers.
What causes outliers to occur in the residual scatter plot?
Mostly because of error in recording the observations. But it

may also be due to interaction with another predictor variable
which is not present in the model.
Why do we need to detect outliers?
LSE are signicantly modied due to the presence of outliers.

This causes wrong t..
If the number of observations is small. This in turn leads us

to errenous conclusions as it alters the residual plot as well.
Exploratory Residual Analysis
MODEL ASSUMPTIONS
The regression function is linear
Plot of residuals against predictor variables or tted values
The error variances are constant
The error terms are independent
Plot of residuals against time sequence
The error terms are normally distributed
Normal Probability Plot of residuals
OUTLIER DETECTION

Tests
In the last few slides we saw a bunch of plots to verify the four
model assumptions. But plots can be interpreted subjectively so in
order to check if the assumptions hold we need to carry out
hypothesis tests.
To check for the independence of the error terms: Durbin

Watson Test
To check for the constancy of variance of the error terms:

Brown Forsythe or Breusch Pagan
To check for the normality of the error terms: Correlation for

the QQ Plot or Kolmogorov-Smironov Test and a bunch of
others.
To check for the linearity of the regression function: F test for

Lack of Fit.
To detect outliers: Remove the point ret the regression

equation and construct a prediction interval.
Remedy
So far we have been discussing about how to detect if the

model assumptions are violated.
Now we shall talk about remedies.
There are two ways of xing the problem
Change the model (from to linear to something more complex).
Transform the data set.
Why should we prefer one over another? Complex models

may lead to a better understanding as to how the data is
being generated. But estimating the model parameters might
turn out to be very challenging. Whereas, if we can transform
the data successfully then we can get away by building a
relatively simpler model (perhaps with lesser number of
parameters to estimate and using simpler techniques).
Model
Regression function is not linear : Consider a polynomial or a

non linear regression model how do we determine its
functional form? scatter plot or techniques which shall be
discussed later. Instead of the usual linear form we could have
E(Y) =
0
+
1
X +
2
X
2
OR E(Y) =
0
X
1
.
Variances of the error terms are not constant : Use weighted

least squares instead of the usual method
That is divide each observation Y

i
by
2
i
, thus making the
variances of each observation constant.
Non independence of errors : Use a dependent structure

instead of the usual iid assumption, which will in turn alter
the theoretical properties of the estimates of the model
parameters.
Non normality of the error term : Use the proper distribution

instead of the normal distribution.
Data Transformation
As we discussed earlier using a complex model can be very
challenging. Whereas, if we can transform the data successfully
then we can get away by building the simple regression model. So
we could
Transforming X.
Transform Y. Power transformation or Box Cox

transformation
Transform both X and Y.

Remember: Why are we transforming? So that we dont have to
use a complex model i.e., we can use a simple linear model. That
is all the 4 model assumptions are valid.
Transforming X
Suppose the scatter plot shows a non linear trend, here we could
either transform X or Y. Here if we have sucient reason to
believe that the error terms are normally distributed and the
variance is constant then we should transform X rather than Y.
Why so ? Because if we transform Y, say
Y, then the
distribution of the error shall change and the variances shall not
necessarily be constant any more.
Step 1: Draw the scatter plot
Step 2: Does this plot look linear?
Step 3: Yes B-)) No? Make a good guess !
Step 4: Transform the X.
Step 5: Draw scatter plot again.
Step 6: Build the model.
Step 7: Residual analysis. What do the plots say?
Note: As we shall soon see in Step 4 this transformation does

not have to be unique, a couple of dierent ones may do the
job. We select any one which ts.
See the comment on page 132 - Very important.
The transformation only on X is essentially used when the

problem lies in the linearity assumption of the regression
function.
Now if the assumptions of constant variance and normality is

violated. What do we do then?
These are issues related to the error term. So the only way to
x it would be transforming the Y values. Usually both these
problems are addressed together by using only one
transformation.
How do we know that there is a problem? Use the residual

plot or observe the scatter plot carefully. If it seems like it is
spreading out then that shows the variance is increasing. Take
a look at the gure 3.15 in the textbook.
Transforming Y
Once we know that transforming Y is necessary, how do we do it?
We could just guess from the scatter plot like we did for X,
but there is another method which is less heuristic.
Box Cox Transformation or the Power Transformation. How

does it work?
We would be happiest if we could t the model

Y
i
=
0
+
1
X
i
+
i
But that is not be : (( So we need to transform the Y

variable, that is come up with a and use Y
instead of Y,
that is now we have to t
Y
i
=
0
+
1
X
i
+
i
Box Cox Transformation
Step 1: Choose 20 (?) uniformly seperated points from the

interval [-2,2] e.g. [-2.0,-1.8,-1.6,...,1.6,1.8,2.0 ]
Step 2: For each value of make the following

transformation:
If = 0
W
i
=
1
((
n
1
Y
i
)
1/n
)
1
(Y
i
1)
If = 0
W
i
= (
n
1
Y
i
)
1/n
log
e
(Y
i
)
Step 3: So now from the original data set consisting of Y

i
s we
have a new data set consisting of W
i
s instead. Now we build
a regression model with these (X
i
, W
i
). The number of
regression models shall be same as the number of dierent
values of .
Step 4: Now calculate the SSE for each of the regression

models. Plot the SSE vs . Select that where the SSE is
minimised.
Step 5: If a non zero is chosen the our transformation will

be Y
otherwise if = 0 the the transformation will be

log
e
(Y
i
)
Transforming both X and Y
When unequal variances are present but regression relation is

linear, a transformation on Y may not be sucient.
Such transformation on Y may stabilise the error variance but

it might also change the linear relationship to non linear.
A transformation on X shall also be required.
So rst we transform Y and then we transform X.

Post Transformation
Once we are done with the transformation, then we t the

linear regression to the transformed data and check the
residual plot.
If we are happy with the residual plot then job done.

Otherwise we try out a dierent transformation till we arrive
at the desire result.
Condence Intervals
100(1 /2)% C.I. for

0
is
b
0
t(1 /4; n 2)s{b
0
}
100(1 /2)% C.I. for

1
is
b
1
t(1 /4; n 2)s{b
1
}
What will be the 100(1 /2)% region for (

0
,
1
)
Let A denote the complement of the region

(b
0
t(1 /4; n 2)s{b
0
}, b
0
+ t(1 /4; n 2)s{b
0
})
Thus P(A) =
Let B denote the complement of the region

(b
1
t(1 /4; n 2)s{b
1
}, b
1
+ t(1 /4; n 2)s{b
1
})
Thus P(B) =
P(A B) = P(A) +P(B) P(A B)
P(A B) 1 P(A) P(B)
Now our required region is, say R

(b
0
t(1 /4; n 2)s{b
0
}) (b
1
t(1 /4; n 2)s{b
1
})
P(R) = P(A B) 1 P(A) P(B)
P(R) = P(A B) 1
Regression through the origin
If our model was

Y
i
=
1
X
i
+
i
E(Y
i
) =
1
X
i
The LSE of
1
is
b
1
=
n
i =1
X
i
Y
i
/
n
i =1
X
2
i
The unbiased estimator of

2
is
s
2
= MSE =
n
i =1
e
2
i
/(n 1)
Table 4.1 page 162

Multiple Regression
So far we had been studying Simple Linear Regression.
Now we move to Multiple linear regression
Why was it simple: It had only one covariate, that is

Y
i
=
0
+
1
X
i
+
i
for all i = 1 to n
In a multiple regression setting we have

Y
i
=
0
+
1
X
1i
+
2
X
2i
+
3
X
3i
+... +
p1
X
(p1)i
+
i
that is
Y
i
=
0
+
p1
j =1
j
X
ji
+
i
for all i = 1 to n
Y
n1
= X
np
p1
+
n1
where
Y
n1
= vector of n observations.
Y = (Y
1
, ..., Y
n
)
X
np
= the design matrix.
The i
th
row of the above matrix is
X
i
= (1, X
1i
, X
2i
, ..., X
(p1)i
)
n1
= vector of n errors.
that is
= (
1
, ...,
n
)
and
i
are i.i.d N(0,
2
)
Now that we can write
Y = X +
we have
E(Y
n1
) = X
Note we are taking the expectation of a vector.
V(Y) =
2
I
nn
We derived the variance of a vector. The Least Squares
Estimates, here, are
= b = (X
X)
1
X
Y
E(
) =
Fitted value
Y = X
Y = X(X
X)
1
X
Y
or
Y = HY
H = X(X
X)
1
X
It is to be noted that the matrix H is idempotent and symmetric.

Residuals in multiple regression
e = Y

Y = Y HY = (I H)Y
E(e) = 0
Why?
V(e) =
2
(I H)
Why?
s
2
(e) = MSE(I H)
ANOVA in multiple regression setting
Recall that
SSTO = SSE + SSR
where,
SSTO = Y
Y
1
n
Y
JY = Y
[I
1
n
J]Y
as usual it has n 1 degrees of freedom.
SSE = e
e = (Y Xb)
(Y Xb) = Y
(I H)Y
it has n p degrees of freedom.
SSR = b
Y
1
n
Y
JY = Y
[H
1
n
J]Y
it has p 1 degrees of freedom. where J is an n n matrix of 1s.
(For a numerical example see page 243, for ANOVA table see page
225)
Thus we have
MSE = SSE/(n p)
(n p)MSE/
2

2
np
MSR = SSR/(p 1)
As before it can be shown that, if all the
i
s are zero, then,
E(MSR) =
2
otherwise
E(MSR) >
2
Now that we have MSE, MSR we can construct an ANOVA table
as before. Now here,
H
0
:
1
=
2
= ... =
p1
= 0
H
a
: At least one
i
is non zero
The test statistic
F
=
MSR
MSE
Rejection region is
F
F(1 ; p 1, n p)
R
2
: Coecient of multiple determination
As before
R
2
=
SSR
SSTO
= 1
SSE
SSTO
It meausres the proportionate reduction of the total variation

in Y associated with the use of set of X variables X
1
, ..., X
p1
.
0 R
2
1. Now R
2
= 0, when all b
k
= 0 for
k = 0, 1, .., p 1, and R
2
= 1 when all

Y
i
= Y
i
for all
i = 1, ..., n.
As we increase the number of covariates, i.e. X

i
s (say from p
=5 to p=10) the SSE will decrease, thus R
2
will increase.
But this does not mean that the model has a better t.
This is why we have adjusted R

2
R
2
adj
= 1
SSE/(np)
SSTO/(n1)
Inference about the regression parameters in multiple
regression
E(b) =
Why?
V(b) =
2
(X
X)
1
Why?
s
2
(b) = MSE(X
X)
1
We know that
b MVN(, )
So we know that each b
k
where k = 0, 1, ..., (p 1) is normally
distributed. Hence as before
b
k

k
s{b
k
}
t(n p)
for all k = 0, 1, 2, ..., p 1
Hence the interval estimate of
k
with (1 ) condence
coecient is
_
b
k
t(1 /2, n p)s{b
k
}, b
k
+ t(1 /2, n p)s{b
k
}
_
Tests for
k
where k = 0, 1, 2, ..., p 1. In order to test
H
0
:
k
= 0
H
a
:
k
= 0
We use
t
=
b
k
s{b
k
}
as our test statistic and our ctitical region is,
|t
| > t(1 /2; n p)

Joint condence intervals. If we wish to nd the joint condence
interval of g of the p parameters, the condence limits with family
condence coecient 1 are:
b
k
t(1 /2g; n p)
Interval Estimation of E(Y
h
). Recall
E(Y) = X
Thus
E(Y
h
) = X
General Linear Test Approach

Recall that the model under consideration is
Y = X +
= (XX
)
1
Y
In order to test
H
0
:
j
= 0
H
a
:
j
= 0
We use, the t-test, where the test statistic is
i

i
MSE C
ii
t
(np)
Also we had
SSE = SSR + SSE
and as p increases SSE decreases or (SSR increases) and vice versa.
Here
SST =
n
i =1
(y
i
y)
2
and it does not depend on either p, the number of parameters in
the model or the values of the covariates in the model (i.e, the
actual values of X
i
s).
Now lets, for a moment, get back to the SLR model setting, that
is,
y
i
=
0
+
1
x
i
+
i
when i = 1, ..., n
Now in this setting the full model is
y
i
=
0
+
1
x
i
+
i
and the reduced model is
y
i
=
0
+
i
Under the full model the
SSE =
n
i =1
(y
i
y
i
)
2
=
n
i =1
(y
i
b
0
b
1
x
i
)
2
Under the reduced model
SSE =
n
i =1
(y
i
y
i
)
2
=
n
i =1
(y
i
y)
2
Observe that, under the reduced model, the SSE = SST. Since
SSE decreases as p increases, so if adding any new variable had
any eect can be found out by comparing
SSE(F) with SSE(R)
We also know that
SSE(F) SSE(R)
So in order to test
H
0
:
1
= 0
H
a
:
1
= 0
we use the following test statistic
F
=
(SSE(R) SSE(F))/(df
R
df
F
)
SSE(F)/df
F
If H
0
is true we know
F
=
SST SSE
(n 1) (n 2)

SSE
(n 2)
=
MSR
MSE
F(1, n 2)
Here (that is the case when p=2) we nd that the General Linear
Test is identical to the ANOVA test.
When p=2, we have
SST = SSR(X
1
) + SSE(X
1
)
When p=3, we have
SST = SSR(X
1
, X
2
) + SSE(X
1
, X
2
)
When p=4, we have,
SST = SSR(X
1
, X
2
, X
3
) + SSE(X
1
, X
2
, X
3
)
EXTRA SUM OF SQUARES:
SSR(X
2
|X
1
) = SSR(X
1
, X
2
) SSR(X
1
)
= SSE(X
1
) SSE(X
1
, X
2
)
The EXTRA SUM OF SQUARES is the measure of the marginal
eect of adding the new variable to the existing model. Similarly
we can dene:
SSR(X
3
|X
1
, X
2
) = SSE(X
1
, X
2
) SSE(X
1
, X
2
, X
3
)
or
SSR(X
3
, X
2
|X
1
) = SSE(X
1
) SSE(X
1
, X
2
, X
3
)
SST = SSR(X
1
) + SSE(X
1
)
SSR(X
2
|X
1
) = SSE(X
1
) SSE(X
1
, X
2
)
SST = SSR(X
1
) + SSE(X
2
|X
1
) + SSE(X
1
, X
2
)
SST = SSR(X
1
, X
2
) + SSE(X
1
, X
2
)
Comparing the two we get
SSR(X
1
, X
2
) = SSR(X
1
) + SSR(X
2
|X
1
)
Since
SSR(X
1
|X
2
) = SSE(X
1
) SSE(X
1
, X
2
)
SSR(X
3
|X
1
, X
2
) = SSE(X
1
, X
2
) SSE(X
1
, X
2
, X
3
)
We can write
SST = SSR(X
1
) + SSE(X
1
)
= SSR(X
1
) + SSE(X
1
, X
2
) + SSE(X
2
|X
1
)
= SSR(X
1
) + SSE(X
2
|X
1
) + SSR(X
3
|X
1
, X
2
) + SSE(X
1
, X
2
, X
3
)
Comparing this with
SST = SSR(X
1
, X
2
, X
3
) + SSE(X
1
, X
2
, X
3
)
We can write
SSR(X
1
, X
2
, X
3
) = SSR(X
1
) + SSE(X
2
|X
1
) + SSR(X
3
|X
1
, X
2
)
The df of SSR is p 1, so the df of SSR(X
1
|X
2
, X
3
) is 1, and that
of SSR(X
1
, X
2
|X
3
) is 2. Now that we have SSR we can also dene
the MSR
MSR(X
2
, X
3
|X
1
) = SSR(X
2
, X
3
|X
1
)/2
Thus we decompose the total SSR into smaller components. What
is the use of all this ? Well it gives us an idea as to how the
reduction in variation takes place and how each covariate is
responsible in bringing about this change, in other words, the
contribution of each covariate gets more explicit.
Consider the following (full)model
Y
i
=
0
+
1
X
i 1
+
2
X
i 2
+
3
X
i 3
+
i
H
0
:
3
= 0
H
a
:
3
= 0
SSE(F) = SSE(X
1
, X
2
, X
3
)
SSE(R) = SSE(X
1
, X
2
)
F
=
SSE(R) SSE(F)
df
R
df
F
SSE(F)
df
F
=
SSE(X
1
, X
2
) SSE(X
1
, X
2
, X
3
)
(n 3) (n 4)

SSE(X
1
, X
2
, X
3
)
(n 4)
=
SSR(X
3
|X
1
, X
2
)
1

SSE(X
1
, X
2
, X
3
)
n 4
=
MSR(X
3
|X
1
, X
2
)
MSE(X
1
, X
2
, X
3
)
This is known as the partial F-test
H
0
:
2
=
3
= 0
H
a
: H
0
is not true.
Full Model: Same as before, thus SSE(F) = SSE(X
1
, X
2
, X
3
)
Reduced Model: Y
i
=
0
+
1
X
i 1
+
i
, thus SSE(R) = SSE(X
1
)

Stat 473-573 Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat 473-573 Notes

Uploaded by

Copyright:

Available Formats

Class Notes STAT 473-573

Sample Space (Set of all possible outcomes)

Events (Subsets of Sample Space)

Random Variable Univariate and Bivariate/Multivariate

Distribution of a Random Variable

Discerete and Continuous Random Variables

Distribution of a Random Variable is characterized by a set of

Parameters and Parametric models

Normal, Binomal, Uniform, Poisson,.....

What if the parameter is unknown?

Inference: We infer the parameters value from the data at

Estimation and Hypothesis Testing.

Estimation: Two kinds Point Estimation and Interval

An object is produced in lots of varying sizes

What is the optimum lot size for producing this part?

This qn can be answered only if we can successfully study the

So the question is how can we study successfully the

The rst thing we need to build a statistical model is DATA.

Number of variable. Usually denoted by p. Here p = 2.

Number of observations. Usually denoted by n. Here n = 25.

The observations are independent (VERY IMPORTANT). The

Variable names: Lot Size and Work Hours.

Dependent Variable: Work Hours

Independent/Explanatory Variable/Covariate: Lot Size.

One dependent and one independent variable. (Can we have

Let the DATA speak for itself.

Exploratory Data Analysis.

Is there a quick or huerstic way of determining if a relationship

Yes. Scatter plot. Dependent on Y axis and the Independent

We can roughly expect 3 kinds of trends.

Now based on the scatter plot we need model the data.

Since we shall be working only with linear models so we would

It would be best if our model could pass through each

But since we want to t a stright line it wont be possible. So

We could either say

As statiticians we choose to model these deviations as

So we want draw a stright line on the scatter plot. And we

We want to draw a stright line which minimizes the sum of

In the last slide we said .. straight line which minimizes the

A stright line has two parameters. Slope

How to do we choose the slope

As stated before we choose that (

Calculus maxima minima in two variables Pg 17

So these LSE have any special statistical properties? Yes

Minimum variance among among all unbiased linear estimators

BLUE (Best linear unbiased estimator) not blue : (((((

What is the tted regression equation here?

There are six properties in all (see page 23-24). We shall

A change in the population mean value of the observed

What are tted values? The tted value of the i

The residual for the i

Error Sum of squares or Residual sum of squares

Suppose I have a random sample of size n drawn from a

Suggest an unbiased estimator for

What were the two main assumptions required to answer the

Now consider our Y

Does this random sample of size n satisfy the above two

So what would be the estimator of

From the scatter plot above we know a linear model is

But we are not sure if

What do we do now? That is, which model should we t:

By tting a model we mean estimating its parameters. In our

Now if we cannot infer much from the scatter plot and we