Linear Models

INTRODUCTION.
In today industry, there is no shortage of ”information”. The purpose of this course is to

explain in
some detail the technique of extracting, from masses of data the main features of the
relationship
hidden or implied in the tabulated figures. Regression analysis is a statistical tool that utilizes
the
relation between two or more quantitative variables so that one variable can be predicted from
the other or the others.
RELATIONS BETWEEN VARIABLES.
The concept of relation between two variables, such as between family income and family
expenditures is a familiar one. We can distinguish between a functional relation and a
statistical
relation.
Functional relation between two variables.
A functional relation between two variables is expressed by a mathematical formula
Y  f(X)
where X - independent variable and Y - dependent one.
Example: Sales in rands and number of unit sold. If selling price is R5 the relation is
expressed in
Y  5X
Number of units and Rand sales during three recent periods were as follows
period number of unit sold Rand sales
1 5 25
2 10 50
3 15 75
These observations are plotted in figure below
Notice that all fall directly on the line of functional relationship.
Statistical relation between two variables.
A statistical relation, unlike a functional relation, is not a perfect one. In general, the
observations
for a statistical relation do not fall directly on the curve of relationship.For a certain set of
observation the following plot is drawn
1
However this relation is not a perfect one. There is a scattering of points suggesting that some
variation of Y is not accounted by X. The plotted line indicates the general tendency by which
Y vary with changes in X. This scattering of points around the line represents variation in Y
which is not associated with X, and which is usually considered to be of a random nature.
REGRESSION MODELS AND THEIR USES.

Basic concepts
A regression model is a formal means of expressing the two essential ingredients
of a statistical relation:
1. A tendency of the dependent variable to vary with the independent variable or
variables in a systematic fashion.
2. A scattering of observations around the curve of statistical relationship.
These two characteristics are embodied in a regression model by postulating that:
1. In the population of observations associated with the sampled process, there
is a probability distribution of Y for each level of X. [ E(Y)  h(X) ]
2. The means of these probability distributions vary in some systematic fashion with X.
Regression analysis serves three major purposes:
1. description
2. control
3. prediction
SIMPLE LINEAR REGRESSION.
We will start with the case where is only one independent variable and the regression
function is linear. In this case we consider the following:
(2.1) Yi  o  1Xi  i
where:
Y i - is the value of the response in the ith trial
 o and  1 are parameters
X i is the known constant, namely,the value of the independent
variable in the ith trial
 i is a random error term with mean E i   0 and variance  2    2
and  i and  j are uncorrelated so the covariance cov i ,  j   0
for all i,j such that i ≠ j.
This model is said to be simple, linear in the parameters, and linear in the independent
variable.
Let us notice that:
1. The observed value of Y in the ith trial is the sum of two components:
the term  o   1 X i and the random term  i . Hence Y i is a random variable
2. Since E i   0, it follows that
(2.2 ) EY i   E o   1 X i   i    o   1 X i  E i    o   1 X i
Therefore the regression function for the simple linear regression is
2
(2.3) EY   o   1 X i
3. The observed value of Y in the ith trial differs from the value of the regression
function by error term amount  i .
4. The error terms  i are assumed to have constant variance  2 . It therefore follows
that the variance of the response Y i is:
(2.4)  2 Y i    2 .
Meaning of regression parameters
The parameters  o and  1 in regression model (2.1) are called regression coefficients.
 1 is the slope of the regression line. It indicates the change in the mean of the probability
distribution of Y per unit increase in X. The parameter  o is the Y intercept of the regression
line. If the scope of the model includes X  0,  o gives the mean of the probability
distribution of Y at X  0. When the scope of the model does not cover X  0,
 o does not have any particular meaning as a separate term in the regression model.
ESTIMATION OF REGRESSION FUNCTION.
We shall denote (X,Y) observations for the first trial as (X 1 ,Y 1 ), for the second trial
(X 2 ,Y 2 ), and in general for the ith trial (X i ,Y i ), where i  1, 2, . . . , n. To find estimators
for  o and  1 we use the method of least squares. According to the method of least squares,
the estimators of  o and  1 are those values b o and b 1 , respectively, that minimize the
criterion Q given by:
n
(2.8) Q  ∑Y i −  o −  1 X i  2
i1
To find this we differentiate Q with respect to  o and  1 and solve the system of equations
∂Q ∂Q
∂ o
 0 and ∂ 1  0.
From these we get the following
(2.9a) ∑ Y i  nb o  b 1 ∑ X i
(2.9b) ∑ X i Y i  b o ∑ X i  b 1 ∑ X 2i
The equations (2.9a) and (2.9b) are called normal equations; b o and b 1 are called point
estimators of  o and  1 , respectively. Solving one can get the following solutions
 ∑ Xi∑ Yi
∑ XiYi− n ∑X i −XY i − Y 
(2.10a) b1  
∑ Xi2 ∑X i −X 2
∑ X 2i − n
(2.10b) b o  1n ∑ Y i − b 1 ∑ X i   Y − b 1 X
3
where Y and X have the usual meaning.
Properties of least squares estimators.
An important theorem, called the Gauss-Markov theorem, states

(2.11) THEOREM: Under the conditions of model (2.1), the least squares estimators
b o and b 1 in (2.10) are unbiased and have minimum variance among all unbiased linear
estimators. Hence
Eb o    o
Eb 1    1
Point estimation of mean response.
Estimated regression function. Given sample estimators b o and b 1 of the parameters

in the regression function (2.3):
EY   o   1 X
it is naturalthat we estimate the regression function as follows:
(2.12)  Y  b o  b 1 X
where Y is the value of the estimated regression function  at the level X of the independent
variable. For the observations in the sample, we will call Y i the fitted value for the ith
observation given by:
(2.13) Y i  b o  b 1 X i , i  1, 2, 3, . . . , n.
Residuals.
 is the difference between observed value Y i and the corresponding

The ith residual
fitted value Y i . Denoting
 this residual by e i , we can write
(2.16) ei  Yi − Y i  Yi − bo − b1Xi
We need to distinguish between
 the model error term value  i  Y i − EY i 
and the residual e i  Y i − Y i . The former involves the vertical deviation of Y i
from the unknown population regression line, and hence is unknown. On the
other hand, the residual is the observed vertical deviation of Y i from the fitted
regression line. Residuals are useful for studying the inferences in regression
analysis.
Properties of fitted line.
1. The sum of residuals is zero:

n
(2.17) ∑ ei  0
i1
Proof: Let us recall the normal equation (2.9a)
∑ Y i  nb o  b 1 ∑ X i
Hence
n
∑ e i  ∑ Y i − nb o − b 1 ∑ X i  0
i1
n
2. The sum of the squared residuals, ∑ e 2i , is a minimum. (since Q)
i1 
3. The sum of the observed values Y i equals the sum of the fitted values Y i ;
n n 
(2.18) ∑ Yi  ∑ Y i
i1 i1
This condition is implicit in the first normal equation
4
n 
∑ Y i  nb o  b 1 ∑ X i  ∑b o  b 1 X i   ∑ Y i .
i1
4. The sum of the weighted residuals is zero when the residual in the ith trial
is weighted by the level of the independent variable in the ith trial:
n
(2.19) ∑ Xiei  0
i1
This follows from the second normal equation (2.9b)
n n
∑ X i e i  ∑ X i Y i − b o − b 1 X i   ∑ X i Y i − b o ∑ X i − b 1 ∑ X 2i  0
i1 i1
n 
5. ∑ Y iei  0
i1
6. The regression line always goes through the point (X, Y 
PROBLEMS
Prove 5 and 6 stated above.
ESTIMATION OF ERROR TERMS VARIANCE  2
Let us notice that Y i come from different probability distributions with different means,
depending upon the level of X i . To estimate the error we start with
n  n n
(2.21) SSE  ∑Y i − Y i  2  ∑Y i − b o − b 1 X i  2  ∑ e 2i
i1 i1 i1
where SSE stands for error sum of squares or residual sum of squares.
The sum of squares SSE has n-2 degrees of freedom  associated with it ( two are lost
because both  o and  1 had to be estimated in obtaining Y i ).
Hence, the appropriate mean square, denoted by MSE, is
n n n

∑Y i −Y i  2 ∑Y i −b o −b 1 X i  2 ∑ e 2i
(2.22) MSE  SSEn−2
 i1 n−2  i1 n−2  i1
n−2
where MSE stands for error mean square or residual mean square.
Theorem MSE is an unbiased estimator of  2 for the regression model (2.1)
Proof: We know that

SSE  ∑Y i − Y i  2  ∑Y i − b o − b 1 X i  2 
 ∑Y i − Y  b 1 X − b 1 X i  2 
 ∑ Y 2i − n Y  2  ∑ X 2i − nX 2 b 21  2nXb 1 Y − 2 ∑ X i Y i b 1
ESSE  E∑ Y 2i − n Y  2  ∑ X 2i − nX 2 b 21  2nXb 1 Y − 2 ∑ X i Y i b 1  
 ∑ EY 2i  − nE Y  2  ∑ X 2i − nX 2 Eb 21  2nXEb 1 Y  − 2 ∑ X i EY i b 1  
 ∑ EY 2i  − nE Y  2   ∑X i − X 2 Eb 21  2nXEb 1 Y  − 2 ∑ X i EY i b 1 
∑ EY 2i   ∑ 2   o   1 X i  2   n 2  n 2o  2n o  1 X   21 ∑ X 2i

nE Y  2   n n   o   1 X 2    2  n 2o  2n o  1 X  n 21 X 2
2
∑X i − X 2 Eb 21  ∑X i − X 2 2
  21   2   21 ∑X i − X 2
∑X i −X 2
  2   21 ∑ X 2i − n 21 X 2  ∑X i − X 2 Eb 21
X i −X X i −X X i −X

Eb 1 Y   1
E∑ Yi ∑ Yj  1
E ∑ YiYj  ∑ Y 2i 
n
∑X i −X 2
n
i≠j ∑X i −X 2 ∑X i −X 2
5
X i −X X i −X
 1
∑ EY i Y j   1
∑ EY 2i 
n
i≠j ∑X i −X 2
n
∑X i −X 2
X i −X X i −X
 1
∑  o   1 X i  o   1 X j   1
∑  2   o   1 X i  2  
n
i≠j ∑X i −X 2
n
∑X i −X 2
X i −X X i −X
 1
∑  o   1 X i  o   1 X j   ∑  o   1 X i  2 
n
i≠j ∑X i −X 2 ∑X i −X 2
X i −X X i −X
 1n ∑ 2  1
∑  o   1 X i  o   1 X j  
∑X i −X 2 n
i,j ∑X i −X 2
X i −X X i −X
 1
∑∑  o   1 X i  o   1 X j   1
∑ o   1 X j  ∑  o   1 X i
n
∑ j i X i −X 2
n
j i ∑X i −X 2
X i −XX i X i −XX i
 n n o   1 ∑ X j  1 ∑
1
   o   1 X 1 ∑    o   1 X 1
i ∑X i −X 2 i ∑X i −X 2
2nXEb 1 Y   2nX o   1 X 1  2n o  1 X  2n 21 X 2  2nXEb 1 Y 
X j −X X j −X X i −X
EY i b 1   E ∑ YjYi  ∑ EY j Y i  EY 2i 
∑X i −X 2 j,,j≠i ∑X i −X 2 ∑X i −X 2
X j −X X i −X
∑  o   1 X i  o   1 X j    2   o   1 X i  2  
j≠i ∑X i −X 2 ∑X i −X 2
X j −X X i −X X i −X
∑  o   1 X i  o   1 X j    o   1 X i  2  2 
j≠i ∑X i −X 2 ∑X i −X 2 ∑X i −X 2
X j −X X i −X
∑  o   1 X i  o   1 X j   2 
j ∑X i −X 2 ∑X i −X 2
X j −X X i −X
  o   1 X i  ∑  o   1 X j   2 
j ∑ X i −X 2 ∑ X i −X 2
X j −X X j −XX j X i −X

  o   1 X i   o ∑  1 ∑  2 
j ∑X i −X 2 j ∑X i −X 2 ∑X i −X 2
X i −X
  o   1 X i  1  2
∑ X i −X 2
X i −X
2 ∑ X i EY i b 1   2 ∑ X i  o   1 X i  1  2 
∑X i −X 2
X i −XX i
 2 ∑ X i  o   1 X i  1  2 ∑ 2 
∑X i −X 2
 2 o  1 ∑ X i  2 21 ∑ X 2i  2 2 
 2n o  1 X  2 21 ∑ X 2i  2 2  2n o  1 X  2 21 ∑ X 2i  2 2  2 ∑ X i EY i b 1 
Hence
ESSE  E∑ Y 2i − n Y  2  ∑ X 2i − nX 2 b 21  2nXb 1 Y − 2 ∑ X i Y i b 1  
 ∑ EY 2i  − nE Y  2  ∑ X 2i − nX 2 Eb 21  2nXEb 1 Y  − 2 ∑ X i EY i b 1  
 ∑ EY 2i  − nE Y  2   ∑X i − X 2 Eb 21  2nXEb 1 Y  − 2 ∑ X i EY i b 1  
 n 2  n 2o  2n o  1 X   21 ∑ X 2i −  2 − n 2o − 2n o  1 X − n 21 X 2   2 
 21 ∑ X 2i − n 21 X 2  2n o  1 X  2n 21 X 2 − 2n o  1 X − 2 21 ∑ X 2i − 2 2 
 n − 2 2
We used the following
X ∑X i − X  0
j
6
∑X j −XX j −X ∑X i −X ∑X j −XX j −XX i −X
X j −XX j
∑
j j j
  
j ∑X i −X 2 ∑X i −X 2 ∑X i −X 2
∑X j −XX j −X
j
 1
∑X i −X 2
and ∑ X 2i − nX 2   ∑X i − X 2  ∑ X 2i − nX 2
Finally
ESSE n−2 2
EMSE  E SSE
n−2
 n−2
 n−2
 2
There are number of alternative formulas for SSE :
(2.24a) SSE  ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i
∑X i −XY i − Y  2

(2.24b) SSE  ∑Y i − Y  2 −
∑X i −X 2
∑ Xi ∑ Yi
2
∑ XiYi−
 ∑ Yi2 n
(2.24c) SSE  ∑ Y 2i − −
n
 ∑ Xi2
∑ X 2i − n
PROBLEMS
Prove (2.24a),(2.24b) and (2.24c) stated above.

Solution:
  2
SSE  ∑Y i −Y i  2  ∑Y 2i − 2Y i Y i  Y i  
  2
 ∑Y 2i − Y i Y i − Y i Y i  Y i  

 ∑ Y 2i − ∑ Y i b o  b 1 X i  − ∑b o  b 1 X i Y i − Y i  
 
 ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i − b o ∑Y i − Y i  − b 1 ∑ X i Y i − Y i  
 ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i
 
 ∑Y i − Y i   0, ∑ X i Y i − Y i   ∑ X i e i  0
(2.24a) SSE  ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i
(2.21) SSE  ∑Y i − b o − b 1 X i  2
 ∑ Xi∑ Yi
∑ XiYi− n ∑X i −XY i − Y 
(2.10a) b1  
∑ X i  2 ∑X i −X 2
∑ X 2i − n
(2.10b) b o  1n ∑ Y i − b 1 ∑ X i   Y − b 1 X
∑X i −XY i − Y  2
SSE  ∑Y i − 1n ∑ Y i − b 1 ∑ X i  − Xi
∑X i −X 2
∑X i −XY i − Y  ∑X i −XY i − Y  2
 ∑Y i − 1n ∑ Y i − ∑ X1 − Xi
∑X i −X 2
∑X i −X 2
∑X i −XY i − Y  ̄ ∑X i −XY i − Y  2
 ∑Y i − Y  X− Xi
∑X i −X 2 ∑X i −X 2
∑X i −XY i − Y  2
 ∑Y i − Y  − X i − X̄  
∑X i −X 2
∑X i −XY i − Y  ̄ 2 ∑ i
X −XY i − Y 
 ∑Y i − Y  2 − 2 ∑Y i − Ȳ X i − X̄    ∑ X i − X 2
∑ i X −X 2
∑ iX −X 2
7
∑X i −XY i − Y  2 ∑X i −XY i − Y  2

 ∑Y i − Y  2 − 2  
∑X i −X 2 ∑X i −X 2
∑X i −XY i − Y  2
 ∑Y i − Y  2 −
∑X i −X 2
∑
2
X i −XY i − Y 
(2.24b) SSE  ∑Y i −Ȳ  −2
∑X i −X  2
∑X i −XY i − Y  2
(2.24c) SSE  ∑Y i − Y  2 − 
∑X i −X 2
Let us notice that
∑X i − XY i − Y  2  ∑X i Y i − X i Y − Y i X  X Y  2 
 ∑ X i Y i − ∑ X i Y − ∑ XY i  nX Y  2 
∑ Xi ∑ Yi 2
 ∑ XiYi − n
∑Z i − Z  2  ∑Z 2i − 2Z i Z  Z 2   ∑Z 2i − Z i Z − Z i Z  Z 2  

 ∑ Z 2i − Z ∑ Z i − Z ∑Z i − Z   ∑ Z 2i − 1n ∑ Z i ∑ Z i 
∑ Z i  2
 ∑ Z 2i − n
Using it for Z i  Y i and for Z i  X i
∑ Xi ∑ Yi
2
∑ XY−
∑X i −XY i − Y  2 ∑ Y i  2
i i n

SSE  ∑Y i − Y  − 2
 ∑ Y 2
− −
∑X i −X 2 i n
 ∑ Xi2
∑ X 2i − n
∑ Xi ∑ Yi
2
∑ XiYi−
 ∑ Yi 2
n
(2.24c) SSE  ∑ Y 2i − −
n
 ∑ Xi2
∑ X 2i − n
NORMAL ERROR REGRESSION MODEL
The normal error model is as follows:

(2.25) Yi  o  1Xi  i
where:
Y i - is the value of the response in the ith trial
X i is the known constant, namely,the value of the independent
variable in the ith trial
 i are independent N0,  2  i  1, 2, . . . , n.
PROBLEMS
Find the distribution of Y i and  i under the normal error model

Solution: Since Y i is a linear transformation of a normally distributed
random variable  i therefore Y i has normal distribution.
EY i   E o   1 X i   i    o   1 X i  E i    o   1 X i
VarY i   Var o   1 X i   i   Var i    2
Hence under normal model Y i has N o   1 X i ,  2 
or
Since  i is N0,  2  that means that the corresponding density
function is given by
8
2
f e i z  f N0, 2  z  1
exp− 2z 2 
2  2
Since Y 1   o   1 X i   i therefore
f Y i y i   f  i y i −  o −  1 X i   f N0, 2  y i −  o −  1 X i  
y − o − 1 X i  2
 1
exp− i 2 2 
2  2
F Y i y  PY i  y  P o   1 X i   i  y 

 P i  y −  o −  1 X i   F  i y −  o −  1 X i 
Hence
f Y i y  ∂y∂ F Y i y  ∂y∂ F  i y −  o −  1 X i   f  i y −  o −  1 X i  
y i − o − 1 X i  2
 f N0, 2  y i −  o −  1 X i   1
exp− 2 2

2  2
Hence under normal model Y i has N o   1 X i ,  2 
Estimation of parameters by method of maximum likelihood.
The likelihood function for the normal error model (2.25), given the
sample observations Y 1 , . . . , Y n , is:
n
(2.26) L o ,  1 ,    
2 1
exp − 21 2 Y i −  o −  1 X i  2
i1 2 2
The values of  o , 1 and  2 which maximize this likelihood function
are the maximum likelihood estimators. These are:
parameter maximum likelihood estimator
o bo the same as (2.10b)
1 b1 the same as (2.10a)
(2.27)
n

∑Y i −Y i  2
2
2   i1
n
2
Thus, the maximum likelihood estimator  is biased, and ordinary MSE as given in
(2.22) is used.
PROBLEMS.
Question 1.
The results of a certain experiments are shown below
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X i 5.5 4.8 4.7 3.9 4.5 6.2 6.0 5.2 4.7 4.3 4.9 5.4 5.0 6.3 4.6 4.3 5.0 5.9 4.1 4.7
Y i 3.1 2.3 3.0 1.9 2.5 3.7 3.4 2.6 2.8 1.6 2.0 2.9 2.3 3.2 1.8 1.4 2.0 3.8 2.2 1.5
Summary calculation results are:
∑ X i  100. 0, ∑ Y i  50. 0, ∑ X 2i  509. 12, ∑ Y 2i  134. 84, ∑ X i Y i  257. 66.
a) Obtain the least squares estimates of  o and  1 , and state the estimated regression
function.
b) Obtain the point estimate for mean Y when X score is 5.0
c) What is the point estimate of change in the mean response when the X score increases
by one .
Question 2
For the following set of data:
9
X i 30 20 60 80 40 50 60 30 70 60
Y i 73 50 128 170 87 108 135 69 148 132
1) Obtain the estimated regression function. 2) Interpret b o and b 1 .
Question 3.
Prove theorem 2.11.
Question 4.
Prove the following statements: 
1) The sum of the observed values Y i equals the sum of the fitted values Y i ;
n n 
∑ Yi  ∑ Y i
i1 i1
2) The regression line always goes through the point (X, Y 
Question 5.
Prove the following statements:
1) The sum of residuals is zero:
n
∑ ei  0
i1
2) The sum of the weighted residuals is zero when the residual in the ith trial
is weighted by the level of the independent variable in the ith trial:
n
∑ Xiei  0
i1
Throughout remainder of this course unless otherwise stated, we assume that the normal
error model (2.25) is applicable.
The model is
3. 1 Yi  o  1Xi  i
where
X i are known constants
 i are independent N0,  2 
Inferences concerning  1 .
Tests concerning  1 are of the interest, particularly one of the form

Ho : 1  0
Ha : 1 ≠ 0
The reason for this is that  1  0 indicates that there is no linear association between X
and Y. To develop appropriate test first we have to find out the sampling distribution of
b 1 - our point estimator of  1 .
Sampling distribution of b 1 :
The point estimator of b 1 was given in (2.10a) as follows:

∑X i −XY i − Y 
3. 2 b1 
∑X i −X 2
Theorem: For the model (3.1), the sampling distribution of b 1 is normal
with:
3. 3a Eb 1    1
2
3. 3b Varb 1   2
∑X i −X
Proof: Let us notice that b 1 is a linear combination of Y i
10
∑X i −XY i − Y  X i −X ∑X i −X
b1   ∑ Y i − Y
∑X i −X 2 ∑X i −X 2 ∑X i −X 2
but ∑X 1 − X  ∑ X i − nX  ∑ X i − ∑ X i  0
hence
X i −X
3. 5 b 1  ∑ Yi
∑X i −X 2
X i −X
According to our model are constants (since X i are fixed), and we
∑X i −X 2
know that linear combination of independent normally distributed random
variables is normally distributed. The unbiasedness of the point estimator b 1
stated earlier in the Gauss-Markov theorem. We denote the quantities in (3.5) by
X i −X
3. 5a k i 
∑X i −X
2
The variance of b 1 we calculate using the fact that Y i are independent random
variables , each with the variance  2 .
2
X i −X
Hence Varb 1   Var∑ k i Y i   ∑ k 2i VarY i   ∑
2
k 2i  ∑
2

∑X i −X 2
∑X i −X 2
 2 2  
2 1
.
∑X i −X 2 ∑ i −X 2
X
Estimated variance of the sampling distribution of b 1 by replacing the parameter

 2 with the unbiased estimator of  2 namely

, MSE
∑Y i −Y i  2
3. 9 s 2 b 1   MSE
 .
∑X i −X 2 n−2 ∑X i −X 2
Let us notice that k i have a number of properties:
1. ∑ k i  0
2. ∑ k i X i  1
Proof of 2
∑ k i X i  ∑ k i X i − X ∑ k i  ∑ k i X i − ∑ k i X  ∑ k i X i − X 
X i −X X −XX i −X
∑ X i − X  ∑ i  1.
∑X i −X 2 2
∑X i −X
3. ∑ k 2i  1
∑X i −X 2
Sampling distribution of (b 1 −  1 /sb 1 
From the probability theory we know that:

If Z 1 , . . , Z n are independent random variables each N0, 1
then ∑ Z 2i is  2 n
If Z is N0, 1 and V is  2 n and Z and V are independent then
T Z have an t distribution with n degrees of freedom.
V/n
If Z 1 , . . , Z n are independent random variables each N0, 1 then
Z and (Z 1 − Z , . . , Z n − Z  are independent.
Using the above theorems one can get that
b 1 − 1
3. 10 sb 1 
is distributed as t(n-2) for the model (3.1)
The reason for the n-2 degrees of freedom is that the two parameters
( o and  1  need to be estimated, hence , two degrees of freedom are
lost here.
Confidence interval for  1 .
Since b 1 −  1 /sb 1  follows a t distribution, we can make the following
11
probability statement:
(3.12) Pt/2; n − 2  b 1 −  1 /sb 1   t1 − /2; n − 2  1 − 
where t/2; n − 2 denotes /2100 percentile of the t distribution with
n-2 degrees of freedom. Since t distribution is symmetric we know that
(3.13) t/2; n − 2  −t1 − /2; n − 2
Therefore we get the following confidence interval
(3.14) Pb 1 − t1 − /2; n − 2sb 1    1  b 1  t1 − /2; n − 2sb 1   1 − .
or we can rewrite it as
(3.15) b 1  t1 − /2; n − 2sb 1 
Tests concerning  1
The hypothesis are as follows:

(3.16) H o :  1  0
Ha : 1 ≠ 0
b 1 − 1
Since sb is distributed as t(n-2) the test statistics is
1
(3.17) t ∗  sbb 1 
1
The decision rule (at the level of significance ) is
(3.17a) if |t ∗ |  t1 − /2; n − 2, conclude H o
if |t ∗ |  t1 − /2; n − 2, conclude H a
Remark:Many computer packages commonly report the

P-value together with the value of the test statistics. In this way,
one can conduct a test at any desired level of significance by
comparing the P-value with the specific level . Users of
computers need to be careful to ascertain whether one sided
or two sided P-values are furnished.
Sampling distribution of b o
The point estimator b o was given in (2.10b) as follows

(3.19) bo  Y − b1 X
For the model (3.1), (normal error model) the sampling distribution of
b o is normal with the following parameters:
(3.20a) Eb o    o
∑ X 2i 2
(3.20b) Varb o    2  2 1
 X
n ∑X i −X 2 ∑X i −X 2
n
An estimator of Varb o  is obtained by replacing  2 by its point

estimator MSE;
∑ X 2i 2
(3.21) s 2 b o   MSE  MSE 1
 X
n ∑X i −X 2 ∑X i −X 2
n
The same arguments as for b 1 imply that

b o − o
(3.22) sb o 
is distributed as tn − 2 for the model (3.1)
The confidence limits for  o are calculated in the
same manner as those for  1 . The confidence limits for  o are as follows:
(3.23) b o  t1 − /2; n − 2sb o 
Interval estimation of E(Y h )
Let X h denote the level of X for which we wish to estimate the mean response.
X h may be the value which occurred in the sample, or it may be some other value of
the independent variable within the scope of the model. The mean response when
12

X  X h is denoted by E(Y h ). Formula (2.12) gives us the point estimator Y h of
E(Y h ) : 
(3.27) Y h  bo  b1Xh 
We will consider now the sampling distribution  of Y h.
For model (3.1),  the sampling distribution of Y h is normal, with
(3.28a) E Y h   EY h 
 X h −X 2
(3.28b) Var Y h    2 1n 
∑X i −X 2 
When MSE is substituted for  2 in (3.28b), we obtain s 2  Y h , the estimated
variance of Y h :
 X h −X 2
(3.30) s 2  Y h   MSE 1n 
∑X i −X
2
Statement 
Y h −EY h 
(3.31)  is distributed as tn − 2 for model (3.1)
sY h 
A 1 −  confidence
 interval for E(Y
 h ) is given by:
(3.32) Y h  t1 − /2; n − 2s Y h 
Prediction of new observation.
The basic idea of a prediction interval is thus to choose a range in the distribution of Y
where in
most of the observations will fall, and to declare that the next observation will fall in the
range.
In general, when the regression parameters are known, the 1- prediction limits for Y hnew
are:
(3.33) EY h   z1 − /2
In case when
 the parameters are unknown 1- prediction limits are
(3.35) Y h  t1 − /2; n − 2sY hnew 
where 
(3.37) s 2 Y hnew   s 2 Y h   MSE
Using (3.30) one can get
X h −X 2
(3.37a) s 2 Y hnew   MSE1  1n  
∑X i −X 2
Analysis of variance approach to the regression analysis.
Basic notations. The analysis of variance is based on the partitioning of sums of squares
and degrees of freedom associated with the response variable Y. To illustrate this approach
we will consider the following data
13
Run i X 1 Y i
1 30 73
2 20 50
3 60 128
4 80 170
5 40 87
6 50 108
7 60 135
8 30 69
9 70 148
10 60 132
The first picture shows Y with the observed Y i
The variation of Y i is measured in terms of deviations:

(3.41) Yi − Y
The measure of total variation , denoted by SSTO is equal to
(3.42) SSTO  ∑Y i − Y  2
Here SSTO stands for total sum of squares. If SSTO  0, all observations are the same.
When we use the regression approach, the variation reflecting the uncertainty in the data
is that of Y observations
 around the regression line:
(3.43) Yi − Y i
These deviation are shown in the next figure. The measure of variation in the data with
the regression model is sum of2 squared deviations (3.43)
(3.44) SSE  ∑Y i − Y i 
If SSE  0 , all observations fall on the fitted regression line.
14
In Our case
SSTO  13 660 and SSE  60.
What accounts for this difference? The difference, is another sum of squares
(3.45) SSR  ∑Y i − Y  2
where SSR stands for regression sum of squares.
This deviations are shown in the next figure
Each deviation is simply the difference between the fitted value on the regression line
and the mean of the fitted values Y (Recall from (2.18) that the mean of fitted

values Y i is Y ).
Formal development of partitioning. Consider the deviation Y i − Y . We can

decompose this deviation as follows:
 
Yi − Y  Yi − Y  Yi − Y i
(3.47) Deviation of fitted

Total Deviation around
regression value
deviation regression line
aroud mean
Next figure shows this decomposition for one of the observations
15
It is very interesting property that the sum of these squared deviations have the same
relationship:  
(3.48) ∑Y i − Y  2  ∑Y i − Y  2  ∑Y i − Y i  2
or using notation in (3.42), (3.44) and (3.45):
(3.48a) SSTO  SSR  SSE
Proof:  
∑Y i − Y  2  ∑Y i −Y   Y i− Y i  2  
 ∑Y i − Y  2  Y i − Y i  2  2Y i − Y Y i − Y i  
   
 ∑Y i − Y  2  ∑Y i − Y i  2  2 ∑Y i − Y Y i − Y i 
The last
 term on the  right is zero
 since: 
2 ∑Y i − Y Y i − Y i   2 ∑ Y i Y i − Y i  − 2 Y ∑Y i − Y i 
and 
2 Y ∑Y i − Y i   2 Y ∑ e i  0 ((2.17))
and   
2 ∑ Y i Y i − Y i   2 ∑ Y i e i  2 ∑b o  b 1 X i e 2 
2b o ∑ e i  2b 1 ∑ X i e i  0 (since (2.19))
Hence (3.48) follows.
Computational formulas.
The definitional formulas for SSTO, SSR and SSE presented above are often not
convenient for hand computation. Useful formulas for SSTO and SSR are:
 ∑ Yi2
(3.49) SSTO  ∑ Y 2i − n  ∑ Y 2i − n Y  2
∑ Xi ∑ Yi
(3.50a) SSR  b 1 ∑ X i Y i − n  b 1 ∑X i − XY i − Y 
or
(3.50b) SSR  b 21 ∑X i − X 2
PROBLEMS.
1.Calculate SSTO, SSR and SSE for our data.

2. Prove (3.49), (3.50a) and (3.50b)
Solution:
SSTO  ∑Y i − Y  2

SSE  ∑Y i − Y i  2

SSR  ∑Y i − Y  2
SSTO  SSR  SSE
SSTO  ∑ Y 2i − n Y  2
16
2 
SSR  ∑Y i − 2Y i Y  Y  
2
2 2
 ∑ Y i − 2n Y  2  n Y  2  ∑ Y i  − n Y  2
 ∑b o  b 1 X i  2  − n Y  2
 ∑b 2o  2b o b 1 X i  b 21 X 2i  − n Y  2 
 n Y − b 1 X 2  2 Y − b 1 Xb 1 nX  b 21 ∑ X 2i − n Y  2 
 n Y  2 − 2nb 1 X Y  nb 21 X 2  2nb 1 X Y − 2nb 21 X 2  b 21 ∑ X 2i − n Y  2 
 ∑ Xi2
 b 21 ∑ X 2i − nb 21 X 2  b 21 ∑ X 2i − n   b 21 ∑X i − X 2
or 
SSR  ∑Y i − Y  2  ∑b o  b 1 X i − Y  2  ∑ Y − b 1 X  b 1 X i − Y  2 
 ∑b 1 X i − X 2  b 21 ∑X i − X 2
so we get 3. 50b
SSR  b 21 ∑X i − X 2  b 1 b 1 ∑X i − X 2 
∑ Xi ∑ Yi
∑ XiYy− ∑ Xi ∑ Yi
∑ ∑
n
 b1 X i − X 2
 b 1  X i Y y − 
∑X i −X 2 n
so we get 3. 50a

∑ Xi ∑ Yi
∑ XiYy− n
b1 
∑X i −X 2
bo  Y − b1 X
Degrees of freedom.
Corresponding to the partitioning of the total sum of squares SSTO there is partitioning
of associated degrees of freedom.
SSE has n - 2 degrees of freedom. SSR has one degree of freedom. Therefore SSTO
has n - 1 degrees of freedom.
Mean squares.
We are interested in the regression mean square, denoted by MSR

(3.51) MSR  SSR 1
 SSR
and in the error mean square, MSE defined in (2.22):
(3.52) MSE  SSEn−2
In our example we have SSR  13. 600 and SSE  60
hence MSE  13.6001
 13. 600 and MSE  608  7. 5
Analysis of variance table.
Basic table. The breakdowns of the total sum of squares and associated degrees
of freedom are displayed in the form of an analysis of variance table
(ANOVA table):
source of variation SS df MS E(MS)

regression SSR  ∑Y i − Y  2 1 MSR  SSR 1
 2   21 ∑X i − X 2

error SSE  ∑Y i − Y i  2 n − 2 MSE  SSE n−2
2
total SSTO  ∑Y i − Y  2 n − 1
In our case the table is as follows:
17
source of variation SS df MS
regression 13600 1 13600
error 60 8 7. 5
total 13660 9
Modified table. Sometimes, an ANOVA table showing one additional element of

decomposition is utilized. Recall that by (3.49):
 ∑ Yi2
SSTO  ∑ Y 2i − n  ∑ Y 2i − n Y  2
In the modified ANOVA table, the total uncorrected sum of squares, denoted by
SSTOU, is defined as:
(3.53) SSTOU  ∑ Y 2i
and the correction for the mean sum of squares, denoted by SS (correction for mean)
is defined as:
(3.54) SScorrection for mean  n Y  2

regression SSR  ∑Y i − Y  2 1 MSR  SSR
1

error SSE  ∑Y i − Y i  2 n − 2 MSE  SSE
n−2
total SSTO  ∑Y i − Y  2 n−1

2
correction for mean SScorrection for mean  n Y 1
total uncorrected SSTOU  ∑ Y 2i n
F test of  1  0 versus  1 ≠ 0.
ANOVA provides us with battery of highly useful tests for regression model.
For a simple regression case considered here, the analysis of variance provides us with
test for:
(3.58) Ho : 1  0
Ha : 1 ≠ 0
Test statistics:
(3.59) F ∗  MSR
MSE
The decision rule:
(3.61) If F ∗  F1 − ; 1, n − 2, conclude H o
If F ∗  F1 − ; 1, n − 2, conclude H a
where F1 − ; 1, n − 2 is 1 − 100 percentile of the appropriate F distribution.
Equivalence of F test and t test.
For a given  level, the F test of  1  0 versus  1 ≠ 0 is equivalent algebraically to
the two-tailed t test. To see this, recall from (3.50b) that
SSR  b 21 ∑X i − X 2
Thus, we can write:
b 21 ∑X i −X 2
F ∗  MSR
MSE
 SSR1
SSEn−2
 MSE
Since s b 1  
2 MSE
we have
∑X i −X 2
18
b 21 ∑X i −X 2 b 21 b 21 2
(3.62) F∗    s b
 b1
sb 1 
 t ∗  2
1
MSE MSE 2
∑Xi−X2
Corresponding to the relation between t ∗ and F ∗ , we have the following
relation between the required percentiles of the t and F distributions in the tests:
t1 − /2; n − 2 2  F1 − ; 1; n − 2.
Thus, at given  level, we can use either t test or the F test for testing
of  1  0 versus  1 ≠ 0. Whenever one test leads to H o, so will the other, and
correspondingly to H a . The t test is more flexible since it can be used for one-sided
alternatives involving  1 ( 1  0 or  1  0 while the test F cannot.
Descriptive measures of association between X and Y in regression model.
Coefficient of determination.
We saw earlier that SSTO measures the variation in the observations Y i , or the
uncertainty in predicting Y, when no account of independent variable X is not
considered. Also SSE measures the variation in the Y i when a regression model
utilizing the independent variable X is employed. A natural measure of the effect
of X in reducing variation in Y i , (the uncertainty in predicting Y) is
(3.69) r 2  SSTO−SSE
SSTO
 SSTO
SSR
 1 − SSTO
SSE
The measure r 2 is called the coefficient of determination. Since 0  SSE  SSTO

it follows that
(3.70) 0  r2  1
We may interpret r 2 as the proportionate reduction of total variation associated with
the use of independent variable. Thus, the larger r 2 , the more is the total variation of Y
reduced by introducing the independent variable X. The limiting value for r 2 occur
as follows:
1. If all observations fall on the fitted regression line, SSE  0 and r 2  1.
In this case the independent variable accounts for all variation in the
observations Y i 
2. If the slope of the fitted regression line is zero (b 1  0) so that Y i  Y , SSE  SSTO
and r 2  0. In this case there is no linear association between X and Y in the sample
data, and the independent variable is no help in reducing the variation in the
observations Y i with linear regression.
Coefficient of correlation
The square root of r 2

(3.71) r   r2
is called the coefficient of correlation. A plus or minus sign is attached to this measure
according to whether the slope of the fitted regression line is positive or negative.
The range is
(3.72) −1  r  1
r does not have such a clear-cut interpretation as r 2 .
A direct computational formula for r, which automatically furnishes the
proper sign is:
∑X i −XY i − Y 
(3.73) r
∑X i −X 2 ∑Y i − Y  2  1/2
EXAMPLES
For our data we get the following
19
 
Xi Yi XiYi X 2i Y i  b o  b 1 X i e i  Y i − Y i e 2i
30 73 2190 900 70 3 9
20 50 1000 400 50 0 0
60 128 7680 3600 130 -2 4
80 170 13600 6400 170 0 0
40 87 3480 1600 90 -3 9
50 108 5400 2500 110 -2 4
60 135 8100 3600 130 5 25
30 69 2070 900 70 -1 1
70 148 10360 4900 150 -2 4
60 132 7920 3600 130 2 4
500 1100 61800 28400 1100 0 60 ←TOTALS
Therefore
X  1n ∑ X i  500 10
 50
Y  n ∑ Y i  10  110
1 1100
To calculate b o and b 1 we use the following formulas

 ∑ Xi∑ Yi
∑ XiYi− n 61800− 5001100
b1   10
 2. 0
 ∑ Xi2 28400− 500
2
∑ X 2i − n
10
b o  Y − b 1 X  110 − 2  50  10
So

Y i  10  2 ∗ X i
In our case 
SSE  ∑Y i − Y i  2  ∑ e 2i  60
MSE  SSE n−2
 608
 7. 5
s b 1  
2 MSE
 MSE
 7.5
 7.5
 0. 002206
∑X i −X
2 2  ∑ Xi 28400− 500 2 3400
∑ X 2i − n
10
sb 1   0. 002206  0. 04 696 8

The 95% confidence interval for  1 is
b 1 − t1 − /2; n − 2sb 1 , b 1  t1 − /2; n − 2sb 1  
2 − 2. 306  0. 046968, 2  2. 306  0. 046968  1. 89, 2. 11
where t1 − /2; n − 2  t0. 975; 8  2. 306.
Test for  1
Ho; 1  0
Ha; 1 ≠ 0
The test statistics
t ∗  sbb 1   0.046968
2
 42. 582
1
The decision rule in our case is
If |t ∗ |  t1 − /2; n − 2, conclude H o
If |t ∗ |  t1 − /2; n − 2, conclude H a
In our case |t ∗ |  42. 582  2. 306  t0. 975; 8  t1 − /2; n − 1
so we conclude H a ( 1 ≠ 0), that means that there is a linear association
between X and Y.
The P-value for our calculated value of the test statistics is
20
Pt8  t ∗  42. 58  0. 0005
Confidence interval for  o

First we calculate
∑ X 2i
s 2 b o   MSE  7. 5 103400
28400
  6. 264 7
n ∑X i −X 2
0ne can use also
2
50 2
s 2 b o   MSE 1n  X
  7. 5 101    6. 264 7
∑X i −X 2 3400
so sb o   s 2 b o   6. 2647  2. 502 9

The 90% confidence interval for  o is
b o − t1 − /2; n − 2sb o , b o  t1 − /2; n − 2sb o  
10 − 1. 860  2. 5029, 10  1. 860  2. 5029  5. 34, 14. 66
where t1 − /2; n − 2  t0. 95; 8  1. 860
Confidence interval for E(Y h )

We would like to find the confidence interval for mean response corresponding to the
level of explanatory variable denoted by X h .
 X h −X 2
s 2 Y h   MSE 1n  
∑X i −X2
Let X h  55, then

 55−50 2
s 2 Y 55   7. 5 10
1
 3400   0. 805 15
Hence
sY 55   0. 80515  0. 897 3
Using the regression we get
Y 55  b o  b 1 X 55  10  2  55  120
The
 90% confidence formean  EY 55  is given by 
Y h − t1 − /2; n − 2sY h , Y h  t1 − /2; n − 2sY h  
120 − 1. 860  0. 8973, 120  1. 860  0. 8973  118. 3, 121. 7
where t1 − /2; n − 2  t0. 95; 8  1. 860.
Prediction of a new observation Y h(new) .

We know that 
s 2 Y hnew   s 2 Y h   MSE
Let X hnew  55
We get 
s 2 Y 55new   s 2 Y 55   MSE  0. 80515  7. 5  8. 305 2
and
sY 55new   8. 305 2  2. 881 9
The
 confidence interval is given by 
Y h − t1 − /2; n − 2sY hnew , Y h  t1 − /2; n − 2sY hnew 
For 90% confidence level we get
120 − 1. 860  2. 8819, 120 − 1. 860  2. 8819  114. 6, 125. 4
where t1 − /2; n − 2  t0. 95; 8  1. 860.
Analysis of variance table.
Basic table. The breakdowns of the total sum of squares and associated degrees
of freedom are displayed in the form of an analysis of variance table
(ANOVA table):
21
source of variation SS df MS E(MS)

regression SSR  ∑Y i − Y  2 1 MSR  SSR
1
 2   21 ∑X i − X 2

error SSE  ∑Y i − Y i  2 n − 2 MSE  SSE
n−2
2
total SSTO  ∑Y i − Y  2 n − 1
In our case the table is as follows:
regression 13600 1 13600
error 60 8 7. 5
total 13660 9
F test of  1  0 versus  1 ≠ 0.
The hypothesis.
Ho : 1  0
Ha : 1 ≠ 0
Test statistics:
F ∗  MSR
MSE
 13600
7.5
 1813. 3
The decision rule:
If F ∗  F1 − ; 1, n − 2, conclude H o
If F ∗  F1 − ; 1, n − 2, conclude H a
If   0. 05 then F1 − ; 1; n − 2  F0. 95; 1; 8  5. 32
Since F ∗  1813. 3  5. 32 we conclude H a , that  1 ≠ 0.
Coefficient of determination.
r 2  SSTO−SSE
SSTO
 SSTO
SSR
 1 − SSTO
SSE
 1 − 13660
60
 0. 995 61
Coefficient of correlation
r   r 2  0. 99561  0. 997 8
(a plus sign since the slope of the fitted regression line is positive)
PROBLEMS.
Question 1.
i 1 2 3 4 5 6 7 8 9
Xi 7 6 5 1 5 4 7 3 4
Y i 97 86 78 10 75 62 101 39 53
i 10 11 12 13 14 15 16 17 18
Xi 2 8 5 2 5 7 1 4 5
Y i 33 118 65 25 71 105 17 49 68
Summary calculational results are: ∑ Y i  1152, ∑ X i  81, ∑Y i − Y  2  16504,

∑X i − X 2  74. 5, ∑Y i − Y X i − X  1098.
1) Obtain the estimated regression function. 2) Plot the estimated regression function and
the data. 3) Interpret b o and b 1 . 4) Find the 90% confidence interval for:  o ,  1 , and
interpret them. 5)Test the H o :  1  0 versus H a :  1 ≠ 0 using t ∗ and ANOVA.
using   0. 05
22
6) Find 90% confidence intervals for mean of response variable corresponding
to the level of the explanatory equal to 5.
7)Find 95% prediction limits for new observation of the response variable
corresponding to the level of the explanatory equal to 5.
8) Obtain the residuals e i . 9) Estimate  2 and .
Question 2.
i 1 2 3 4 5 6 7 8 9 10
Xi 1 0 2 0 3 1 0 1 2 0
Y i 16 9 17 12 22 13 8 15 19 11
1) Obtain the estimated regression function. 2) Plot the estimated regression function
and the data. 3) Interpret b o and b 1 . 4) Find the 95% confidence interval for:  o ,  1 ,
and interpret them. 5)Test the H o :  1  0 versus H a :  1 ≠ 0 using t ∗ and ANOVA.
using   0. 05 6) Find 95% confidence intervals for mean of response variable
7)Find 90% prediction limits for new observation of the response variable
8) Obtain the residuals e i . 9) Estimate  2 and . 10)Compute ∑ e 2i
Question 3.
In a test of the alternatives H o :  1  0 versus H a :  1  0, a student concluded
H o . Does this conclusion imply that there is no linear association between X and Y?
Explain.
Question 4.
Show that b o as defined in (3.19) is an unbiased estimator of  o .
Question 5.
Obtain the likelihood function for the sample observations Y 1 , . . . , Y n
given X 1 , . . . , X n if the normal model is assumed to be applicable.
Question 6.
The following data were obtained in the study of solution concentration.
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Xi 9 9 9 7 7 7 5 5 5 3 3 3 1 1 1
Y i 0.07 0.09 0.08 0.16 0.17 0.21 0.49 0.58 0.53 1.22 1.15 1.07 2.84 2.57 3.1
Summary calculational results are: ∑ X i  75, ∑ Y i  14. 33, ∑ X 2i  495
∑ Y 2i  29. 2117, ∑ X i Y i  32. 77.
1) Fit a linear regression function 2) Perform an F test to determine whether or not there
is lack of fit of a linear regression function. Use   0. 05.
Question 7.
The following data were obtained in a certain study.
i 1 2 3 4 5 6 7 8 9 10 11 12
Xi 1 1 1 2 2 2 2 4 4 4 5 5
Y i 6.2 5.8 6 9.7 9.8 10.3 10.2 17.8 17.9 18.3 21.9 22.1
Summary calculational results are: ∑ X i  33, ∑ Y i  156, ∑ X 2i  117
∑ Y 2i  2448. 5, ∑ X i Y i  534.
1) Fit a linear regression function 2) Perform an F test to determine whether or not there
is lack of fit of a linear regression function. Use   0. 05.
RESIDUALS
A residuale i , as defined in (2.16) is
ei  Yi − Y i
23
As such, it may be regarded as the observed error, in distinction to the unknown true
error  i in the regression model
(4.2)  i  Y i − EY i 
Properties of residuals
The mean of the n residuals e i is by (2.17)
(4.3) e  1n ∑ e i  0
where e denotes the mean of the residuals.
The variance of n residuals is defined as follows
∑e i − e  2 ∑ e 2i
(4.4) n−2
 n−2  SSEn−2
 MSE
If the model is appropriate, MSE is an unbiased estimator for the variance
of the error terms  2 .
Standardized residuals.
Since the standard deviation of the error terms  i is , which is estimated by MSE
we shall define the standardized residual as follow:
ei− e
(4.5)  ei
MSE MSE
Departures from model to be studied by residuals
The following six important departures from model (3.1), the simple linear
regression model with the normal errors, are as follows:
1) The regression function is not linear
2) The error terms do not have the constant variance
3) The error terms are not independent.
4) The model fits all but one or few outlier observations
5) The error terms are not normally distributed
6) One or several important independent variables have been omitted from
the model.
Graphic analysis of residuals.
We take up now some informal ways in which graphs of residuals can be analyzed
to provide information on whether any of the six types of departures from the simple
linear regression model (3.1) are present.
To study this we plot residuals against explanatory variable X. The following prototype
residual plots are considered:
24
Figure (a) shows a prototype situation of the residual plot against X if the linear
model is appropriate. The residuals should tend to fall within a horizontal band
centered around 0, displaying no systematic tendencies to be positive and negative.
Figure (b) shows a prototype situation of a departure from the linear regression
model indicating the need for curvilinear regression function. Here the residuals
tend to vary in a systematic fashion between being positive and negative.
Similar interpretation will be if the picture is convex (like x 2 ).
Figure (c) shows a prototype situation of a fact that the error terms do not
have the constant variance. (error variance increases with X) Similar
picture (but oriented differently) will be in case of the decrease of the error
variance with X.
Whenever data are obtained in a time sequence it is a good idea to plot the
residuals against the time, even though time has not been explicitly incorporated
as a variable in the model. The purpose is to see if there is any correlation between
the error terms over time. Example of time related effect is shown in figure (d).
If errors are time independent we would expect the residuals to follow the figure (a).
Presence of outliers
Outliers are extreme observations. In residual plot, they are points that lie far beyond
the scatter of the remaining residuals, perhaps four or more standard deviations from
zero. The following figure present standardized residuals and contains one outlier,
which is circled
Outliers can create great difficulty. When we encounter one, our first suspicion is that
25
the observation resulted from a mistake or other extraneous effect, and hence should
be discarded. On the other hand, outliers may convey significant information. A safe
rule is to discard an outlier if there is direct evidence that it represents an error in
recording, a miscalculation, a malfunction of equipment, or a similar type of
circumstances.
Nonnormality of error terms.
The normality of the error terms can be studied informally by examining the residuals in
a variety of graphic ways. One can construct a histogram of the residuals and see if gross
departures from normality are shown by it. Another possibility is to determine whether,
say, about 68 percent of the standardized residuals e i / MSE fall between -1 and 1,
or about 90 fall between -1.64 and 1.64. (If the sample size is small, the corresponding
t values would be used).
Another possibility is to prepare a normal probability plot of the residuals. This plot
of the ordered residuals against their expected values. To find the expected values
of the i − th smallest observation under normality we use the following expression
(4.6) MSE z i−0.375
n0.25

where zA as usual denotes the A100 percentile of the standard normal distribution
( F N0,1 zA  1 − A2 )
One method of assessing the linearity of the normal probability plot is to calculate the
coefficient of correlation (3.73) relating residuals e i to their expected values under
normality. A high value of the coefficient of correlation, say 0.9 or more is indicative
of normality.
F test for lack of fit.
The following test is used for determining whether or not a specified regression
function adequately fits the data. The test assumes that the observations Y for
given X are 1) independent 2) normally distributed and 3) the distributions of Y
have the same variance  2 . The lack of fit test requires repeat observations at one or
more X levels. Repeated trials for the same level of independent variable, of type
described are called replications. The resulting observations are called replicates
Decomposition of SSE
Pure error component. The basic idea for the first component of SSE rest on the fact
that there are replications at some levels of X. Let us denote the i − th observation
for the j − th level of X by Y i,j where i  1, 2, . . . , n j (n j - number of observations
at j − th level of X) j  1, 2, . . . , c (c - number of observed levels of X).
nj
Yj  1
nj ∑ Y i,j - the sample mean of Y at j − th level of X.
i1
The square deviations of Y at j − th level of X is equal to
nj
(4.8) ∑Y i,j − Y j  2
i1
Then we add these sums of squares over all levels of X and denote this sum by SSPE
c nj
(4.9) SSPE  ∑ ∑Y i,j − Y j  2
j1 i1
SSPE stands for pure error sum of squares. The degrees of freedom associated with
c
SSPE are n − c (where n  ∑ n j - number of all observations)
j1
The pure error mean square MSPE is given by:
26
(4.11) MSPE  SSPE n−c
Lack of fit component. The second component of SSE is:
(4.12) SSLF  SSE − SSPE
where SSLF denotes lack of fit sum of squares. It can be shown that
c 
(4.13) SSLF  ∑ Y j − Y j  2
 j1
where Y j denotes the fitted value when X  X j . The are c − 2 degrees of freedom
associated with SSLF. Thus, the lack of fit mean square is
(4.14) MSLF  SSLF c−2
F test.
Test statistics
(4.15) F ∗  MSPE
MSLF
The hypothesis
H o: EY   o   1 X
(4.17)
H a: EY ≠  o   1 X
The decision rule
If F ∗  F1 − ; c − 2, n − c conclude H o
(4.18)
If F ∗  F1 − ; c − 2, n − c conclude H a
Example:The following data were obtained in the study of solution concentration.

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Xi 9 9 9 7 7 7 5 5 5 3 3 3 1 1 1
Y i 0.07 0.09 0.08 0.16 0.17 0.21 0.49 0.58 0.53 1.22 1.15 1.07 2.84 2.57 3.1
Summary calculational results are: ∑ X i  75, ∑ Y i  14. 33, ∑ X 2i  495
∑ Y 2i  29. 2117, ∑ X i Y i  32. 77.
1) Fit a linear regression function
2) Perform an F test to determine whether or not there is lack of fit
of a linear regression function. Use   0. 05.
Solution:1) We have
 ∑ Xi∑ Yi
∑ XiYi− n 32.77− 7514.33
b1   15
 − 0. 324
 ∑ Xi2 495−
75 2
∑ X 2i − n
15
and
b o  Y − b 1 X  1n ∑ Y i − b 1 1n ∑ X i  151  14. 33  0. 324  151  75  2. 575 3
Therefore

Y i  2. 575 3 − 0. 324  X i
2) F test for lack of fit.
We have c  5 levels for X and 3 replicates for each level (hence each n j  3
and n  15.
Hence
nj
Yj  1
nj ∑ Y i,j - the sample mean at j − th level of X.
i1
Y1  1
3
3. 1  2. 57  2. 84  2. 836 7 at level X  1
Y2  1
3Y
1. 07  1. 15  1. 22  1. 146 7 at level X  3
Y3  1
3
0. 53  0. 58  0. 49  0. 533 33 at level X  5
27
Y4  1
3
0. 21  0. 17  0. 16  0. 18 at level X  7
Y4  1
3
0. 08  0. 09  0. 07  0. 0 8 at level X  9
c nj
SSPE  ∑ ∑Y i,j − Y j  2  2. 84 − 2. 8367 2  2. 57 − 2. 8367 2 
j1 i1
 3. 1 − 2. 8367 2  1. 22 − 1. 1467 2  1. 15 − 1. 1467 2 
 1. 07 − 1. 1467 2  0. 49 − 0. 53333 2  0. 58 − 0. 5333 2 
 0. 53 − 0. 5333 2  0. 16 − 0. 18 2  0. 17 − 0. 18 2 
 0. 21 − 0. 18 2  0. 07 − 0. 08 2  0. 09 − 0. 08 2  0. 08 − 0. 08 2  0. 157 4
MSPE  SSPEn−c  15−5  0. 0 157 4
0.157 4
n 
SSE  ∑Y i − Y i  2  ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i 
i1
 29. 2117 − 2. 575 30  14. 33  0. 324  32. 77  2. 925 1
SSLF  SSE − SSPE  2. 925 1 − 0. 1574  2. 767 7
MSLF  SSLF c−2
 2.7677
5−2
 0. 922 57
The hypothesis
H o: EY   o   1 X
H a: EY ≠  o   1 X
Test statistics
F ∗  MSPE
MSLF
 0. 922 57
0.0 157 4
 58. 613
The decision rule
If F ∗  F1 − ; c − 2, n − c conclude H o
If F ∗  F1 − ; c − 2, n − c conclude H a
F1 − ; c − 2; n − c  F0. 95; 3; 10  3. 71
Since F ∗  58. 613  F1 − ; c − 2; n − c  3. 71 we conclude H a .
PROBLEMS
Question 1.
Distinguish between:
1) residual and standardized residual
2) E i   0 and e  0
3) error term and residual.
Question 2.
The fitted values and residuals are
i 1 2 3 4 5 6 7 8 9 10

Y i 2.92 2.33 2.25 1.58 2.08 3.51 3.34 2.67 2.25 1.91
e i 0.18 -0.03 0.75 0.32 0.42 0.19 0.06 -0.07 0.55 -0.31
i 11 12 13 14 15 16 17 18 19 20

Y i 2.42 2.84 2.50 3.59 2.16 1.91 2.50 3.26 1.74 2.25
e i -0.42 0.06 -0.20 -0.39 -0.36 -0.51 -0.50 0.54 0.46 -0.75

a) Plot the residuals e i against the fitted values Y i . What departures from
regression model (3.1) can be studied from this plot? What are your findings?
b) Prepare a normal probability plot. Also calculate the coefficient of correlation
between the ordered residuals and their expected values. What is your conclusion?
Question 3.
The following data were obtained in the study of solution concentration.
28
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Xi 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0
Y i 0.65 0.87 0.79 0.15 0.18 0.22 0.51 0.57 0.55 1.23 1.17 1.09 2.91 2.61 3.11
a) Fit a linear regression function
b) Perform an F test to determine whether or not there is lack of fit
of a linear regression function. Use   0. 025
c) Does the test in part b) indicate what regression function is appropriate when it
leads to conclusion that lack of fit of a linear regression exists? Explain.
Question 4.
The following data were obtained in a study of the relation between diastolic
blood pressure (Y) and age (X) for boys 5 to 13 years old
i 1 2 3 4 5 6 7 8
Xi 5 8 11 7 13 12 12 6
Y i 63 67 74 64 75 69 90 60
a) Assuming regression model (3.1) is appropriate, obtain the estimated
regression function and plot the residuals e i against X i . What does your
residual plot show?
b) Omit observation 7 from the data and obtain the estimated regression
line based on the remaining seven observations. Compare this estimated
regression function to that obtained in part a). What can you conclude
about the effect of the observation 7?
c) Using your fitted regression function in part b) obtain a 99 percent
confidence interval for a new observation Y at X  12. Does observation
Y 7 fall outside this prediction interval? What is the significance of this?
Question 5.
i 1 2 3 4 5 6 7 8 9 10 11 12
Xi 1 1 1 2 2 3 3 3 3 5 5 5
Y i 4.8 4.9 5.1 7.9 8.3 10.9 10.8 11.3 11.1 16.5 17.3 17.1
∑ Y 2i  1554. 66, ∑ X i Y i  434.
1) Fit a linear regression function
2) Perform an F test to determine whether or not there is lack of fit
of a linear regression function. Use   0. 05.
WEIGHTED LEAST SQUARES
General approach
The least squares criterion (2.8):
n
Q  ∑Y i −  o −  1 X i  2
k1
weights each observation equally. There are times, when some observations should
receive greater weight and others smaller weight. The weighted last square criterion for
simple linear regression is
n
(5.31) Q w  ∑ w i Y i −  o −  1 X i  2
k1
where w i is the given weight of the ith observation. Minimizing Q w with respect to  o
29
and  1 leads to the normal equations
∑ wiYi  bo ∑ wi  b1 ∑ wiXi
(5.32)
∑ w i X i Y i  b o ∑ w i X i  b 1 ∑ w i X 2i
Solving we get:
∑ wiXi ∑ wiYi
∑ wiXiYi−
∑ wi
(5.33a) b1 
∑ w i X i  2
∑ w i X 2i −
∑ wi
∑ w i Y i −b 1 ∑ w i X i
(5.33b) bo 
∑ wi
Note that if all weights are equal so w i is the same for all i, the normal equations (5.32)
for weighted least squares reduce to the ones for unweighted least squares (2.10).
Very popular choice for weights is
(5.35) w i  X12 .
i
This particular weight relate the error term variance ( case then  2  kX 2i )
is frequently encountered in business, economic.
PROBLEMS
Question 1.
Data from a study of the relation between the size of a bid in million rands (X) and
the cost to the firm of preparing the bid in thousands rands (Y) for 12 recent bids
are presented in table below:
i 1 2 3 4 5 6 7 8 9 10 11 12
X i 2.13 1.21 11.0 6.0 5.6 6.91 2.97 3.35 10.39 1.1 4.36 8.0
Y i 15.5 11.1 62.6 35.4 24.9 28.1 15.0 23.2 42.0 10 20 47.5
The scatter plot strongly suggest that the error variance increase with X.
Fit the weighted lest squares regression line using weights w i  X12 .
i
Question 2.
Data from a study of computer-assisted learning by 12 students, showing the total
number of responses in completing a lesson (X) and the cost of computer time
(Y, in cents), follow
i 1 2 3 4 5 6 7 8 9 10 11 12
X i 16 14 22 10 14 17 10 13 19 12 18 11
Y i 77 70 85 50 62 70 52 63 88 57 81 54
Fit the weighted lest squares regression line using weights w i  1
X 2i
.
MATRICES.
The matrix A has elements denoted by a i,k where k refers to the
column and i to the row.
A  a i,k 
Example:
30
Column 1 Column 2 Column 3
Row 1 6 5 1
Row 2 13 7 2
Row 3 -1 5 2
A matrix with r rows and k columns will be referred as rk matrix.
Transpose of a matrix A denoted by A ′ is the matrix that is obtained by
interchanging corresponding columns and rows of the matrix A.
For example let A be 32 matrix given by
2 5
A 7 10 that is a 1,1  2, a 1,2  5, a 2,1  7, a 2,2  10, a 3,1  3, a 3,2  4
3 4
then
2 7 4
A′ 
5 10 4
that is
b 1,1  a 1,1 , b 1,2  a 2,1 , b 1,3  a 3,1 , b 2,1  a 1,2 , b 2,2  a 2,2 , b 2,3  a 3,2
Two matrices A and B are said to be equal if they have the same dimension and all
corresponding elements are equal.
Let us notice that if Y is n1 matrix then
Y1
Y2
Y and Y ′  Y1 Y2 . . . Yn
:
Yn
Adding or subtracting two matrices requires that they have the same dimension.
The sum of two matrices is another matrix whose elements each consist of the
sum of the corresponding elements of the two matrices e.g.:
1 4 2 3 12 43 3 7
2 5  1 4  21 54  3 9
3 6 1 3 31 63 4 9
The difference of two matrices is another matrix whose elements each consist of the
difference of the corresponding elements of the two matrices e.g.:
1 4 2 3 1−2 4−3 −1 1
2 5 − 1 4  2−1 5−4  1 1
3 6 1 3 3−1 6−3 2 3
Multiplication of a matrix A by a matrix B. The product AB is only defined
when the number of columns in A equals the number of rows in B.
In general if A has dimension rc and B has dimension cs then
a 1,1 a 1,2 . . . a 1,c b 1,1 b 1,2 . . . b 1,s
a 2,1 a 2,2 . . . a 2,c b 2,1 b 2,2 . . . b 2,s
A B
: : : : : : : :
a r,1 a r,2 . . . a r,c b c,1 b c,2 . . . bc r,s
31
and
c
AB ∑ a i,k b k,j
k1
(we multiply the corresponding elements of i − th row of A by corresponding elements
of j − th column of B and put the sum of the products as i, j − th element
in the resulting matrix)
The determinant of A will be denoted by ether |A|or det(A).
a b
det  ad − cb
c d
a b c
det d e f  iae − afh − idb  dch  gbf − gce
g h i
Formulas for det A in case of higher dimensions is given in theorem 18
Special types of matrices
Symmetric matrix ; If AA ′ then A is said to be symmetric matrix.
Diagonal matrix ; A diagonal matrix is a square matrix whose off-diagonal
elements are all zeros e.g:
4 0 0 0
0 1 0 0
C
0 0 −1 0
0 0 0 7
so if Cc i,j  then c i,j  0 for i ≠ j.
Identity matrix; The identity matrix denoted by I is a matrix where diagonal
elements are equal to 1 and all the others equal to zero e.g.
1 0 0 0
0 1 0 0
I 44 
0 0 1 0
0 0 0 1
Inverse matrix; Let A be a square matrix. If there exists a matrix B such that AB  I,
then B is called the inverse of A, denoted by A −1 . Also if AB  I, then it can be shown
that BA  I. When there exists a matrix B such that AB  BA  I, the matrix A is
said to be nonsingular; in the contrary case, A is said to be singular.
Theorem 1
If a matrix has an inverse, the inverse is unique.
Theorem 2
If A has an inverse, then A −1 has an inverse and (A −1  −1  A.
Theorem 3
If A and B are nonsingular matrices, then AB has an inverse and (AB) −1  B −1 A −1 .
This can be extended to any finite number of matrices.
Theorem 4
If A is a nonsingular matrix and k is a nonzero constant, then
(kA) −1  1k A −1
Theorem 5
If A and B are mn matrices, and a, b are scalars, then
aA ′  Aa ′  A ′ a  aA ′
32
and
aA bB ′  aA ′  bB ′
Theorem 6
If A is any matrix, then
A ′  ′  A
Theorem 7
Let A and B be any matrices such that AB is defined, then
AB ′  B ′ A ′
This can be extended to any finite number of matrices.
Theorem 8
If D is a diagonal matrix, then D  D ′
Theorem 9
If A is any matrix, then AA ′ and A ′ A are symmetric.
Theorem 10
If A is a nonsingular matrix, then A ′ and A −1 are nonsingular and
A −1  ′  A ′  −1
The matrices A,B,C discussed in this section are assumed to have size nn.
Theorem 11
detA  detA ′ 
Corollary 12
Any theorem about detA that is true for rows (columns) of a matrix A
is true for columns (rows).
Theorem 13
If two rows (columns) of a matrix are interchanged, the determinant for the matrix
changes sign .
Theorem 14
If each element of the i − th row of an nm matrix A contains a given factor k, then
we may write detA |k| detB, where the rows of B are the same as the rows of A
except that the number k has been factored from each element of the i − th row of A.
Theorem 15
If each element of a row of the matrix A is zero, then detA 0
Theorem 16
If two rows of a matrix A are identical, then detA 0
Theorem 17
The determinant of a matrix is not changed if the elements of the i − th row are
multiplied by a scalar k and the results are added to the corresponding elements of the
h − th row, h ≠ i.
Theorem 18
If A and B are nn matrices, then
detAB  detAdetB
this result can be extended to any finite number of matrices.
Let A be any mn matrix. From this matrix, if one deletes any set r  m rows and
any set s  n columns, the matrix of the remaining elements is a submatrix of A.
If A is an nn matrix and if the i − th row and j − th column are deleted, the determinant
of the remaining matrix, denoted by m ij , is called the minor of a ij . We call A ij the
cofactor of the element a ij , where
A ij  −1 ij m i,j
Theorem 19
Let A ij be the cofactor of a ij , then
33
n
detA  ∑ a ij A ij
j1
for any i.
Theorem 20
Let A be the nonsingular. (this is equivalent to the detA ≠0)
Then
A −1  detA
1
A ij  ′
where A ij  is the matrix of cofactors.
Example:
1 0 1
Let A  2 1 2
0 4 6
1 0 1
then detA  det 2 1 2 6
0 4 6
To find the inverse one we have to calculate the cofactors.

1 2 2 1
A 1,1  −1 11 det  −2, A 1,2  −1 12 det  −12
4 6 0 6
2 1 0 1
A 1,3  −1 13 det  8, A 2,1  −1 21 det 4
0 4 4 6
1 1 1 0
A 2,2  −1 22 det  6, A 2,3  −1 23 det  −4
0 6 0 4
0 1 1 1
A 3,1  −1 31 det  −1, A 3,2  −1 32 det 0
1 2 2 2
1 0
A 3,3  −1 33 det 1
2 1
Hence
−2 −12 8
A i,j   4 6 −4
−1 0 1
and
′
−2 −12 8 −2 4 −1
′
A i,j   4 6 −4  −12 6 0
−1 0 1 8 −4 1
Hence
34
−2 4 −1 − 13 2
3
− 16
A −1  1
6
A i,j  ′  1
6 −12 6 0  −2 1 0
8 −4 1 4
3
− 2
3
1
6
RANK OF MATRICES
An nm matrix A is said to be of the rank r if the size of the largest nonsingular
square submatrix of A is r. This definition is rather difficult to apply directly to find out
the rank of any given matrix.
Each of the following operations is called an elementary transformation of a matrix A;
1) The interchange of two rows (or two columns) of A.
2) The multiplication of the elements of a row (or a column) of A by the same
nonzero scalar k.
3) The addition of the elements of a row (or a column) of A, after they have been
multiplied by the scalar k, to the corresponding elements of another row (or another
column) of A.
We define the inverse of an elementary transformation as the transformation that restores
the resulting matrix to the original form of A.
Theorem 21
The inverse of an elementary transformation of a matrix is an elementary transformation
of the same type.
Theorem 22
The size and rank of a matrix are not altered by an elementary transformation of a matrix.
Two matrices that have the same size and rank are said to be equivalent
Theorem 23
Every elementary matrix has an inverse of the same type.
Theorem 24
Any nonsingular matrix can be written as the product of elementary matrices.
Theorem 25
The size and rank of the matrix A are not altered by multiplying and postmultiplying
A by elementary matrices.
Theorem 26
If matrices A and B are nonsingular, then for any matrix C the following
matrices AC, CB, and ACB all have the same rank (provided that all multiplication
are defined)
Theorem 27
If A is an mn matrix of rank r, then there exist nonsingular matrices
P and Q such that PAQ is equal to
I I 0
I, I 0 , ,
0 0 0
depending on n,m,r.
Theorem 28
Two matrices A and B of the same size are equivalent if and only if B can be
obtained from A by multiplying and postmultiplying by finite number of elementary
matrices.
Theorem 29
The rank of the product of matrices A and B cannot exceed the rank of either A or B.
35
PROBLEMS
Question 1.
1 −1 0
a) Find the determinant of A  2 1 1
1 0 3
b) Find A −1
Question 2
Suppose that you want to perform the following operations on 33 matrix A
1) Interchange the first and third row
2) Interchange the first and third columns
3) Multiply the first row by -2 and add the result to the third row
4) Multiply the first column by -2 and add the result to the third column
5) Multiply the second row by -2 and add the result to the third row
6) Multiply the second column by -2 and add the result to the third column
7) Mmultiply the second row by 1/2
8) Multiply the third column by 7
0 4 2
If A  4 2 0 find the resulting matrix after performing the eight transformations.
2 0 1
Question 3
For the matrices A and X
1 2 −1 1 3
2 1 0 3 4
A X
0 −3 2 1 1
−3 0 −1 −5 1
find AX, X ′ A.
REGRESSION EXAMPLE.
In regression analysis, one basic matrix is the vector Y (n1), consisting of the n
observations on the dependent variable
Y1
Y2
(6.4) Y
:
Yn
Note that the transpose Y ′ is the row vector;
(6.5) Y′  Y1 Y2 . . . Yn
Another basic matrix in regression analysis is the X (n2) matrix, which is defined
as follows for simple regression analysis:
36
1 X1
1 X2
(6.6) X
: :
1 Xn
The matrix X consists of column of 1’s and a column containing the n values of
the independent variable X. Note that the transpose of X is
1 1 ... 1
(6.7) X′ 
X1 X2 . . . Xn
The regression model;
Y i  EY i    i i  1, 2, . . . , n
can be written compactly in matrix notation.
First, let us define the vector (n1) of mean responses:
EY 1 
EY 2 
(6.9) EY 
:
EY n 
and the vector of error terms
1
2
(6.10) 
:
n
Using Y (6.4) we can write
Y  E(Y)  
because
Y1 EY 1  1 EY 1    1
Y2 EY 2  2 EY 2    2
  
: : : :
Yn EY n  n EY n    n
Thus, the observations vector Y equals to the sum of two vectors, a vector
containing the expected values and another containing the error terms.
Let us define vector  of the regression coefficients as follows:
o
(6.13) 
1
Then the product X, where X is defined in (6.60 is
1 X1 o  1X1
1 X2 o o  1X2
(6.14) X  
: : 1 :
1 Xn o  1Xn
Since  o   1 X i  EY i  we see that X is the vector of expected values EY i 
for simple linear regression model i.e. EY  X where EY is defined in (6.9).
Another product frequently needed is Y ′ Y,
37
Y1
Y2
(6.15) Y′Y  Y1 Y2 . . . Yn  ∑ Y 2i 
:
Yn
′
Note that Y Y is a 11 matrix,or a scalar. In a compact way we can write the sum of
squares Y ′ Y  ∑ Y 2i .
We also will need X ′ X
1 X1
1 1 ... 1 1 X2 n ∑ Xi
(6.16) X′X  
X1 X2 . . . Xn : : ∑ X i ∑ X 2i
1 Xn
and X ′ Y
Y1
1 1 ... 1 Y2 ∑ Yi
(6.17) X′Y  
X1 X2 . . . Xn : ∑ XiYi
Yn
The principal inverse matrix in regression analysis is the inverse of
the matrix X ′ X
n ∑ Xi ∑ X i  2
det  n ∑ X 2i − ∑ X i  2  n∑ X 2i − n   n ∑X i − X 2
∑ Xi ∑ Xi 2
Hence
−1
∑ X 2i ∑ Xi
−
n ∑ Xi n ∑X i −X 2 n ∑X i −X 2

(6.26) X ′ X −1  
∑ X i ∑ X 2i −∑ X i
n
n ∑X i −X 2 n ∑X i −X 2
Since ∑ X i  nX we can simplify (6.26):
∑ X 2i −X
′ −1 n ∑X i −X 2 ∑X i −X 2
(6.27) X X 
−X 1
∑X i −X 2 ∑X i −X 2
RANDOM VECTORS AND MATRICES
A random vector or a random matrix contains elements which are random variables.
Expectation of a random vector or matrix.
Suppose we have the following observation vector
Y1
Y  Y2
Y3
The expected value of Y is a vector, denoted by E(Y), which is defined as follows:
38
EY 1 
E(Y)  EY 2 
EY 3 
In general we can use the following notation
(6.41) E(Y) EY i  i  1, 2, . . , n
and for random matrix Y with dimension np, the expectation is
(6.42) E(Y) EY i,k  i  1, 2, . . . , n k  1, 2, . . . , p
Variance-covariance matrix of a random vector.
Consider the following random vector

Y1
Y Y 2
Y3
Each random variable (coordinate) has a variance  2 Y i  and any two random
variables have a covariance Y i , Y j . We can assemble these in a matrix called the
variance-covariance matrix of Y denoted by  2 Y
 2 Y 1  Y 1 , Y 2  Y 1 , Y 3 
(6.43)  2 Y  Y 2 , Y 1   2 Y 2  Y 2 , Y 3 
Y 3 , Y 1  Y 3 , Y 2   2 Y 3 
Let us notice that
Y 1 − EY 1 
 2 Y  E Y 2 − EY 2  Y 1 − EY 1  Y 2 − EY 2  Y 3 − EY 3 
Y 3 − EY 3 
It follows readily that
(6.44)  2 Y  EY − EYY − EY ′ 
If
Y1
Y2
Y
:
Yn
in general we can write
 2 Y 1  Y 1 , Y 2  . . . Y 1 , Y n 
Y 2 , Y 1   2 Y 2  . . . Y 2 , Y n 
(6.45)  2 Y 
: : ... :
Y n , Y 1  Y n , Y 2  . . .  Y n 
2
Note again that  Y is a symmetric matrix.

2
Basic theorems
Let W be a random vector obtained by premultiplying the random vector

Y by a constant matrix A
(6.46) W  AY
Then
39
(6.47) E(A)  A
(6.48) E(W)  E(AY) AE(Y)
(6.49)  2 W   2 AY  A 2 YA ′
PROBLEMS
Question 1.
For the matrices below obtain AB,A-B,AC,AB ′ ,B ′ A
1 4 1 3
3 8 1
A 2 6 B 1 4 C
5 4 0
3 8 2 5
Question 2.
For the matrices below obtain AC,A-C,B ′ A,AC ′
2 1 6 3 8
3 5 9 8 6
A B C
5 7 3 5 1
4 8 1 2 4
State the dimension of each resulting matrix.
Question 3.
Find the inverse of each of the following matrices
4 3 2
2 4
A B 6 5 10
3 1
10 1 6
Question 4.
Consider the following functions of the random variables
Y 1 , Y 2 and Y 3 :
W1  Y1  Y2  Y3
W2  Y1 − Y2
W3  Y1 − Y2 − Y3
a) State above in matrix notation
W1
b) Find the expectation of the random vector W  W2
W3
c) Find the variance-covariance matrix of W.
Example:a)Let us return to the case based on n  3 observations. Suppose that
the error terms have constant variance  2    2 , and are uncorrelated so
 i ,  j   0 for i ≠ j. We can then write the variance-covariance matrix for the
random vector  as follows (using the fact that E i   0 i  1, 2, 3)
1 11 12 13
 2   E  2 1 2 3  E 21 22 23 
3 31 32 33
40
E 1  1  E 1  2  E 1  3  2 0 0
 E 2  1  E 2  2  E 2  3   0 2 0  2I
E 3  1  E 3  2  E 3  3  0 0 2
b) Let us consider W  AY where
W1 1 −1 Y1
W 2  1 A  2  2 Y 
W2 1 1 Y2
then
W1 1 −1 Y1 Y1 − Y2
 
W2 1 1 Y2 Y1  Y2
1 −1 EY 1 EY 1 − EY 2
E(W)  
1 1 EY 2 EY 1  EY 2
and
1 −1  2 Y 1  Y 1 , Y 2  1 1
 2 W)  
1 1 Y 2 , Y 1   Y 2 
2
−1 1
 2 Y 1    2 Y 2  − 2 2 Y 1 , Y 2   2 Y 1  −  2 Y 2 

 2 Y 1  −  2 Y 2   2 Y 1    2 Y 2   2 2 Y 1 , Y 2 
SIMPLE LINEAR REGRESSION MODEL IN MATRIX TERMS
We are going to rewrite the regression model (3.1)

(6.50) Y i   o   1 X i   i i  1, 2, . . . , n
This implies
Y1  o  1X1  1
Y2  o  1X2  2
(6.51)
:
Yn  o  1Xn  n
and in matrix terms
Y1 1 X1 1
Y2 1 X2 o 2
(6.52) Y X  
: : : 1 :
Yn 1 Xn n
Now we can write
(6.53) Y X   
n1 n2 21 n1
since
Y1 1 X1 1 o  1X1  1
Y2 1 X2 o 2 o  1X2  2
  
: : : 1 : :
Yn 1 Xn n o  1Xn  n
With respect to the error terms, model (3.10 assumes that E i   0
and  2  i    2 i  1, 2, . . . , n, and that  i are independent normal random
41
variables. The condition E i   0 in matrix terms is:
1 E 1 0
2 E 2 0
(6.54) E  E   0
: : :
n E n 0
The condition that the error terms have constant variance and that
all  i ,  j   0 for i ≠ j (since independence) is expressed in matrix
terms through the variance-covariance matrix:
2 0 . . . 0 1 0 ... 0
0 2 . . . 0 0 1 ... 0
(6.55)  2    2  2I
: : ... : : : ... :
0 0 . . . 2 0 0 ... 1
Thus the normal error model (3.1) in matrix terms is:
Let
Y1 1 X1 1
Y2 1 X2 o 2
Y X  
: : : 1 :
Yn 1 Xn n
(6.56 ′ ) Y X   
n1 n2 n1 n1
where:  is the vector of parameters
X - matrix of known constants,namely,the values of the independent
variables
 is a vector of independent normal random variables with E  0
and  2    2 I.
LEAST SQUARES ESTIMATION OF REGRESSION PARAMETERS
The normal equations (2.9):

nb o  b 1 ∑ X i  ∑ Y i
(6.57)
b o ∑ X i  b 1 ∑ X 2i  ∑ X i Y i
in matrix terms are:
(6.58) X ′ Xb  X ′ Y
where b is the vector of the least squares regression coefficients:
bo
(6.58a) b
b1
One can verify that
1 X1
1 1 ... 1 1 X2 n ∑ Xi
X′X  
X1 X2 . . . Xn : : ∑ X i ∑ X 2i
1 Xn
42
Y1
1 1 ... 1 Y2 ∑ Yi
X′Y  
X1 X2 . . . Xn : ∑ XiYi
Yn
Using this result we have
n ∑ Xi bo ∑ Yi

∑ X i ∑ X 2i b1 ∑ XiYi
or
nb o  b 1 ∑ X i ∑ Yi

b o ∑ X i  b 1 ∑ X 2i ∑ XiYi
Estimated regression coefficients.

To obtain the estimated regression coefficient from the normal equations:
X ′ Xb  X ′ Y
by matrix methods, we premultiply both sides by the inverse of X ′ X (we
assume this exists)
X ′ X −1 X ′ Xb  X ′ X −1 X ′ Y
so that we find, since X ′ X −1 X ′ X  I and Ib  b;
(6.59) b  X ′ X −1 X ′ Y
The estimators b o and b 1 in b are the same as those given earlier in (2.10a)
and (2.10b).
Example:
Let us find estimated regression coefficients for the following data:
Xi Yi XiYi X 2i
30 73 2190 900
20 50 1000 400
60 128 7680 3600
80 170 13600 6400
40 87 3480 1600
50 108 5400 2500
60 135 8100 3600
30 69 2070 900
70 148 10360 4900
60 132 7920 3600
500 1100 61800 28400 ←TOTALS
Therefore
n  10, ∑ Y i  1100, ∑ X i  500, ∑ X 2i  28400, ∑ X i Y i  61800
Let us now use (6.26)
−1
∑ X 2i ∑ Xi
−
n ∑ Xi n ∑X i −X 2 n ∑X i −X 2

X ′ X −1  
∑ X i ∑ X 2i −∑ X i
n
n ∑X i −X 2 n ∑X i −X 2
43
to evaluate X ′ X −1 . We have
 ∑ Xi2 500 2
n ∑X i − X 2  n ∑ X 2i − n  10 28400 − 10
 34 000
Therefore
∑ X 2i ∑ Xi
−
n ∑X i −X 2 n ∑X i −X 2

28400 −500
′ −1 34000 34000
X X  
−∑ X i −500 10
n 34000 34000
n ∑X i −X 2 n ∑X i −X 2
We also use (6.17) to evaluate X ′ Y :
Y1
1 1 ... 1 Y2 ∑ Yi 1100
X′Y   
X1 X2 . . . Xn : ∑ XiYi 61800
Yn
hence
28400 −500
bo 1100 10
b  X ′ X −1 X ′ Y  34000
−500
34000
10

b1 34000 34000
61800 2
so b 0  10 and b 1  2.
To reduce the error in calculations we can write
∑ X 2i − ∑ X i
X ′ X −1  1
n ∑X i −X 2 − ∑ Xi n
In our case
28400 −500
X ′ X −1  1
34000
−500 10
and
∑ X 2i − ∑ X i ∑ Yi
b 1

n ∑X i −X 2 − ∑ Xi n ∑ XiYi
28400 −500 1100 10
 1
34000

−500 10 61800 2
ANALYSIS OF VARIANCE RESULTS
Fitted values and residuals.


Let the vector of the fitted values Y i be denoted by Y :

Y1

Y2
(6.64) Y n  1
:

Yn

and the vector of residuals e i  Y i − Y i be denoted by e :
44
e1
e2
(6.65) e n  1
:
en
In matrix notation we have
(6.66) Y  Xb
because

Y1 1 X1 bo  b1X1

Y2 1 X2 bo bo  b1X2
 
: : : b1 :

Yn 1 Xn bo  b1Xn
Similarly:
(6.67) e  Y −Y  Y − Xb
Sums of squares.
To see how the sums of squares are expressed in matrix notation,

we begin with SSTO. We know that
 ∑ Yi2
SSTO  ∑ Y 2i − n Y  ∑ Y 2i −
2
(6.68) n
We also known from (6.15)
Y ′ Y  ∑ Y 2i 
1
1
Let 1  n  1
:
1
Using this we have
1 Y1
1 Y2
(6.69)  1n Y ′ 11 ′ Y  1n  Y1 Y2 . . . Yn 1 1 ... 1 
: :
1 Yn
∑ Yi
2
  1n Y 1  Y 2 . . . Y n Y 1  Y 2 . . . Y n   n
Hence
(6.70a) SSTO  Y ′ Y − 1n Y ′ 11 ′ Y
The same way as for ∑ Y 2i we obtain that SSE  ∑ e 2i in matrix terms is:
(6.70b) SSE  e ′ e  Y − Xb ′ Y − Xb
which can be shown to equal:
(6.70c) SSE  Y ′ Y − b ′ X ′ Y
For SSR  SSTO − SSE in matrix terms we have:
(6.70c) SSR  b ′ X ′ Y − 1n Y ′ 11 ′ Y.
Example.
For data from a previous example we have

45
Y ′ Y  ∑ Y 2i  134660
and
10 1100
b X′Y 
2 61800
Hence:
1100
b′X′Y  10 2  134 600
61800
and
SSE  Y ′ Y − b ′ X ′ Y 134660 − 134600  60
73
50
128
170
87
Y Y′  73 50 128 170 87 108 135 69 148 132
108
135
69
148
132
Hence
1
1
1
1
1
Y′1  73 50 128 170 87 108 135 69 148 132  1100
1
1
1
1
1
46
73
50
128
170
87
1′Y  1 1 1 1 1 1 1 1 1 1  1100
108
135
69
148
132
′ ′
 Y 11 Y 
1
n 1100  1100  121 000
1
10
and finally:
SSR  b ′ X ′ Y − 1n Y ′ 11 ′ Y 134 600 − 121 000  13 600
SUMS OF SQUARES AS QUADRATIC FORMS
The ANOVA sums of squares can be shown to be quadratic forms.

An example of a quadratic form of the observations Y i when n  2 is:
(6.71) 5Y 21  6Y 1 Y 2  4Y 22
Note that tis expression is a second-degree polynomial containing terms
involving the squares of the observations and the cross product. We can
express (6.71) in the matrix terms as follows:
5 3 Y1
(6.71a) Y1 Y2  Y ′ AY
3 4 Y2
where A is asymmetric matrix of coefficients.
In general, a quadratic form is defined as:
n n
(6.72) Y ′ AY  ∑ ∑ a i,j Y i Y j where a i,j  a j,i
i1 j1
A is a symmetric nn matrix and is called the matrix of the quadratic form.
The ANOVA sums of squares SSTO, SSR and SSE are all quadratic forms.
To see this, we need to express the matrix form for these sums of squares
in (6.70). We do this by using
(6.73) 1 1′  J
n11n nn
where J is the nn matrix all of whose elements are 1 ′ s. Also, the transpose of b
in (6.59) can be obtained using the following facts:
A  B  A  B, AB ′  B ′ A ′ and A −1  ′  A ′  −1 in
(6.74) b ′  X ′ X −1 X ′ Y ′  Y ′ XX ′ X −1
by noting that X ′ Xis a symmetric matrix so that it equals its transpose.
Hence
(6.75a) SSTO  Y ′ I − 1n JY
(6.75b) SSR  Y ′ XX ′ X −1 X ′ − 1n J Y
(6.75c) SSE  Y ′ I − XX ′ X −1 X ′ Y
Each of these sums of squares can now be seen to be of the form Y ′ AY.
It can be show that the three A matrices
(6,76a) I − 1n J
(6.76b) XX ′ X −1 X ′ − 1n J
47
(6.76c) I − XX ′ X −1 X ′
are symmetric. Hence, SSTO, SSR, and SSE are quadratic forms, with the matrices
of quadratic forms given in (6.76)
INFERENCES IN REGRESSION ANALYSIS
Regression coefficients
The variance-covariance matrix of b:
bo  2 b o  b o , b 1 
(6.77)  2 b   2  
b1 b 1 , b o   2 b 1 
is
(6.78)  2 b  2 X ′ X −1
22
or, using (6.27)
∑ X 2i
2
X 2
−
n ∑X i −X 2 ∑X i −X 2
(6.78a)  2 b 
X 2 2
−
∑X i −X 2 ∑X i −X 2
When MSE is substituted for  2 in (6.78a) we have
MSE ∑ X 2i
− XMSE
(6.79) s 2 b MSEX ′ X −1


n ∑X i −X 2 ∑X i −X 2
22 − XMSE MSE
∑ X i −X 2 ∑X i −X 2
where s 2 b is the estimated variance-covariance matrix of b. In (6.78a), we can
recognize the variances of b o (3.20b) and b 1 (3.3b) and the covariance of b o
and b 1 .
Estimation.
Mean response.
To estimate the mean response at X h let us define the vector
1
(6.81) Xh  or X ′  1 X h
21 Xh
The fitted value
 in matrix notation is
(6.82) Y h  X ′h b
since
bo  
X ′h b  1 Xh  b o  b 1 X h   Y h  Y h
b1

The variance of  Y h , given earlier in (3.28b), is in matrix notation:
(6.83)  Y h    X h X ′ X −1 X h  X ′h  2 bX h
2 2 ′
where  2 b is the variance-covariance

 matrix of the regression coefficients
in (6.78) Therefore  2 Y h  is a function of variances  2 b o ,  2 b 1  and
of the covariance b o , b 1 . 
The estimated variance of Y h , given earlier in (3.30), is in matrix notation:
(6.84) s 2  Y h   MSEX ′h X ′ X −1 X h   X ′h s 2 bX h
where s b is the estimated variance-covariance matrix of the regression
2
coefficients in (6.79).
Prediction of new observation
48
The estimated variance s 2 Y hnew  given earlier in (3.37), is in matrix notation:

(6.85) s 2 Y hnew   MSE  s 2  Y h   MSE  X ′h s 2 bX h 
 MSE1  X ′h X ′ X −1 X h 
Example:
1) For data (examples previously considered)
−1
1 30
1 20
1 60
1 80
1 1 1 1 1 1 1 1 1 1 1 40
X ′ X −1  
30 20 60 80 40 50 60 30 70 60 1 50
1 60
1 30
1 70
1 60
−1
10 500
71
85
− 681 . 835 29 −0. 1470 588
  
500 28 400 − 68
1 1
3400
−0. 1470 588 0. 002 941 2
We found earlier that MSE  7. 5
Hence
s 2 b  MSEX ′ X −1 
. 835 29 −0. 1470 588 6. 264 706 −0. 1102941
 7. 5  
−0. 1470 588 0. 002 941 2 −0. 1102941 0. 0022059
Thus
s 2 b o   6. 264706 and s 2 b
1   0. 002206
For the same data we will find s Y h  when X h  55.
2
We define X ′h  1 55 and using (6.84) we get

 6. 264 706 −0. 1102941 1
s 2 Y 55   X ′h s 2 bX h  1 55  0. 8052
−0. 1102941 0. 0022059 55
WEIGHTED LEAST SQUARES.
The regression results for the weighted least squares can be stated in matrix algebra
using the following notation:
w1 0 . . . 0
0 wi . . . 0
(6.88) W
nn : : : :
0 0 . . . wn
is the diagonal matrix containing the weights w i .
The weighted normal equations (5.32) can be rewritten as
(6.89) X ′ WXb  X ′ WY
and the weighted least squares estimators are:
49
(6.90) b  X ′ WX −1 X ′ WY
21
The variance-covariance matrix of the weighted least squares estimators is:
(6.91)  2 b  2 X ′ WX −1
22
and the estimated variance-covariance matrix is:
(6.92) s 2 b MSE w X ′ WX −1
22
where MSE w is based on the weighted

squared deviations:
∑ w i Y i −Y i  2
(6.92a) MSE w  n−2
Residuals
For the analysis of residuals, it will be useful to recognize that each residual e i
can be expressed as a linear combination of the observations Y i . It can be shown
that e defined in (6.65), equals:
(6.93) e   I − H Y
n1 nn nn n1
where
(6.93a) H  XX ′ X −1 X ′
nn
Note from (6.76c) that the matrix I − H is the matrix of the quadratic form
(6.75c) for SSE  ∑ e 2i .
The square nn matrix H is called the hat matrix and plays an important
role in regression analysis.
The variance covariance matrix of e can be derived by means of (6.49)
 2 W   2 AY  A 2 YA ′
Since e  I − HY then
 2 e  I − H 2 YI − H ′
Now  2 Y   2   2 I for the normal error model.
The matrix I − H has the special property. First it is symmetric and
I − HI − H  I − H
Hence
 2 e  2 I − HII − H  2 I − H.
PROBLEMS.
Question 1.
Assume that the normal regression model is applicable.
For the following data given by:
i 1 2 3 4 5
Xi 8 4 0 -4 -8
Y i 7.8 9 10.2 11 11.7
using matrix method find:
1) Y ′ Y 2) X ′ X 3) X ′ Y
4) b 5) ANOVA table
6) covariance-variance matrix s 2 b
Question 2.
Find the matrix A of the quadratic form:
3Y 21  10Y 1 Y 2  6Y 22
Question 3.
50
i 1 2 3 4 5 6
Xi 4 1 2 3 3 4
Y i 16 5 10 15 13 22
1) Y ′ Y 2) X ′ X 3) X ′ Y 4) b 5) ANOVA table
7) Y 8) s 2 Y hnew  when X h  3. 5
Question 4.
For the matrix
1 0 4
A 0 3 0
4 0 9
find the quadratic form of the observations Y 1 , Y 2 and Y 3 .
Question 5.
i 1 2 3 4 5 6 7 8 9
Xi 7 6 5 1 5 4 7 3 4
Y i 97 86 78 10 75 62 101 39 53
i 10 11 12 13 14 15 16 17 18
Xi 2 8 5 2 5 7 1 4 5
Y i 33 118 65 25 71 105 17 49 68
Summary calculational results are: ∑ Y i  1152, ∑ X i  81, ∑Y i − Y  2  16504,

∑X i − X 2  74. 5, ∑Y i − Y X i − X  1098.
Find: 1) Y ′ Y 2) X ′ X 3) X ′ Y
4) b 5) ANOVA table
Question 6.
i 1 2 3 4 5 6 7 8 9 10
Xi 1 0 2 0 3 1 0 1 2 0
Y i 16 9 17 12 22 13 8 15 19 11
∑ Y 2i  2194, ∑ X i Y i  182.
Find: 1) Y ′ Y 2) X ′ X 3) X ′ Y
4) b 5) ANOVA table
Question 7.
i 1 2 3 4 5 6 7 8 9 10 11 12
Xi 1 1 1 2 2 2 2 4 4 4 5 5
Y i 6.2 5.8 6 9.7 9.8 10.3 10.2 17.8 17.9 18.3 21.9 22.1
∑ Y 2i  2448. 5, ∑ X i Y i  534.
Find: 1) Y ′ Y 2) X ′ X 3) X ′ Y
4) b 5) ANOVA table
51
Question 8.
Consider the simple linear regression model
Prove the following
SSR  b ′ X ′ Y − 1n Y ′ 11 ′ Y
Question 9.
1) Define the quadratic form.
2) Find the matrix A of the quadratic form:
3Y 21  10Y 1 Y 2  6Y 22
3) Find the quadratic form for SSR.
MULTIPLE REGRESSION MODELS
Multiple regression analysis is one of the most widely used of all statistical
tools. In many practical situations a number of key independent variables
affects the response variable in important and distinctive way. Furthermore,
in such case one will find that predictions of the response variable based on
the model containing only a single independent variable are to imprecise to be
useful. A more complex model, containing additional independent variables,
is more helpful in providing sufficiently precise predictions of the response
variable.
First-order model with two independent variables.
When there are two independent variables X 1 and X 2 , the model:

(7.1) Y i   o   1 X i,1   2 X i,2   i
is called a first-order model with two independent variables. It is linear in
parameters and linear in the independent variables. Y i denotes the response in the
i − th trial, and X i,1 and X i,2 are the values of the two independent variables in the
i − th trial. The parameters of the model are  o ,  1 and  2 , and the random error
term  i . Assuming that E i   0, the regression function for the model (7.1) is:
(7.2) EY   o   1 X 1   2 X 2
Analogous to simple linear regression, where the regression function EY   o   1 X
is a line, the regression function (7.2) is a plane:
Note that a point on the response plane corresponds to the mean response EY
at the given combination of levels of X 1 and X 2 . Figure above also shows
a series of observations Y i corresponding to given levels of the two independent
variables X i,1 , X i,2 . Note that each vertical rule in picture above represents the
difference between Y i and the mean EY i . Hence, the vertical distance from Y i to
the response plane represents the error term  i  Y i − EY i . Frequently the regression
function in multiple regression is called a regression surface or a response surface.
52
Meaning of regression coefficients.
Let us consider the meaning of the regression parameters in multiple regression

function (7.2). The parameter  o is the Y intercept of the regression plane. If the
scope of the model includes X 1  0, X 2  0,  o gives the mean response at X 1  0,
X 2  0. Otherwise  o does not have any particular meaning as a separate
term in regression model. The parameter  1 indicates the change in the mean in the
response per unit increase in X 1 when X 2 is held constant. Likewise,  2 indicates the
change in the mean in the response per unit increase in X 2 when X 1 is held constant.
The parameters  1 and  2 are frequently called partial regression coefficients
because they reflect the partial effect of one of independent variables when the other
independent variable is included in the model and is held constant. We can readily
establish the meaning of  1 and  2 by calculus, taking partial derivatives of the
response surface (7.2) with respect to X 1 and X 2 in turn:
∂EY ∂EY
∂X 1
 1 ∂X 2
 2.
First-order model with more than two independent variables.
We consider now the case where there are p − 1 independent variables X 1 , . . . X p−1 .
The model
(7.5) Y i   o   1 X i,1   2 X i,2 . . .  p−1 X i,p−1   i
is called a first-order model with p − 1 independent variables. It can also be written:
p−1
(7.5a) Y i   o  ∑  k X i,k   i
k1
Assuming that E i   0, the response function model for model (7.5) is:
(7.6) EY i    o   1 X i,1   2 X i,2 . . .  p−1 X i,p−1
This response function is a hyperplane, which is a plane in more than two
dimensions.
General linear regression model
In general, the variables X 1 , . . , X p−1 in a regression model do not have to

represent different independent variables. The general linear regression model
is:
(7.7) Y i   o   1 X i,1   2 X i,2 . . .  p−1 X i,p−1   i
where:
 o ,  1 ,  2 , . . . ,  p−1 are parameters
X i,1 , X i,2 , . . . X i,p−1 are known constants
 i are independent N0,  2 
i  1, 2, . . . n
The response function for model (7.7) is
(7.8) EY i    o   1 X i,1   2 X i,2 . . .  p−1 X i,p−1
This implies that the observations Y i are independent normal variables with mean EY i )
given by (7.8) and with constant variances  2 .
GENERAL LINEAR REGRESSION MODEL IN MATRIX TERMS
It is a remarkable property of matrix algebra that the results for the general
linear regression model (7.7) appear exactly the same in matrix notation
as those for the simple linear regression model (6.56).
To express the general linear regression model in matrix terms, we need to
53
define the following matrices:
Y1
Y2
(7.17a) Y
n1 :
Yn
1 X 1,1 X 1,2 . . . X 1,p−1
1 X 2,1 X 2,2 . . . X 2,p−1
(7.17b) X
np : : : ... :
1 X n,1 X n,2 . . . X n,p−1
o
1
(7.17c) 
p1 :
 p−1
1
2
(7.17d) 
n1 :
n
Note that Y and  vectors are the same as for simple regression. The  vector
contains additional regression parameters, and X matrix contains a column of 1’s
as well as a column of the n values for the each of the p − 1 X variables in the
regression model. The row subscript for each element X i,k in the X matrix identifies
the trial, and the column subscript identifies the X variable.
In matrix terms, the general linear regression model (7.7) is:
(7.18) YX   
n1 npp1 n1
where:
Y is a vector of observations
 is a vector of parameters
X is a matrix of constants
 is a vector of independent normal random variables with expectation
E  0 and variance-covariance matrix  2    2 I
Consequently, the random vector Y has expectation
(7.18a) EY  X
and the variance-covariance matrix of Y is:
(7.18b)  2 Y   2 I
LEAST SQUARES ESTIMATORS
Let us denote the vector of estimated regression coefficients b o , b 1 , . . . , b p−1

as b
bo
b1
(7.19) b
p1 :
b p−1
54
The least squares normal equations for the general linear regression model (7.18)
are:
(7.20) X ′ X b  X ′ Y
pp p1 pn n1
and the least squares estimators are:
(7.21) b  X ′ X −1 X ′ Y
p1 pp p1
For the model (7.18), these least squares estimators are also maximum likelihood
estimators and have all the properties stated before: they are unbiased, minimum
variance unbiased, and sufficient.
ANALYSIS OF VARIANCE RESULTS


Let the vector of thefitted values Y i be denoted by Yand the vector of the
residual terms e i  Y i − Y i be denoted by e :

Y1

Y2
(7.22a) Y
n1 :

Yn
e1
e2
(7.22b) e
n1 :
en
The fitted values are represented by
(7.23) Y  Xb
and the residual terms by:
(7.24) e  Y −Y  Y − Xb
Sums of squares and mean squares
The sums of squares for the analysis of variance are:

(7.25) SSTO  Y ′ Y − 1n Y ′ 11 ′ Y
(7.26) SSR  b ′ X ′ Y − 1n Y ′ 11 ′ Y
(7.27) SSE  e ′ e  Y − Xb ′ Y − Xb  Y ′ Y − b ′ X ′ Y
where 1 is an n  1 vector of 1’s.
SSTO, as usual, has n − 1 degrees of freedom associated with it. SSE has n − p degrees
of freedom associated with it since p parameters need to be estimated in the regression
function for the model (7.18). Finally, SSR has p − 1degrees of freedom associated with
it, representing the number of X variables X 1 , . . . , X p−1 .
(7.28) MSR  SSR p−1
(7.29) MSE  SSE n−p
The expectation of MSE is  2 , as for simple regression. If p − 1  2 we have
EMSR   2   21 ∑X i,1 − X 1  2   22 ∑X i,2 − X 2  2  2 1  2 ∑X i,1 − X 1 X i,2 − X 2 /2
Note that if both  1 and  2 equal zero, EMSR   2 . Otherwise EMSR   2 .
ANOVA table for general linear regression model (7.18)
55
Source of variation SS df MS
Regression SSR  b ′ X ′ Y − 1n Y ′ 11 ′ Y p − 1 MSR  SSR
p−1
Error SSE  Y ′ Y − b ′ X ′ Y n − p MSE  SSE

n−p
Total SSTO  Y ′ Y − 1n Y ′ 11 ′ Y n-1
F test for regression relation
To test whether there is a regression relation between the dependent variable Y

and the set of X variables X 1 , . . . , X p−1 we test the following alternatives:
H o :  1   2 . . .   p−1  0
(7.30a)
H a : not all  k (k  1, 2, . . , p − 1 equal to zero
We use the test statistics
(7.30b) F ∗  MSR
MSE
The decision rule is
If F ∗  F1 − , p − 1, n − p, conclude H o
(7.30c)
If F ∗  F1 − , p − 1, n − p, conclude H a
Note that when p − 1  1, this test reduces to the F test in (3.61).
Coefficient of multiple determination
The coefficient of multiple determination, denoted by R 2 , is defined as

follows:
(7.31) R 2  SSTO
SSR
 1 − SSTO
SSE
It measures the proportionate reduction of total variation in Y associated with

the use of the set of X variables X 1 , . . . , X p−1 . Just as for r 2 we have
(7.32) 0  R2  1
R 2 assumes the value 0 when all b k  0 k  1, . . . , p − 1 and takes on the
value 1 when all observations fall directly on the fitted response surface,
(Y i  Y i for all i.
Coefficient of multiple correlation
The coefficient of multiple correlation R is the positive square root of R 2

(7.34) R  R2
INFERENCES ABOUT REGRESSION PARAMETERS
The least squares estimators in b are unbiased:

(7.35) Eb  
The variance-covariance matrix  2 b :
 2 b 0  b 0 , b 1  . . . b 0 , b p−1 
b 1 , b 0   2 b 1  . . . b 1 , b p−1 
(7.36)  2 b 
: : ... :
b p−1 , b 0  b p−1 , b 1  . . .  b p−1 
2
is given by:
(7.37)  2 b  2 X ′ X −1
The estimated variance-covariance matrix s 2 b :
56
s 2 b 0  sb 0 , b 1  . . . sb 0 , b p−1 
sb 1 , b 0  s 2 b 1  . . . sb 1 , b p−1 
(7.38) s b 
2
: : ... :
sb p−1 , b 0  sb p−1 , b 1  . . . s 2 b p−1 
is given by
(7.39) s 2 b MSEX ′ X −1
From s 2 b, one can obtain s 2 b 0 , s 2 b 1  or whatever other variance is needed, or any
needed covariance.
Interval estimation of  k
For the normal error model (7.18), we have:

b k − k
(7.40) sb k 
 tn − p k  0, 1, 2, . . . , p − 1
Hence the confidence limits for  k with 1 −  confidence coefficient are:
(7.41) b k  t1 − /2, n − psb k 
Let us notice that sb k  we can have using (7.38) and (7.39)
Tests for  k
Tests for  k are set up in the usual fashion. To test:

Ho : k  0
(7.42a)
Ha : k ≠ 0
we may use the test statistics
(7.42b) t ∗  sbb k 
k
and the decision rule is
If |t ∗ |  t1 − /2, n − p, conclude H o
(7.42c)
If |t ∗ |  t1 − /2, , n − p, conclude H a
Number of degrees of freedom is n − p.
As with simple regression, to test whether or not  k  0 in multiple
regression models can also be conducted by means of an F test.
Joint inferences
The boundary of the joint confidence region for all p of the  k regression parameters
(k  0, 1, . . . , p − 1) with confidence coefficient 1 −  is:
b− ′ X ′ Xb−
(7.43) pMSE
 F1 − , p, n − p
The region defined by this boundary is generally difficult to obtain and interpret.
The Bonferroni joint confidence intervals, are easy to obtain and interpret.
If g parameters are to be estimated jointly (where g  p), the confidence
limits with family confidence coefficient 1 −  are:
(7.44) b k  Bsb k 
where
(7.44a) B  t1 − /2g, n − p
INFERENCES ABOUT MEAN RESPONSE
Interval estimation of EY h 
57
For given values of X 1 , X 2 , . . . , X p denoted by X h,1 , X h,2 , . . . , X h,p , the mean
response is denoted by EY h . We define the vector X h :
1
X h,1
(7.45) Xh  X h,2
:
X h,p−1
so the mean response to be estimated is:
(7.46) EY h   X ′h  
The estimated
 mean response corresponding to X h denoted by Y h , is:
(7.47) Y h  X ′h b
This estimator  is unbiased:
(7.48) EY h   EX ′h b  X ′h Eb  X ′h  EY h 
and its variance  is: 2 ′ ′ −1
(7.49)  Y h    X h X X X h  X ′h  2 bX h
2

Note that the variance  2 Y h is a function of the variances  2 b k  of the regression
coefficients and of the covariances b k , b j  between pairs of the regression

coefficients, just
 as in simple regression. The estimated variance s 2
 Y h  is given by
(7.50) s 2 Y h   MSEX ′h X ′ X −1 X h  X ′h s 2 bX h
The 1 −  confidence
  h  are:
limits for EY
(7.51) Y h  t1 − /2; n − psY h 
F test for lack of fit.
To test whether the response function

(7.54) EY   0   1 X 1 . . .  p−1 X p−1
is an appropriate response surface for the data at hand requires repeat observations,
as for simple regression analysis. Thus, with two independent variables repeat
observations require that X 1 and X 2 each remain at given levels from trial to trial.
The procedures described in case of simple linear regression of F test for lack of fit
are applicable to multiple regression. Once ANOVA table has been calculated,
SSE is decomposed into pure error and lack of fit component. The pure error of
squares SSPE is obtained by first calculating for each replicate group the sum
of squared deviations of Y observations around the group mean, where a replicate
group has the same values for the X 1 , X 2 . . . . X p−1 variables. Suppose that there are c
replicate groups with distinct sets of levels for X variables, and let the mean
observation of the Y for the jth group be denoted by Y j . Then the sum of squares for
the jth group is given by (4.8), and the pure error sum of squares is the sum of these
sums, as given in (4.9). The lack of fit sum of squares SSLF  SSE − SSPE as
indicated in (4.12). The number of degrees of freedom associated with SSPE
is n − c, and the number of degrees of freedom associated with SSLF is
n − p − n − c  c − p.
The F test is conducted as described in (4.15), but with the degrees of freedom modified
to those just stated.
Therefore:
MSPE  SSPE n−c
The second component of SSE is:
SSLF  SSE − SSPE
where SSLF denotes lack of fit sum of squares.
Thus, the lack of fit mean square is
58
MSLF  SSLF c−p
Test statistics
F ∗  MSPE
MSLF
The hypothesis
H o: EY   o   1 X 1 . . .  p−1 X p−1
H a: EY ≠  o   1 X 1 . . .  p−1 X p−1
The decision rule
If F ∗  F1 − ; c − p, n − c conclude H o
If F ∗  F1 − ; c − p, n − c conclude H a
PREDICTIONS OF NEW OBSERVATIONS
Prediction of new observation Y h(new)
The prediction limits with 1 −  confidence coefficient for a new observation

Y hnew corresponding to X h , the specified values of the X Variables, are:

(7.55) Y h  t1 − /2, n − psY hnew 
where
(7.55a) s 2 Y hnew   MSE  X ′h s 2 bX h  MSE1  X ′h X ′ X −1 X h 
Prediction of mean of m new observations at X h
When m new observations are to be selected at X h and their mean Y hnew is

to be predicted,
 the 1 −  prediction limits are:
(7.56) Y h  t1 − /2, n − ps Y hnew 
where: 
(7.56a) s 2  Y hnew   MSE
m  s Y h  
2
 MSE
m  X ′h s 2 bX h  MSE m1  X ′h X ′ X −1 X h 
Predictions of g new observations
Simultaneous prediction limits for g new observations at g different levels

of X h withfamily confidence 1 −  are given by:
(7.57) Y h  SsY hnew 
where
(7.57a) S 2  gF1 − ; g; n − p
and s 2 Y hnew  is given by (7.55a)
Alternatively, the Bonferroni simultaneous prediction limits can be used.
For g predictions
 with 1 −  family confidence coefficient, they are:
(7.58) Y h  BsY hnew 
where
(7.58a) B  t1 − /2g; n − p
Least squares estimators in case with two independent variables
The sum of squares is:

Q  ∑Y i − b o − b 1 X i,1 − b 2 X i,2  2
∂Q
∂b o
 ∂b∂ o ∑ i Y i − b o − b 1 X i,1 − b 2 X i,2  2  2 ∑ i −Y i  b o  b 1 X i,1  b 2 X i,2 
59
So
∂Q
∂b o
 0  2 ∑ i −Y i  b o  b 1 X i,1  b 2 X i,2   0
Therefore first normal equation is
nb o  b 1 ∑ X i,1  b 2 ∑ X i,2  ∑ Y i
∂Q
∂b 1
 ∂b∂ 1 ∑ i Y i − b o − b 1 X i,1 − b 2 X i,2  2  2 ∑ i −X i,1 Y i  X i,1 b o  b 1 X 2i,1  X i,1 b 2 X i,2 
Next
∂Q
∂b 1
 0  ∑ i −X i,1 Y i  X i,1 b o  b 1 X 2i,1  X i,1 b 2 X i,2   0
Hence the second normal equation is
b o ∑ X i,1  b 1 ∑ X 2i,1  b 2 ∑ X i,1 X i,2  ∑ X i,1 Y i
Finally
∂Q
∂b 2
 ∂b∂ 2 ∑ i Y i − b o − b 1 X i,1 − b 2 X i,2  2  2 ∑ i −X i,2 Y i  X i,2 b o  X i,2 b 1 X i,1  b 2 X 2i,2 
So
∂Q
∂b 2
 0  ∑ i −X i,2 Y i  X i,2 b o  X i,2 b 1 X i,1  b 2 X 2i,2   0
and the third normal equation is
b o ∑ X i,2  b 1 ∑ X i,1 X i,2  b 2 ∑ X 2i,2  ∑ X i,1 Y i
EXAMPLE 1:
Multiple regression with two independent variables

Let us consider the following data:
i Y i X i,1 X i,2
1 162 274 2450
2 120 180 3254
3 223 375 3802
4 131 205 2838
5 67 86 2347
6 169 265 3782
7 81 98 3008
8 192 330 2450
9 116 195 2137
10 55 53 2560
11 252 430 4020
12 232 372 4427
13 144 236 2660
14 103 157 2088
15 212 370 2605
The linear model in use:
(7.59) Y i   o   1 X i,1   2 X i,2   i
Basic calculations
60
162 1 274 2450
120 1 180 3254
223 1 375 3802
131 1 205 2838
67 1 86 2347
169 1 265 3782
81 1 98 3008
Y 192 X 1 330 2450
116 1 195 2137
55 1 53 2560
252 1 430 4020
232 1 372 4427
144 1 236 2660
103 1 157 2088
212 1 370 2605
′
1 274 2450 1 274 2450
1 180 3254 1 180 3254
1 375 3802 1 375 3802
1 205 2838 1 205 2838
1 86 2347 1 86 2347
1 265 3782 1 265 3782
1 98 3008 1 98 3008
′
1) X X  1 330 2450 1 330 2450 
1 195 2137 1 195 2137
1 53 2560 1 53 2560
1 430 4020 1 430 4020
1 372 4427 1 372 4427
1 236 2660 1 236 2660
1 157 2088 1 157 2088
1 370 2605 1 370 2605
15 3626 44 428
 3626 1067 614 11 419 181
44 428 11 419 181 139 063 428
Let us consider the following data:
61
′
1 274 2450 162
1 180 3254 120
1 375 3802 223
1 205 2838 131
1 86 2347 67
1 265 3782 169
1 98 3008 81 2259
′
2) X Y  1 330 2450 192  647 107
1 195 2137 116 7096 619
1 53 2560 55
1 430 4020 252
1 372 4427 232
1 236 2660 144
1 157 2088 103
1 370 2605 212
−1
15 3626 44 428
3) X ′ X −1  3626 1067 614 11 419 181 
44 428 11 419 181 139 063 428
1. 246 348 416 2. 129 664 176  10 −4 −4. 156 712 541  10 −4
 2. 129 664 176  10 −4 7. 732 903 033  10 −6 −7. 030 251 792  10 −7 
−4. 156 712 541  10 −4 −7. 030 251 792  10 −7 1. 977 185 133  10 −7
Algebraic equivalents
Note that X ′ X is
(7.63)
1 X 1,1 X 1,2
1 1 ... 1 n ∑ X i,1 ∑ X i,2
1 X 2,1 X 2,2
X′X  X 1,1 X 2,1 . . . X 2,2  ∑ X i,1 ∑ X 2i,1 ∑ X i,1 X i,2
: : :
X 1,2 X 2,2 . . . X n,2 ∑ X i,2 ∑ X i,1 X i,2 ∑ X 2i,2
1 X n,1 X n,2
So in our case
15 3626 44 428 n ∑ X i,1 ∑ X i,2
3626 1067 614 11 419 181  ∑ X i,1 ∑ X 2i,1 ∑ X i,1 X i,2
44 428 11 419 181 139 063 428 ∑ X i,2 ∑ X i,1 X i,2 ∑ X 2i,2
Also
62
Y1
1 1 ... 1 ∑ Yi
Y2
(7.64) ′
XY X 1,1 X 2,1 . . . X 2,2  ∑ Y i X i,1
:
X 1,2 X 2,2 . . . X n,2 ∑ Y i X i,2
Yn
hence in our case
2259 ∑ Yi
647 107  ∑ Y i X i,1
7096 619 ∑ Y i X i,2
Estimated regression function
Using (7.21) with calculations above we get

b  X ′ X −1 X ′ Y 
1. 246 348 416 2. 129 664 176  10 −4 −4. 156 712 541  10 −4 2259
 2. 129 664 176  10 −4 7. 732 903 033  10 −6 −7. 030 251 792  10 −7 647 107
−4. 156 712 541  10 −4 −7. 030 251 792  10 −7 1. 977 185 133  10 −7 7096 619
3. 452 611 738 bo
 . 496 004 976 1  b1
9. 199 080 488  10 −3 b2
and the
 estimated function is
Y  3. 453  0. 496X 1  0. 00920X 2
Fitted values
1 274 2450 161. 895 722 4

1 180 3254 122. 667 315 3
1 375 3802 224. 429 381 8
1 205 2838 131. 240 622 3
1 86 2347 67. 699 281 59
1 265 3782 169. 684 852 8
1 98 3008 3. 452 611 738 79. 731 933 5
Y  Xb  1 330 2450 . 496 004 976 1  189. 672 001
1 195 2137 9. 199 080 488  10 −3 119. 832 017 1
1 53 2560 53. 290 521 52
1 430 4020 253. 715 055
1 372 4427 228. 690 792 2
1 236 2660 144. 979 340 2
1 157 2088 100. 533 073
1 370 2605 210. 938 057 6
Also we find
63
162 161. 895 722 4 0. 104 277 6
120 122. 667 315 3 −2. 667 315 3
223 224. 429 381 8 −1. 429 381 8
131 131. 240 622 3 −0. 240 622 3
67 67. 699 281 59 −0. 699 281 59
169 169. 684 852 8 −0. 684 852 8
81 79. 731 933 5 1. 268 066 5
e  Y −Y  192 − 189. 672 001  2. 327 999
116 119. 832 017 1 −3. 832 017 1
55 53. 290 521 52 1. 709 478 48
252 253. 715 055 −1. 715 055
232 228. 690 792 2 3. 309 207 8
144 144. 979 340 2 −0. 979 340 2
103 100. 533 073 2. 466 927
212 210. 938 057 6 1. 061 942 4
Analysis of variance
To test whether Y is related to X 1 and X 2 we construct the ANOVA table

using (7.25)-(7.29). The basic quantities needed are
′
162 162
120 120
223 223
131 131
67 67
169 169
81 81
′
YY 192 192  394 107
116 116
55 55
252 252
232 232
144 144
103 103
212 212
64
′ ′
162 1 1 162
120 1 1 120
223 1 1 223
131 1 1 131
67 1 1 67
169 1 1 169
81 1 1 81
′ ′
 Y 11 Y 
1
n
1
15 192 1 1 192  340 205. 4
116 1 1 116
55 1 1 55
252 1 1 252
232 1 1 232
144 1 1 144
103 1 1 103
212 1 1 212
Thus
SSTO  Y ′ Y − 1n Y ′ 11 ′ Y 394 107 − 340 205. 4  53901. 6
and
′
3. 452 611 738 2259
′ ′ ′
SSE  Y Y − b X Y 394 107 − . 496 004 976 1 647 107 
9. 199 080 488  10 −3 7096 619
 56. 888 641 05
Finally we have
SSR  SSTO − SSE  53901. 6 − 56. 884  53844. 716
The ANOVA table
Regression SSR  53844. 716 p − 1  2 MSR  SSR
p−1
 26922. 358
Error SSE  56. 844 n − p  12 MSE  SSE
n−p  4. 740
Total SSTO  53901. 6 n − 1  14
Test of regression relation
The hypothesis
Ho : 1  2  0
H a : not both  1 and  2 are equal to zero
We use the test statistics
F ∗  MSR
MSE
 26922.358
4.740
 5678
The decision rule is
Assuming that   0. 05 from table we get
65
F1 − , p − 1, n − p  F0. 95; 2; 12  3. 89
Since F ∗  56878  F0. 95; 2; 12  3. 89
we conclude H a , that our Y is related to X 1 and X 2 .
Coefficient of multiple determination
In our case we have

R 2  SSTO
SSR
 53844.716
53901.6
 0. 998 944 669 5
Thus, when our X 1 and X 2 are considered, the variation in Y
is reduced by 99.9 percent.
Algebraic expression for SSE
The error sum of squares for the case of two independent

variables is:
∑ Yi
(7.66) SSE  Y ′ Y − b ′ X ′ Y  ∑ Y 2i − bo b1 b2 ∑ Y i X i,1 
∑ Y i X i,2
 ∑ Y 2i − b o ∑ Y i − b 1 ∑ Y i X i,1 − b 2 ∑ Y i X i,2
Estimation of regression parameters.
We will use the simultaneous Bonferroni confidence limits given in (7.44).

Let us use   0. 1
First we need to estimate s 2 b :
(7.67) s 2 b MSEX ′ X −1 
1. 246 348 416 2. 129 664 176  10 −4 −4. 156 712 541  10 −4
 4. 740 2. 129 664 176  10 −4 7. 732 903 033  10 −6 −7. 030 251 792  10 −7 
−4. 156 712 541  10 −4 −7. 030 251 792  10 −7 1. 977 185 133  10 −7
5. 907 462 . 00 100 947 78 −. 00 197 027 58
 . 00 100 947 78 3. 665 394 6  10 −5
−3. 332 362 2  10 −6
−. 00 197 027 58 −3. 332 362 2  10 −6 9. 371 928  10 −7
The two elements we require are:
s 2 b 1   3. 665 394 6  10 −5 or sb 1   0. 006054 250 243
s 2 b 2   9. 371 928  10 −7 or sb 2   0. 0009680 871 862
Next for g  2 from the table
B  t1 − 0. 1/2  2; 12  t0. 975; 12  2. 179
and finally for  1
0. 4960 − 2. 179  0. 006054   1  0. 4960  2. 179  0. 006054
or
0. 483   1  0. 509
and for  2 simultaneously
0. 009199 − 2. 179  0. 0009681   2  0. 009199  2. 179  0. 0009681
or
0. 0071   2  0. 0113
With the confidence coefficient 0.90 we conclude that  1 falls
between 0. 483 and 0. 509 and that  2 falls between 0. 0071 and 0. 0113.
Estimation of mean response.

66
Suppose that we would like to estimate the expected (mean) value for Y
when X h,1  220 and X h,2  2500.
We define
1
Xh  220
2500
The point estimate of mean for Y is by (7.47)
3. 452 611 738

Y h  X ′h b  1 220 2500 . 496 004 976 1  135. 571 407 7
9. 199 080 488  10 −3
Theestimated variance by (7.50) and using the results in (7.67) is:
s 2 Y h   X ′h s 2 bX h  0. 46638
and
sY h   0. 68292
Assume that the confidence coefficient for the interval estimate of EY h  is
to be 0,95. We then need t1 − /2; n − p  t0. 975; 12  2. 179 and
135. 57 − 2. 179  0. 68292  EY h   135. 57  2. 179  0. 68292
so
134. 1  EY h   137. 1
Thus, with confidence coefficient 0.95 we estimate that the mean Y
at levels X 1  220 and X 2  2500 is somewhere between 134.1 and 137.1.
Prediction limits for new observations.
Suppose that we would like to predict Y in two cases of levels of independent ones.
A B
X h,1 220 375
X h,2 2500 3500
In this case g  2. To determine which simultaneous prediction intervals are best
here, we shall find S as given in (7.57a) and B as given in (7.58a) assuming
the confidence coefficient 0.90.
S 2  gF1 − ; g; n − p  2F0. 90; 2; 12  2  2. 81  5. 62
so
S  5. 62  2. 370 7
and
B  t1 − /2g; n − p  t0. 975; 12  2. 179
Hence, the Bonferroni limits are more efficient here.(They give shorter intervals)
For explanatory variables level A we have
1
XA  220
2500
3. 452 611 738
 ′
Y A  X A b  1 220 2500 . 496 004 976 1  135. 571 407 7
9. 199 080 488  10 −3
and
s 2 Y A   X ′A s 2 bX A  0. 46638 and MSE  4. 7403
67
Hence by (7.55a) 
s 2 Y Anew   MSE  s 2 Y A   4. 7403  0. 46638  5. 206 7
or
sY Anew   2. 28182
1
XB  375
3500
3. 452 611 738

Y B  X ′B b  1 375 3500 . 496 004 976 1  221. 65
9. 199 080 488  10 −3
and 
s 2 Y B   X ′B s 2 bX B 
 1 375 3500
5. 907 462 . 00 100 947 78 −. 00 197 027 58 1
. 00 100 947 78 3. 665 394 6  10 −5 −3. 332 362 2  10 −6 375 
−. 00 197 027 58 −3. 332 362 2  10 −6 9. 371 928  10 −7 3500
 0. 760 26
Hence 
s 2 Y Bnew   MSE  s 2 Y B   4. 7403  0. 760 26  5. 500 6
and
sY Bnew   5. 500 6  2. 345 3
We found before that B  2. 179. The simultaneous Bonferroni prediction
intervals
 with confidence coefficient 0.90 are:
Y h  BsY hnew  so
135. 57 − 2. 179  2. 28182  Y Anew  135. 57  2. 179  2. 28182
221. 65 − 2. 179  2. 34536  Y Bnew  221. 65 − 2. 179  2. 34536
or
130. 6  Y Anew  140. 5
216. 5  Y Bnew  226. 8
STANDARDIZED REGRESSION COEFFICIENTS.
Standardized regression coefficients have been proposed to facilitate

comparisons between regression coefficients. It is difficult to compare regression
coefficients because of differences in the unit involved. When considering the fitted
response
 function:
Y  200  20000X 1  0. 2X 2
one may be tempted to conclude that X 1 is the only important independent variable
and that X 2 has little effect on the dependent variable .
Suppose that the units are:
Y in rands
X 1 in thousands rands
X 2 in cents
In that event, the effect on the mean response of a 1000R increase in X 1 when
X 2 is constant would be exactly the same as the effect of a 1000R increase in X 2
when X 1 is constant.
68
Standardized regression coefficients, also called beta coefficients,
are defined as follows:
1
∑Xi,k−X k2 2
∑X i,k −X k  2
1
2
sk
(7.69) Bk  bk  bk n−1
 bk
sy
∑Yi− Y 2 ∑Y i − Y  2
n−1
where s k and s y are the standard deviations of the X k and Y observations,
respectively.
For our example
1
B 1  0. 496 191089
53902
2
 0. 933 89
1
B 2  0. 00920 7473616
53902
2
 0. 108 33
PROBLEMS
Question 1.
The following data were obtained in certain experiment:
i Y i X i,1 X i,2
1 64 4 2
2 73 4 4
3 61 4 2
4 76 4 4
5 72 6 2
6 80 6 4
7 71 6 2
8 83 6 4
9 83 8 2
10 89 8 4
11 86 8 2
12 93 8 4
13 88 10 2
14 95 10 4
15 94 10 2
16 100 10 4
Assume that regression model (7.1) with independent normal errors is appropriate.
1) Find the estimated regression coefficients.
2)Test whether there is a regression relation using   0. 01.
3) Estimate  1 and  2 jointly by the Bonferroni procedure using
99 percent family confidence coefficient.
4) Obtain an interval estimate of EY h  when X h,1  5 and X h,2  4.
Use a 99 percent confidence coefficient.
5) Obtain an ANOVA table and use it to test whether there is a regression
relation using   0. 01.
6) Obtain the residuals.
Question 2.
69
i Yi X i,1 X i,2 i Yi X i,1 X i,2
1 58 7 5.11 11 121 17 11.02
2 152 18 16.72 12 112 12 9.51
3 41 5 3.20 13 50 6 3.79
4 93 14 7.03 14 82 12 6.45
5 101 11 10.98 15 48 8 4.60
6 38 5 4.04 16 127 15 13.86
7 203 23 22.07 17 140 17 13.03
8 78 9 7.03 18 155 21 15.21
9 117 16 10.62 19 39 6 3.64
10 44 5 4.76 20 90 11 9.57
4) Obtain an interval estimate of EY h  when X h,1  5 and X h,2  3. 2.
7) Calculate the coefficient of multiple determination R 2 .
8) Obtain the simultaneous interval estimates for five levels of X :
1 2 3 4 5
X1 5 6 10 14 20
X 2 3.2 4.8 7.0 10.0 18.0
using 95 percent confidence coefficient.
Question 3.
Consider the multiple regression model:
Y i   1 X i,1   2 X i,2   i
where  i are independent normally distributed random errors
with N0,  2 .
1) Derive the least squares estimators for  1 and  2
2) Obtain the maximum likelihood estimators for  1 and  2 .
Question 4.
A pharmaceutical company testing a new pain-killing drug tests the drug on 20 people
suffering from arthritis. The time elapsed, in minutes, from taking the drug until noticeable
relief in pain is detected, is to be predicted from dosage (in grams) and the age
of patient (in years). The results are given below
70
i Time (Y i ) Dosage (X i,1 ) Age (X i,2 )
1 11 2 59
2 3 2 57
3 20 2 22
4 25 2 12
5 27 2 18
6 15 5 40
7 10 5 64
8 34 5 27
9 14 5 54
10 34 5 22
11 35 7 33
12 28 7 49
13 23 7 29
14 21 7 32
15 33 7 20
16 27 10 43
17 8 10 61
18 3 10 69
19 12 10 62
20 14 10 61
4) Obtain an interval estimate of EY h  when X h,1  6 and X h,2  45.
8) Obtain the simultaneous interval estimates for five levels of X :
1 2 3 4 5
X1 5 6 2 4 5
X 2 41 39 52 61 75
Question 5.
A large discount department store chain advertises on television (X 1 ),
on the radio (X 2 ), and in newspapers (X 3 ). A sample of 12 of its stores in
a certain area showed the following advertising expenditures and revenues
during a given month. ( All figures are in thousands of rands)
71
i Revenues (Y i ) X i,1 X i,2 X i,3
1 84 13 5 2
2 84 13 7 1
3 80 8 6 3
4 50 9 5 3
5 20 9 3 1
6 68 13 5 1
7 34 12 7 2
8 30 10 3 2
9 54 8 5 2
10 40 10 5 3
11 57 5 6 2
12 46 5 7 2
3) Estimate  1 ,  2 and  3 jointly by the Bonferroni procedure using
4) Obtain an interval estimate of EY h  when X h,1  11, X h,2  6 and X h,3  2.
8) Obtain the simultaneous interval estimates for two levels of X :
i 1 2
X 1 11 15
X2 7 9
X3 2 3
Question 6.
i 1 2 3 4 5
Xi 8 4 0 -4 -8
Y i 7.8 9 10.2 11 11.7
1) Y ′ Y
2) X ′ X
3) X ′ Y
4) b
5)Test the H o :  1  0 versus H a :  1 ≠ 0 using ANOVA.
with   0. 05
72
MULTICOLLINEARITY AND ITS EFFECTS
In multiple regression analysis the relation between the independent variables and
the dependent one is of prime interest. Questions that are frequently asked include:
1. What is the relative importance of the effects of the different independent variables ?
2. What is the magnitude of the effect of a given independent variable on the dependent
one?
3. Can any independent variable be dropped from the model because it has little or no
effect on the dependent one ?
4. Should any independent variables not yet included in the model be considered
for possible inclusion ?
If the independent variables included in the model are uncorrelated among themselves
and uncorrelated with any other independent variables that are related to the dependent
variable but omitted from the model, relatively simple answers can be given.
Unfortunately, in many situations the independent variables tend to be correlated
among themselves and with other variables that are related to the dependent variable but
are not included in the model. When the independent variables are correlated among
themselves, intercorrelation or multicollinearity among them is said to exist.
Example of uncorrelated independent variables.
Table below contains data for a small-scale experiment on the effect of crew size X 1 
and level of bonus pay X 2 on crew productivity score Y. It is easy to show that X 1
and X 2 are uncorrelated here, that is r 212  0, where r 212 denotes the coefficient of
simple determination between X 1 and X 2 . We will use the notation SSRX 1 , X 2 
and SSEX 1 , X 2 to indicate explicitly the two independent variables in the model.
SSRX 1  and SSEX 1  to show that only one independent variable X 1 is in model
(case of simple linear regression) and SSRX 2 and SSEX 2  in case of X 2 only.
Trial i X i,1 X i,2 Y i
1 4 2 42
2 4 2 39
3 4 3 48
4 4 3 51
5 6 2 49
6 6 2 53
7 6 3 61
8 6 3 60
ANOVA tables I
a) Regression of Y on X 1 and X 2

Y  0. 375  5. 375X 1  9. 250X 2
73
Regression SSRX 1 , X 2   402. 250 2 MSRX 1 , X 2   201. 125
Error SSEX 1 , X 2   17. 625 5 MSEX 1 , X 2   3. 525
Total SSTO  419. 875 7
b) Regression
 of Y on X 1
Y  23. 500  5. 375X 1
Regression SSRX 1   231. 125 1 MSRX 1   231. 125
Error SSEX 1   188. 750 6 MSEX 1   31. 458
Total SSTO  419. 875 7
c) Regression of Y on X 2

Y  27. 250  9. 250X 2
Error SSEX 2   248. 750 6 MSEX 2   41. 458
Total SSTO 7
An important feature to note is that the regression coefficients for X 1 and

X 2 are the same, whether only the given independent variable is included
in the model or both independent variables are included. This is result of
the two independent variables being uncorrelated.
Another important feature is related to the sum of squares. Note from a)
that the error sum of squares when both X 1 and X 2 are included in the model is
SSEX 1 , X 2   17. 625. When only X 1 is included in the model, the error sum
of squares is SSEX 1   188. 750. We may ascribe the difference:
SSEX 1  − SSEX 1 , X 2   188. 750 − 17. 652  171. 125
to the effect of X 2 . We shall denote this difference by SSRX 2 ∣ X 1 
(8.1) SSRX 2 ∣ X 1   SSEX 1  − SSEX 1 , X 2 
This is equal in our case to SSRX 2 . The reason for this is that X 1 and X 2
are uncorrelated. The story is the same for independent variable X 2 . Let
(8.2) SSRX 1 ∣ X 2   SSEX 2  − SSEX 1 , X 2 
In our example we have
SSRX 1 ∣ X 2   SSEX 2  − SSEX 1 , X 2   248. 750 − 17. 625  231. 123
and again it is equal to SSRX 1 .
Example of correlated independent variables.
The table III.1 below contains data for study of the relation of body fat Y to triceps
skinfold thickness X 1 and thigh circumferences X 2 based on sample of 20 healthy
females.
Table III.1
X i,1 X i,2 Yi
19.50 43.10 11.90
74
24.70 48.80 22.80
30.70 51.90 18.70
29.80 54.30 20.10
19.10 42.20 12.90
25.60 53.90 21.70
31.40 58.50 27.10
27.90 52.10 25.40
22.10 49.90 21.30
25.50 53.50 19.30
31.10 56.60 25.40
30.40 56.70 27.20
18.70 46.50 11.70
19.70 44.20 17.80
14.60 42.70 12.80
29.50 54.40 23.90
27.70 53.30 22.60
30.20 58.60 25.40
22.70 48.20 14.80
25.20 51.00 21.10
The triceps skinfold thickness (X 1 ) and thigh circumferences (X 2 ) are highly
correlated, as the scatter plot below suggests. The coefficient of simple correlation
between these two variables is equal to 0.92.
ANOVA tables II
a) Regression of Y on X 1 and X 2

Y  −19. 174  0. 224X 1  0. 6594X 2
Regression SSRX 1 , X 2   385. 44 2 MSRX 1 , X 2   192. 72
Error SSEX 1 , X 2   109. 95 17 MSEX 1 , X 2   6. 47
Total SSTO  495. 39 19
b) Regression of Y on X 1

Y  −1. 496  0. 8572X 1
75
Error SSEX 1   143. 12 18 MSEX 1   7. 95
Total SSTO  495. 39 19
c) Regression of Y on X 2

Y  −23. 634  0. 8566X 2
Error SSEX 2   113. 42 18 MSEX 2   6. 30
Total SSTO  495. 39 19
Note first that the regression coefficient for X 1 is not the same in a) and b).
Thus, the effect ascribed to X 1 by the fitted response varies here, depending
upon whether one X 1 or both X 1 and X 2 are being considered in the model.
The reason for that is correlation between X 1 and X 2 .
Effect of multicollinearity on regression sums of squares
Note from table II a, that the error sum of squares when both X 1 and X 2
are included in the model is SSEX 1 , X 2   109. 95. When only X 2 is included
in the model, the error sum of squares is SSEX 2   113. 42 as seen from table II c.
Using (8.2) we obtain
SSRX 1 ∣ X 2   SSEX 2  − SSEX 1 , X 2   3. 47
When we fit a regression function containing only X 1 , we also obtain a measure of
reduction in variation of Y associated with X 1 namely SSRX 1 . For our example,
table II b indicates that SSRX 1   352. 27 which is not the same as
SSRX 1 ∣ X 2   3. 47. The reason for the large difference is the high positive
correlation between X 1 and X 2 .
The story is the same for the other independent variable.
The important conclusion is: When independent (explanatory ones) variables are
correlated, there is no unique sum of squares which can be ascribed to an independent
variable as reflecting its effect in reducing the total variation in Y. The reduction
in the total variation ascribed to an independent variable must be viewed in the
context of the other independent variables included in the model, whenever the
independent variables are correlated.
The terms SSRX 1 ∣ X 2  and SSRX 2 ∣ X 1  are called extra sums of squares,
since they indicate the additional or extra reduction in the error sum of squares
achieved by introducing an additional independent variable.
Simultaneous test on regression coefficients.
Let us consider the first-order regression model with two independent variables:
(8.4) Y i   o   1 X i,1   2 X i,2   i full model
If the test on  1 indicates it is zero, the regression model (8.4) would be:
Y i   o   2 X i,2   i
If the test on  2 indicates it is zero, the regression model (8.4) would be
Y i   o   1 X i,1   i
76
However, if the separate tests indicate that  1  0 and  2  0, that does not
necessarily imply that:
Yi  o  i
since neither of the test consider this alternative: not both  1 and  2 equal to zero
The proper test for the existence of regression relation :
Ho : 1  2  0
H a : not both  1 and  2 equal to zero
is the F test of (7.30)
F ∗  MSR
MSE
Note; It is possible that a set of independent variables is related to the dependent

variable, yet all of the individual tests on the regression coefficients will lead to the
conclusion that they are equal to zero.
Decomposition of SSR into extra sums of squares.
We defined the extra sum of Squares SSRX 1 ∣ X 2  in (8.2):

(8.7a) SSRX 1 ∣ X 2   SSEX 2  − SSEX 1 , X 2 
Likewise, we defined in (8.1):
(8.7b) SSRX 2 ∣ X 1   SSEX 1  − SSEX 1 , X 2 
These extra sums of squares reflect the reduction in the error sum of squares by adding
an independent variable to the model, given that another independent variable is already
in the model.
Any reduction in the error sum of squares, of course, is equal to the same increase in
the regression sum of squares since always:
SSTO  SSR  SSE so SSE  SSTO − SSR
Hence, an extra sum of squares can also be thought of as the increase in the regression
sum of squares achieved by introducing the new variable. We can state
(8.8a) SSRX 1 ∣ X 2   SSRX 1 , X 2  − SSRX 2 
and
(8.8b) SSRX 2 ∣ X 1   SSRX 1 , X 2  − SSRX 1 
Proof:
SSRX 1 ∣ X 2   SSEX 2  − SSEX 1 , X 2   SSTO − SSRX 2  − SSTO − SSRX 1 , X 2  
 SSRX 1 , X 2  − SSRX 2 
The same is true for (8.8b).
Extensions for three or more X variables is straightforward. We can define:
(8.9) SSRX 3 ∣ X 1 , X 2   SSEX 1 , X 2  − SSEX 1 , X 2 , X 3 
SSRX 3 ∣ X 1 , X 2  measures the reduction in the error sum of squares which is
achieved by introducing X 3 into regression model when X 1 and X 2 are already in
the model.
Decomposition of SSR
We can now obtain a variety of decompositions for the regression sum of

squares SSR. Let us consider the case of three X variables (X 1 , X 2 , X 3 ). We begin
with the following equality for variable X 1 :
(8.10) SSTO  SSRX 1   SSEX 1 
where the notation now shows explicitly that X 1 is in the model (Y i   o   1 X i,1   i )
From (8.7b) we have:
SSEX 1   SSRX 2 ∣ X 1   SSEX 1 , X 2 
Using it in (8.10) one can get:
(8.10a) SSTO  SSRX 1   SSRX 2 ∣ X 1   SSEX 1 , X 2 
From (8.9) we have
77
SSEX 1 , X 2   SSRX 3 ∣ X 1 , X 2   SSEX 1 , X 2 , X 3 
and hence
(8.10b) SSTO  SSRX 1   SSRX 2 ∣ X 1   SSRX 3 ∣ X 1 , X 2   SSEX 1 , X 2 , X 3 
For multiple regression with three independent variables, using our notation, we
can write equivalent of (8.10)
(8.11) SSTO  SSRX 3 , X 1 , X 2   SSEX 1 , X 2 , X 3 
Hence
(8.12) SSRX 3 , X 1 , X 2   SSRX 1   SSRX 2 ∣ X 1   SSRX 3 ∣ X 1 , X 2 
Thus, the regression sum of squares has bee decomposed into marginal components,
each associated with one degree of freedom. Of course order of the independent
variables is arbitrary. For instance:
(8.13) SSRX 3 , X 1 , X 2   SSRX 3   SSRX 1 ∣ X 3   SSRX 2 ∣ X 1 , X 3 
We can define extra sum of squares for two or more independent variables at a time
and obtain still other decompositions. We can define
(8.14) SSRX 2 , X 3 ∣ X 1   SSEX 1  − SSEX 1 , X 2 , X 3 
Thus, SSRX 2 , X 3 ∣ X 1  represents the reduction the reduction in the error sum of
squares which is achieved by introducing X 2 , X 3 into regression model already
containing X 1 . There are two degrees of freedom associated with SSRX 2 , X 3 ∣ X 1 ,
and also
(8.14a) SSRX 2 , X 3 ∣ X 1   SSRX 2 ∣ X 1   SSRX 3 ∣ X 1 , X 2 
With SSRX 2 , X 3 ∣ X 1  we can make use of the decomposition
(8.15) SSRX 3 , X 1 , X 2   SSRX 1   SSRX 2 , X 3 ∣ X 1 .
Uses of extra sums of squares for tests concerning regression coefficients.
Anova table with decomposition of SSR for three independent variables

Regression SSRX 1 , X 2 , X 3  3 MSRX 1 , X 2 , X 3 
X1 SSRX 1  1 MSRX 1 
X2 ∣ X1 SSRX 2 ∣ X 1  1 MSRX 2 ∣ X 1 
X3 ∣ X1, X2 SSRX 3 ∣ X 1 , X 2  1 MSRX 3 ∣ X 1 , X 2 
Error SSEX 1 , X 2 , X 3  n − 4 MSEX 1 , X 2 , X 3 
Total SSTO n−1
In case of two independent variables we have

Regression SSRX 1 , X 2 ,  2 MSRX 1 , X 2 , 
X1 SSRX 1  1 MSRX 1 
X2 ∣ X1 SSRX 2 ∣ X 1  1 MSRX 2 ∣ X 1 
Error SSEX 1 , X 2  n − 3 MSEX 1 , X 2 
Total SSTO n−1
Coefficients of partial determination
A natural measure of the effect of X in reducing variation in Y i , (the uncertainty

in predicting Y) is
78
r 2  SSTO−SSE
SSTO
 SSTO
SSR
 1 − SSTO
SSE
The measure r 2 is called the coefficient of determination.

A coefficient of partial determination measures the marginal contribution of one
variable, when all the others are already included in the model.
Two independent variables.
Let us consider a first-order multiple regression model with two independent variables
Y i   o   1 X i,1   2 X i,2   i .
SSEX 2  measures the variation in Y when X 2 is included in the model. SSEX 1 , X 2 
measures the variation in Y when both X 1 and X 2 are included in the model. Hence the
relative marginal reduction in the variation in Y associated with X 1 when X 2 is already
in the model is:
SSEX 2 −SSEX 1 ,X 2 
SSEX 2 
This measure is the coefficient of partial determination between Y and X 1 , given
that X 2 is in the model. We denote it by r 2Y 1.2 :
SSEX −SSEX ,X  SSEX ,X 
(8.16) r 2Y 1.2  2
SSEX 2 
1 2
 1 − SSEX1 2
2
Using (8.7a) we get
SSRX 1 ∣X 2 
(8.16a) r 2Y 1.2  SSEX
2
The coefficient of partial determination between Y and X 2 , given that X 1 is in the
model is defined by
SSEX 1 −SSEX 1 ,X 2  SSEX ,X  SSRX 2 ∣X 1 
(8.17) r 2Y 2.1  SSEX 1 
 1 − SSEX1 2  SSEX
1 1
For our example given in table III.1 we have:
r 2Y 1.2  113.42−109.95
113.42
 0. 03059 4
and
r 2Y 2.1  143.12−109.95
143.12
 0. 231 76
General case
The generalization of coefficient of partial determination to three or more

independent variables in the model is as follows:
SSRX 1 ∣X 2 ,X 3 
(8.18a) r 2Y 1.2,3  SSEX 
(X 1 to be added when X 2 and X 3 are in model)
1
SSRX 2 ∣X 1 ,X 3 
(8.18b) r 2Y 2.1,3  SSEX 2 
SSRX 3 ∣X 1 ,X 2 
(8.18c) 
r 2Y 3.1,2 SSEX 3 
Note that in the subscripts to r 2 , the entries to the left of the dot show in turn
the variable taken as the response (dependent one), then the variable being added.
The entries to the right of the dot show variables already in the model.
The coefficient of partial determination can take values between 0 and 1.
Coefficient of partial correlation.
The square root of the coefficient of partial determination is called the

coefficient of partial correlation and is denoted by r Y 1.2 , r Y 1.2,3 ,...
depending on model.
TESTING HYPOTHESES CONCERNING REGRESSION COEFFICIENTS

IN MULTIPLE REGRESSION
We have already discussed how to conduct two types of tests concerning the
79
regression coefficients in multiple regression model. We will summarize these
tests and then take some additional types of tests.
Test whether all  k  0
This is the overall F test (7.30)

The hypothesis are
H o :  1   2 . . .   p−1  0
(8.21)
H a : not al  k (k  1, 2, . . . , p − 1) equal 0
and the test statistics
SSRX 1 ,....,X p−1  SSEX 1 ,...,X p−1 
(8.22) F∗  p−1
 n−p  MSR
MSE
Decision rule
Test whether a single  k  0
This is partial F test .

The hypothesis are
Ho : k  0
(8.23)
Ha : k ≠ 0
and the test statistics
SSRX k ∣X 1 ,...,K k−1 ,X k1 ,....,X p−1  SSEX 1 ,...,X p−1  MSRX k ∣X 1 ,...,K k−1 ,X k1 ,....,X p−1 
(8.24) F∗  1
 n−p  MSE
Decision rule
If F ∗  F1 − , 1, n − p, conclude H o
If F ∗  F1 − , 1, n − p, conclude H a
An equivalent test statistics is
(8.25) t ∗  sbb k 
k
and the decision rule is
If |t ∗ |  t1 − /2, n − p, conclude H o
(7.42c)
If |t ∗ |  t1 − /2, , n − p, conclude H a
Test whether some regression coefficients equal to zero.
Sometimes we wish to determine whether or not some regression coefficients equal

to zero. The approach is that of general linear test.
For general multiple regression model:
(8.29) Y i   o   1 X i,1 . . .  p−1 X i,p−1
we wish to test
H o :  q   q1 . . .   p−1  0
(8.30)
H a : not al  k (k  q, q  1, , . . . , p − 1) equal 0
where for convenience, we arrange the model so that the last p − q coefficients
are the ones to be tested. We first fit the full model and obtain SSEX 1 , . . . , X p−1 .
Then we fit the reduced model:
(8.31) Y i   o   1 X i,1 . . .  q−1 X i,q−1 ← reduced model
and obtain SSEX 1 , . . . , X q−1 . Finally we use the general linear test statistic
80
SSEX ,...,X −SSEX ,...,X  SSEX ,...,X 
(8.32) F∗  1 q−1
n−q−n−p
1 p−1
 1
n−p
p−1
or equivalently
SSRX q ,X q1 ,...,X p−1 ∣X 1 ,...,X q−1 
(8.32a) F∗  p−q  MSEX 1 , . . . , X p−1 
Decision rule
If F ∗  F1 − , p − q, n − p, conclude H o
If F ∗  F1 − , p − q, n − p, conclude H a
Correlation transformation.
Least square results can be sensitive to rounding of data in intermediate stages of

calculations. Most of these errors are made primarily when the inverse of X ′ X is
calculated. Some variables have substantially different magnitudes so that entries in
the X ′ X matrix may cover a wide range, say from 15 to 49 000 000. A solution for this
condition is to transform the variables and thereby reparameterize the regression model.
The transformation which we consider is called the correlation transformation.
It makes all entries in the X ′ X matrix for the transformed variables fall between
−1 and 1 inclusive. We shall illustrate the correlation transformation for the case
of two independent variables. The basic regression model we assume is:
(11.1) Y i   o   1 X i,1   2 X i,2   i
The first step is to use deviations X i,1 − X 1 instead of X i,1 and X i,2 − X 2 instead of X i,2 .
To use deviations we have to modify model (11.1)
Y i   o   1 X 1   2 X 2    1 X i,1 − X 1    2 X i,2 − X 2    i
or
(11.2) Y i   ′o   1 X i,1 − X 1    2 X i,2 − X 2    i
where
(11.2a)  ′o   o   1 X 1   2 X 2
It can be shown that least squares estimator of  ′o is always Y . Hence, we can
rewrite (11.2) as follows:
(11.3) Y i − Y   1 X i,1 − X 1    2 X i,2 − X 2    i
The second step in developing the correlation transformation is to express
each deviation in units of standard deviation:
Yi− Y X i,1 −X 1 X i,2 −X 2
(11.4) sY s1 s2
where
∑Y i − Y  2
(11.5a) sY  n−1
∑X i,1 −X 1  2
(11.5b) s1  n−1
∑ X i,2 −X 2  2
(11.5c) s2  n−1
The final step in obtaining the correlation transformation is to use the following
function of standardized variables in (11.4)
Yi− Y
(11.6a) Y ′i  1 sYn−1
X i,1 −X 1
(11.6b) X ′i,1  1
s1
n−1
X i,2 −X 2
(11.6c) X ′i,2  1
s2
n−1
The regression model with the transformed variables Y ′ , X ′1 and X ′2 is a simple
extension of model (11.3)
(11.7) Y ′i   ′1 X ′i,1   ′2 X ′i,1   ′i
It is easy to show that the new parameters  ′1 and  ′2 and the original parameters
 o ,  1 and  2 in (11.1) are related as follows:
81
s
(11.8a)  1   s 1y  ′1
s
(11.8b)  2   s 2y  ′2
(11.8c) o  Y − 1 X 1 − 2 X 2
Thus, the new regression coefficients  ′1 and  ′2 and the original regression
coefficients  1 and  2 are related by simple scaling factors involving ratios
of standard deviations.
X ′ X matrix for transformed variables
The X matrix for the transformed variables in model (11.7) is

X ′1,1 X ′1,2
X ′2,1 X ′2,2
X
: :
X ′n,1 X ′n,2
The X ′ X matrix is:
X ′1,1 X ′1,2
′
X ′1,1 X ′2,1 . . . X ′n,1 X ′2,1 X ′2,2
(11.9) X X  
X ′1,2 X ′2,2 . . . X ′n,2 : :
X ′n,1 X ′n,2
∑X ′i,1  2 ∑ X ′i,1 X ′i,2


∑ X ′i,1 X ′i,2 ∑X ′i,2  2
Let us consider the elements of this matrix. First, we have
X i,1 −X 1
2 ∑X i,1 −X 1 
∑X ′i,1  2  ∑ s 1  n−1
 n−1
 s 21  1
Similarly:
∑X ′i,2  2  1
Finally:
∑X i,1 −X 1 X i,2 −X 2 
∑ X ′i,1 X ′i,2  ∑ sXi,1 −X 1
n−1
X i,2 −X 2
s 2  n−1
 n−1 1
s1s2 
1
∑X i,1 −X 1 X i,2 −X 2 
  r 1,2
∑X i,1 −X 1  2 ∑X i,2 −X 2  2
1/2 1/2
where r 1,2 - coefficient of correlation between X 1 and X 2 by (3.73)

Therefore the X ′ X matrix for the transformed variables, denoted by r xx is
1 r 1,2
(11.10) r xx  X ′ X 
r 1,2 1
and is called the correlation matrix of the independent variables.
Hence
1 −r 1,2
(11.11) r −1
xx  1−r 2
1
1,2 −r 1,2 1
and
r Y,1
(11.12) X′Y 
r Y,2
where r Y,1 and r Y,2 are the coefficients of correlation between Y and X 1
82
and between Y and X 2 respectively. Hence , the estimated regression coefficients
for the reparameterized model (11.7) are
b ′1 1 −r 1,2 r Y,1
(11.13) b  1

b ′2
2
1−r 1,2
−r 1,2 1 r Y,2
r Y,1 − r 1,2 r Y,2
 1
1−r 21,2
r Y,2 − r 1,2 r Y,1
The return to the estimated regression coefficients for the original model
is accomplished by employing the relations in (11.8):
s
(11.14a) b 1   s 1y b ′1
s
(11.14b) b 2   s 2y b ′2
(11.14c) bo  Y − b1 X 1 − b2 X 2
Variance inflation factors and other methods of detecting

multicollinearity
Informal methods.
Indications of presence of serious multicollinearity are given by the following

diagnostics:
1. Large changes in the estimated regression coefficients when a variable is added
or deleted, or when an observation is altered or deleted.
2. Nonsignificant results in individual tests on the regression coefficients for
important independent variables.
3. Estimated regression coefficients with an algebraic sign that is the opposite
of that expected from theoretical considerations or prior experience.
4. Large coefficients of correlation between pairs of independent variables in the
correlation matrix r XX .
5. Wide confidence intervals for the regression coefficients representing important
independent variables.
Variance inflation factors.
We know that the variance covariance matrix of the estimated regression

coefficients is:
(11.27)  2 b   2 X ′ X −1
To reduce roundoff errors in calculating X ′ X −1 , we noted that it is desirable to
first transform the variables by means of correlation transformation. The
variance-covariance matrix of estimated standardized regression coefficient is
(11.28)  2 b   ′  2 r −1XX
where r XX is the matrix of pairwise simple correlation coefficients among
the independent variables, as illustrated in (11.10) for p − 1  2 independent
variables, and  ′  2 is the error term variance for transformed model.
Note from (11.28) that the variance of b ′k (k  1, . . . , p − 1) is equal to the product of
the error term variance  ′  2 and the kth diagonal element of the matrix r −1 XX . This
second factor is called variance inflation factor (VIF) It can be shown that the
variance inflation factor for b ′k , denoted by VIF k is
(11.29) VIF k  1 − R 2k  −1 k  1, 2, . . . , p − 1
2
where R k is the coefficient of multiple determination when X k is regressed on the
p − 2 other X variables in the model. Hence, we have
 ′  2
(11.30)  2 b ′k    ′  2 VIF k  1−R 2
k
83
The variance inflation factor VIF k is equal to 1 when R 2k  0
i.e. when X k is not linearly related to the other X variables. When R 2k ≠ 0,
then VIF k is greater than 1, indicating an inflated variance for b ′k .
Identification of outlying observations.
Frequently in regression analysis application, the data set contains some

observations which are outlying or extreme, i.e., observations which are well
separated from the remainder of the data. These outlying observations may
involve large residuals and often have dramatic effect on the fitted least squares
regression function.
Use of hat matrix for identifying outlying X observations.
We know (6.93) that the least square residuals can be expressed as a linear
combination of the observations Y i by means of hat matrix:
(11.44) e  I − HY
The hat matrix H is given by (6.93a):
(11.45) H  XX ′ X −1 X 
′
Similarly, the fitted values Y i can be expressed as a linear combinations of the

observations Y i by:
(11.46) Y  HY
Further, we noted that
(11.47)  2 e   2 I − H
so that the variance of the residual e i , denoted by  2 e i , is:
(11.48)  2 e i    2 1 − h i,i 
where h i,i is the ith element of the main diagonal of the hat
matrix and it is equal to:
(11.49) h i,i  X ′i X ′ X −1 X i
where X i is
1
X i,1
(11.49a) Xi  X i,2
:
X i,p−1
The diagonal element h i,i in the hat matrix is called the leverage of the ith observation.
Thus, a large leverage value h i,i indicates that the ith observation is distant from the
center of the X observations. The mean leverage value
∑ h i,i p
(11.51) h  n  n
2p
Hence, leverage values greater than n are considered by this rule to indicate outlying
observations with regard to the X values.
Use of studentized deleted residuals for identifying outlying Y observations.
We know that (11.48):

 2 e i    2 1 − h i,i 
an unbiased estimator of this variance is:
(11.54) s 2 e i   MSE1 − h i,i 
The ratio of e i to se i  is called the studentized residual and will be denoted by e ∗i
(11.55) e ∗i  see ii 
84

Let Y i denote the fitted value using fitted regression based on the observations
excluding the ith one(this observation is deleted). The residual
(1.56) d i  Y i − Y i
is called a deleted residual and is denoted by d i .
The studentized deleted residual, denoted by d ∗i is :
(1.57) d ∗i  sdd ii 
Fortunately, the studentized deleted residuals d ∗i can be calculated without having
to fit regression function with the ith observation omitted. It can be shown that :
1/2
d ∗i  e i
n−p−1
(11.58) SSE1−h i,i −e 2i
and they follow the t distribution with n − p − 1 degrees of freedom.
To identify outlying Yobservations, we examine the studentized deleted
residuals for large absolute values and use appropriate t distribution to
ascertain how far in the tails such outlying values fall.
PROBLEMS
Question 1
i Y i X i,1 X i,2
1 64 4 2
2 73 4 4
3 61 4 2
4 76 4 4
5 72 6 2
6 80 6 4
7 71 6 2
8 83 6 4
9 83 8 2
10 89 8 4
11 86 8 2
12 93 8 4
13 88 10 2
14 95 10 4
15 94 10 2
16 100 10 4
a) Fit the first-order simple regression model for relating Y to X 1 .
State the fitted regression function.
b) Find the estimated regression coefficients for full model (Y on X 1 and X 2 ).
c) Does SSRX 1  equal SSRX 1 ∣ X 2  here? If not, is the difference substantial?
d) Calculate the coefficient of simple correlation between X 1 and X 2 .
The diagonal elements of the hat matrix are:
i 1 2 3 4 5 6 7 8
h i,i 0.237 0.237 0.237 0.237 0.137 0.137 0.137 0.137
85
i 9 10 11 12 13 14 15 16
h i,i 0.137 0.137 0.137 0.137 0.237 0.237 0.237 0.237
e) Identify any outlying X observations using the hat matrix method.
f) Obtain the studentized deleted residuals and identify any outlying Y observations.
Question 2.
i Y i X i,1 X i,2 i Y i X i,1 X i,2
1 58 7 5.11 11 121 17 11.02
2 152 18 16.72 12 112 12 9.51
3 41 5 3.20 13 50 6 3.79
4 93 14 7.03 14 82 12 6.45
5 101 11 10.98 15 48 8 4.60
6 38 5 4.04 16 127 15 13.86
7 203 23 22.07 17 140 17 13.03
8 78 9 7.03 18 155 21 15.21
9 117 16 10.62 19 39 6 3.64
10 44 5 4.76 20 90 11 9.57
1) Find the estimated regression coefficients for full model (Y on X 1 and X 2 ).
2) Fit the first-order simple linear regression model for relating Y and X 2
3) Does SSRX 2  equal to SSRX 2 ∣ X 1  here? If not is the difference substantial?
4) Calculate the coefficient of simple correlation between X 1 and X 2 .
The diagonal elements of the hat matrix are:
i 1 2 3 4 5 6 7 8 9 10
h i,i 0.91 0.194 0.131 0.268 0.149 0.141 0.429 0.067 0.135 0.165
i 11 12 13 14 15 16 17 18 19 20
h i,i 0.179 0.059 0.110 0.156 0.095 0.128 0.97 0.230 0.112 0.073
5) Identify any outlying X observations using the hat matrix method.
6) Obtain the studentized deleted residuals and identify any outlying Y observations.
Question 3.
For a certain experiment the first-order regression model with two independent
variables was used. The calculated diagonal elements of the hat matrix are:
i 1 2 3 4 5 6 7 8
h i,i 0.237 0.237 0.237 0.237 0.137 0.137 0.137 0.137
i 9 10 11 12 13 14 15 16
h i,i 0.137 0.137 0.137 0.137 0.237 0.237 0.237 0.237
1) Describe use of hat matrix for identifying outlying X observations.
2) Identify any outlying X observations using the hat matrix method.
SELECTION OF INDEPENDENT VARIABLES
The all possible regression models.

The all-possible-regressions selection procedure calls for an examination of all possible
regression models involving potential X variables and identifying “good” subsets according
86
to some criterion. The following can be used:
R 2p criterion
The R 2p criterion calls for an examination of the coefficient of multiple determination R 2
in order to select one or several subsets of X variables. We show the number of parameters
in the regression model as a subscript of R 2 . Thus, R 2p indicates that there are p parameters,
or p − 1 predictor variables, in the regression equation on which R 2p is based.
Since R 2p is a ratio of sums of squares:
SSR SSE
(12.1) R 2p  SSTOp  1 − SSTOp
and the denominator is constant for all possible regressions, R 2p varies inversely with the
error sums of squares . But we know that can never increase as additional independent
variables are included in the model. Thus, R 2p will be a maximum when all p − 1 potential X
variables are included in the regression model. The reason for using the R 2p criterion with the
all-possible-regressions approach therefore cannot be to maximize R 2p . Rather, the intent
is to find the point where adding more X variables is not worthwhile because it leads to
a very small increase in R 2p . Often, the point is reached when only a limited number of X
variables is included in the regression model. Clearly, the determination of where
diminishing returns set in is a judgmental one.
Example 12.1. The following table contains the R 2p values for all possible regression models
for a certain data (54 observations) with 4 possible variables:
X variables p df SSE p R 2p MSE p C p
none 1 53 3.9728 0 0.0750 1721.6
X1 2 52 3.4960 0.120 0.0672 1510.7
X2 2 52 2.5762 0.352 0.0495 1100.1
X3 2 52 2.2154 0.442 0.0426 939.0
X4 2 52 1.8777 0.527 0.0361 788.3
X1, X2 3 51 2.2324 0.438 0.0438 948.6
X1, X3 3 51 1.4073 0.646 0.0276 580.3
X1, X4 3 51 1.8759 0.528 0.0368 789.5
X2, X3 3 51 0.7431 0.813 0.0146 283.7
X2, X4 3 51 1.3922 0.650 0.0273 573.5
X3, X4 3 51 1.2455 0.687 0.0244 508.0
X1, X2, X3 4 50 0.1099 0.972 0.0022 3.1
X1, X2, X4 4 50 1.3905 0.650 0.0278 574.8
X1, X3, X4 4 50 1.1157 0.719 0.0223 452.1
X2, X3, X4 4 50 0.4653 0.883 0.0093 161.7
X 1 , X 2 , X 3 , X 4 5 49 0.1098 0.972 0.0022 5.0
Using the R 2p one can notice that the use of the subset X 1 , X 2 , X 3  appears
to be reasonable since there is a little increase after these three X variables are included
in the model.
MSE p or R 2a criterion.
Since R 2p does not take account of the number of parameter in the model, and since
maxR 2p  can never decrease as p increases, the use of the adjusted coefficient of multiple
determination R 2a
SSE p MSE p
(12.2) R 2a  1 −  n−1
n−p  SSTO  1 − SSTO/n−1
87
has been suggested as a criterion which takes the number of parameters in the model into
account through the degrees of freedom. It can be seen from (12.2) that R 2a increases if and
only if MSE p decreases since SSTP/n − 1 is fixed for the given Y observations. Hence, R 2a
and MSE p are equivalent criteria. We shall consider here the criterion MSE p . minMSE p 
can, indeed, increase as p increases when the reduction in SSE p becomes so small that it is
not sufficient to offset the loss of an additional degree of freedom. Users of the MSE p
criterion either seek to find the subset of X variables that minimizes MSE p , or one or several
subsets for which MSE p is so close to the minimum that adding more variables is not
worthwhile. Using the table from example 12.1 one can see that the subset X 1 , X 2 , X 3 
appears to be best using MSE p (R 2a ) criterion too.
C p criterion.
This criterion is concerned with the total mean squared error of the n fitted values for each
of the various subset regression models. The mean squared error concept involves a bias
component and a random error component. Here, the mean squared error pertains to the fitted
values for theregression model employed. The bias component for the ith fitted value is:
(12.3) EY i  − EY i 
where EY i  is the expectation of the ith fitted value for the given
 regressionmodel and EY i 
is the true mean response. The  random error component for Y i is simply  2 Y i , its variance.
The mean squared error for Y i is then the sum of the squared bias and the variance:
 2 
(12.4) EY i  − EY i    2 Y i 
The total mean squared error for all n fitted values is the sum of the n individual mean
squared
errors:
n  2
n 
(12.5) ∑ EY i  − EY i   ∑  2 Y i 
i1 i1
The criterion measure, denoted by Γ p , is simply the total mean squared error divided by  2 ,
the true error variance:
n  2
n 
(12.6) Γ p  12 ∑ EY i  − EY i   ∑  2 Y i 
i1 i1
The model which includes all p − 1 potential X variables is assumed to have been carefully
chosen so that MSEX 1 , . . . , X p−1  is an unbiased estimator of  2 . It can than be shown
that an estimator of Γ p is C p :
SSE p
(12.7) C p  MSEX ,...,X − n − 2p
1 p−1
where SSE p is the error sum of squares for the fitted subset regression model with p
parameters
( i.e., with p − 1 predictor variables).
When there is no bias in the regression model with p − 1 predictor variables so that
EY i  ≡ EY i , the expected value of C p is approximately p:

(12.8) E C p ∣ EY i  ≡ EY i  ≃ p
Thus, when values for all possible regression models are plotted against p, those models with
little bias will tend to fall near the line C p  p. Models with substantial bias will tend to fall
considerably above this line.
In using the C p criterion, one seeks to identify subsets of X variables for which
(1) the C p value is small and
(2) the value C p is near p.
Sets of X variables with small C p values have a small total mean squared error, and when
C p value is also near p, the bias of the regression model is small. The regression model
based on the subset of X variables with the smallest C p value may involves substantial bias.
In that case, one may at times prefer a regression model based on somewhat larger subset
of X variables for which the C p value is slightly larger but which does not involve a
88
substantial bias component.
Identification of "best" subsets by use of algorithm.
In those occasional cases when the pool of potential X variables contains many variables.
An automatic search procedure that develops sequentially the sbset of X variables
to be included in the regression model may be helpful in those cases. It was developed
to economize on computational efforts, as compared with the all possible regression approach,
while arriving at a reasonable good subset of independent variables.
Stepwise Regression.
Stepwise regression uses t statistics (and related prob-values) to determine the importance
(or significance) of the independent (explanatory) variables in various regression models.
In this context the t statistics indicates that the independent variable is significant at the
 level if and only if the related p − value is less than . This implies that we
can reject H o :  j  0 in favor of H a :  j ≠ 0 with the probability of Type I error equal
to . Before beginning the stepwise procedure we choose a value of  entry , which we call
“the probability of a Type I error related to entering an independent variable into regression
model”. We also choose  stay , which we call “the probability of a Type I error related to
retaining an independent variable that was previously entered into model”
The SAS default values are  entry  0. 5 and  stay  0. 15.
Then the stepwise regression is performed as follows:
Step 1. The stepwise procedure considers all possible one-independent variable regression
models of the form
y  o  1xj  
Each different model includes a different potential independent variable. For each model
the t statistics (and p − value) related to testing H o :  1  0 versus H a :  1 ≠ 0
is calculated. If the largest absolute value of t statistics is significant (that is the
corresponding smallest p − value   entry ) the corresponding variable is included in
the model say X 1 , we consider the model
y   o   1 x 1  
and relevant in step one we have: variable entered.
If the t statistics does not indicate that X 1 is significant at the  entry level, then
the stepwise procedure terminates by choosing the model
y  o  
Step 2.The stepwise procedure considers all possible two-independent-variable models
(with one variable already selected in the first step). For each model the t statistics related
to testing H o :  2  0 versus H a :  2 ≠ 0 is calculated. The variable with the biggest
absolute value is included if is significant (corresponding p − value   entry ).
Further steps. The stepwise procedure continues by adding independent variables one at
the time if and only if it has the largest (in absolute value) and if its t statistics is significant
(the corresponding p − value   entry ). After adding an independent variable the stepwise
procedure checks all the independent variables already included in the model and removes
any if it is not significant at the level  stay . The stepwise procedure terminates when all
independent variables not in a model are insignificant at  entry level or when the variable to
be added to the model is the one just removed from it.
Notice: In some packages instead of t statistics F statistics (t 2  is used in reports.
MAXR SAS PROCEDURE.
This method does not settle on a single model. Instead, it looks for the “best”
one-variable model, the “best” two-variable model, and so forth.
The MAXR methods begins, by finding one-variable model that would producing
the highest R 2 . Then another variable, that one that would yield the greatest increase in
R 2 is added. Once two-variable model is obtained, each variables is compared to each
variable not in a model. For each comparison, MAXR determines if removing one variable
and replacing it with the other variable would increase R 2 . After comparing all possible
89
switches, the one that produces the largest increase in R 2 is made.
Comparison begin again, and the procedure continues until no switch could increase R 2 .
The two-variable model thus achieved is considered the “best” two-variable model the
technique can find. Another variable is then added to the model, and the comparing and
switching process is repeated to find the “best” three-variable model, and so forth.
The difference between the stepwise technique and maximum improvement method is that
all switches are evaluated before any switch is made in the MAXR method. In stepwise
method, the “worst” variable may be removed without considering what adding the “best”
remaining variable may accomplish.
AUTOCORRELATION IN TIME SERIES DATA
The basic regression models considered so far have assumed that the random
error terms  i are either uncorrelated random variables or independent normal
random variables. Many regression applications involve time series data. For such
data, the assumption of uncorellated or independent error terms is often not
appropriate. Error term correlated over time are said to be autocorrelated or
serially correlated.
First-order autoregressive error model.
Simple linear regression.

The simple linear regression model with the random terms following
a first-order autoregressive process is:
Yt  o  1Xt  t
(13.1)
 t   t−1  u t
where:
 is parameter such that ||  1
u t are independent N0,  2 
Each error term in model (13.1) consists of a fraction of previous error term
plus a new disturbance term u t . The parameter is called the autocorrelation parameter.
Multiple regression
The multiple regression model with the random error following a first-order
autoregressive process is:
Y t   o   1 X t,1 . . .  p−1 X t,p−1   t
(13.2)
 t   t−1  u t
where:
 is parameter such that ||  1
u t are independent N0,  2 
Durbin-Watson test for autocorrelation
The Durbin-Watson test assumes that first-order autoregressive error models (13.1)
or (13.2). Because correlated errors in applications tend to show positive serial
correlation, the usual test alternatives are:
Ho :   0
(13.3)
Ha :   0
The test statistics
90
n
∑e t −e t−1  2
(13.4) D t2
n
∑ e 2t
t1
where n is the number of observations.
The critical values d L and d U are given in tables with decision rule
If D  d U conclude H o
(13.5) If D  d L conclude H a
If d L  D  d U the test is inconclusive
PROBLEMS
Question 1
For the following data:
t 1 2 3 4 5 6 7 8
X t 2.052 2.026 2.002 1.949 1.942 1.887 1.986 2.053
Y t 102.9 101.5 100.8 98.0 97.3 93.5 97.5 102.2
t 9 10 11 12 13 14 15 16
X t 2.102 2.113 2.058 2.060 2.035 2.080 2.102 2.150
Y t 105.0 107.2 105.1 103.9 103.0 104.8 105.0 107.2
1) Fit a simple linear regression line.
2) Conduct a formal test for positive autocorrelation using   0. 05.
Question 2.
The fitted values and residuals of a regression analysis are given below
i 1 2 3 4 5 6 7

Y i 2.92 2.33 2.25 1.58 2.08 3.51 3.34
e i 0.18 -0.03 0.75 0.32 0.42 0.19 0.06
i 8 9 10 11 12 13 14 15

Y i 2.42 2.84 2.50 3.59 2.16 1.91 2.50 3.26
e i -0.42 0.06 -0.20 -0.39 -0.36 -0.51 -0.50 0.54
Assume that the simple linear regression model with the random
terms following a first-order autoregressive process is appropriate.
Conduct a formal test for positive autocorrelation using   0. 05.
Question 3
The following data were obtained in a certain experiment:
91
i Yi X i,1 X i,2
1 64 4 2
2 73 4 4
3 61 4 2
4 76 4 4
5 72 6 2
6 80 6 4
7 71 6 2
8 83 6 4
9 83 8 2
10 89 8 4
11 86 8 2
12 93 8 4
13 88 10 2
14 95 10 4
15 94 10 2
16 100 10 4
The data summary is given below in matrix form
16 112 48
99
80
− 807 − 163
X′X  112 864 336 X ′ X −1  − 807 1
80
0
48 336 160 − 163 0 1
16
1308
′
XY 9510 Y ′ Y  108896
3994
Assume that first-order regression model with independent normal errors is appropriate.
1
1
2
 exp− x22 dx  0. 682 69
−1
2
1
2
 exp− x22 dx  0. 954 5
−2
3
1
2
 exp− x22 dx  0. 997 3
−3
4
1
2
 exp− x22 dx  0. 999 94
−4
92

Linear Models

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Models

Uploaded by

Copyright:

Available Formats

INTRODUCTION.

In today industry, there is no shortage of ”information”. The purpose of this course is to

Notice that all fall directly on the line of functional relationship.

Statistical relation between two variables.

REGRESSION MODELS AND THEIR USES.

SIMPLE LINEAR REGRESSION.

Meaning of regression parameters

ESTIMATION OF REGRESSION FUNCTION.

Properties of least squares estimators.

An important theorem, called the Gauss-Markov theorem, states

Point estimation of mean response.

Estimated regression function. Given sample estimators b o and b 1 of the parameters

 is the difference between observed value Y i and the corresponding

Properties of fitted line.

1. The sum of residuals is zero:

Prove 5 and 6 stated above.

ESTIMATION OF ERROR TERMS VARIANCE  2

Theorem MSE is an unbiased estimator of  2 for the regression model (2.1)

Proof: We know that

∑ EY 2i   ∑ 2   o   1 X i  2   n 2  n 2o  2n o  1 X   21 ∑ X 2i

X i −X X i −X X i −X

X j −X X j −XX j X i −X

There are number of alternative formulas for SSE :

Prove (2.24a),(2.24b) and (2.24c) stated above.

∑Z i − Z  2  ∑Z 2i − 2Z i Z  Z 2   ∑Z 2i − Z i Z − Z i Z  Z 2  

NORMAL ERROR REGRESSION MODEL

The normal error model is as follows:

Find the distribution of Y i and  i under the normal error model

F Y i y  PY i  y  P o   1 X i   i  y 

Estimation of parameters by method of maximum likelihood.

Tests concerning  1 are of the interest, particularly one of the form

The point estimator of b 1 was given in (2.10a) as follows:

Estimated variance of the sampling distribution of b 1 by replacing the parameter

From the probability theory we know that:

Confidence interval for  1 .

Since b 1 −  1 /sb 1  follows a t distribution, we can make the following

The hypothesis are as follows:

Remark:Many computer packages commonly report the

The point estimator b o was given in (2.10b) as follows

An estimator of Varb o  is obtained by replacing  2 by its point

The same arguments as for b 1 imply that

Interval estimation of E(Y h )

Prediction of new observation.

Analysis of variance approach to the regression analysis.

The variation of Y i is measured in terms of deviations:

Formal development of partitioning. Consider the deviation Y i − Y . We can

(3.47) Deviation of fitted

1.Calculate SSTO, SSR and SSE for our data.

so we get 3. 50a

We are interested in the regression mean square, denoted by MSR

Analysis of variance table.

In our case the table is as follows:

Modified table. Sometimes, an ANOVA table showing one additional element of

(3.54) SScorrection for mean  n Y  2

total SSTO  ∑Y i − Y  2 n−1

The measure r 2 is called the coefficient of determination. Since 0  SSE  SSTO

The square root of r 2

To calculate b o and b 1 we use the following formulas

sb 1   0. 002206  0. 04 696 8

Confidence interval for  o

so sb o   s 2 b o   6. 2647  2. 502 9

Confidence interval for E(Y h )

Let X h  55, then