Professional Documents
Culture Documents
Overview of Classical
Linear Regression
Dr Raju Chinthalapati
Email: v.l.r.chinthalapati@soton.ac.uk
Office hour: Wednesday (15.00-16.00), Thursday (14.00-15.00),
Friday (13.00-14.00) in 2/3063
Learning Objectives
• To understand what regression analysis can do.
2
What is regression?
• Regression is probably the single most important tool at the
econometrician’s disposal.
• We have some intuition that the beta on this fund is positive, and
we therefore want to find whether there appears to be a
relationship between X and Y given the data that we have. The
6
first stage would be to form a scatter plot of the two variables.
Graph (Scatter Diagram)
45
40
Excess return on fund XXX
35
30
25
20
15
10
5
0
0 5 10 15 20 25
Excess return on market portfolio
7
How to model this?
• We need to use what we call the Line of Best Fit.
• We can use the general equation for a straight line to get the line
that best “fits” the data.
y = a + bx
9
Determining the Regression Coefficients
• So how do we determine what α and β are?
x 10
Ordinary Least Squares
• The most common method used to fit a line to the data is
known as OLS (ordinary least squares).
yi
yt
ûe
i t
ŷiyˆ t
12
xi x
How OLS works?
• So min(𝑒↓1↑2 +𝑒↓2↑2 +𝑒↓3↑2 +𝑒↓4↑2 +𝑒↓5↑2 ), or
minimise ∑𝑡=1↑5▒𝑒↓𝑡↑2 . This is known as the residual
sum of squares, or RSS.
13
Simple regression model: The Mechanism
• To begin with, we will draw the fitted line so as to minimize
the sum of the squares of the residuals, RSS. This is
described as the least squares criterion.
n
RSS = ∑ ei2 = e12 + ... + en2
i =1
• Can we just minimize the following?
n
∑e
i =1
i = e1 + ... + en
14
Deriving linear regression coefficients
• This sequence shows how the regression coefficients for a simple
regression model are derived.
(3, 6)
(2, 5)
(1, 3)
15
Deriving linear regression coefficients
• We will determine the values of b1 and b2 that minimize
RSS:
(3, 6)
(2, 5)
(1, 3)
16
Deriving linear regression coefficients
• Given our choice of b1 and b2, the residuals are as shown.
(3, 6)
(2, 5)
(1, 3)
17
Deriving linear regression coefficients
• For a minimum, the partial derivatives of RSS with respect to b1
and b2 should be zero. (We should also check a second-order
condition.)
RSS = e12 + e22 + e32 = (3 − b1 − b2 )2 + (5 − b1 − 2b2 ) 2 + (6 − b1 − 3b2 ) 2
= 70 + 3b12 + 14b22 − 28b1 − 62b2 + 12b1b2
⎧ ∂RSS
⎪⎪ ∂b ⇒ 6b1 + 12b2 − 28
1
⎨ ∂RSS
⎪ ⇒ 12b1 + 28b2 − 62
⎪⎩ ∂b2
18
Deriving linear regression coefficients
• When we establish the expression for RSS, we do so as a function
of b1 and b2. At this stage, b1 and b2 are not specific values.
Our task is to determine the particular values that minimize
RSS.
RSS = e 2 + e 2 + e 2 = (3 − b − b )2 + (5 − b − 2b ) 2 + (6 − b − 3b ) 2
1 2 3 1 2 1 2 1 2
20
Deriving linear regression coefficients
• The residual for the first observation is defined.
21
Deriving linear regression coefficients
• Similarly we define the residuals for the remaining
observations. That for the last one is marked.
22
Deriving linear regression coefficients
RSS = e12 + ... + en2 = (Y1 − b1 − b2 X 1 ) 2 + ... + (Yn − b1 − b2 X n ) 2
= Y12 + b12 + b22 X 12 − 2b1Y1 − 2b2 X 1Y1 + 2b1b2 X 1
+...
+ Yn2 + b12 + b22 X n2 − 2b1Yn − 2b2 X nYn + 2b1b2 X n
= ∑Yi 2 + nb12 + b22 ∑ X i2 − 2b1 ∑Yi − 2b2 ∑ X iYi + 2b1b2 ∑ X i
23
Deriving linear regression coefficients
RSS = ∑Yi 2 + nb12 + b22 ∑ X i2 − 2b1 ∑Yi − 2b2 ∑ X iYi + 2b1b2 ∑ X i
∂RSS
= 0 ⇒ 2nb1 − 2∑Yi + 2b2 ∑ X i = 0
∂b1
⇒ nb1 = ∑Yi −b2 ∑ X i
⇒ b1 = Y − b2 X
24
Deriving linear regression coefficients
RSS = ∑Yi 2 + nb12 + b22 ∑ X i2 − 2b1 ∑Yi − 2b2 ∑ X iYi + 2b1b2 ∑ X i
∂RSS
= 0 ⇒ 2b2 ∑ X i2 − 2∑ X iYi + 2b1 ∑ X i = 0
∂b2
⇒ b2 ∑ X i2 − ∑ X iYi + (Y − b2 X )∑ X i = 0
⇒ b2 ∑ X i2 − ∑ X iYi + (Y − b2 X )nX = 0
⇒ b2 ∑ X i2 − ∑ X iYi + nXY − nb2 X 2 = 0
⇒ b2 ( ∑ X i2 − nX 2 ) = ∑ X iYi − nXY
∑ X Y − nXY
i i
⇒ b2 = 2 2
∑ X − nX
i
25
Deriving linear regression coefficients
• We might use the alternative expression of b2
∑ X Y − nXY
i i
b =
2 2 2
∑ X − nXi
⇒b =
∑ ( X − X )(Y − Y )
i i
2 2
∑( X − X ) i
⇒b =
∑ ( X − X )(Y − Y )
i i
2 2
∑( X − X ) i
2
∑ ( X − X ) = ∑ ( X − 2 XX + X )
i i
2
i
2
2 2
= ∑ X − 2X ∑ X + ∑ X
i i
2 2
= ∑ X − 2 X ( nX ) + nX
i
2 2 27
= ∑ X − nX
i
Deriving linear regression coefficients
• These are the particular values of b1 and b2 that minimize RSS, and we
should differentiate them from the rest by giving them special names,
for example b1OLS and b2OLS. Matrix Expression of OLS
28
Simple model with no intercept
ei = Yi − Yˆi = Yi − b2 X i
2
RSS = ∑ (Yi − b2 X i ) = ∑ Yi 2 − 2b2 ∑ X iYi + b22 ∑ X i2
dRSS
= 2b2 ∑ X i2 − 2∑ X iYi = 0
db2
⇒ 2b2OLS ∑ X i2 − 2∑ X iYi = 0
OLS ∑ XY
i i
⇒b 2 = 2
∑X i
29
An Example from Finance
• Suppose that we have the following data on the excess returns on
a fund manager’s portfolio (fund XXX) together with the excess
returns on a market index:
Year t Excess return (Ri-Rf): Y Market premium (Rm-Rf): X
1 17.8 13.7
2 39.0 23.2
3 12.8 6.9
4 24.2 16.8
5 17.2 12.3
• We have some intuition that the beta on this fund is positive, and
we therefore want to find whether there appears to be a
relationship between X and Y given the data that we have. The
30
first stage would be to form a scatter plot of the two variables.
Estimation of coefficients for CAPM example
• In the CAPM example used above, plugging the 5 observations in
to make up the formula given above would lead to the estimates
αˆ = −1.74 and βˆ = 1.64
Then we have the fitted line as: yˆ t = −1.74 + 1.64 xt
• Question: If an analyst tells you that she expects the market to
yield a return 20% higher than the risk-free rate next year, what
would you expect the return on fund XXX to be?
bOLS
=
∑ ( X − X )(Y − Y )
i i
2 2
∑( X − X )
i
32
Accuracy of Intercept Estimate
• Care needs to be exercised when considering the intercept
estimate, particularly if there are no or few observations
close to the y-axis:
y
0 x
33
The Population and the Sample
• The population is the total collection of all objects or people
to be studied.
34
The DGP and the PRF
• The population regression function (PRF) is a description of the
model that is thought to be generating the actual data and the
true relationship between the variables (i.e. the true values of α
and β).
• The PRF is yt = α + β xt + ut
• We usually make the following set of assumptions about the ut’s (the
unobservable error terms):
2. Var (ut) = σ2 The variance of the errors is constant and finite over all
values of xt
3. Cov (ui, uj)=0 The errors are statistically independent of one another
4. Cov (ut, xt)=0 No relationship between the error and corresponding x
variate 38
The Assumptions Underlying the CLRM
• An alternative assumption to 4., which is slightly stronger,
is that the xt’s are non-stochastic or fixed in repeated
samples, or E(ut|xt)=0 for all t.
39
Properties of the OLS Estimator
• If assumptions 1. through 4. hold, then the estimators 𝛼 and 𝛽
determined by OLS are known as Best Linear Unbiased
Estimators (BLUE). What does the acronym stand for?
The least squares estimators 𝛼 and 𝛽 are consistent. That is, the estimates
will converge to their true values as the sample size increases to infinity.
Need the assumptions Cov(xtut)=0 and Var(ut)=σ2<∞ to prove this.
Consistency implies that
limPr ⎡ βˆ − β > δ ⎤ = 0 ∀δ > 0
• Unbiased T →∞ ⎣ ⎦
The least squares estimates of 𝛼 and 𝛽 are unbiased. That is E(𝛼 )=𝛼
and E(𝛽 )=𝛽.Thus on average the estimated value will be equal to the
true values. To prove this also requires the assumption that E(ut)=0.
Unbiasedness is a stronger condition than consistency. (Proof)
• Efficiency
• Recall that the estimators of αandβ from the sample parameters (𝛼 and
𝛽 ) are given by ˆ ∑ ( xt − x )( yt − y )
β= 2
and αˆ = y − βˆ x
∑ ( xt − x )
• What we need is some measure of the reliability or precision of the
estimators (𝛼 and 𝛽 ). The precision of the estimate is given by its
standard error. Given assumptions 1 - 4 above, then the standard
errors can be shown to be given by x 2
x
2
SE (αˆ ) = s
∑ t =s ∑ t ,
T ∑ ( xt − x )2 T ∑ xt2 − T 2 x 2
1 1
SE ( βˆ ) = s 2
=s 2 2
(
∑ tx − x ) x
∑t − Tx
42
Estimating the Variance of the Disturbance Term
• The variance of the random variable ut is given by
Var(ut) = E[(ut)-E(ut)]2
which reduces to
Var(ut) = E(ut2)
• We could estimate this using the average of 𝑢↓𝑡↑2 :
1
σ2 = u
∑t 2
T
• But this estimator is a biased estimator of σ2. 43
Estimating the Variance of the Disturbance Term
• An unbiased estimator of σ is given by
2
s=
∑ uˆ
t
T −2
where ∑𝑢 ↓𝑡↑2 is the residual sum of squares (RSS) and T is the sample
size, so:
RSS
s=
T −2
44
Some Comments on the Standard Error Estimators
• Both SE(𝛼 ) and SE(𝛽 ) depend on s2 (or s). The greater the variance s2,
then the more dispersed the errors are about their mean value and
therefore the more dispersed y will be about its mean value.
• The sum of the squares of x about their mean appears in both formula.
The larger the sum of squares, the smaller the coefficient variances.
2 2
SE (αˆ ) = s
∑x t
=s
∑ xt ,
2 2 2 2
T ∑(x − x )
t T∑x −T x
t
1 1
SE ( βˆ ) = s 2
=s
∑ ( xt − x ) ∑ xt2 − Tx 2
45
Some Comments on the Standard Error Estimators
y
y
y
y
x x
0 x 0 x
46
Some Comments on the Standard Error Estimators
• The larger the sample size, T, the smaller will be the coefficient
variances.
2 2
SE (αˆ ) = s
∑ xt = s ∑ xt ,
T ∑ ( xt − x )2 T ∑ xt2 − T 2 x 2
1 1
SE ( βˆ ) = s 2
=s
∑ ( xt − x ) ∑ xt2 − Tx 2
• The term ∑𝑥↓𝑡↑2 appears in the SE(𝛼 ) .
The reason is that ∑𝑥↓𝑡↑2 measures how far the points are away from
the y-axis.
47
Example: How to Calculate the Parameters and
Standard Errors
• Assume we have the following data calculated from a regression of y on
a single variable x and a constant over 22 observations.
• Data:
∑ x y = 830102, T = 22, x = 416.5,
t t y = 86.65,
2
∑ x = 3919654, RSS = 130.6
t
yˆ t = −59.12 + 0.35xt
48
Example (cont’d)
• SE(regression):
RSS 130.6
s= = = 2.55
T −2 22 − 2
2
SE (αˆ ) = s
∑x t
= 2.55 ×
3919654
= 3.35
T ∑ xt2 − T 2 x 2 ( 22 × 3919654 ) − ( 22 × 416.5 )
2
1 1
SE ( βˆ ) = s 2 2
= 2.55 × = 0.0079
x
∑t − Tx 3919654 − ( 22 × 416.5 )
2
• We have goodness of fit statistics to test this: i.e. how well the sample
regression function (SRF) fits the data.
2 RSS
R = 1− 2
(
∑ ty − y )
50
Defining 𝑅↑2
• Defining total sum of squares (TSS) to measure of the total sample
variation. TSS = ( y − y )2 ∑ t
• We can split the TSS into two parts, the part which we have explained
(known as the explained sum of squares, ESS) and the part which we
did not explain using the model (the RSS).
2 2
(
∑ ty − y ) = [( y
∑ t t t
− ˆ
y ) + ( ˆ
y − y )]
= ∑ [uˆt + ( yˆ t − y )]2
ESS RSS
= ∑ uˆt2 + 2∑ uˆt ( yˆ t − y ) + ∑ ( yˆ t − y )2 R 2 = = 1−
TSS TSS
Let ESS = ∑ ( yˆ t − y ) 2
TSS = RSS + 2∑ uˆt ( yˆ t − y ) + ESS
TSS = RSS + ESS 51
The Limit Cases: 𝑅↑2 =0 and 𝑅↑2 =1
yt
yt
xt xt
52
Problems with 𝑅↑2
• 𝑅↑2 is defined in terms of variation about the mean of y so that if a
model is reparameterised (rearranged) and the dependent variable
changes, 𝑅↑2 will change.
• 𝑅↑2 never falls if more regressors are added to the regression, for
example, Regression 1: yt = β1 + β 2 x2t + β3 x3t + ut
Regression 2: yt = β1 + β2 x2t + β3 x3t + β4 x4t + ut
• 𝑅↑2 quite often takes on values of 0.9 or higher for time series
regressions.
53
Adjusted 𝑅↑2
• In order to get around these problems, a modification is often made
which takes into account the loss of degrees of freedom associated with
adding extra variables. This is known as 𝑅 ↑2 , or adjusted 𝑅↑2 :
T −1
R = 1−[
2
(1 − R 2 )]
T −k
where k is the number of regressors (including the constant)
• Therefore we have:
y = Xβ +u
• That is:
⎡ y1 ⎤ ⎡1 x21 x31 ⎤ ⎡ u1 ⎤
⎢ y ⎥ ⎢1 ⎥ ⎡ β1 ⎤ ⎢ ⎥
x22 x32 ⎢ ⎥ u2
⎢ 2⎥ = ⎢ ⎥ β2 + ⎢ ⎥
⎢ M⎥ ⎢M M M ⎥ ⎢ ⎥ ⎢ M⎥
⎢ ⎥ ⎢ ⎥ ⎢⎣ β 3 ⎥⎦ ⎢ ⎥
⎣ yT ⎦ ⎣1 x2T x3T ⎦ ⎣uT ⎦
How to calculate the parameters in this case?
• Previously, we took the residual sum of squares, and minimised
it w.r.t. 𝛼 and 𝛽.
βˆ2OLS =
∑ ( X − X )(Y − Y ) = ∑ ( X − X )Y − ∑ ( X − X )Y
i i i i i
2 2
∑( X − X ) i ∑( X − X ) i
=
∑ ( X − X ) Y − ∑ ( X Y − XY ) ∑ ( X − X ) Y − (Y ∑ X
i i
=
i i i i − nXY )
2 2
∑( X − X ) i ∑( X − X ) i
=
∑ ( X − X )Y ∑ ( X − X ) ( β + β X + u )
i
=
i i 1 2 i i
2 2
∑( X − X ) i ∑( X − X ) i
β ∑ ( X − X ) + β ∑ ( X − X )X + ∑ ( X − X )u
1 i 2 i i i i
= 2
∑( X − X ) i
Unbiasedness of OLS
Given ∑ ( X i − X ) = 0, we have,
ˆ OLS β 2 ∑ ( X i − X )X i + ∑ ( X i − X )ui
β2 = 2
∑( Xi − X )
2
Given ∑ ( X i − X ) = ∑ X i2 − nX 2 =∑ ( X i − X )X i , we have,
ˆ OLS ∑ ( X i − X )ui
β2 = β2 + 2
∑( Xi − X )
Given E (ui ) = 0, we have,
∑ ( X i − X )ui
E ( β 2 ) = E ( β 2 ) + E[
ˆ OLS
2
] = E (β2 )
∑( Xi − X )
Q.E .D.
Unbiasedness of OLS
• To prove 𝐸(𝛽 ↓1 )=𝛽↓1 :
Averaging Yi = β1 + β 2 X i + ui for all i, we have,
∑Y i
= β1 + β 2
∑ X + ∑u
i i
n n n
thus, Y = β1 + β 2 X i + u
Given βˆ1OLS = Y − βˆ2OLS X ,
= β1 + β 2 X + u − β 2OLS X
= β1 + ( β 2 − β 2OLS ) X + u
Therefore, E ( βˆ1OLS ) = E ( β1 ) + E[( β 2 − β 2OLS ) X ] + E (u ) = E ( β1 )
Q.E.D. Back