Professional Documents
Culture Documents
Fall 2008
1 i n,
iid
i N (0, 2 )
(1)
where the constant real parameters (a0, b0, 02) are unknown. The parameters (a, b) are estimated by maximum likelihood, which in the case
of normally distributed iid errors as assumed here is equivalent to choosing a, b in terms of {(Xi , Yi )}ni=1 by least squares, i.e. to minimize
Pn
2
i=1 (Yi a bXi ) , which results in:
b = s.cov(X, Y ) = sXY
s.var(X)
s2X
a
= Y bX
s2X =
(2)
where
n
X
= 1
X
Xi
n i=1
n
1 X
Y =
Yi
n i=1
n
1 X
)2
(Xi X
n 1 i=1
and
s2Y =
n
1 X
(Yi Y )2
n 1 i=1
sXY =
n
1 X
)
(Yi Y )(Xi X
n 1 i=1
The predictors and residuals for the observed responses Yi are given
respectively by
+ bXi
Predictori = Yi = a
,
1
Residuali = i = Yi Yi
2 = MRSS =
n
1 X
(Yi Yi )2
n 2 i=1
Confidence intervals for estimated parameters are all based on the fact
that the least squares estimates a, b and the corresponding predictors of (the
mean of) Yi are linear combinations of the independent normally distributed
variables j , j = 1, . . . , n, and the general formula for any sequence of
constants uj , j = 1, . . . , n,
n
X
uj j N (0 , 2
j=1
n
X
u2j )
(3)
j=1
We use this formula below with various choices for the vector u = {uj }nj=1 .
Under the model (1), with true parameters (a0, b0 , 02), we first calculate
from (2) that
Pn
=
(
+
a
+
b
X)
=
cj j
X
Y
)(X
j
0
0
j
(n 1) s2X j=1
j=1
b b0 =
(since
Pn
j=1
j=1
= 0), where
(Xj X)
Xj X
cj =
(n 1)s2X
Pn
2
X)
1
=
4
2
(n 1) sX
(n 1)s2X
j=1 (Xj
n
X
j=1
2
Xj X
N
0,
j
(n 1) s2X
(n 1)s2X
Next, using
n
n
X
X
a0 = 1
) =
(j (b b0) X
uj j
a
a0 = Y bX
n j=1
j=1
(4)
where
uj =
j X)
X(X
1
n
(n 1)s2X
2
X
1
+
n
(n 1) s2X
we find
a a0 =
n
X
1
j=1
j X)
2
X(X
X
2 1
+
N
0,
{
} (5)
j
2
2
(n 1) sX
n
(n 1)sX
Similarly,
n
X
1
(Xi X)(X
j X)
j
(n 1) s2X
j=1 n
n1
)2 o
(Xi X
N 0, 2
+
n
(n 1) s2X
a
a0 + (b b0) Xi =
(6)
and finally
bXi =
Yi a
n
X
(Xi X)(X
1
j X)
j
n
(n 1) s2X
j=1
n
2 o
(Xi X)
1
N 0, 2 1
n
(n 1)s2X
ji
(7)
where in the last display we have used the Kronecker delta ji defined equal
to 1 if i = j and equal to 0 otherwise.
A further item of theoretical background is the expression of the sum of
squared errors in a form allowing us to find that it is independent of b and
is distributed as a constant times 2n2 . For that, note first that
SSE =
n
X
(Yj a bXj )2 =
j=1
n
X
(Yj Y ) b(Xj X)
2
(8)
j=1
We used this equation in class to show, by expanding the square in the last
summation, that
SSE = (n 1) s2Y (1 r2 ) ,
r =
sXY
sX
= b
sX sY
sY
Continuing with the formula (8) for SSE, we find via (4) that with uj = cj =
)/((n 1)s2X ),
(Xj X
SSE =
n
X
2
(j (b b0 )(Xj X))
j=1
n
X
j=1
n
X
)
j (Xj X
n
X
k=1
(j )2
j=1
= e0 I
2
Xk X
k
(n 1)s2X
n
X
1
2
(X
X)
j
j
(n 1) s2X j=1
1 0
11 (n 1)s2X cc0 e
n
(9)
2
=
(n
2)
2n2
(10)
02
02
An important aspect of this proof is the observation that the matrix
M = I n1 110 (n1)s2X cc0 is a projection matrix, that is, is symmetric
and idempotent , which means that M 2 = M, and the quadratic form (9) is
a, b) and Me is confirmed
equal to (Me)0 (Me). The independence of (
by checking that the covariances are 0:
Cov(Me, 10e) = 02 M1 = 0 ,
Cov(Me, c0 e) = 02 Mc = 0
The independence of (
a, b) and
2 then immediately implies, by definition
of the t-distribution, that
(b b0)
n1
sX
tn2
o1/2
2
X
a a0 n 1
+
tn2
n
(n 1) s2X
(11)
sX n 1
1
1/2
1
+
a
tn2,/2
n
(n 1)s2X
1
)2 1/2
(X0 X
+
a
+ bX0 tn2,/2
n
(n 1)s2X
2 1/2
(Xi X)
1
a
+ bXi tn2,/2
1
n
(n 1)s2X
b0 b tn2,/2
(12)
a0
(13)
a0 + b0 X0
Yi
(14)
(15)
The confidence intervals (12) and (13) are exactly as used by SAS in
determining p-values for the significance of coefficients a, b (in testing the
respective null hyptheses that b = 0 or that a = 0.) The interval (14) is
what SAS calculates in generating the CLM upper and lower confidence limits
that it calculates and plots at location Xi either within PROC GPLOT or PROC
REG. The interval (15) is only a retrospective prediction interval within which
we should have found Yi with respect to its predictor Yi but it is NOT what
SAS calculates in generating the CLI upper and lower individual-observation
prediction limits at location Xi either within PROC GPLOT or PROC REG.
Prediction intervals are meant to capture not the observations
already seen but rather any new observations Yi0 which would be
collected at the previous locations Xi or at brand-new locations
So we discuss next the corrected formula for prediction
X0 .
interval which SAS actually calculates.
We are interested sometimes, especially as part of diagnostic checking
and model-building, in making prediction intervals for values Y0 (not yet
observed) corresponding to values X0 which were not in fact part of the
dataset. The thing to understand is that formula (15) is not applicable to
this situation because it refers to observations Yi which were already used
as part of the model-fitting that resulted in a
, b. If Y0 = a0 + b0 X0 + 0
with a0, b0 the same as before but with 0 N (0, 02 ) independent of all
5
(16)
(17)