Professional Documents
Culture Documents
Introduction
In Chapter we examine the relationship between
interval variables via a mathematical equation.
The motivation for using the technique:
Forecast the value of a dependent variable (y) from
the value of independent variables (x1, x2,xk.).
Analyze the specific relationships between the
independent variables and the dependent variable.
2
The Model
The model has a deterministic and a probabilistic components
House
Cost
ut
o
b
sa
t
s
o
ec
s
u
e)
z
o
.
i
t
h
S
o
a
fo + 75(
g
e
n
r
i
a
d
Buil er squ 25000
p
=
$75 e cost
s
Hou
House size
The Model
However, house cost vary even among same size
houses!
Since cost behave unpredictably,
House
Cost
The Model
The first order linear model
yy 00 11xx
y = dependent variable
x = independent variable
0 = y-intercept
1 = slope of the line
= error variable
Rise
0
Run
= Rise/Run
x
5
(2,4)
(4,3.2)
(1,2)
(3,1.5)
cov(XX,,YY))
cov(
bb11
22
s
s xx
yy bb00 bb11xx
bb00 yybb11xx
Independent Dependent
variable x variable y
10
43,528,690
x 36,009.45;
s 2x
y 14,822.823;
( x x )( y
cov( X , Y )
n 1
y)
n 1
2,712,511
where n = 100.
cov( X , Y ) 1,712,511
b1
.06232
2
sx
43,528,690
b 0 y b1 x 14,822.82 ( .06232)(36,009.45) 17,067
11
12
y 17,067 .0623 x
ANOVA
df
Regression
Residual
Total
SS
MS
1 16734111 16734111
98 9005450
91892
99 25739561
Significance F
182.11
0.0000
P-value
0.0000
0.0000
13
17067
Price
16000
15000
14000
No data 13000
Odometer
y 17,067 .0623 x
The intercept is b0 = $17067.
Do not interpret the intercept as the
Price of cars that have not been driven
14
The Normality of
E(y|x3)
The standard deviation remains constant,
0 + 1x3
E(y|x2)
0 + 1x2
E(y|x1)
From the
the first
first three
three assumptions
assumptions we
we have:
have:
From
x1
normally distributed
distributed with
with mean
mean
yy isis normally
E(y) == 00 ++ 11x,
x, and
and aa constant
constant standard
standard
E(y)
deviation
deviation
x2
x3
16
SSE
SSE
A shortcut formula
22 .
(
y
y
)
( y i i y i )i .
i i 11
cov(X , Y)
22 cov( X , Y )
SSE
(
n
1
)
s
SSE (n 1)s YY
22
s
s xx
18
19
Example 17.3
Solution
s 2Y
( y i y i ) 2
n 1
Calculated before
259,996
2
cov(
X
,
Y
)
(
2
,
712
,
511
)
SSE (n 1)s 2Y
99(259,996)
9,005,450
2
43,528,690
sx
SSE
9,005,450
s
303.13
n2
98
Linear relationship.
Different inputs (x) yield
different outputs (y).
No linear relationship.
Different inputs (x) yield
the same output (y).
21
where
ssbb11
ss
((nn11))ss2x2x
23
s
( n 1) s x2
303.1
(99)(43,528,690)
.00462
b1 1 .0623 0
13.49
.00462
sb1
The rejection region is t > t.025 or t < -t.025 with = n-2 = 98.
Approximately, t.025 = 1.984
24
25
Coefficient of determination
To measure the strength of the linear relationship we
use the coefficient of determination.
2
2
cov(
X
,
Y
)
R22 cov( X , Y )
22x s22y
s
sx sy
SSE
SSE
oror R 11
2
2
(
y
y
)
( y i y )
R22
26
Coefficient of determination
To understand the significance of this coefficient note:
art by
p
n
i
ed
n
i
a
l
p
Ex
Overall variability in y Re
mains
,
in par
t, une
xplain
ed
The error
27
Coefficient of determination
y2
y1
x1
x2
Total variation in y =
(y1 y )2 (y 2 y)2
( y 1 y ) 2 ( y 2 y ) 2
28
Coefficient of determination
R2 measures the proportion of the variation in y
that is explained by the variation in x.
2
R 1
SSE
(y i y)
(y y)
( y i y ) 2 SSE
SSR
(y i y)2
Coefficient of determination,
Example
Example 17.5
Find the coefficient of determination for Example 17.2;
what does this statistic tell you about the model?
Solution
2
Solving by hand;R
[cov( x, y )]2
s x2 s 2y
[ 2, 712,511]2
( 43,528, 688)( 259,996 )
.6501
30
Coefficient of determination
Using the computer
From the regression output we have
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8063
R Square
0.6501
Adjusted R Square
0.6466
Standard Error
303.1
Observations
100
ANOVA
df
Regression
Residual
Total
Intercept
Odometer
SS
16734111
9005450
25739561
MS
16734111
91892
CoefficientsStandard Error
17067
169
-0.0623
0.0046
t Stat
100.97
-13.49
1
98
99
Significance F
182.11
0.0000
P-value
0.0000
0.0000
31
32
This isANOVA
a measure of the stocks
This is a measure of the total market-related risk
df
SS
MS
F Significance F
marketRegression
related risk. In this
sample,
embedded
in
the Nortel
stock.
1 0.10563 0.10563
26.51
0.0000
Residual
0.003985
for each
1% increase in58the0.231105
TSE
Specifically, 31.37% of the variation in Nortels
59 0.336734
return,Total
the average increase
in
return are explained by the variation in the
Nortels return is
.8877%.Standard Error tTSEs
returns.
Coefficients
Stat
P-value
Intercept
TSE
0.0128
0.8877
0.0082
0.1724
1.56
5.15
0.1245
0.0000
33
Point Prediction
Example 17.7
Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 17.2).
A point prediction
Interval Estimates
Two intervals can be used to discover how closely the
predicted value will match the true value of y.
Prediction interval predicts y for a given value of x,
Confidence interval estimates the average y for a given x.
The prediction
prediction interval
interval
The
tt 22ss
yy
22
(
x
x
)
(
x
x
)
11
gg
1
1
11))ss2x2x
nn ((nn
2
(
x
x
)
1
g
2
n (n 1) s x
36
Interval Estimates,
Example
Example 17.7 - continued
Provide an interval estimate for the bidding price on
a Ford Taurus with 40,000 miles on the odometer.
Two types of predictions are required:
A prediction for a specific car
An estimate for the average price per car
37
Interval Estimates,
Example
Solution
A prediction interval provides the price estimate for a
single car:
( x g x)2
1
y t 2 s 1
n (n 1)s 2x
t.025,98
Approximately
1
(40,000 36,009) 2
[17,067 .0623(40000)] 1.984(303.1) 1
14,575 605
100 (100 1)43,528,690
38
Interval Estimates,
Example
Solution continued
A confidence interval provides the estimate of the
mean price per car for a Ford Taurus with 40,000
miles reading on the odometer.
The confidence interval (95%) = y t 2 s
( x g x)2
( xi x)2
1
(40,000 36,009) 2
[17,067 .0623( 40000)] 1.984(303.1)
14,575 70
100 (100 1) 43,528,690
39
y t 2 s
2
1 ( x g x)
n (n 1)s 2x
40
y( x g x 1)
y( x g x 1)
y t 2 s
y t 2 s
2
1 ( x g x)
n (n 1)s 2x
1
12
n (n 1)s 2x
x 1 x 1
( x 1) x 1 ( x 1) x 1
41
x2
x2
( x 2) x 2 ( x 2) x 2
y t 2 s
2
1 ( x g x)
n (n 1) s 2x
y t 2 s
1
12
n (n 1)s 2x
y t 2 s
1
22
n (n 1)s 2x
42
Coefficient of Correlation
The coefficient of correlation is used to measure the
strength of association between two variables.
The coefficient values range between -1 and 1.
If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.
If r = 0 there is no linear pattern.
Y
X
44
tr
n2
1 r 2
where r is the sample
coefficient of correlatio n
calculated by r
cov( x, y )
sx s y
45
49
1
0.4911
Japanese Index
1
50