Professional Documents
Culture Documents
Advanced Statistics
Summer Term 2011
(April 5, 2011 May 17, 2011)
Tuesdays, 14.15 15.45 and 16.00 17.30
Room: J 498
Contents
1
1.1
1.2
Introduction
Syllabus
Why Advanced Statistics ?
2
2.1
2.2
2.3
2.4
3
3.1
3.2
3.3
3.4
4
4.1
4.2
4.3
4.4
5
5.1
5.2
5.3
5.3.1
5.3.2
5.3.3
Methods of Estimation
Sampling, Estimators, Limit Theorems
Properties of Estimators
Methods of Estimation
Least-Squares Estimators
Method-of-moments Estimators
Maximum-Likelihood Estimators
6
6.1
6.2
6.2.1
6.2.2
6.2.3
Hypothesis Testing
Basic Terminology
Classical Testing Procedures
Wald Test
Likelihood-Ratio Test
Lagrange-Multiplier Test
i
In English:
Chiang, A. (1984). Fundamental Methods of Mathematical Economics, 3. edition. McGrawHill, Singapore.
Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1. John
Wiley & Sons, New York.
Feller, W. (1971). An Introduction to Probability Theory and its Applications, Vol. 2. John
Wiley & Sons, New York.
Garthwaite, P.H., Jolliffe, I.T. and B. Jones (2002). Statistical Inference, 3. edition. Oxford
University Press, Oxford.
Mood, A.M., Graybill, F.A. and D.C. Boes (1974). Introduction to the Theory of Statistics,
3. edition. McGraw-Hill, Tokyo.
ii
1. Introduction
1.1 Syllabus
Aim of this course:
Consolidation of
probability calculus
statistical inference
(on the basis of previous Bachelor courses)
Preparatory course to Econometrics, Empirical Economics
1
Web-site:
http://www1.wiwi.uni-muenster.de/oeew/
Study Courses summer term 2011
Advanced Statistics
Style:
Lecture is based on slides
Slides are downloadable as PDF-files from the web-site
References:
See Contents
2
Class teacher:
Dipl.-Mathem. Marc Lammerding
(see personal web-site)
Now:
Course in Advanced Statistics
(probability calculus and mathematical statistics)
Preliminaries:
BA courses
Mathematics
Statistics I
Statistics II
The slides for the BA courses Statistics I+II are downloadable
from the web-site
(in German)
Later courses based on Advanced Statistics:
All courses belonging to the three modules Econometrics
and Empirical Economics
(Econometrics I+II, Analysis of Time Series, ...)
7
Preliminaries:
Repetition of the notions
random experiment
outcome (sample point) and sample space
event
probability
In economics:
Random experiments (according to Def. 2.1) are rare
(historical data, trials are not controllable)
Modern discipline: Experimental Economics
11
Obviously:
The number of elements in can be either (1) finite or (2)
infinite, but countable or (3) infinite and uncountable
Now:
Definition of the notion Event based on mathematical sets
Definition 2.3: (Event)
An event of a random experiment is a subset of the sample space
. We say the event A occurs if the random experiment has
an outcome A.
13
Remarks:
Events are typically denoted by A, B, C, . . . or A1, A2, . . .
A = is called the sure event
(since for every sample point we have A)
A = (empty set) is called the impossible event
(since for every we have
/ A)
If the event A is a subset of the event B (A B) we say that
the occurrence of A implies the occurrence of B
(since for every A we also have B)
Obviously:
Events are represented by mathematical sets
application of set operations to events
14
n
T
i=1
Union:
n
S
i=1
Set difference:
C = A\B occurs, if A occurs and B does not occur
Complement:
C = \A A occurs, if A does not occur
The events A and B are called disjoint, if A B =
(both events cannot occur simultaneously)
15
Now:
For any arbitrary event A we are looking for a number P (A)
which represents the probability that A occurs
Formally:
P : A P (A)
(P () is a set function)
Question:
Which properties should the probability function (set function) P () have?
16
Easy to check:
The three axioms imply several additional properties and rules
when computing with probabilities
Theorem 2.5: (General properties)
The Kolmogorov-axioms imply the following properties:
Probability of the complementary event:
P (A) = 1 P (A)
Probability of the impossible event:
P () = 0
Range of probabilities:
0 P (A) 1
18
Next:
General rules when computing with probabilities
Theorem 2.6: (Calculation rules)
The Kolmogorov-axioms imply the following calculation rules
(A, B, C are arbitrary events):
Notice:
If B implies A (i.e. if B A) it follows that
P (A\B) = P (A) P (B)
21
24
Example 1:
Consider the experiment of tossing a single coin (H=Head,
T =Tail). Let the rv X represent the Number of Heads
We have
= {H, T }
The random variable X can take on two values:
X(T ) = 0,
X(H) = 1
25
Example 2:
Consider the experiment of tossing a coin three times. Let
X represent the Number of Heads
We have
= {(H,
H,
H)}, |(H, {z
H, T )}, . . . , |(T, {z
T, T )}}
{z
|
=1
=2
=8
The rv X is defined by
X() = number of H in
Obviously:
X relates distinct s to the same number, e.g.
X((H, H, T )) = X((H, T, H)) = X((T, H, H)) = 2
26
Example 3:
Consider the experiment of randomly selecting 1 person from
a group of people. Let X represent the persons status of
employment
We have
= {employed
{z
}, unemployed
|
{z
}}
|
=1
=2
X can be defined as
X(1) = 1,
X(2) = 0
27
Example 4:
Consider the experiment of measuring tomorrows price of a
specific stock. Let X denote the stock price
We have = [0, ), i.e. X is defined by
X() =
Conclusion:
The random variable X can take on distinct values with specific probabilities
28
Question:
How can we determine these specific probabilities and how
can we calculate with them?
Simplifying notation: (a, b, x R)
P (X = a) P ({|X() = a})
P (a < X < b) P ({|a < X() < b})
P (X x) P ({|X() x})
Solution:
We can compute these probabilities via the so-called cumulative distribution function of X
29
Intuitively:
The cumulative distribution function of the random variable
X characterizes the probabilities according to which the possible values x are distributed along the real line
(the so-called distribution of X)
Example:
Consider the experiment of tossing a coin three times. Let
X represent the Number of Heads
We have
= {(H,
H,
H)}, (H,
H, T )}, . . . , |(T, {z
T, T )}}
|
|
{z
{z
= 1
= 2
= 8
FX (x) =
0.000
0.125
0.5
0.875
for x < 0
for 0 x < 1
for 1 x < 2
for 2 x < 3
for x 3
Remarks:
In practice, it will be sufficient to only know the cdf FX of X
In many situations, it will appear impossible to exactly specify
the sample space or the explicit function X : R.
However, often we may derive the cdf FX from other factual
consideration
32
General properties of FX :
FX (x) is a monotone, nondecreasing function
We have
lim FX (x) = 0
and
lim FX (x) = 1
x+
33
Summary:
Via the cdf FX (x) we can answer the following question:
What is the probability that the random variable X takes
on a value that does not exceed x?
Now:
Consider the question:
What is the value which X does not exceed with a
prespecified probability p (0, 1)?
quantile function of X
34
Special quantiles:
Median: p = 0.5
Quartiles: p = 0.25, 0.5, 0.75
Quintiles: p = 0.2, 0.4, 0.6, 0.8
Deciles: p = 0.1, 0.2, . . . , 0.9
Now:
Consideration of two distinct classes of random variables
(discrete vs. continuous rvs)
36
Reason:
Each class requires a specific mathematical treatment
Mathematical tools for analyzing discrete rvs:
Finite and infinite sums
Mathematical tools for analyzing continuous rvs:
Differential- and integral calculus
Remarks:
Some rvs are partly discrete and partly continuous
Such rvs are not treated in this course
37
and
J,...
X
P (X = xj ) = 1.
j=1
38
or
Remarks:
The discrete density function fX () takes on strictly positive
values only for elements of the support of X. For realizations
of X that do not belong to the support of X, i.e. for x
/
supp(X), we have fX (x) = 0:
fX (x) =
P (X = xj ) > 0
0
for x = xj supp(X)
for x
/ supp(X)
40
fX (xj ) = 1
xj supp(X)
fX (xj )
xj A
41
Example:
Consider the experiment of tossing a coin three times and
let X = Number of Heads
(see slide 31)
Obviously: X is discrete and has the support
supp(X) = {0, 1, 2, 3}
The discrete density function of X is given by
fX (x) =
P (X = 0) = 0.125
P (X = 1) = 0.375
P (X = 2) = 0.375
P (X = 3) = 0.125
for x = 0
for x = 1
for x = 2
for x = 3
for x
/ supp(X)
42
FX (x) =
0.000
0.125
0.5
0.875
for x < 0
for 0 x < 1
for 1 x < 2
for 2 x < 3
for x 3
Obviously:
The cdf FX () can be obtained from fX ():
FX (x) = P (X x) =
{xj supp(X)|xj x}
=P (X=xj )
z }| {
fX (xj )
43
Conclusion:
The cdf of a discrete random variable X is a step function
with steps at the points xj supp(X). The height of the
step at xj is given by
FX (xj ) xx
lim F (x) = P (X = xj ) = fX (xj ),
j
x<xj
i.e. the step height is equal to the value of the discrete density
function at xj
(relationship between cdf and discrete density function)
44
Now:
Definition of continuous random variables
Intuitively:
In contrast to discrete random variables, continuous random
variables can take on an uncountable number of values
(e.g. every real number on a given interval)
In fact:
Definition of a continuous random variable is quite technical
45
Z x
fX (t)dt
for all x R.
Remarks:
The cdf FX () of a continuous random variable X is a primitive function of the pdf fX ()
FX (x) = P (X x) is equal to the area under the pdf fX ()
between the limits and x
46
fX(t)
P(X x) = FX(x)
47
for all x R
fX (x)dx = 1
48
0
0.1
, for x
/ [0, 10]
, for x [0, 10]
Z x
fX (t) dt =
Z x
0 dt = 0
49
Z x
Z 0
fX (t) dt
0 dt +
|
{z
=0
Z x
0
0.1 dt
= [0.1 t]x0
= 0.1 x 0.1 0
= 0.1 x
50
Z x
Z 0
fX (t) dt
0 dt +
{z
|
=0
= 1
Z 10
|0
0.1 dt +
{z
=1
0 dt
| 10{z }
=0
51
Now:
Interval probabilities, i.e. (for a, b R, a < b)
P (X (a, b]) = P (a < X b)
We have
P (a < X b) = P ({|a < X() b})
= P ({|X() > a} {|X() b})
= 1 P ({|X() > a} {|X() b})
= 1 P ({|X() > a} {|X() b})
= 1 P ({|X() a} {|X() > b})
52
= 1 [P (X a) + P (X > b)]
= 1 [FX (a) + (1 P (X b))]
= 1 [FX (a) + 1 FX (b)]
= FX (b) FX (a)
=
Z b
Z b
fX (t) dt
Z a
fX (t) dt
fX (t) dt
53
P(a < X b)
54
for all a R
Proof:
P (X = a) = lim P (a < X b) = lim
ba
Z a
a
Z b
ba a
fX (x) dx
fX (x)dx = 0
Conclusion:
The probability that a continuous random variable X takes
on a single explicit value is always zero
55
fX(x)
b3
b2
b1
56
Notice:
This does not imply that the event {X = a} cannot occur
Consequence:
Since for continuous random variables we always have P (X =
a) = 0 for all a R, it follows that
P (a < X < b) = P (a X < b) = P (a X b)
= P (a < X b) = FX (b) FX (a)
(when computing interval probabilities for continuous rvs, it
does not matter if the interval is open or closed)
57
E(X) =
xj P (X = xj )
{x supp(X)}
Z +
x fX (x) dx
, if X is discrete
.
, if X is continuous
58
Remarks:
The expectation of the random variable X is approximately
equal to the sum of all realizations each weighted by the
probability of its occurrence
Instead of E(X) we often write X
There exist random variables that do not have an expectation
(see class)
59
60
fX (x) =
P (X = 0) = 6/36
P (X = 1) = 10/36
P (X = 2) = 8/36
P (X = 3) = 6/36
P (X = 4) = 4/36
P (X = 5) = 2/36
This gives
for x = 0
for x = 1
for x = 2
for x = 3
for x = 4
for x = 5
for x
/ supp(X)
10
8
6
4
2
6
+1
+2
+3
+4
+5
E(X) = 0
36
36
36
36
36
36
=
70
= 1.9444
36
61
x
, for 1 x 3
fX (x) =
4
0
, elsewise
E(X) =
=
Z +
Z 1
x fX (x) dx
Z 3
+
x
0 dx +
0 dx
x dx +
4
3
1
62
1 1 3 3
dx =
x
4 3
1 4
1
Z 3 2
x
27 1
1
=
4
3
3
26
=
= 2.1667
12
Frequently:
Random variable X plus discrete density or pdf fX is known
We have to find the expectation of the transformed random
variable
Y = g(X)
63
E(Y ) = E[g(X)]
g(xj ) P (X = xj )
{xj supp(X)}
Z +
g(x) fX (x) dx
, if X is discrete
.
, if X is continuous
64
Remarks:
All functions considered in this course are Baire-functions
For the special case g(x) = x (the identity function) Theorem
2.15 coincides with Definition 2.14
Next:
Some important rules for calculating expected values
65
E[g1(X)] E[g2(X)].
Proof: Class
66
Now:
Consider the random variable X (discrete or continuous) and
the explicit function g(x) = [x E(X)]2
variance and standard deviation of X
Definition 2.17: (Variance, standard deviation)
For any random variable X the variance, denoted by Var(X), is
defined as the expected quadratic distance between X and its
expectation E(X); that is
Var(X) = E[(X E(X))2].
SD(X) = + Var(X).
67
Remark:
Setting g(X) = [X E(X)]2 in Theorem 2.15 (on slide 64)
yields the following explicit formulas for discrete and continuous random variables:
Var(X) = E[g(X)]
X
2 P (X = x )
(X)]
[x
j
j
{xj supp(X)}
Z +
[x E(X)]2 fX (x) dx
68
69
E [g(X)]
.
k
Special case:
Consider
g(x) = [x E(X)]2
and
k = r2 Var(X)
P [X E(X)]
r2 Var(X)
(r > 0)
Var(X)
1
2
= 2
r Var(X)
r
71
Now:
n
P [X E(X)]
r2 Var(X)
It follows that
1
P {|X E(X)| < r SD(X)} 1 2
r
(specific Chebyshev inequality)
72
Remarks:
The specific Chebyshev inequality provides a minimal probability of the event that any arbitrary random variable X takes
on a value from the following interval:
[E(X) r SD(X), E(X) + r SD(X)]
For example, for r = 3 we have
1
8
P {|X E(X)| < 3 SD(X)} 1 2 =
3
9
which is equivalent to
P {E(X) 3 SD(X) < X < E(X) + 3 SD(X)} 0.8889
or
P {X (E(X) 3 SD(X), E(X) + 3 SD(X))} 0.8889
73
E [g(X)] g(E[X]).
Remarks:
If the function g is concave (i.e. if g 00(x) 0 for all x) then
Jensens inequality states that E [g(X)] g(E[X])
Notice that in general we have
E [g(X)] =
6 g(E[X])
74
Example:
Consider the random variable X and the function g(x) = x2
We have g 00(x) = 2 0 for all x, i.e. g is convex
It follows from Jensens inequality that
E
E[X])
[g(X)] g(
| {z }
| {z }
=E(X 2)
i.e.
=[E(X)]2
E(X 2) [E(X)]2 0
This implies
Var(X) = E(X 2) [E(X)]2 0
(the variance of an arbitrary rv cannot be negative)
75
Now:
Consider the random variable X with expectation E(X) = X ,
the integer number n N and the functions
g1(x) = xn
g2(x) = [x X ]n
Relations:
01 = E(X) = X
(the 1st moment coincides with E(X))
1 = E[X X ] = E(X) X = 0
(the 1st central moment is always equal to 0)
2 = E[(X X )2] = Var(X)
(the 2nd central moment coincides with Var(X))
77
Remarks:
The first four moments of a random variable X are important
measures of the probability distribution
(expectation, variance, skewness, kurtosis)
The moments of a random variable X play an important role
in theoretical and applied statistics
In some cases, when all moments are known, the cdf of a
random variable X can be determined
78
Question:
Can we find a function that gives us a representation of all
moments of a random variable X?
i
tX
e
.
79
Remarks:
The moment generating function mX (t) is a function in t
There are rvs X for which mX (t) does not exist
If mX (t) exists it can be calculated as
mX (t) = E
etX
etxj P (X = xj )
{x supp(X)}
Z +
etx fX (x) dx
, if X is discrete
, if X is continuous
80
Question:
Why is mX (t) called the moment generating function?
Answer:
Consider the nth derivative of mX (t) with respect to t:
X
n etxj P (X = x )
(x
)
j
j
{xj supp(X)}
dn
mX (t) =
dtn
Z +
xn etx fX (x) dx
for discrete X
for continuous X
81
(xj )n P (X = xj )
{xj supp(X)}
dn
mX (0) =
n
dt
Z +
xn fX (x) dx
for discrete X
for continuous X
= E(X n) = 0n
(see Definition 2.21(a) on slide 76)
82
Example:
Let X be a continuous random variable with pdf
fX (x) =
0
ex
, for x < 0
, for x 0
mX (t) = E etX =
=
for t <
Z +
0
Z +
etx fX (x) dx
e(t)x dx =
83
It follows that
m0X (t) =
and thus
( t)2
0 (0) = E(X) =
mX
and
and
m00X (t) =
2
( t)3
2
m00X (0) = E(X 2) = 2
Now:
Important result on moment generating functions
84
Example:
Suppose that a random variable X has the moment generating function
mX (t) =
1
1t
0
ex
, for x < 0
, for x 0
86
Central result:
The distribution of a random variable X is (essentially) determined by fX (x) or FX (x)
FX (x) can be determined by fX (x)
(cf. slide 46)
fX (x) can be determined by FX (x)
(cf. slide 48)
Question:
How many different distributions are known to exist?
88
Answer:
Infinitely many
But:
In practice, there are some important parametric families of
distributions that provide good models for representing realworld random phenomena
These families of distributions are decribed in detail in all
textbooks on mathematical statistics
(see e.g. Mosler & Schmid (2008), Mood et al. (1974))
89
90
Remark:
The most important family of distributions at all is the normal distribution
2
x
1
2
1
e
2
x R.
91
fX(x)
N(5,1)
N(0,1)
N(5,3)
N(5,5)
92
Remarks:
The special normal distribution N (0, 1) is called standard normal distribution the pdf of which is denoted by (x)
The properties as well as calculation rules for normally distributed random variables are important pre-conditions for
this course
(see Wilfling (2011), Section 3.4)
93
95
for i = 1, . . . , n.
Then X = (X1, . . . , Xn)0 is called an n-dimensional random variable or an n-dimensional random vector.
Remark:
In the literature random vectors are often denoted by
X = ( X1 , . . . , X n )
or more simply by
X1 , . . . , X n
96
or
(X, Y )
or
X, Y
or
x = (x, y)0 R2
Now:
Characterization of the probability distribution of the random
vector X
97
y+
Remark:
Analogous properties
FX1,...,Xn (x1, . . . , xn)
hold
for
the
cdf
n-dimensional
99
Now:
Joint discrete versus joint continuous random vectors
Definition 3.3: (Joint discrete random vector)
The random vector X = (X1, . . . , Xn)0 is defined to be a joint discrete random vector if it can assume only a finite (or a countable
infinite) number of realizations x = (x1, . . . , xn)0 such that
P (X1 = x1, X2 = x2, . . . , Xn = xn) > 0
and
X
Z xn
...
Z x
1
Example:
Consider X = (X, Y )0 with joint pdf
fX,Y (x, y) =
x+y
0
2
1.5
fHx,yL 1
0.5
0
0
1
0.8
0.6
0.4 y
0.2
0.4
x 0.6
0.2
0.8
10
102
Z y
Z x
Z yZ x
0
fX,Y (u, v) du dv
(u + v) du dv
= ...
0.5(x2y + xy 2)
0.5(x2 + x)
=
0.5(y 2 + y)
, for
, for
, for
, for
(Proof: Class)
103
Remarks:
If X = (X1, . . . , Xn)0 is a joint continuous random vector,
then
nFX1,...,Xn (x1, . . . , xn)
x1 xn
Z ao
n
au
n
...
Z ao
1
au
1
104
In this course:
Emphasis on joint continuous random vectors
Analogous results for joint discrete random vectors
(see Mood, Graybill, Boes (1974), Chapter IV)
Now:
Determination of the distribution of a single random variable Xi from the joint distribution of the random vector
(X1, . . . , Xn)0
marginal distribution
105
106
fX1 (x1) =
Z +
fX2 (x2) =
Z +
fXn (xn) =
Z +
...
Z +
...
Z +
...
Z +
are called marginal pdfs of the one-dimensional (univariate) random variables X1, . . . , Xn.
107
Example:
Consider the bivariate pdf
fX,Y (x, y)
=
108
3
fHx,yL 2
1
0.8
1
0
0
0.6
0.4 y
0.2
0.4
x 0.6
0.2
0.8
10
109
Z 1
0
= 40(x 0.5)2
Z 1
0
1
2x
1
3
= 40(x 0.5)2 y 4
y4 y5
4
4
5
0
3 2x 1
2
= 40(x 0.5)
4
4
5
fHxL
1.5
1.25
1
0.75
0.5
0.25
x
0.2
0.4
0.6
0.8
111
Z 1
0
= 40y 3
=
Z 1
0
(x 0.5)2(3 2x y)dx
10 3
y (y 2)
3
112
fHyL
3
2.5
2
1.5
1
0.5
y
0.2
0.4
0.6
0.8
113
Remarks:
When considering the marginal instead of the joint distributions, we are faced with an information loss
(the joint distribution uniquely determines all marginal distributions, but the converse does not hold in general)
Besides the respective univariate marginal distributions, there
are also multivariate distributions which can be obtained from
the joint distribution of X = (X1, . . . , Xn)0
114
Example:
For n = 5 consider X = (X1, . . . , X5)0 with joint pdf fX1,...,X5
Then the marginal pdf of Z = (X1, X3, X5)0 obtains as
fX1,X3,X5 (x1, x3, x5)
=
Z + Z +
115
116
117
Remark:
Conditional densities of random vectors are defined analogously, e.g.
fX1,X2,X4|X3=x3,X5=x5 (x1, x2, x4) =
fX1,X2,X3,X4,X5 (x1, x2, x3, x4, x5)
fX3,X5 (x3, x5)
118
Example:
Consider the bivariate pdf
fX,Y (x, y)
=
10 3
y (y 2)
3
119
It follows that
fX|Y =y (x) =
=
fX,Y (x, y)
fY (y)
40(x 0.5)2y 3(3 2x y)
3
10
3 y (y 2)
12(x 0.5)2(3 2x y)
=
2y
120
Bedingte
Dichte
3
2.5
2
1.5
1
0.5
x
0.2
0.4
0.6
0.8
121
Bedingte
Dichte
1.2
1
0.8
0.6
0.4
0.2
x
0.2
0.4
0.6
0.8
122
Now:
Combine the concepts joint distribution and conditional
distribution to define the notion stochastic independence
(for two random variables first)
for all x, y R.
123
Remarks:
Alternatively, stochastic independence can be defined via the
cdfs:
X and Y are stochastically independent, if and only if
FX,Y (x, y) = FX (x) FY (y)
for all x, y R.
fX,Y (x, y)
fY (y)
fX,Y (x, y)
fX (x)
f (x) fY (y)
= X
= fX (x)
fY (y)
f (x) fY (y)
= X
= fY (y)
fX (x)
Now:
Extension to n random variables
Definition 3.8: (Stochastic independence [II])
Let (X1, . . . , Xn)0 be a continuous random vector with joint pdf
fX1,...,Xn (x1, . . . , xn) and joint cdf FX1,...,Xn (x1, . . . , xn). X1, . . . , Xn
are defined to be stochastically independent, if and only if for all
(x1, . . . , xn)0 Rn
fX1,...,Xn (x1, . . . , xn) = fX1 (x1) . . . fXn (xn)
or
FX1,...,Xn (x1, . . . , xn) = FX1 (x1) . . . FXn (xn).
125
Remarks:
For discrete random vectors we define: X1, . . . , Xn are stochastically independent, if and only if for all (x1, . . . , xn)0 Rn
P (X1 = x1, . . . , Xn = xn) = P (X1 = x1) . . . P (Xn = xn)
or
FX1,...,Xn (x1, . . . , xn) = FX1 (x1) . . . FXn (xn)
In the case of independence, the joint distribution results
from the marginal distributions
If X1, . . . , Xn are stochastically independent and g1, . . . , gn are
continuous functions, then Y1 = g1(X1), . . . , Yn = gn(Xn) are
also stochastically independent
126
127
E[g(X1, . . . , Xn)]
=
Z +
...
Z +
128
Remarks:
For a discrete random vector (X1, . . . , Xn)0 the analogous definition is
E[g(X1, . . . , Xn)] =
Z +
xfX (x) dx
Z +
[x E(X)]2fX (x) dx
129
Z + Z +
q
Corr(X1, X2) = q
Var(X1) Var(X2)
Now:
Expectation and variances of random vectors
Definition 3.10: (Expected vector, covariance matrix)
Let X = (X1, . . . , Xn)0 be a random vector. The expected vector
of X is defined to be
E(X1)
...
E(X) =
.
E(Xn)
Cov(X) =
Var(X1)
Cov(X1, X2)
Cov(X2, X1)
Var(X2)
...
...
Cov(Xn, X1) Cov(Xn, X2)
. . . Cov(X1, Xn)
. . . Cov(X2, Xn)
...
...
...
Var(Xn)
131
Bemerkung:
Obviously, the covariance matrix is symmetric per definition
Now:
Expected vectors and covariance matrices under linear transformations of random vectors
Let
X = (X1, . . . , Xn)0 be a n-dimensional random vector
A be an (m n) matrix of real numbers
b be an (m 1) column vector of real numbers
132
Obviously:
Y = AX + b is an (m 1) random vector:
Y =
a11 a12
a21 a22
...
...
am1 am2
. . . a1n
X1
. . . a2n
X2
.
.
...
.. ..
. . . amn
Xn
b1
b2
...
bm
133
E(Y) =
= AE(X) + b
The covariance matrix of Y is given by
Cov(Y) =
Cov(Y1, Y2)
Var(Y1)
Var(Y2)
Cov(Y2, Y1)
...
...
Cov(Yn, Y1) Cov(Yn, Y2)
. . . Cov(Y1, Yn)
. . . Cov(Y2, Yn)
...
...
...
Var(Yn)
= ACov(X)A0
(Proof: Class)
134
Remark:
Cf. the analogous results for univariate variables:
E(a X + b) = a E(X) + b
Var(a X + b) = a2 Var(X)
Up to now:
Expected values for unconditional distributions
Now:
Expected values for conditional distributions
(cf. Definition 3.6, Slide 117)
135
E[g(X, Y )|X = x] =
Z +
136
Remarks:
An analogous definition applies to a discrete random vector
(X, Y )0
Definition 3.11 naturally extends to higher-dimensional distributions
For g(x, y) = y we obtain the special case E[g(X, Y )|X = x] =
E(Y |X = x)
Note that E[g(X, Y )|X = x] is a function of x
137
Example:
Consider the joint pdf
fX,Y (x, y) =
x+y
0
x+y
x + 0.5
fY |X (y) =
1
x+y
x
1
y
E(Y |X = x) =
dy =
+
x + 0.5
x + 0.5
2
3
0
138
Remarks:
Consider the function g(x, y) = g(y)
(i.e. g does not depend on x)
Denote h(x) = E[g(Y )|X = x]
We calculate the unconditional expectation of the transformed variable h(X)
We have
139
Z +
Z +
h(x) fX (x) dx
Z + "Z +
Z + Z +
Z + Z +
= E[g(Y )]
140
Theorem 3.12:
Let (X, Y )0 be an arbitrary discrete or continuous random vector.
Then
141
Theorem 3.13:
Let (X, Y )0 be an arbitrary discrete or continuous random vector
and g1(), g2() two unidimensional functions. Then
1. E[g1(Y ) + g2(Y )|X = x] = E[g1(Y )|X = x] + E[g2(Y )|X = x],
2. E[g1(Y ) g2(X)|X = x] = g2(x) E[g1(Y )|X = x].
3. If X and Y are stochastically independent we have
142
Finally:
Moment generating function for random vectors
Definition 3.14: (Joint moment generating function)
Let X = (X1, . . . , Xn)0 be an arbitrary discrete or continuous
random vector. The joint moment generating function of X is
defined to be
mX1,...,Xn (t1, . . . , tn) = E
et1X1+...+tnXn
if this expectation exists for all t1, . . . , tn with h < tj < h for an
arbitary value h > 0 and for all j = 1, . . . , n.
143
Remarks:
Via the joint moment generating function mX1,...,Xn (t1, . . . , tn)
we can derive the following mathematical objects:
the marginal moment generating functions mX1 (t1), . . . ,
mXn (tn)
the moments of the marginal distributions
the so-called joint moments
144
145
12
= ...
1n
...
...
and
,
2
n1 n
if for x = (x1, . . . , xn)0 Rn its joint pdf is given by
1
fX(x) = (2)n/2 [det()]1/2 exp (x )0 1 (x ) .
2
1
.
=
..
n
146
Remarks:
See Chang (1984, p. 92) for a definition and the properties
of the determinant det(A) of the matrix A
Notation:
X N (, )
is a column vector with 1, . . . , n R
is a regular, positive definite, symmetric (n n) matrix
Role of the parameters:
E(X) =
and
Cov(X) =
147
1 0
n/2
(x) = (2)
exp x x
2
Cf. the analogy to the univariate pdf in Definition 2.24, Slide
91
Properties of the N (, ) distribution:
Partial vectors (marginal distributions) of X also have multivariate normal distributions, i.e. if
X=
"
X1
X2
"
1
2
# "
11 12
21 22
#!
then
X1 N (1, 11)
X2 N (2, 22)
148
Thus, all univariate variables of X = (X1, . . . , Xn)0 have univariate normal distributions:
X1 N (1, 12)
X2 N (2, 22)
...
2)
Xn N (n, n
The conditional distributions are also (univariately or multivariately) normal:
1
X1|X2 = x2 N 1 + 1222
(x2 2), 11 121
22 21
Linear transformations:
Let A be an (m n) matrix, b an (m 1) vector of real
numbers and X = (X1, . . . , Xn)0 N (, ). Then
AX + b N (A + b, AA0)
149
Example:
Consider
X N (, )
N
"
0
1
# "
1 0.5
0.5 2
#!
"
1 2
3 4
1
2
AA0 =
"
b=
"
"
3
6
and
12 24
24 53
#
150
Now:
Consider the bivariate case (n = 2), i.e.
X = (X, Y )0,
E(X) =
"
X
Y
"
2
X
Y X
XY
Y2
We have
XY = Y X = Cov(X, Y ) = X Y Corr(X, Y ) = X Y
The joint pdf follows from Definition 3.15 with n = 2
fX,Y (x, y) =
1
2X Y
"
exp
2
2 1
1 2
(y Y )2
(x X )2 2(x X )(y Y )
+
2
X Y
X
Y2
(Derivation: Class)
151
#)
fHx,yL0.1
0.15
0.05
0
0 y
-2
0
-2
x
2
152
0.3
fHx,yL0.2
2
0.1
0
0 y
-2
0
-2
x
2
153
Remarks:
The marginal distributions are given by
2 ) and
X N (X , X
Y N (Y , Y2 )
interesting result for the normal distribution:
X
2 1 2
X|Y = y N X +
(y Y ), X
Y
Y
(x X ), Y2 1 2
Y |X = x N Y +
X
(Proof: Class)
154
Example:
Consider as given X1, . . . , Xn with fX1,...,Xn
Consider the functions
g1(X1, . . . , Xn) =
n
X
i=1
Xi
and
n
1 X
g2(X1, . . . , Xn) =
Xi
n i=1
Pn
1 Pn
Y
=
X
and
2
i=1 i
n i=1 Xi
Remark:
From the joint distribution fY1,...,Yk we can derive the k marginal
distributions fY1 , . . . fYk
(cf. Chapter 3, Slides 106, 107)
156
157
E(Y ) =
or
E(Y ) =
Z +
...
Z +
Z +
y fY (y) dy
Now:
Calculation rules for expected values, variances, covariances
of sums of random variables
Setting:
X1, . . . , Xn are given continuous or discrete random variables
with joint density fX1,...,Xn
The (transforming) function g : Rn R is given by
g(x1, . . . , xn) =
n
X
xi
i=1
160
n
X
Xi
i=1
n
X
i=1
and
Var
n
X
i=1
Xi =
n
X
i=1
Xi =
n
X
E(Xi)
i=1
Var(Xi) + 2
n
X
n
X
Cov(Xi, Xj ).
i=1 j=i+1
161
Implications:
For given constants a1, . . . , an R we have
n
X
ai Xi =
i=1
(why?)
n
X
i=1
ai E(Xi)
Var
n
X
i=1
Xi =
n
X
Var(Xi)
i=1
162
Now:
Calculating the covariance of two sums of random variables
Theorem 4.2: (Covariance of two sums)
Let X1, . . . , Xn and Y1, . . . , Ym be two sets of random variables
and let a1, . . . an and b1, . . . , bm be two sets of constants. Then
Cov
n
X
i=1
ai Xi,
m
X
j=1
bj Yj =
n X
m
X
i=1 j=1
ai bj Cov(Xi, Yj ).
163
Implications:
The variance of a weighted sum of random variables is given
by
Var
n
X
i=1
ai Xi = Cov
n
X
i=1
n X
n
X
i=1 j=1
j=1
ai aj Cov(Xi, Xj )
n
X
a2
i Var(Xi ) +
n
X
a2
i Var(Xi) + 2
i=1
aj Xj
n
X
i=1
ai Xi,
n
X
n
X
i=1 j=1,j6=i
n
X
n
X
ai aj Cov(Xi, Xj )
i=1 j=i+1
ai aj Cov(Xi, Xj )
164
165
Setting:
Let X1, X2 be both continuous or both discrete random variables with joint density fX1,X2
Let g : Rn R be defined as g(x1, x2) = x1 x2
Find the expectation of
Y = g(X1, X2) = X1 X2
Theorem 4.3: (Expectation of a product)
For the random variables X1, X2 we have
Implication:
If X1 and X2 are stochastically independent, we have
167
169
Example 1:
Consider n = 1 (i.e. consider X1 X with cdf FX ) and k = 1
(i.e. g1 g and Y1 Y )
Consider the function
g(x) = a x + b,
b R, a > 0
170
yb
= P X
a
yb
= FX
a
yb
1
yb
0
fY (y) = FY0 (y) = FX
= fX
a
a
a
(cf. Slide 48)
171
Example 2:
Consider n = 1 and k = 1 and the function
g(x) = ex
The cdf of Y = g(X) = eX is given by
FY (y) = P (Y y)
= P (eX y)
= P [X ln(y)]
= FX [ln(y)]
If X is continuous, the pdf of Y is given by
fX [ln(y)]
0
0
fY (y) = FY (y) = FX [ln(y)] =
y
172
Now:
Consider n = 2 and k = 2, i.e. consider X1 and X2 with joint
density fX1,X2 (x1, x2)
Consider the functions
g1(x1, x2) = x1 + x2
and
g2(x1, x2) = x1 x2
173
Z +
Z +
and
fY2 (y2) =
Z +
Z +
174
Implication:
If X1 and X2 are independent, then
fY1 (y1) =
Z +
fY2 (y2) =
Z +
Example:
Let X1 and X2 be independent random variables both with
pdf
fX1 (x) = fX2 (x) =
1
0
, for x [0, 1]
, elsewise
Now:
Analogous results for the product and the ratio of two random variables
and
y1
fY1 (y1) =
fX1,X2 (x1, ) dx1
x1
|x1|
fY2 (y2) =
Z +
Motivation:
Consider as given the random variables X1, . . . , Xn with joint
pdf fX1,...,Xn
Again, find the joint distribution of Y1, . . . , Yk where Yj =
gj (X1, . . . , Xn) for j = 1, . . . , k
177
et1Y1+...+tk Yk
Z +
...
Z +
et1g1(x1,...,xn)+...+tk gk (x1,...,xn)
Example:
Consider n = 1 and k = 1 where the random variable X1 X
has a standard normal distribution
Consider the function g1(x) g(x) = x2
Find the distribution of Y = g(X) = X 2
The moment generating function of Y is given by
h
mY (t) = E etY = E e
=
Z +
tX 2
etx fX (x)dx
179
Z +
= ...
1
2
2
1
tx
= 1
2t
2
1
2x
dx
1
for t <
2
Now:
Distribution of sums of independent random variables
Preliminaries:
Consider the moment generating function of such a sum
Let X1, . . . , Xn be independent random variables and let Y =
Pn
i=1 Xi
etY
= E
et
Pn
i=1 Xi
= E
i
h
i
h
i
tX
tX
tX
e 1 E e 2 ... E e n
[Theorem 3.13(c)]
181
n
Y
i=1
mXi (t)
Hopefully:
Pn
The distribution of the sum Y = i=1 Xi may be identified
from the moment generating function of the sum mY (t)
182
Example 1:
Assume that X1, . . . , Xn are independent and identically distributed exponential random variables with parameter > 0
The moment generating function of each Xi (i = 1, . . . , n) is
given by
mXi (t) =
for t <
t
(cf. Mood, Graybill, Boes (1974), pp. 540/541)
Pn
So the moment generating function of the sum Y = i=1 Xi
is given by
mY (t) = mP Xi (t) =
mXi (t) =
t
i=1
n
Y
183
184
Example 2:
Assume that X1, . . . , Xn are independent random variables
and that Xi N (i, i2)
Furthermore, let a1, . . . , an R be constants
Then the distribution of the weighted sum is given by
Y =
n
X
i=1
(Proof: Class)
ai Xi N
n
X
i=1
ai i,
n
X
i=1
ai2 i2
185
186
Resort:
There are constructive methods by which it is generally possible (under rather mild conditions) to find the distributions
of transformed random variables
transformation theorems
Here:
We restrict attention to the simplest case where n = 1, k = 1,
i.e. we consider the transformation Y = g(X)
For multivariate extensions (i.e. for n 1, k 1) see Mood,
Graybill, Boes (1974), pp. 203-212
187
1
dg (y)
1
g (y)
f
fY (y) = dy X
, for y W
, elsewise
188
Remark:
The transformation g : D W with y = g(x) is called oneto-one, if for every y W there exists exactly one x D with
y = g(x)
Example:
Suppose X has the pdf
fX (x) =
x1
0
, for x [1, +)
, elsewise
dg 1(y)
= ey ,
dy
i.e. the derivative is continuous and nonzero for all y [0, +)
ey (ey )1
0
ey
0
, for y [0, +)
, elsewise
, for y [0, +)
, elsewise
190
5. Methods of Estimation
Setting:
Let X be a random variable (or let X be a random vector)
representing a random experiment
We are interested in the actual distribution of X (or X)
Notice:
In practice the actual distribution of X is a priori unknown
191
Therefore:
Collect information on the unknown distribution by repeatedly observing the random experiment (and thus the random
variable X)
random sample
statistic
estimator
192
193
194
Remarks:
We assume that, in principle, the random experiment can be
repeated as often as desired
We call the realizations x1, . . . , xn of the random sample
X1, . . . , Xn the observed or the concrete sample
Considering the random sample X1, . . . , Xn as a random vector, we see that its joint density is given by
fX1,...,Xn (x1, . . . , xn) =
n
Y
i=1
fXi (xi)
(since the Xis are independent; cf. Definition 3.8, Slide 125)
195
Zufallsvorgang X
X1 (ZV)
X2 (ZV)
...
Xn (ZV)
x1 (Realisation 1. Exp.)
x2 (Realisation 2. Exp.)
...
xn (Realisation n. Exp.)
Mgliche
Realisationen
196
Now:
Consider functions of the sampling variables X1, . . . , Xn
statistic
estimator
Definition 5.2: (Statistic)
Let X1, . . . , Xn be a random sample from X and let g : Rn R
be a real-valued function with n arguments that does not contain
any unknown parameters. Then the random variable
T = g(X1, . . . , Xn)
is called a statistic.
197
Examples:
Sample mean:
n
1 X
X = g1(X1, . . . , Xn) =
Xi
n i=1
Sample variance:
n
2
1 X
2
S = g2(X1, . . . , Xn) =
Xi X
n i=1
v
u
n
2
u1 X
Xi X
S = g3(X1, . . . , Xn) = t
n i=1
198
Remarks:
All these concepts can be extended to the multivariate case
The statistic T = g(X1, . . . , Xn) is a function of random variables and hence it is itself a random variable
a statistic has a distribution
(and, in particular, an expectation and a variance)
Purposes of statistics:
Statistics provide information on the distribution of X
Statistics are central tools for
estimating parameters
hypothesis-testing on parameters
199
Stichprobe
( X1, . . ., Xn)
g( X1, . . ., Xn)
Statistik
Messung
Stichprobenrealisation
( x1, . . ., xn)
g( x1, . . ., xn)
Realisation der Statistik
200
Now:
Let X be a random variable with unknown cdf FX (x)
We may be interested in one or several unknown parameters
of X
Let denote this unknown vector of parameters, e.g.
"
E(X)
Var(X)
"
Remarks:
Example:
Let X N (, 2) with unknown parameters and 2
The vector of parameters to be estimated is given by
"
"
E(X)
Var(X)
=
Xi
n i=1
2 =
and
an estimator of is given by
b =
"
n
1 X
)2
(Xi
n 1 i=1
1 Pn X
i=1 i
= 1 n
Pn
2
)
n 1 i=1 (Xi
203
Question:
Why do we need this seemingly complicated concept of an
estimator in the form of a random variable?
Answer:
To establish a comparison between alternative estimators of
the parameter vector
Example:
Let = Var(X) denote the unknown variance of X
204
n
2
1 X
Xi X
1(X1, . . . , Xn) =
n i=1
n
2
1 X
2(X1, . . . , Xn) =
Xi X
n 1 i=1
Question:
Which estimator is better and for what reasons?
properties (goodness criteria) of point estimators
(see Section 5.2)
205
Notice:
Some of these criteria qualify estimators in terms of their
properties when the sample size becomes large
(n , large-sample-properties)
Therefore:
Explanation of the concept of stochastic convergence:
Central-limit theorem
Weak law of large numbers
Convergence in probability
Convergence in distribution
206
X n N ,
2
n
and
Xn
n
N (0, 1).
Next:
Generalization to the multivariate case
207
1
Xn N ,
n
and
n Xn N (0, ).
208
Remarks:
A multivariate random sample from the random vector X
arises naturally by replacing all univariate random variables
in Definition 5.1 (Slide 194) by corresponding multivariate
random vectors
Note the formal analogy to the univariate case in Theorem
5.4
(be aware of matrix-calculus rules!)
Next:
Famous theorem on the arithmetic sample mean
209
E(Xi) = < ,
Var(Xi) = 2 < .
Consider the random variable
n
1 X
Xi
Xn =
n i=1
lim P X n = 0.
n
210
Remarks:
Theorem 5.6 is known as the weak law of large numbers
Irrespective of how small we choose > 0, the probability
that X n deviates more than from its expectation tends
to zero when the sample size increases
Notice the analogy between a sequence of independent and
identically distributed random variables and the definition of
a random sample from X on Slide 194
Next:
The first important concept of limiting behaviour
211
or
Yn .
Remarks:
Specific case: Weak law of large numbers
plim X n =
or
Xn
212
Typically (but not necessarily) a sequence of random variables converges in probability to a constant R
For multivariate sequences of random vectors Y1, Y2, . . . the
Definition 5.7 has to be applied to the respective corresponding elements
The concept of convergence in probability is important to
qualifying estimators
Next:
Alternative concepts of stochastic convergence
213
Yn Z.
Remarks:
Specific case: central-limit theorem
Xn d
U N (0, 1)
Yn = n
Xn = a (for b 6= 0).
Yn
b
Remark:
There is a property similar to Slutskys theorem that holds
for the convergence in distribution
h (Xn) h(Z).
Next:
Connection of both convergence concepts
216
er-Theorem)
Theorem 5.11: (Cram
Let X1, X2, . . . and Y1, Y2, . . . be sequences of random variables,
let Z be a random variable and a R a constant. Assume that
d
plim Xn = a and Yn Z. Then
d
(a) Xn + Yn a + Z,
d
(b) Xn Yn a Z.
Example:
Let X1, . . . , Xn be a random sample from X with E(X) =
and Var(X) = 2
217
n
2
X
1
2
Xi X n = 2
plim Sn = plim
n 1 i=1
n
2
X
1
2
Xi X n = 2
plim Sn = plim
n i=1
plim g1 Sn2
plim g1 Sn2
Sn2
= plim 2 = g1( 2) = 1
Sn2
= plim 2 = g1( 2) = 1
218
2
plim g2 Sn
= plim = g2( 2) = 1
Sn
plim g2 Sn2
= plim
= g2( 2) = 1
Sn
219
Now, Cram
ers theorem yields
Xn
2
g2 S n n
Xn
=
n
Sn
Xn
n
=
Sn
d
1U
= U N (0, 1)
Analogously, Cram
ers theorem yields
Xn d
U N (0, 1)
n
Sn
220
and
n
2
1 X
2 =
Xi X
n 1 i=1
221
Important questions:
Are there reasonable criteria according to which we can select
a good estimator?
How can we construct good estimators?
First goodness property of point estimators:
Concept of repeated sampling:
Draw several random samples from X
Consider the estimator for each random sample
An average of the estimates should be close to the
unknown parameter
(no systematic bias)
unbiasedness of an estimator
222
E (X1, . . . , Xn) = .
The bias of the estimator is defined as
Bias() = E() .
Remarks:
Definition 5.12 easily generalizes to the multivariate case
The bias of an unbiased estimator is equal to zero
223
Now:
Important and very general result
Theorem 5.13: (Unbiased estimators of E(X) and Var(X))
Let X1, . . . , Xn be a random sample form X where X may be
arbitrarily distributed with unknown expectation = E(X) and
unknown variance 2 = Var(X). Then the estimators
and
n
1 X
(X1, . . . , Xn) = X =
Xi
n i=1
2(X1, . . . , Xn) = S 2 =
n
2
X
1
Xi X
n 1 i=1
Remarks:
Proof: Class
Note that no explicit distribution of X is required
Unbiasedness does, in general, not carry over to parameter
transformations.
For example,
2 be two unbiased estimators of the unknown pa1 is defined to be relatively more efficient than 2
Var(1) Var(2)
226
Example:
Assume = E(X)
Consider the estimators
n
1 X
Xi
1(X1, . . . , Xn) =
n i=1
n
X
1
X1
2(X1, . . . , Xn) =
+
Xi
2
2(n 1) i=2
2
h
= Var + Bias()
i2
Remarks:
If an estimator is unbiased, then its MSE is equal to the
variance of the estimator
The MSE of an estimator depends on the value of the
unknown parameter
228
Next:
Comparison of alternative estimators via their MSEs
Definition 5.16: (MSE efficiency)
Let 1 and 2 be two alternative estimators of the unknown
parameter . 1 is defined to be more MSE efficient than 2 if
MSE(1) MSE(2)
for all possible parameter values of and
MSE(1) < MSE(2)
for at least one possible parameter value of .
229
2 ( X 1, K , X n )
1 ( X 1, K , X n )
230
Remarks:
Frequently 2 estimators of are not comparable with respect
to MSE efficiency since their respective MSE curves cross
There is no general mathematical principle for constructing
MSE efficient estimators
However, there are methods for finding the estimator with
uniformly minimum-variance among all unbiased estimators
restriction to the class of all unbiased estimators
These specific methods are not discussed here
(Rao-Blackwell-Theorem, Lehmann-Scheff
e-Theorem)
Here, we consider only one important result
231
CR() E
.
232
Remarks:
The value CR() is the minmal variance that any unbiased
estimator can take on
goodness criterion for unbiased estimators
If for an unbiased estimator (X1, . . . , Xn)
Var() = CR(),
then is called UMVUE
(Uniformly Minimum-Variance Unbiased Estimator)
233
Example:
Assume that X N (, 2) with known 2 (e.g. 2 = 1)
Consider the following two estimators of :
n
1 X
Xi
n(X1, . . . , Xn) =
n i=1
n
X
2
1
Xi +
n(X1, . . . , Xn) =
n i=1
n
n is (weakly) consistent for
(Theorem 5.6, Slide 210: weak law of large numbers)
235
n N ( + 2/n, 2/n)
(linear transformation of the normal distribution)
236
8
6
4
2
0
-1
-0.5
=0
0.5
237
8
6
4
2
0
-0.5 =0
0.5
1.5
2.5
238
Remarks:
Sufficient (but not necessary) condition for consistency:
lim E(n) =
(asymptotic unbiasedness)
lim Var(n) = 0
n
Possible properties of an estimator:
consistent and unbiased
inconsistent and unbiased
consistent and biased
inconsistent and biased
239
Next:
Application of the central-limit theorem to estimators
asymptotic normality of an estimator
Definition 5.19: (Asymptotic normality)
An estimator n(X1, . . . , Xn) of the parameter is called asymptotically normal if there exist (1) a sequence of real constants
1, 2, . . . and (2) a function V () such that
d
n n n U N (0, V ()).
240
Remarks:
Alternative notation:
appr.
n N (n, V ()/n)
241
Remarks:
There are further methods
(e.g. the Generalized Method-of-Moments, GMM)
Here: focus on ML estimation
243
n
X
i=1
244
Remark:
The LS-method is central to the linear regression model
(cf. the courses Econometrics I + II)
245
p =
n i=1
n i=1
247
Remarks:
The theoretical moments 0p and p had already been introduced in Definition 2.21 (Slide 76)
The sample moments
p are estimators of the theo0p and
retical moments p0 and p
The arithmetic sample mean is the 1st sample moment of
X1, . . . , Xn
The sample variance is the 2nd central sample moment of
X1 , . . . , X n
248
General setting:
Based on the random sample X1, . . . , Xn from X estimate the
r unknown parameters 1, . . . , r
249
250
ex
0
, for x > 0
, elsewise
E(X) =
Var(X) =
1
2
251
2. This implies
1
= 0
1
3. Method-of-moments estimator for :
1
(X1, . . . , Xn) =
Pn
1/n i=1 Xi
252
2. This implies
=
1
2
(X1, . . . , Xn) = u
t
2
Pn
1/n i=1 Xi X
Remarks:
Method-of-moment estimators are (weakly) consistent since
0 ,...,
plim 1 = plim h1(
r ,
1
1 , . . . ,
0r )
r0 )
= h1(plim
1, . . . , plim
r , plim
01, . . . , plim
= h1(1, . . . , r , 01, . . . , r0 )
= 1
In general, method-of-moments estimators are not unbiased
Method-of-moments estimators typically are asymptotically
normal
The asymptotic variances are often hard to determine
254
255
Example:
Consider an urn containing black and white balls
The ratio of numbers is known to be 3 : 1
It is not known if the black or the white balls are more numerous
Draw n balls with replacement
Let X denote the number of black balls in the sample
Discrete density of X:
n
P (X = x) =
px(1p)nx,
x
(binomial distribution)
256
27
64
1
64
27
64
9
64
9
64
27
64
1
64
27
64
Intuitive estimation:
We estimate p by that value which ex-ante maximizes the
probability of observing the actual realization x
(
0.25 , f
ur x = 0, 1
ur x = 2, 3
0.75 , f
Maximum-Likelihood (ML) estimation
p =
257
Next:
Formalization of the ML estimation technique
Notions:
Likelihood-, Loglikelihood function
ML estimator
Definition 5.21: (Likelihood function)
The likelihood function of n random variables X1, . . . , Xn is defined to be the joint density of the n random variables, say
fX1,...,Xn (x1, . . . , xn; ), which is considered to be a function of
the parameter vector .
258
Remarks:
If X1, . . . , Xn is a random sample from the continuous random
variable X with pdf fX (x, ), then
fX1,...,Xn (x1, . . . , xn; ) =
n
Y
i=1
fXi (xi; ) =
n
Y
fX (xi; )
i=1
n
Y
fX (xi; )
i=1
259
n
Y
P (X = xi; )
i=1
n
Y
i=1
1
2 2
1
2 2
2
1/2((x
)/)
i
e
n/2
exp
n
X
1
2
(x
)
i
2 2 i=1
260
262
Example:
Let X1, . . . , Xn be a random sample from X N (, 2) with
the likelihood function
L(, 2) =
1
2 2
n/2
exp
n
X
1
2
)
(x
i
2 2 i=1
n
X
n
n
1
2
= ln(2) ln( 2)
x
)
(
i
2
2
2 2 i=1
263
and
n
L(, 2)
1 X
= 2
(xi )
i=1
n
L(, 2)
1 X
n 1
2
=
)
(x
i
2
2 2
2 4 i=1
Setting these equal to zero, solving the system of equations
and replacing the realizations by the random variables yields
the ML estimators
n
1 X
Xi = X
(X1, . . . , Xn) =
n i=1
2(X1, . . . , Xn) =
n
2
1 X
Xi X
n i=1
264
265
3. Asymptotic normality:
d
n n U N (0, V ())
4. Asymptotic efficiency:
V () coincides with the Cram
er-Rao lower bound
5. Direct computation (numerical methods)
6. Quasi-ML estimation:
ML estimators computed on the basis of normally distributed random samples are robust even if the random
sample actually is not normally distributed
(robustness against distribution misspecification)
266
6. Hypothesis Testing
Setting:
Let X represent the random experiment under consideration
Let X have the unknown cdf FX (x)
We are interested in an unknown parameter in the distribution of X
Now:
Testing of a statistical hypothesis on the unknown on the
basis of a random sample X1, . . . , Xn
267
Example 1:
In a our local pub the glasses are said to contain 0.4 litres
of beer. We suspect that in many cases the glasses actually
contain less than 0.4 litres of beer
Let X represent the process of filling a glass of beer
Let = E(X) denote the expected amount of beer filled in
one glass
On the basis of a random sample X1, . . . , Xn we would like
to test
= 0.4
versus
< 0.4
268
Example 2:
We know from past data that the risk of a specific stock
(measured by the standard deviation of the stock return) has
been equal to 25%. Now, there is a change in the managerial
board of the firm. Does this change affect the risk of the
stock?
Let X represent the stock return
q
versus
6= 0.25
269
versus
H1 : /0 = 1
Types of hypotheses:
If |0| = 1 (i.e. 0 = {0}) and H0 : = 0, then H0 is called
simple
Otherwise H0 is called composite
An analogous terminology applies to H1
271
H 1 : 6 = 0
H0 : 0
versus
H1 : > 0
H0 : 0
versus
H1 : < 0
H 0 : = 0
is called a two-sided test
The tests
and
272
Next:
Consider the general testing problem
H0 : 0
versus
H1 : 1 = /0
General procedure:
Based on a random sample X1, . . . , Xn from X decide on
whether to reject H0 in favor of H1 or not
Explicit procedure:
Select an appropriate test statistic T (X1, . . . , Xn) and determine an appropriate critical region K R
Decision:
Notice:
T (X1, . . . , Xn) is a random variable
the decision is random
possibility of wrong decisions
Types of errors:
Reality
H0 true
H0 false
Conclusion:
Type I error: test rejects H0 when H0 is true
Type II error: test accepts H0 when H0 is false
274
275
Question:
When does a hypothesis test of the form
H0 : 0
versus
H1 : 1 = /0
277
Remark:
Using the power function of a test, we can express the probabilities of the type I error as
G()
for all 0
for all 1
Question:
What should an ideal test look like?
Intuitively:
A test would be ideal if the probabilities of both the type I
and the type II errors were constantly equal to zero
the test would yield the correct decision with probab. 1
278
Example:
For 0 consider the testing problem
H0 : 0
versus
H1 : > 0
Unfortunately:
It can be shown mathematically that, in general, such an
ideal test does not exist
Way out:
For the selected test statistic T (X1, . . . , Xn) consider the
maximal type-I-error probability
= max {P (T (X1, . . . , Xn) K)} = max {G()}
0
281
Therefore:
It is crucial how to formulate H0 and H1
We formulate our research hypothesis in H1
(hoping that, for a concrete sample, our test rejects H0)
Example:
Consider Example 1 on Slide 268
If, for a concrete sample, our test rejects H0 we can be quite
sure that (on average) the glasses contain less than 0.4 litres
of beer
If our test accepts H0 we cannot make a statistically significant statement
(the data are not inconsistent with H0)
283
Setting:
Let X1, . . . , Xn be a random sample from X
Let R be an unknown parameter
Let L() = L(; x1, . . . , xn) denote the likelihood function
284
versus
H1 : g() =
6 q
285
286
Previous knowledge:
Equivariance property of the ML estimator (Slide 265)
g(M L) is the ML estimator of g()
Asymptotic normality (Slide 266)
d
g(M L) g() U N (0, Var(g(M L)))
i2
d
g M L q (under H0)
i
h
U 2
W =
1
d g
M L
Var
287
Test decision:
Reject H0 at the significance level if W > 2
1;1
Remarks:
The Wald test is a pure test against H0
(it is not necessary to exactly specify H1)
The Wald principle can be applied to any consistent, asymptotically normally distributed estimator
288
g( )
ML
ln[ L( )]
289
max L()
(= L(M L))
L(H0 )
=
L(M L)
Properties of :
01
If H0 is true, then should be close to one
290
LR test statistic:
n
io (under H )
0
U 2
1
Remarks:
The LR test verifies if the distance in the loglikelihood functions, ln[L(M L)] ln[L(H0 )], is significantly larger than 0
The LR test does not require the computation of any asymptotic variance
292
ln[ L( ML )]
g( )
LR
ln[ L( H 0 )]
H 0
ML
ln[L( )]
293
ln[L()]
=0
=M L
LM test statistic:
i1 (under H )
ln[L()]
0
2
d
H
LM =
Var
1
0
H0
Test decision:
2
Reject H0 at the significance level if LM > 1;1
295
ln[L( )]
g()
LM
H0
ML
ln[ L( )]
296
Remarks:
The test statistics of both, the Wald and the LM tests, contain the estimated variances of the estimator H0
These unknown variances can be estimated consistently by
the co-called Fisher-information
Many econometric tests are based on these three construction principles
The three tests are asymptotically equivalent, i.e. for large
sample sizes they produce identical test decisions
The three principles can be extended to the testing of hypotheses on a parameter vector
If Rm, then all 3 test statistics have a 2
m distribution
under H0
297
ln L( )
ln[( ML )]
ln[( H 0 )]
LR
g( )
LM
H 0
ML
ln L( )
298