Fortstat English PDF

Slides
Advanced Statistics
Summer Term 2011
(April 5, 2011 May 17, 2011)
Tuesdays, 14.15 15.45 and 16.00 17.30
Room: J 498
Prof. Dr. Bernd Wilfling

Westf
alische Wilhelms-Universit
at M
unster
Contents
1
1.1
1.2
Introduction
Syllabus
Why Advanced Statistics ?
2
2.1
2.2
2.3
2.4
Random Variables, Distribution Functions, Expectation,

Moment Generating Functions
Basic Terminology
Random Variable, Cumulative Distribution Function, Density Function
Expectation, Moments and Moment Generating Functions
Special Parameteric Families of Univariate Distributions
3
3.1
3.2
3.3
3.4
Joint and Conditional Distributions, Stochastic Independence

Joint and Marginal Distribution
Conditional Distribution and Stochastic Independence
Expectation and Joint Moment Generating Functions
The Multivariate Normal Distribution
4
4.1
4.2
4.3
4.4
Distributions of Functions of Random Variables

Expectations of Functions of Random Variables
Cumulative-distribution-function Technique
Moment-generating-function Technique
General Transformations
5
5.1
5.2
5.3
5.3.1
5.3.2
5.3.3
Methods of Estimation
Sampling, Estimators, Limit Theorems
Properties of Estimators
Methods of Estimation
Least-Squares Estimators
Method-of-moments Estimators
Maximum-Likelihood Estimators
6
6.1
6.2
6.2.1
6.2.2
6.2.3
Hypothesis Testing
Basic Terminology
Classical Testing Procedures
Wald Test
Likelihood-Ratio Test
Lagrange-Multiplier Test
i
References and Related Reading

In German:
Mosler, K. und F. Schmid (2008). Wahrscheinlichkeitsrechnung und schlieende Statistik
(3. Auflage). Springer Verlag, Heidelberg.
Schira, J. (2009). Statistische Methoden der VWL und BWL Theorie und Praxis (3. Auflage). Pearson Studium, Munchen.
Wilfling, B. (2010). Statistik I. Skript zur Vorlesung Deskriptive Statistik im Wintersemester 2010/2011 an der Westfalischen Wilhelms-Universitat Munster.
Wilfling, B. (2011). Statistik II. Skript zur Vorlesung Wahrscheinlichkeitsrechnung
und schlieende Statistik im Sommersemester 2011 an der Westfalischen
Wilhelms-Universitat Munster.
In English:
Chiang, A. (1984). Fundamental Methods of Mathematical Economics, 3. edition. McGrawHill, Singapore.
Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1. John
Wiley & Sons, New York.
Feller, W. (1971). An Introduction to Probability Theory and its Applications, Vol. 2. John
Wiley & Sons, New York.
Garthwaite, P.H., Jolliffe, I.T. and B. Jones (2002). Statistical Inference, 3. edition. Oxford
University Press, Oxford.
Mood, A.M., Graybill, F.A. and D.C. Boes (1974). Introduction to the Theory of Statistics,
3. edition. McGraw-Hill, Tokyo.
ii
1. Introduction
1.1 Syllabus
Aim of this course:
Consolidation of
probability calculus
statistical inference
(on the basis of previous Bachelor courses)
Preparatory course to Econometrics, Empirical Economics
1
Web-site:
http://www1.wiwi.uni-muenster.de/oeew/
Study Courses summer term 2011
Advanced Statistics
Style:
Lecture is based on slides
Slides are downloadable as PDF-files from the web-site
References:
See Contents
2
How to get prepared for the exam:

Courses
Class in Advanced Statistics
(Thu, 14.00 16.00 and 16.00 18.00, J 498,
April 7, 2011 May 19, 2011)
Auxiliary material to be used in the exam:

Pocket calculator (non-programmable)
All course-slides and solutions to class-exercises
No textbooks
3
Class teacher:
Dipl.-Mathem. Marc Lammerding
(see personal web-site)
1.2 Why Advanced Statistics ?

Contents of the BA course Statistics II:
Random experiments, events, probability
Random variables, distributions
Samples, statistics
Estimators
Tests of hypothesis
Aim of the BA course Statistics II:
Elementary understanding of statistical concepts
(sampling, estimation, hypothesis-testing)
Now:
Course in Advanced Statistics
(probability calculus and mathematical statistics)
Aim of this course:

Better understanding of distribution theory
How can we find good estimators?
How can we construct good tests of hypothesis?
Preliminaries:
BA courses
Mathematics
Statistics I
Statistics II
The slides for the BA courses Statistics I+II are downloadable
from the web-site
(in German)
Later courses based on Advanced Statistics:
All courses belonging to the three modules Econometrics
and Empirical Economics
(Econometrics I+II, Analysis of Time Series, ...)
7
2. Random Variables, Distribution Functions, Expectation, Moment generating Functions

Aim of this section:
Mathematical definition of the concepts
random variable
(cumulative) distribution function
(probability) density function
expectation and moments
moment generating function
Preliminaries:
Repetition of the notions
random experiment
outcome (sample point) and sample space
event
probability
(see Wilfling (2011), Chapter 2)
2.1 Basic Terminology

Definition 2.1: (Random experiment)
A random experiment is an experiment
(a) for which we know in advance all conceivable outcomes that

it can take on, but
(b) for which we do not know in advance the actual outcome

that it eventually takes on.
Random experiments are performed in controllable trials.

10
Examples of random experiments:

Drawing of lottery numbers
Roulette, tossing a coin, tossing a dice
Technical experiments
(testing the hardness of lots from steel production etc.)
In economics:
Random experiments (according to Def. 2.1) are rare
(historical data, trials are not controllable)
Modern discipline: Experimental Economics
11
Definition 2.2: (Sample point, sample space)

Each conceivable outcome of a random experiment is called a
sample point. The totality of conceivable outcomes (or sample
points) is defined as the sample space and is denoted by .
Examples:
Random experiment of tossing a single dice:
= {1, 2, 3, 4, 5, 6}
Random experiment of tossing a coin until HEAD shows up:
= {H, TH, TTH, TTTH, TTTTH, . . .}
Random experiment of measuring tomorrows exchange rate
between the euro and the US-$:
= [0, )
12
Obviously:
The number of elements in can be either (1) finite or (2)
infinite, but countable or (3) infinite and uncountable
Now:
Definition of the notion Event based on mathematical sets
Definition 2.3: (Event)
An event of a random experiment is a subset of the sample space
. We say the event A occurs if the random experiment has
an outcome A.
13
Remarks:
Events are typically denoted by A, B, C, . . . or A1, A2, . . .
A = is called the sure event
(since for every sample point we have A)
A = (empty set) is called the impossible event
(since for every we have
/ A)
If the event A is a subset of the event B (A B) we say that
the occurrence of A implies the occurrence of B
(since for every A we also have B)
Obviously:
Events are represented by mathematical sets
application of set operations to events
14
Combining events (set operations):

Intersection:
n
T
i=1
Union:
n
S
i=1
Ai occurs, if all Ai occur
Ai occurs, if at least one Ai occurs
Set difference:
C = A\B occurs, if A occurs and B does not occur
Complement:
C = \A A occurs, if A does not occur
The events A and B are called disjoint, if A B =
(both events cannot occur simultaneously)
15
Now:
For any arbitrary event A we are looking for a number P (A)
which represents the probability that A occurs
Formally:
P : A P (A)
(P () is a set function)
Question:
Which properties should the probability function (set function) P () have?
16
Definition 2.4: (Kolmogorov-axioms)

The following axioms for P () are called Kolmogorov-axioms:
Nonnegativity: P (A) 0 for every A

Standardization: P () = 1
Additivity: For two disjoint events A and B (i.e. for AB = )
P () satisfies
P (A B) = P (A) + P (B)
17
Easy to check:
The three axioms imply several additional properties and rules
when computing with probabilities
Theorem 2.5: (General properties)
The Kolmogorov-axioms imply the following properties:
Probability of the complementary event:
P (A) = 1 P (A)
Probability of the impossible event:
P () = 0
Range of probabilities:
0 P (A) 1
18
Next:
General rules when computing with probabilities
Theorem 2.6: (Calculation rules)
The Kolmogorov-axioms imply the following calculation rules
(A, B, C are arbitrary events):
Addition rule (I):

P (A B) = P (A) + P (B) P (A B)
(probability that A or B occurs)
19
Addition rule (II):

P (A B C) = P (A) + P (B) + P (C)
P (A B) P (B C)
P (A C) + P (A B C)
(probability that A or B or C occurs)
Probability of the difference event:

P (A\B) = P (A B)
= P (A) P (A B)
20
Notice:
If B implies A (i.e. if B A) it follows that
P (A\B) = P (A) P (B)
21
2.2 Random Variable, Cumulative Distribution

Function, Density Function
Frequently:
Instead of being interested in a concrete sample point
itself, we are rather interested in a number depending on
Examples:
Profit in euro when playing roulette
Profit earned when selling a stock
Monthly salary of a randomly selected person

Intuitive meaning of a random variable:
Rule translating the abstract into a number
22
Definition 2.7: (Random variable [rv])

A random variable, denoted by X or X(), is a mathematical
function of the form
X : R
X().
Remarks:
A random variable relates each sample point to a real
number
Intuitively:
A random variable X characterizes a number that is a priori
unknown
23
When the random experiment is carried out, the random

variable X takes on the value x
x is called realization or value of the random variable X after
the random experiment has been carried out
Random variables are denoted by capital letters, realizations
are denoted by small letters
The rv X describes the situation ex ante, i.e. before carrying
out the random experiment
The realization x describes the situation ex post, i.e. after
having carried out the random experiment
24
Example 1:
Consider the experiment of tossing a single coin (H=Head,
T =Tail). Let the rv X represent the Number of Heads
We have
= {H, T }
The random variable X can take on two values:
X(T ) = 0,
X(H) = 1
25
Example 2:
Consider the experiment of tossing a coin three times. Let
X represent the Number of Heads
We have
= {(H,
H,
H)}, |(H, {z
H, T )}, . . . , |(T, {z
T, T )}}
{z
|
=1
=2
=8
The rv X is defined by
X() = number of H in
Obviously:
X relates distinct s to the same number, e.g.
X((H, H, T )) = X((H, T, H)) = X((T, H, H)) = 2
26
Example 3:
Consider the experiment of randomly selecting 1 person from
a group of people. Let X represent the persons status of
employment
We have
= {employed
{z
}, unemployed
|
{z
}}
|
=1
=2
X can be defined as
X(1) = 1,
X(2) = 0
27
Example 4:
Consider the experiment of measuring tomorrows price of a
specific stock. Let X denote the stock price
We have = [0, ), i.e. X is defined by
X() =
Conclusion:
The random variable X can take on distinct values with specific probabilities
28
Question:
How can we determine these specific probabilities and how
can we calculate with them?
Simplifying notation: (a, b, x R)
P (X = a) P ({|X() = a})
P (a < X < b) P ({|a < X() < b})
P (X x) P ({|X() x})
Solution:
We can compute these probabilities via the so-called cumulative distribution function of X
29
Intuitively:
The cumulative distribution function of the random variable
X characterizes the probabilities according to which the possible values x are distributed along the real line
(the so-called distribution of X)
Definition 2.8: (Cumulative distribution function [cdf])

The cumulative distribution function of a random variable X,
denoted by FX , is defined to be the function
FX : R [0, 1]
x FX (x) = P ({|X() x}) = P (X x).
30
Example:
Consider the experiment of tossing a coin three times. Let
X represent the Number of Heads
We have
= {(H,
H,
H)}, (H,
H, T )}, . . . , |(T, {z
T, T )}}
|
|
{z
{z
= 1
= 2
= 8
For the probabilities of X we find
P (X = 0) = P ({(T, T, T )}) = 1/8

P (X = 1) = P ({(T, T, H), (T, H, T ), (H, T, T )}) = 3/8
P (X = 2) = P ({(T, H, H), (H, T, H), (H, H, T )}) = 3/8
P (X = 3) = P ({(H, H, H)}) = 1/8
31
Thus, the cdf is given by
FX (x) =
0.000
0.125
0.5
0.875
for x < 0
for 0 x < 1
for 1 x < 2
for 2 x < 3
for x 3
Remarks:
In practice, it will be sufficient to only know the cdf FX of X
In many situations, it will appear impossible to exactly specify
the sample space or the explicit function X : R.
However, often we may derive the cdf FX from other factual
consideration
32
General properties of FX :
FX (x) is a monotone, nondecreasing function
We have
lim FX (x) = 0
and
lim FX (x) = 1
x+
FX is continuous from the right; that is,

lim
F (z) = FX (x)
zx X
z>x
33
Summary:
Via the cdf FX (x) we can answer the following question:
What is the probability that the random variable X takes
on a value that does not exceed x?
Now:
Consider the question:
What is the value which X does not exceed with a
prespecified probability p (0, 1)?
quantile function of X
34
Definition 2.9: (Quantile function)

Consider the rv X with cdf FX . For every p (0, 1) the quantile
function of X, denoted by QX (p), is defined as
QX : (0, 1) R
QX (p) = min{x|FX (x) p}.
p
The value of the quantile function xp = QX (p) is called the pth

quantile of X.
Remarks:
The pth quantile xp of X is defined as the smallest number
x satisfying FX (x) p
In other words: The pth quantile xp is the smallest value that
X does not exceed with probability p
35
Special quantiles:
Median: p = 0.5
Quartiles: p = 0.25, 0.5, 0.75
Quintiles: p = 0.2, 0.4, 0.6, 0.8
Deciles: p = 0.1, 0.2, . . . , 0.9
Now:
Consideration of two distinct classes of random variables
(discrete vs. continuous rvs)
36
Reason:
Each class requires a specific mathematical treatment
Mathematical tools for analyzing discrete rvs:
Finite and infinite sums
Mathematical tools for analyzing continuous rvs:
Differential- and integral calculus
Remarks:
Some rvs are partly discrete and partly continuous
Such rvs are not treated in this course
37
Definition 2.10: (Discrete random variable)

A random variable X will be defined to be discrete if it can take
on either
(a) only a finite number of values x1, x2, . . . , xJ or
(b) an infinite, but countable number of values x1, x2, . . .
each with strictly positive probability; that is, if for all j =

1, . . . , J, . . . we have
P (X = xj ) > 0
and
J,...
X
P (X = xj ) = 1.
j=1
38
Examples of discrete variables:

Countable variables (X = Number of . . .)
Encoded qualitative variables
Further definitions:
Definition 2.11: (Support of a discrete random variable)

The support of a discrete rv X, denoted by supp(X), is defined
to be the totality of all values that X can take on with a strictly
positive probability:
supp(X) = {x1, . . . , xJ }
or
supp(X) = {x1, x2, . . .}.

39
Definition 2.12: (Discrete density function)

For a discrete random variable X the function
fX (x) = P (X = x)
is defined to be the discrete density function of X.
Remarks:
The discrete density function fX () takes on strictly positive
values only for elements of the support of X. For realizations
of X that do not belong to the support of X, i.e. for x
/
supp(X), we have fX (x) = 0:
fX (x) =
P (X = xj ) > 0
0
for x = xj supp(X)
for x
/ supp(X)
40
The discrete density function fX () has the following properties:

fX (x) 0 for all x
X
fX (xj ) = 1
xj supp(X)
For any arbitrary set A R the probability of the event

{|X() A} = {X A} is given by
P (X A) =
fX (xj )
xj A
41
Example:
Consider the experiment of tossing a coin three times and
let X = Number of Heads
(see slide 31)
Obviously: X is discrete and has the support
supp(X) = {0, 1, 2, 3}
The discrete density function of X is given by
fX (x) =
P (X = 0) = 0.125
P (X = 1) = 0.375
P (X = 2) = 0.375
P (X = 3) = 0.125
for x = 0
for x = 1
for x = 2
for x = 3
for x
/ supp(X)
42
The cdf of X is given by (see slide 32)
FX (x) =
0.000
0.125
0.5
0.875
for x < 0
for 0 x < 1
for 1 x < 2
for 2 x < 3
for x 3
Obviously:
The cdf FX () can be obtained from fX ():
FX (x) = P (X x) =
{xj supp(X)|xj x}
=P (X=xj )
z }| {
fX (xj )
43
Conclusion:
The cdf of a discrete random variable X is a step function
with steps at the points xj supp(X). The height of the
step at xj is given by
FX (xj ) xx
lim F (x) = P (X = xj ) = fX (xj ),
j
x<xj
i.e. the step height is equal to the value of the discrete density
function at xj
(relationship between cdf and discrete density function)
44
Now:
Definition of continuous random variables
Intuitively:
In contrast to discrete random variables, continuous random
variables can take on an uncountable number of values
(e.g. every real number on a given interval)
In fact:
Definition of a continuous random variable is quite technical
45
Definition 2.13: (Continuous rv, probability density function)

A random variable X is called continuous if there exists a function
fX : R [0, ) such that the cdf of X can be written as
FX (x) =
Z x
fX (t)dt
for all x R.
The function fX (x) is called the probability density function (pdf)

of X.
Remarks:
The cdf FX () of a continuous random variable X is a primitive function of the pdf fX ()
FX (x) = P (X x) is equal to the area under the pdf fX ()
between the limits and x
46
Cdf FX () and pdf fX ()
fX(t)
P(X x) = FX(x)
47
Properties of the pdf fX ():

1. A pdf fX () cannot take on negative value, i.e.
fX (x) 0
for all x R
2. The area under a pdf is equal to one, i.e.

Z +
fX (x)dx = 1
3. If the cdf FX (x) is differentiable we have

0 (x) dF (x)/dx
fX (x) = FX
X
48
Example: (Uniform distribution over [0, 10])

Consider the random variable X with pdf
fX (x) =
0
0.1
, for x
/ [0, 10]
, for x [0, 10]
Derivation of the cdf FX :

For x < 0 we have
FX (x) =
Z x
fX (t) dt =
Z x
0 dt = 0
49
For x [0, 10] we have

FX (x) =
Z x
Z 0
fX (t) dt
0 dt +
|
{z
=0
Z x
0
0.1 dt
= [0.1 t]x0
= 0.1 x 0.1 0
= 0.1 x
50
For x > 10 we have

FX (x) =
Z x
Z 0
fX (t) dt
0 dt +
{z
|
=0
= 1
Z 10
|0
0.1 dt +
{z
=1
0 dt
| 10{z }
=0
51
Now:
Interval probabilities, i.e. (for a, b R, a < b)
P (X (a, b]) = P (a < X b)
We have
P (a < X b) = P ({|a < X() b})
= P ({|X() > a} {|X() b})
= 1 P ({|X() > a} {|X() b})
= 1 P ({|X() > a} {|X() b})
= 1 P ({|X() a} {|X() > b})
52
= 1 [P (X a) + P (X > b)]
= 1 [FX (a) + (1 P (X b))]
= 1 [FX (a) + 1 FX (b)]
= FX (b) FX (a)
=
Z b
Z b
fX (t) dt
Z a
fX (t) dt
fX (t) dt
53
Interval probability between the limits a and b

fX(x)
P(a < X b)
54
Important result for a continuous rv X:

P (X = a) = 0
for all a R
Proof:
P (X = a) = lim P (a < X b) = lim
ba
Z a
a
Z b
ba a
fX (x) dx
fX (x)dx = 0
Conclusion:
The probability that a continuous random variable X takes
on a single explicit value is always zero
55
Probability of a single value
fX(x)
b3
b2
b1
56
Notice:
This does not imply that the event {X = a} cannot occur
Consequence:
Since for continuous random variables we always have P (X =
a) = 0 for all a R, it follows that
P (a < X < b) = P (a X < b) = P (a X b)
= P (a < X b) = FX (b) FX (a)
(when computing interval probabilities for continuous rvs, it
does not matter if the interval is open or closed)
57
2.3 Expectation, Moments and Moment Generating Functions

Repetition:
Expectation of an arbitrary random variable X
Definition 2.14: (Expectation)
The expectation of the random variable X, denoted by E(X), is
defined by
E(X) =
xj P (X = xj )
{x supp(X)}
Z +
x fX (x) dx
, if X is discrete
.
, if X is continuous
58
Remarks:
The expectation of the random variable X is approximately
equal to the sum of all realizations each weighted by the
probability of its occurrence
Instead of E(X) we often write X
There exist random variables that do not have an expectation
(see class)
59
Example 1: (Discrete random variable)

Consider the experiment of tossing two dice. Let X represent the absolute difference of the two dice. What is the
expectation of X?
The support of X is given by
supp(X) = {0, 1, 2, 3, 4, 5}
60
The discrete density function of X is given by
fX (x) =
P (X = 0) = 6/36
P (X = 1) = 10/36
P (X = 2) = 8/36
P (X = 3) = 6/36
P (X = 4) = 4/36
P (X = 5) = 2/36
This gives
for x = 0
for x = 1
for x = 2
for x = 3
for x = 4
for x = 5
for x
/ supp(X)
10
8
6
4
2
6
+1
+2
+3
+4
+5
E(X) = 0
36
36
36
36
36
36
=
70
= 1.9444
36
61
Example 2: (Continuous random variable)

Consider the continuous random variable X with pdf
x
, for 1 x 3
fX (x) =
4
0
, elsewise
To calculate the expectation we split up the integral:
E(X) =
=
Z +
Z 1
x fX (x) dx
Z 3
+
x
0 dx +
0 dx
x dx +
4
3
1
62
1 1 3 3
dx =
x
4 3
1 4
1
Z 3 2
x
27 1
1
=
4
3
3
26
=
= 2.1667
12
Frequently:
Random variable X plus discrete density or pdf fX is known
We have to find the expectation of the transformed random
variable
Y = g(X)
63
Theorem 2.15: (Expectation of a transformed rv)

Let X be a random variable with discrete density or pdf fX ().
For any Baire-function g : R R the expectation of the transformed random variable Y = g(X) is given by
E(Y ) = E[g(X)]
g(xj ) P (X = xj )
{xj supp(X)}
Z +
g(x) fX (x) dx
, if X is discrete
.
64
Remarks:
All functions considered in this course are Baire-functions
For the special case g(x) = x (the identity function) Theorem
2.15 coincides with Definition 2.14
Next:
Some important rules for calculating expected values
65
Theorem 2.16: (Properties of expectations)

Let X be an arbitrary random variable (discrete or continuous),
c, c1, c2 R constants and g, g1, g2 : R R functions. Then:
1. E(c) = c.
2. E[c g(X)] = c E[g(X)].
3. E[c1 g1(X) + c2 g2(X)] = c1 E[g1(X)] + c2 E[g2(X)].
4. If g1(x) g2(x) for all x R then
E[g1(X)] E[g2(X)].
Proof: Class
66
Now:
Consider the random variable X (discrete or continuous) and
the explicit function g(x) = [x E(X)]2
variance and standard deviation of X
Definition 2.17: (Variance, standard deviation)
For any random variable X the variance, denoted by Var(X), is
defined as the expected quadratic distance between X and its
expectation E(X); that is
Var(X) = E[(X E(X))2].
The standard deviation of X, denoted by SD(X), is defined to

be the (positive) square root of the variance:
q
SD(X) = + Var(X).
67
Remark:
Setting g(X) = [X E(X)]2 in Theorem 2.15 (on slide 64)
yields the following explicit formulas for discrete and continuous random variables:
Var(X) = E[g(X)]
X
2 P (X = x )
(X)]
[x
j
j
{xj supp(X)}
Z +
[x E(X)]2 fX (x) dx
68
Example: (Discrete random variable)

Consider again the experiment of tossing two dice with X
representing the absolute difference of the two dice (see Example 1 on slide 60). The variance is given by
Var(X) = (0 70/36)2 6/36 + (1 70/36)2 10/36
+ (2 70/36)2 8/36 + (3 70/36)2 6/36
+ (4 70/36)2 4/36 + (5 70/36)2 2/36

= 2.05247
Notice:
The variance is an expectation per definitionem
rules for expectations are applicable
69
Theorem 2.18: (Rules for variances)

Let X be an arbitrary random variable (discrete or continuous)
and a, b R real constants; then
1. Var(X) = E(X 2) [E(X)]2.
2. Var(a + b X) = b2 Var(X).
Proof: Class
Next:
Two important inequalities dealing with expectations and
transformed random variables
70
Theorem 2.19: (Chebyshev inequality)

Let X be an arbitrary random variable and g : R R+ a nonnegative function. Then, for every k > 0 we have
P [g(X) k]
E [g(X)]
.
k
Special case:
Consider
g(x) = [x E(X)]2
and
k = r2 Var(X)
Theorem 2.19 implies

n
P [X E(X)]
r2 Var(X)
(r > 0)
Var(X)
1
2
= 2
r Var(X)
r
71
Now:
n
P [X E(X)]
r2 Var(X)
= P {|X E(X)| r SD(X)}

= 1 P {|X E(X)| < r SD(X)}
It follows that
1
P {|X E(X)| < r SD(X)} 1 2
r
(specific Chebyshev inequality)
72
Remarks:
The specific Chebyshev inequality provides a minimal probability of the event that any arbitrary random variable X takes
on a value from the following interval:
[E(X) r SD(X), E(X) + r SD(X)]
For example, for r = 3 we have
1
8
P {|X E(X)| < 3 SD(X)} 1 2 =
3
9
which is equivalent to
P {E(X) 3 SD(X) < X < E(X) + 3 SD(X)} 0.8889
or
P {X (E(X) 3 SD(X), E(X) + 3 SD(X))} 0.8889
73
Theorem 2.20: (Jensen inequality)

Let X be a random variable with mean E(X) and let g : R R
be a convex function, i.e. for all x we have g 00(x) 0; then
E [g(X)] g(E[X]).
Remarks:
If the function g is concave (i.e. if g 00(x) 0 for all x) then
Jensens inequality states that E [g(X)] g(E[X])
Notice that in general we have
E [g(X)] =
6 g(E[X])
74
Example:
Consider the random variable X and the function g(x) = x2
We have g 00(x) = 2 0 for all x, i.e. g is convex
It follows from Jensens inequality that
E
E[X])
[g(X)] g(
| {z }
| {z }
=E(X 2)
i.e.
=[E(X)]2
E(X 2) [E(X)]2 0
This implies
Var(X) = E(X 2) [E(X)]2 0
(the variance of an arbitrary rv cannot be negative)
75
Now:
Consider the random variable X with expectation E(X) = X ,
the integer number n N and the functions
g1(x) = xn
g2(x) = [x X ]n
Definition 2.21: (Moments, central moments)

(a) The n-th moment of X, denoted by 0n, is defined as
0n E[g1(X)] = E(X n).
(b) The n-th central moment of X about X , denoted by n, is
defined as
n E[g2(X)] = E[(X X )n].
76
Relations:
01 = E(X) = X
(the 1st moment coincides with E(X))
1 = E[X X ] = E(X) X = 0
(the 1st central moment is always equal to 0)
2 = E[(X X )2] = Var(X)
(the 2nd central moment coincides with Var(X))
77
Remarks:
The first four moments of a random variable X are important
measures of the probability distribution
(expectation, variance, skewness, kurtosis)
The moments of a random variable X play an important role
in theoretical and applied statistics
In some cases, when all moments are known, the cdf of a
random variable X can be determined
78
Question:
Can we find a function that gives us a representation of all
moments of a random variable X?
Definition 2.22: (Moment generating function)

Let X be a random variable with discrete density or pdf fX ().
The expected value of etX is defined to be the moment generating function of X if the expected value exists for every value
of t in some interval h < t < h, h > 0. That is, the moment
generating function of X, denoted by mX (t), is defined as
mX (t) = E
i
tX
e
.
79
Remarks:
The moment generating function mX (t) is a function in t
There are rvs X for which mX (t) does not exist
If mX (t) exists it can be calculated as
mX (t) = E
etX
etxj P (X = xj )
{x supp(X)}
Z +
etx fX (x) dx
, if X is discrete
80
Question:
Why is mX (t) called the moment generating function?
Answer:
Consider the nth derivative of mX (t) with respect to t:
X
n etxj P (X = x )
(x
)
j
j
{xj supp(X)}
dn
mX (t) =
dtn
Z +
xn etx fX (x) dx
for discrete X
for continuous X
81
Now, evaluate the nth derivative at t = 0:
(xj )n P (X = xj )
{xj supp(X)}
dn
mX (0) =
n
dt
Z +
xn fX (x) dx
for discrete X
for continuous X
= E(X n) = 0n
(see Definition 2.21(a) on slide 76)
82
Example:
Let X be a continuous random variable with pdf
fX (x) =
0
ex
, for x < 0
, for x 0
(exponential distribution with parameter > 0)

We have
mX (t) = E etX =
=
for t <
Z +
0
Z +
etx fX (x) dx
e(t)x dx =
83
It follows that
m0X (t) =
and thus
( t)2
0 (0) = E(X) =
mX
and
and
m00X (t) =
2
( t)3
2
m00X (0) = E(X 2) = 2
Now:
Important result on moment generating functions
84
Theorem 2.23: (Identification property)

Let X and Y be two random variables with densities fX () and
fY (), respectively. Suppose that mX (t) and mY (t) both exist
and that mX (t) = mY (t) for all t in the interval h < t < h for
some h > 0. Then the two cdfs FX () and FY () are equal; that
is FX (x) = FY (x) for all x.
Remarks:
Theorem 2.23 states that there is a unique cdf FX (x) for a
given moment generating function mX (t)
if we can find mX (t) for X then, at least theoretically, we
can find the distribution of X
We will make use of this property in Section 4
85
Example:
Suppose that a random variable X has the moment generating function
mX (t) =
1
1t
for 1 < t < 1
Then the pdf of X is given by

fX (x) =
0
ex
, for x < 0
, for x 0
(exponential distribution with parameter = 1)
86
2.4 Special Parametric Families of Univariate Distributions

Up to now:
General mathematical properties of arbitrary distributions
Discrimination: discrete vs continuous distributions
Consideration of
the cdf FX (x)
the discrete density or the pdf fX (x)
expectations of the form E[g(X)]
the moment generating function mX (t)
87
Central result:
The distribution of a random variable X is (essentially) determined by fX (x) or FX (x)
FX (x) can be determined by fX (x)
(cf. slide 46)
fX (x) can be determined by FX (x)
(cf. slide 48)
Question:
How many different distributions are known to exist?
88
Answer:
Infinitely many
But:
In practice, there are some important parametric families of
distributions that provide good models for representing realworld random phenomena
These families of distributions are decribed in detail in all
textbooks on mathematical statistics
(see e.g. Mosler & Schmid (2008), Mood et al. (1974))
89
Important families of discrete distributions

Bernoulli distribution
Binomial distribution
Geometric distribution
Poisson distribution
Important families of continuous distributions
Uniform or rectangular distribution
Exponential distribution
Normal distribution
90
Remark:
The most important family of distributions at all is the normal distribution
Definition 2.24: (Normal distribution)

A continuous random variable X is defined to be normally distributed with parameters R and 2 > 0, denoted by X
N (, 2), if its pdf is given by
fX (x) =
2
x
1
2
1
e
2
x R.
91
PDFs of the normal distribution
fX(x)
N(5,1)
N(0,1)
N(5,3)
N(5,5)
92
Remarks:
The special normal distribution N (0, 1) is called standard normal distribution the pdf of which is denoted by (x)
The properties as well as calculation rules for normally distributed random variables are important pre-conditions for
this course
(see Wilfling (2011), Section 3.4)
93
3. Joint and Conditional Distributions, Stochastic

Independence
Aim of this section:
Multidimensional random variables (random vectors)
(joint and marginal distributions)
Stochastic (in)dependence and conditional distribution
Multivariate normal distribution
(definition, properties)
Literature:
Mood, Graybill, Boes (1974), Chapter IV, pp. 129-174
Wilfling (2011), Chapter 4
94
3.1 Joint and Marginal Distribution

Now:
Consider several random variables simultaneously
Applications:
Several economic applications
Statistical inference
95
Definition 3.1: (Random vector)

Let X1, , Xn be a set of n random variables each representing
the same random experiment, i.e.
Xi : R
for i = 1, . . . , n.
Then X = (X1, . . . , Xn)0 is called an n-dimensional random variable or an n-dimensional random vector.
Remark:
In the literature random vectors are often denoted by
X = ( X1 , . . . , X n )
or more simply by
X1 , . . . , X n
96
For n = 2 it is common practice to write

X = (X, Y )0
or
(X, Y )
or
X, Y
Realizations are denoted by small letters:

x = (x1, . . . , xn)0 Rn
or
x = (x, y)0 R2
Now:
Characterization of the probability distribution of the random
vector X
97
Definition 3.2: (Joint cumulative distribution function)

Let X = (X1, . . . , Xn)0 be an n-dimensional random vector. The
function
FX1,...,Xn : Rn [0, 1]
defined by
FX1,...,Xn (x1, . . . , xn) = P (X1 x1, X2 x2, . . . , Xn xn)
is called the joint cumulative distribution function of X.
Remark:
Definition 3.2 applies to discrete as well as to continuous
random variables X1, . . . , Xn
98
Some properties of the bivariate cdf (n = 2):

FX,Y (x, y) is monotone increasing in x and y
lim FX,Y (x, y) = 0

x
lim FX,Y (x, y) = 1

x+
lim FX,Y (x, y) = 0
y+
Remark:
Analogous properties
FX1,...,Xn (x1, . . . , xn)
hold
for
the
cdf
n-dimensional
99
Now:
Joint discrete versus joint continuous random vectors
Definition 3.3: (Joint discrete random vector)
The random vector X = (X1, . . . , Xn)0 is defined to be a joint discrete random vector if it can assume only a finite (or a countable
infinite) number of realizations x = (x1, . . . , xn)0 such that
P (X1 = x1, X2 = x2, . . . , Xn = xn) > 0
and
X
P (X1 = x1, X2 = x2, . . . , Xn = xn) = 1,
where the summation is over all possible realizations of X.

100
Definition 3.4: (Joint continuous random vector)

The random vector X = (X1, . . . , Xn)0 is defined to be a joint
continuous random vector if and only if there exists a nonnegative
function fX1,...,Xn (x1, . . . , xn) such that
FX1,...,Xn (x1, . . . , xn) =
Z xn
...
Z x
1
fX1,...,Xn (u1, . . . , un) du1 . . . dun
for all (x1, . . . , xn). The function fX1,...,Xn is defined to be a joint

probability density function of X.
Example:
Consider X = (X, Y )0 with joint pdf
fX,Y (x, y) =
x+y
0
, for (x, y) [0, 1] [0, 1]

, elsewise
101
Joint pdf fX,Y (x, y)
2
1.5
fHx,yL 1
0.5
0
0
1
0.8
0.6
0.4 y
0.2
0.4
x 0.6
0.2
0.8
10
102
The joint cdf can be obtained by

FX,Y (x, y) =
=
Z y
Z x
Z yZ x
0
fX,Y (u, v) du dv
(u + v) du dv
= ...
0.5(x2y + xy 2)
0.5(x2 + x)
=
0.5(y 2 + y)
, for
, for
, for
, for
(x, y) [0, 1] [0, 1]

(x, y) [0, 1] [1, )
(x, y) [1, ) [0, 1]
(x, y) [1, ) [1, )
(Proof: Class)
103
Remarks:
If X = (X1, . . . , Xn)0 is a joint continuous random vector,
then
nFX1,...,Xn (x1, . . . , xn)
x1 xn
= fX1,...,Xn (x1, . . . , xn)
The volume under the joint pdf represents probabilities:

u
o
o
P (au
1 < X1 a1 , . . . , an < Xn an )
Z ao
n
au
n
...
Z ao
1
au
1
fX1,...,Xn (u1, . . . , un) du1 . . . dun
104
In this course:
Emphasis on joint continuous random vectors
Analogous results for joint discrete random vectors
(see Mood, Graybill, Boes (1974), Chapter IV)
Now:
Determination of the distribution of a single random variable Xi from the joint distribution of the random vector
(X1, . . . , Xn)0
marginal distribution
105
Definition 3.5: (Marginal distribution)

Let X = (X1, . . . , Xn)0 be a continuous random vector with joint
cdf FX1,...,Xn and joint pdf fX1,...,Xn . Then
FX1 (x1) = FX1,...,Xn (x1, +, +, . . . , +, +)
FX2 (x2) = FX1,...,Xn (+, x2, +, . . . , +, +)
...
FXn (xn) = FX1,...,Xn (+, +, +, . . . , +, xn)
are called marginal cdfs while
106
fX1 (x1) =
Z +
fX2 (x2) =
Z +
fXn (xn) =
Z +
...
Z +
fX1,...,Xn (x1, x2, . . . , xn) dx2 . . . dxn
...
Z +
fX1,...,Xn (x1, x2, . . . , xn) dx1 dx3 . . . dxn
...
Z +
fX1,...,Xn (x1, x2, . . . , xn) dx1 dx2 . . . dxn1
are called marginal pdfs of the one-dimensional (univariate) random variables X1, . . . , Xn.
107
Example:
Consider the bivariate pdf
fX,Y (x, y)
=
40(x 0.5)2y 3(3 2x y)

0
, for (x, y) [0, 1] [0, 1]

, elsewise
108
Bivariate pdf fX,Y (x, y)
3
fHx,yL 2
1
0.8
1
0
0
0.6
0.4 y
0.2
0.4
x 0.6
0.2
0.8
10
109
The marginal pdf of X obtains as

fX (x) =
Z 1
0
40(x 0.5)2y 3(3 2x y)dy
= 40(x 0.5)2
Z 1
0
(3y 3 2xy 3 y 4)dy
1
2x
1
3
= 40(x 0.5)2 y 4
y4 y5
4
4
5
0
3 2x 1
2
= 40(x 0.5)
4
4
5
= 20x3 + 42x2 27x + 5.5

110
Marginal pdf fX (x)
fHxL
1.5
1.25
1
0.75
0.5
0.25
x
0.2
0.4
0.6
0.8
111
The marginal pdf of Y obtains as

fY (y) =
Z 1
0
40(x 0.5)2y 3(3 2x y)dx
= 40y 3
=
Z 1
0
(x 0.5)2(3 2x y)dx
10 3
y (y 2)
3
112
Marginal pdf fY (y)
fHyL
3
2.5
2
1.5
1
0.5
y
0.2
0.4
0.6
0.8
113
Remarks:
When considering the marginal instead of the joint distributions, we are faced with an information loss
(the joint distribution uniquely determines all marginal distributions, but the converse does not hold in general)
Besides the respective univariate marginal distributions, there
are also multivariate distributions which can be obtained from
the joint distribution of X = (X1, . . . , Xn)0
114
Example:
For n = 5 consider X = (X1, . . . , X5)0 with joint pdf fX1,...,X5
Then the marginal pdf of Z = (X1, X3, X5)0 obtains as
fX1,X3,X5 (x1, x3, x5)
=
Z + Z +
fX1,...,X5 (x1, x2, x3, x4, x5) dx2 dx4
(integrate out the irrelevant components)
115
3.2 Conditional Distribution and Stochastic Independence

Now:
Distribution of a random variable X under the condition that
another random variable Y has already taken on the realization y
(conditional distribution of X given Y = y)
116
Definition 3.6: (Conditional distribution)

Let X = (X, Y )0 be a bivariate continuous random vector with
joint pdf fX,Y (x, y). The conditional density of X given Y = y is
defined to be
fX,Y (x, y)
.
fX|Y =y (x) =
fY (y)
Analogously, the conditional density of Y given X = x is defined
to be
fX,Y (x, y)
fY |X=x(y) =
.
fX (x)
117
Remark:
Conditional densities of random vectors are defined analogously, e.g.
fX1,X2,X4|X3=x3,X5=x5 (x1, x2, x4) =
fX1,X2,X3,X4,X5 (x1, x2, x3, x4, x5)
fX3,X5 (x3, x5)
118
Example:
Consider the bivariate pdf
fX,Y (x, y)
=
40(x 0.5)2y 3(3 2x y)

0
, for (x, y) [0, 1] [0, 1]

, elsewise
with marginal pdf

fY (y) =
10 3
y (y 2)
3
(cf. Slides 108-112)
119
It follows that
fX|Y =y (x) =
=
fX,Y (x, y)
fY (y)
40(x 0.5)2y 3(3 2x y)
3
10
3 y (y 2)
12(x 0.5)2(3 2x y)
=
2y
120
Conditional pdf fX|Y =0.01(x) of X given Y = 0.01
Bedingte
Dichte
3
2.5
2
1.5
1
0.5
x
0.2
0.4
0.6
0.8
121
Conditional pdf fX|Y =0.95(x) of X given Y = 0.95
Bedingte
Dichte
1.2
1
0.8
0.6
0.4
0.2
x
0.2
0.4
0.6
0.8
122
Now:
Combine the concepts joint distribution and conditional
distribution to define the notion stochastic independence
(for two random variables first)
Definition 3.7: (Stochastic Independence [I])

Let (X, Y )0 be a bivariate continuous random vector with joint
pdf fX,Y (x, y). X and Y are defined to be stochastically independent if and only if
fX,Y (x, y) = fX (x) fY (y)
for all x, y R.
123
Remarks:
Alternatively, stochastic independence can be defined via the
cdfs:
X and Y are stochastically independent, if and only if
FX,Y (x, y) = FX (x) FY (y)
for all x, y R.
If X and Y are independent, we have

fX|Y =y (x) =
fY |X=x(y) =
fX,Y (x, y)
fY (y)
fX,Y (x, y)
fX (x)
f (x) fY (y)
= X
= fX (x)
fY (y)
f (x) fY (y)
= X
= fY (y)
fX (x)
If X and Y are independent and g and h are two continuous

functions, then g(X) and h(Y ) are also independent
124
Now:
Extension to n random variables
Definition 3.8: (Stochastic independence [II])
Let (X1, . . . , Xn)0 be a continuous random vector with joint pdf
fX1,...,Xn (x1, . . . , xn) and joint cdf FX1,...,Xn (x1, . . . , xn). X1, . . . , Xn
are defined to be stochastically independent, if and only if for all
(x1, . . . , xn)0 Rn
fX1,...,Xn (x1, . . . , xn) = fX1 (x1) . . . fXn (xn)
or
FX1,...,Xn (x1, . . . , xn) = FX1 (x1) . . . FXn (xn).
125
Remarks:
For discrete random vectors we define: X1, . . . , Xn are stochastically independent, if and only if for all (x1, . . . , xn)0 Rn
P (X1 = x1, . . . , Xn = xn) = P (X1 = x1) . . . P (Xn = xn)
or
FX1,...,Xn (x1, . . . , xn) = FX1 (x1) . . . FXn (xn)
In the case of independence, the joint distribution results
from the marginal distributions
If X1, . . . , Xn are stochastically independent and g1, . . . , gn are
continuous functions, then Y1 = g1(X1), . . . , Yn = gn(Xn) are
also stochastically independent
126
3.3 Expectation and Joint Moment Generating

Functions
Now:
Definition of the expectation of a function
g : Rn R
(x1, . . . , xn) 7 g(x1, . . . xn)
of a continuous random vector X = (X1, . . . , Xn)0
127
Definition 3.9: (Expectation of a function)

Let (X1, . . . , Xn)0 be a continuous random vector with joint pdf
fX1,...,Xn (x1, . . . , xn) and g : Rn R a real-valued continuous
function. The expectation of the function g of the random vector
is defined to be
E[g(X1, . . . , Xn)]
=
Z +
...
Z +
g(x1, . . . , xn) fX1,...,Xn (x1, . . . , xn) dx1 . . . dxn.
128
Remarks:
For a discrete random vector (X1, . . . , Xn)0 the analogous definition is
E[g(X1, . . . , Xn)] =
g(x1, . . . , xn) P (X1 = x1, . . . , Xn = xn),

where the summation is over all realizationen of the vector
Definition 3.9 includes the expectation of a univariate random variable X:

Set n = 1 and g(x) = x
E(X1) E(X) =
Z +
xfX (x) dx
Definition 3.9 includes the variance of X:

Set n = 1 and g(x) = [x E(X)]2
Var(X1) Var(X) =
Z +
[x E(X)]2fX (x) dx
129
Definition 3.9 includes the covariance of two variables:

Set n = 2 and g(x1, x2) = [x1 E(X1)] [x2 E(X2)]
Cov(X1, X2)
=
Z + Z +
[x1 E(X1)][x2 E(X2)]fX1,X2 (x1, x2) dx1 dx2
Via the covariance we define the correlation coefficient:

Cov(X1, X2)
q
Corr(X1, X2) = q
Var(X1) Var(X2)
General properties of expected values, variances, covariances

and the correlation coefficient
Class
130
Now:
Expectation and variances of random vectors
Definition 3.10: (Expected vector, covariance matrix)
Let X = (X1, . . . , Xn)0 be a random vector. The expected vector
of X is defined to be
E(X1)
...
E(X) =
.
E(Xn)
The covariance matrix of X is defined to be
Cov(X) =
Var(X1)
Cov(X1, X2)
Cov(X2, X1)
Var(X2)
...
...
Cov(Xn, X1) Cov(Xn, X2)
. . . Cov(X1, Xn)
. . . Cov(X2, Xn)
...
...
...
Var(Xn)
131
Bemerkung:
Obviously, the covariance matrix is symmetric per definition
Now:
Expected vectors and covariance matrices under linear transformations of random vectors
Let
X = (X1, . . . , Xn)0 be a n-dimensional random vector
A be an (m n) matrix of real numbers
b be an (m 1) column vector of real numbers
132
Obviously:
Y = AX + b is an (m 1) random vector:
Y =
a11 a12
a21 a22
...
...
am1 am2
. . . a1n
X1
. . . a2n
X2
.
.
...
.. ..
. . . amn
Xn
b1
b2
...
bm
a11X1 + a12X2 + . . . + a1nXn + b1

a21X1 + a22X2 + . . . + a2nXn + b2
...
am1X1 + am2X2 + . . . + amnXn + bm
133
The expected vector of Y is given by
E(Y) =
a11E(X1) + a12E(X2) + . . . + a1nE(Xn) + b1

a21E(X1) + a22E(X2) + . . . + a2nE(Xn) + b2
...
am1E(X1) + am2E(X2) + . . . + amnE(Xn) + bm
= AE(X) + b
The covariance matrix of Y is given by
Cov(Y) =
Cov(Y1, Y2)
Var(Y1)
Var(Y2)
Cov(Y2, Y1)
...
...
Cov(Yn, Y1) Cov(Yn, Y2)
. . . Cov(Y1, Yn)
. . . Cov(Y2, Yn)
...
...
...
Var(Yn)
= ACov(X)A0
(Proof: Class)
134
Remark:
Cf. the analogous results for univariate variables:
E(a X + b) = a E(X) + b
Var(a X + b) = a2 Var(X)
Up to now:
Expected values for unconditional distributions
Now:
Expected values for conditional distributions
(cf. Definition 3.6, Slide 117)
135
Definition 3.11: (Conditional expected value of a function)

Let (X, Y )0 be a continuous random vector with joint pdf fX,Y (x, y)
and let g : R2 R be a real-valued function. The conditional
expected value of the function g given X = x is defined to be
E[g(X, Y )|X = x] =
Z +
g(x, y) fY |X (y) dy.
136
Remarks:
An analogous definition applies to a discrete random vector
(X, Y )0
Definition 3.11 naturally extends to higher-dimensional distributions
For g(x, y) = y we obtain the special case E[g(X, Y )|X = x] =
E(Y |X = x)
Note that E[g(X, Y )|X = x] is a function of x
137
Example:
Consider the joint pdf
fX,Y (x, y) =
x+y
0
, for (x, y) [0, 1] [0, 1]

, elsewise
The conditional distribution of Y given X = x is given by
x+y
x + 0.5
fY |X (y) =
, for (x, y) [0, 1] [0, 1]

, elsewise
For g(x, y) = y the conditional expectation is given as

Z 1
1
x+y
x
1
y
E(Y |X = x) =
dy =
+
x + 0.5
x + 0.5
2
3
0
138
Remarks:
Consider the function g(x, y) = g(y)
(i.e. g does not depend on x)
Denote h(x) = E[g(Y )|X = x]
We calculate the unconditional expectation of the transformed variable h(X)
We have
139
E {E[g(Y )|X = x]} = E[h(X)] =

=
Z +
Z +
h(x) fX (x) dx
E[g(Y )|X = x] fX (x) dx
Z + "Z +
Z + Z +
g(y) fY |X (y) fX (x) dy dx
Z + Z +
g(y) fX,Y (x, y) dy dx
g(y) fY |X (y) dy fX (x) dx
= E[g(Y )]
140
Theorem 3.12:
Let (X, Y )0 be an arbitrary discrete or continuous random vector.
Then
E[g(Y )] = E {E[g(Y )|X = x]}

and, in particular,
E[Y ] = E {E[Y |X = x]} .

Now:
Three important rules for conditional and unconditional expected values
141
Theorem 3.13:
Let (X, Y )0 be an arbitrary discrete or continuous random vector
and g1(), g2() two unidimensional functions. Then
1. E[g1(Y ) + g2(Y )|X = x] = E[g1(Y )|X = x] + E[g2(Y )|X = x],
2. E[g1(Y ) g2(X)|X = x] = g2(x) E[g1(Y )|X = x].
3. If X and Y are stochastically independent we have
E[g1(X) g2(Y )] = E[g1(X)] E[g2(Y )].
142
Finally:
Moment generating function for random vectors
Definition 3.14: (Joint moment generating function)
Let X = (X1, . . . , Xn)0 be an arbitrary discrete or continuous
random vector. The joint moment generating function of X is
defined to be
mX1,...,Xn (t1, . . . , tn) = E
et1X1+...+tnXn
if this expectation exists for all t1, . . . , tn with h < tj < h for an
arbitary value h > 0 and for all j = 1, . . . , n.
143
Remarks:
Via the joint moment generating function mX1,...,Xn (t1, . . . , tn)
we can derive the following mathematical objects:
the marginal moment generating functions mX1 (t1), . . . ,
mXn (tn)
the moments of the marginal distributions
the so-called joint moments
144
Important result: (cf. Theorem 2.23, Slide 85)
For any given joint moment generating function

mX1,...,Xn (t1, . . . , tn) there exists a unique joint cdf
FX1,...,Xn (x1, . . . , xn)
145
3.4 The Multivariate Normal Distribution

Now:
Extension of the univariate normal distribution
Definition 3.15: (Multivariate normal distribution)
Let X = (X1, . . . , Xn)0 be an continuous random vector. X is defined to have a multivariate normal distribution with parameters
12
= ...
1n
...
...
and
,
2
n1 n
if for x = (x1, . . . , xn)0 Rn its joint pdf is given by
1
fX(x) = (2)n/2 [det()]1/2 exp (x )0 1 (x ) .
2
1
.
=
..
n
146
Remarks:
See Chang (1984, p. 92) for a definition and the properties
of the determinant det(A) of the matrix A
Notation:
X N (, )
is a column vector with 1, . . . , n R
is a regular, positive definite, symmetric (n n) matrix
Role of the parameters:
E(X) =
and
Cov(X) =
147
Joint pdf of the multiv. standard normal distribution N (0, In):
1 0
n/2
(x) = (2)
exp x x
2
Cf. the analogy to the univariate pdf in Definition 2.24, Slide
91
Properties of the N (, ) distribution:
Partial vectors (marginal distributions) of X also have multivariate normal distributions, i.e. if
X=
"
X1
X2
"
1
2
# "
11 12
21 22
#!
then
X1 N (1, 11)
X2 N (2, 22)
148
Thus, all univariate variables of X = (X1, . . . , Xn)0 have univariate normal distributions:
X1 N (1, 12)
X2 N (2, 22)
...
2)
Xn N (n, n
The conditional distributions are also (univariately or multivariately) normal:
1
X1|X2 = x2 N 1 + 1222
(x2 2), 11 121
22 21
Linear transformations:
Let A be an (m n) matrix, b an (m 1) vector of real
numbers and X = (X1, . . . , Xn)0 N (, ). Then
AX + b N (A + b, AA0)
149
Example:
Consider
X N (, )
N
"
0
1
# "
1 0.5
0.5 2
#!
Find the distribution of Y = AX + b where

A=
"
1 2
3 4
1
2
AA0 =
"
b=
"
It follows that Y N (A + b, AA0)

In particular,
A + b =
"
3
6
and
12 24
24 53
#
150
Now:
Consider the bivariate case (n = 2), i.e.
X = (X, Y )0,
E(X) =
"
X
Y
"
2
X
Y X
XY
Y2
We have
XY = Y X = Cov(X, Y ) = X Y Corr(X, Y ) = X Y
The joint pdf follows from Definition 3.15 with n = 2
fX,Y (x, y) =
1
2X Y
"
exp
2
2 1
1 2
(y Y )2
(x X )2 2(x X )(y Y )
+
2
X Y
X
Y2
(Derivation: Class)
151
#)
fX,Y (x, y) for X = Y = 0, x = Y = 1 and = 0
fHx,yL0.1
0.15
0.05
0
0 y
-2
0
-2
x
2
152
fX,Y (x, y) for X = Y = 0, x = Y = 1 and = 0.9
0.3
fHx,yL0.2
2
0.1
0
0 y
-2
0
-2
x
2
153
Remarks:
The marginal distributions are given by
2 ) and
X N (X , X
Y N (Y , Y2 )
interesting result for the normal distribution:
If (X, Y )0 has a bivariate normal distribution, then X and Y

are independent if and only if = Corr(X, Y ) = 0
The conditional distributions are given by
X
2 1 2
X|Y = y N X +
(y Y ), X
Y
Y
(x X ), Y2 1 2
Y |X = x N Y +
X
(Proof: Class)
154
4. Distributions of Functions of Random Variables

Setup:
Consider as given the joint distribution of X1, . . . , Xn
(i.e. consider as given fX1,...,Xn and FX1,...,Xn )
Consider k functions
g1 : Rn R, . . . , gk : Rn R
Find the joint distribution of the k random variables
Y1 = g1(X1, . . . , Xn), . . . , Yk = gk (X1, . . . Xn)
(i.e. find fY1,...,Yk and FY1,...,Yk )
155
Example:
Consider as given X1, . . . , Xn with fX1,...,Xn
Consider the functions
g1(X1, . . . , Xn) =
n
X
i=1
Find fY1,Y2 with Y1 =
Xi
and
n
1 X
g2(X1, . . . , Xn) =
Xi
n i=1
Pn
1 Pn
Y
=
X
and
2
i=1 i
n i=1 Xi
Remark:
From the joint distribution fY1,...,Yk we can derive the k marginal
distributions fY1 , . . . fYk
(cf. Chapter 3, Slides 106, 107)
156
Aim of this chapter:

Techniques for finding the (marginal) distribution(s)
of (Y1, . . . , Yk )0
157
4.1 Expectations of Functions of Random Variables

Simplification:
In a first step, we are not interested in the exact distributions,
but merely in certain expected values of Y1, . . . , Yk
Expectation two ways:
Consider as given the (continuous) random variables X1, . . . ,
Xn and the function g : Rn R
Consider the random variables Y = g(X1, . . . , Xn) and find
the expectation E[g(X1, . . . , Xn)]
158
Two ways of calculating E(Y ):
E(Y ) =
or
E(Y ) =
Z +
...
Z +
Z +
y fY (y) dy
g(x1, . . . , xn)fX1,...,Xn (x1, . . . xn) dx1 . . . dxn

It can be proved that
Both ways of calculating E(Y ) are equivalent
choose the most convenient calculation
159
Now:
Calculation rules for expected values, variances, covariances
of sums of random variables
Setting:
X1, . . . , Xn are given continuous or discrete random variables
with joint density fX1,...,Xn
The (transforming) function g : Rn R is given by
g(x1, . . . , xn) =
n
X
xi
i=1
160
In a first step, find the expectation and the variance of

Y = g(X1, . . . , Xn) =
n
X
Xi
i=1
Theorem 4.1: (Expectation and variance of a sum)

For the given random variables X1, . . . , Xn we have
n
X
i=1
and
Var
n
X
i=1
Xi =
n
X
i=1
Xi =
n
X
E(Xi)
i=1
Var(Xi) + 2
n
X
n
X
Cov(Xi, Xj ).
i=1 j=i+1
161
Implications:
For given constants a1, . . . , an R we have
n
X
ai Xi =
i=1
(why?)
n
X
i=1
ai E(Xi)
For two random variables X1 and X2 we have
E(X1 X2) = E(X1) E(X2)

If X1, . . . , Xn are stochastically independent, it follows that
Cov(Xi, Xj ) = 0 for all i 6= j and hence
Var
n
X
i=1
Xi =
n
X
Var(Xi)
i=1
162
Now:
Calculating the covariance of two sums of random variables
Theorem 4.2: (Covariance of two sums)
Let X1, . . . , Xn and Y1, . . . , Ym be two sets of random variables
and let a1, . . . an and b1, . . . , bm be two sets of constants. Then
Cov
n
X
i=1
ai Xi,
m
X
j=1
bj Yj =
n X
m
X
i=1 j=1
ai bj Cov(Xi, Yj ).
163
Implications:
The variance of a weighted sum of random variables is given
by
Var
n
X
i=1
ai Xi = Cov
n
X
i=1
n X
n
X
i=1 j=1
j=1
ai aj Cov(Xi, Xj )
n
X
a2
i Var(Xi ) +
n
X
a2
i Var(Xi) + 2
i=1
aj Xj
n
X
i=1
ai Xi,
n
X
n
X
i=1 j=1,j6=i
n
X
n
X
ai aj Cov(Xi, Xj )
i=1 j=i+1
ai aj Cov(Xi, Xj )
164
For two random variables X1 and X2 we have

Var(X1 X2) = Var(X1) + Var(X2) 2 Cov(X1, X2),
and if X1 and X2 are independent we have
Var(X1 X2) = Var(X1) + Var(X2)
Finally:
Important result concerning the expectation of a product of
two random variables
165
Setting:
Let X1, X2 be both continuous or both discrete random variables with joint density fX1,X2
Let g : Rn R be defined as g(x1, x2) = x1 x2
Find the expectation of
Y = g(X1, X2) = X1 X2
Theorem 4.3: (Expectation of a product)
For the random variables X1, X2 we have
E (X1 X2) = E(X1) E(X2) + Cov(X1, X2).

166
Implication:
If X1 and X2 are stochastically independent, we have
E (X1 X2) = E(X1) E(X2)

Remarks:
A formula for Var(X1 X2) also exists
In many cases, there are no explicit formulas for expected
values and variances of other transformations (e.g. for ratios
of random variables)
167
4.2 The Cumulative-distribution-function Technique

Motivation:
Consider as given the random variables X1, . . . , Xn with joint
density fX1,...,Xn
Find the joint distribution of Y1, . . . , Yk where Yj = gj (X1, . . . ,
Xn) for j = 1, . . . , k
The joint cdf of Y1, . . . , Yk is defined to be
FY1,...,Yk (y1, . . . , yk ) = P (Y1 y1, . . . , Yk yk )
168
Now, for each y1, . . . , yk the event

{Y1 y1, . . . , Yk yk }
= {g1(X1, . . . , Xn) y1, . . . , gk (X1, . . . , Xn) yk } ,
i.e. the latter event is an event described in terms of the given
functions g1, . . . , gk and the given random variables X1, . . . , Xn
since the joint distribution of X1, . . . , Xn is assumed given,
presumably the probability of the latter event can be calculated and consequently FY1,...,Yk determined
169
Example 1:
Consider n = 1 (i.e. consider X1 X with cdf FX ) and k = 1
(i.e. g1 g and Y1 Y )
Consider the function
g(x) = a x + b,
b R, a > 0
Find the distribution of

Y = g(X) = a X + b
170
The cdf of Y is given by

FY (y) = P (Y y)
= P [g(X) y]
= P (a X + b y)
yb
= P X
a
yb
= FX
a
If X is continuous, the pdf of Y is given by
yb
1
yb
0
fY (y) = FY0 (y) = FX
= fX
a
a
a
(cf. Slide 48)
171
Example 2:
Consider n = 1 and k = 1 and the function
g(x) = ex
The cdf of Y = g(X) = eX is given by
FY (y) = P (Y y)
= P (eX y)
= P [X ln(y)]
= FX [ln(y)]
If X is continuous, the pdf of Y is given by
fX [ln(y)]
0
0
fY (y) = FY (y) = FX [ln(y)] =
y
172
Now:
Consider n = 2 and k = 2, i.e. consider X1 and X2 with joint
density fX1,X2 (x1, x2)
Consider the functions
g1(x1, x2) = x1 + x2
and
g2(x1, x2) = x1 x2
Find the distributions of the sum and the difference of two

random variables
Derivation via the two-dimensional cdf-technique
173
Theorem 4.4: (Distribution of a sum / difference)

Let X1 and X2 be two continuous random variables with joint pdf
fX1,X2 (x1, x2). Then the pdfs of Y1 = X1 + X2 and Y2 = X1 X2
are given by
fY1 (y1) =
Z +
fX1,X2 (x1, y1 x1) dx1
Z +
fX1,X2 (y1 x2, x2) dx2
and
fY2 (y2) =
Z +
fX1,X2 (x1, x1 y2) dx1
Z +
fX1,X2 (y2 + x2, x2) dx2.
174
Implication:
If X1 and X2 are independent, then
fY1 (y1) =
Z +
fX1 (x1) fX2 (y1 x1) dx1
fY2 (y2) =
Z +
fX1 (x1) fX2 (x1 y2) dx1
Example:
Let X1 and X2 be independent random variables both with
pdf
fX1 (x) = fX2 (x) =
1
0
, for x [0, 1]
, elsewise
Find the pdf of Y = X1 + X2

(Class)
175
Now:
Analogous results for the product and the ratio of two random variables
Theorem 4.5: (Distribution of a product / ratio)

Let X1 and X2 be continuous random variables with joint pdf
fX1,X2 (x1, x2). Then the pdfs of Y1 = X1 X2 and Y2 = X1/X2
are given by
Z +
1
and
y1
fY1 (y1) =
fX1,X2 (x1, ) dx1
x1
|x1|
fY2 (y2) =
Z +
|x2| fX1,X2 (y2 x2, x2) dx2.

176
4.3 The Moment-generating-function Technique
Motivation:
Consider as given the random variables X1, . . . , Xn with joint
pdf fX1,...,Xn
Again, find the joint distribution of Y1, . . . , Yk where Yj =
gj (X1, . . . , Xn) for j = 1, . . . , k
177
According to Definition 3.14, Slide 143, the joint moment

generating function of the Y1, . . . , Yk is defined to be
mY1,...,Yk (t1, . . . , tk ) = E
=
et1Y1+...+tk Yk
Z +
...
Z +
et1g1(x1,...,xn)+...+tk gk (x1,...,xn)
fX1,...,Xn (x1, . . . , xn) dx1 . . . dxn

If mY1,...,Yk (t1, . . . , tk ) can be recognized as the joint moment
generating function of some known joint distribution, it will
follow that Y1, . . . , Yk has that joint distribution by virtue of
the identification property
(cf. Slide 145)
178
Example:
Consider n = 1 and k = 1 where the random variable X1 X
has a standard normal distribution
Consider the function g1(x) g(x) = x2
Find the distribution of Y = g(X) = X 2
The moment generating function of Y is given by
h
mY (t) = E etY = E e
=
Z +
tX 2
etx fX (x)dx
179
Z +
= ...
1
2
2
1
tx
= 1
2t
2
1
2x
dx
1
for t <
2
This is the moment generating function of a gamma distri1

=
bution with parameters = 1
and
r
2
2
(see Mood, Graybill, Boes (1974), pp. 540/541)
Y = X 2 (0.5, 0.5)
180
Now:
Distribution of sums of independent random variables
Preliminaries:
Consider the moment generating function of such a sum
Let X1, . . . , Xn be independent random variables and let Y =
Pn
i=1 Xi
The moment generating function of Y is given by

mY (t) = E
= E
etY
= E
et
Pn
i=1 Xi
= E
i
h
i
h
i
tX
tX
tX
e 1 E e 2 ... E e n
etX1 etX2 . . . etXn
[Theorem 3.13(c)]
= mX1 (t) mX2 (t) . . . mXn (t)
181
Theorem 4.6: (Moment generating function of a sum)

Let X1, . . . , Xn be stochastically independent random variables
with existing moment generating functions mX1 (t), . . . , mXn (t)
for all t (h, h), h > 0. Then the moment generating function
P
of the sum Y = n
i=1 Xi is given by
mY (t) =
n
Y
i=1
mXi (t)
for t (h, h).
Hopefully:
Pn
The distribution of the sum Y = i=1 Xi may be identified
from the moment generating function of the sum mY (t)
182
Example 1:
Assume that X1, . . . , Xn are independent and identically distributed exponential random variables with parameter > 0
The moment generating function of each Xi (i = 1, . . . , n) is
given by
mXi (t) =
for t <
t
(cf. Mood, Graybill, Boes (1974), pp. 540/541)
Pn
So the moment generating function of the sum Y = i=1 Xi
is given by
mY (t) = mP Xi (t) =
mXi (t) =
t
i=1
n
Y
183
This is the moment generating function of a (n, ) distribution

(cf. Mood, Graybill, Boes (1974), pp. 540/541)
the sum of n independent, identically distributed exponential random variables with parameter has a (n, )
distribution
184
Example 2:
Assume that X1, . . . , Xn are independent random variables
and that Xi N (i, i2)
Furthermore, let a1, . . . , an R be constants
Then the distribution of the weighted sum is given by
Y =
n
X
i=1
(Proof: Class)
ai Xi N
n
X
i=1
ai i,
n
X
i=1
ai2 i2
185
4.4 General Transformations

Up to now:
Techniques that allow us, under special circumstances, to
find the distributions of the transformed variables
Y1 = g1(X1, . . . , Xn), . . . , Yk = gk (X1, . . . , Xn)
However:
These methods do not necessarily hit the mark
(e.g. if calculations get too complicated)
186
Resort:
There are constructive methods by which it is generally possible (under rather mild conditions) to find the distributions
of transformed random variables
transformation theorems
Here:
We restrict attention to the simplest case where n = 1, k = 1,
i.e. we consider the transformation Y = g(X)
For multivariate extensions (i.e. for n 1, k 1) see Mood,
Graybill, Boes (1974), pp. 203-212
187
Theorem 4.7: (Transformation theorem for densities)

Suppose X is a continuous random variable with pdf fX (x). Set
D = {x : fX (x) > 0}. Furthermore, assume that
(a) the transformation g : D W with y = g(x) is a one-to-one
transformation of D onto W ,
(b) the derivative with respect to y of the inverse function g 1 :
W D with x = g 1(y) is continuous and nonzero for all
y W.
Then Y = g(X) is a continuous random variable with pdf
1
dg (y)
1
g (y)
f
fY (y) = dy X
, for y W
, elsewise
188
Remark:
The transformation g : D W with y = g(x) is called oneto-one, if for every y W there exists exactly one x D with
y = g(x)
Example:
Suppose X has the pdf
fX (x) =
x1
0
, for x [1, +)
, elsewise
(Pareto distribution with parameter > 0)

Find the distribution of Y = ln(X)
We have D = [1, +), g(x) = ln(x), W = [0, +)
189
Furthermore, g(x) = ln(x) is a one-to-one transformation of

D = [1, +) onto W = [0, +) with inverse function
x = g 1(y) = ey
Its derivative with respect to y is given by
dg 1(y)
= ey ,
dy
i.e. the derivative is continuous and nonzero for all y [0, +)
Hence, the pdf of Y = ln(x) is given by

fY (y) =
ey (ey )1
0
ey
0
, for y [0, +)
, elsewise
, for y [0, +)
, elsewise
190
5. Methods of Estimation
Setting:
Let X be a random variable (or let X be a random vector)
representing a random experiment
We are interested in the actual distribution of X (or X)
Notice:
In practice the actual distribution of X is a priori unknown
191
Therefore:
Collect information on the unknown distribution by repeatedly observing the random experiment (and thus the random
variable X)
random sample
statistic
estimator
192
5.1 Sampling, Estimators, Limit Theorems

Setting:
Let X represent the random experiment under consideration
(X is a univariate random variable)
We intend to observe the random experiment (i.e. X) n times
Prior to the explicit realizations we may consider the potential
observations as a set of n random variables X1, . . . , Xn
193
Definition 5.1: (Random sample)

The random variables X1, . . . , Xn are defined to be a random
sample from X if
(a) each Xi, i = 1, . . . , n, has the same distribution as X,
(b) X1, . . . , Xn are stochastically independent.
The number n is called the sample size.
194
Remarks:
We assume that, in principle, the random experiment can be
repeated as often as desired
We call the realizations x1, . . . , xn of the random sample
X1, . . . , Xn the observed or the concrete sample
Considering the random sample X1, . . . , Xn as a random vector, we see that its joint density is given by
fX1,...,Xn (x1, . . . , xn) =
n
Y
i=1
fXi (xi)
(since the Xis are independent; cf. Definition 3.8, Slide 125)
195
Model of a random sample
Zufallsvorgang X
X1 (ZV)
X2 (ZV)
...
Xn (ZV)
x1 (Realisation 1. Exp.)
x2 (Realisation 2. Exp.)
...
xn (Realisation n. Exp.)
Mgliche
Realisationen
196
Now:
Consider functions of the sampling variables X1, . . . , Xn
statistic
estimator
Definition 5.2: (Statistic)
Let X1, . . . , Xn be a random sample from X and let g : Rn R
be a real-valued function with n arguments that does not contain
any unknown parameters. Then the random variable
T = g(X1, . . . , Xn)
is called a statistic.
197
Examples:
Sample mean:
n
1 X
X = g1(X1, . . . , Xn) =
Xi
n i=1
Sample variance:
n
2
1 X
2
S = g2(X1, . . . , Xn) =
Xi X
n i=1
Sample standard deviation:
v
u
n
2
u1 X
Xi X
S = g3(X1, . . . , Xn) = t
n i=1
198
Remarks:
All these concepts can be extended to the multivariate case
The statistic T = g(X1, . . . , Xn) is a function of random variables and hence it is itself a random variable
a statistic has a distribution
(and, in particular, an expectation and a variance)
Purposes of statistics:
Statistics provide information on the distribution of X
Statistics are central tools for
estimating parameters
hypothesis-testing on parameters
199
Random samples and statistics
Stichprobe
( X1, . . ., Xn)
g( X1, . . ., Xn)
Statistik
Messung
Stichprobenrealisation
( x1, . . ., xn)
g( x1, . . ., xn)
Realisation der Statistik
200
Now:
Let X be a random variable with unknown cdf FX (x)
We may be interested in one or several unknown parameters
of X
Let denote this unknown vector of parameters, e.g.
"
E(X)
Var(X)
Frequently, the distribution family of X is known, e.g. X

N (, 2), but we do not know the specific parameters. Then
"
We will estimate the unknown parameter vector on the basis

of statistics from a random sample X1, . . . , Xn
201
Definition 5.3: (Estimator, estimate)

b (X , . . . , X ) is called estimator (or point estimaThe statistic
n
1
tor) of the unknown parameter vector . After having observed
the concrete sample x1, . . . , xn, we call the realization of the esb (x , . . . , x ) an estimate.
timator
n
1
Remarks:
b (X , . . . , X ) is a random variable or a random

The estimator
n
1
vector
an estimator has a (joint) distribution, an expected value
(or vector) and a variance (or a covariance matrix)
b (x , . . . , x ) is a number (or a vector of num The estimate
n
1
bers)
202
Example:
Let X N (, 2) with unknown parameters and 2
The vector of parameters to be estimated is given by
"
"
E(X)
Var(X)
Potential estimators of and 2 are

n
1 X
=
Xi
n i=1
2 =
and
an estimator of is given by
b =
"
n
1 X
)2
(Xi
n 1 i=1
1 Pn X
i=1 i
= 1 n
Pn
2
)
n 1 i=1 (Xi
203
Question:
Why do we need this seemingly complicated concept of an
estimator in the form of a random variable?
Answer:
To establish a comparison between alternative estimators of
the parameter vector
Example:
Let = Var(X) denote the unknown variance of X
204
Two alternative estimators of are
n
2
1 X
Xi X
1(X1, . . . , Xn) =
n i=1
n
2
1 X
2(X1, . . . , Xn) =
Xi X
n 1 i=1
Question:
Which estimator is better and for what reasons?
properties (goodness criteria) of point estimators
(see Section 5.2)
205
Notice:
Some of these criteria qualify estimators in terms of their
properties when the sample size becomes large
(n , large-sample-properties)
Therefore:
Explanation of the concept of stochastic convergence:
Central-limit theorem
Weak law of large numbers
Convergence in probability
Convergence in distribution
206
Theorem 5.4: (Univariate central-limit theorem)

Let X be any arbitrary random variable with E(X) = and
Var(X) = 2. Let X1, . . . , Xn be a random sample from X and
let
n
1 X
Xn =
Xi
n i=1
denote the arithmetic sample mean. Then, for n , we have
X n N ,
2
n
and
Xn
n
N (0, 1).
Next:
Generalization to the multivariate case
207
Theorem 5.5: (Multivariate central-limit theorem)

Let X = (X1, . . . , Xm)0 be any arbitrary random vector with
E(X) = and Cov(X) = . Let X1, . . . , Xn be a (multivariate) random sample from X and let
n
1 X
Xi
Xn =
n i=1
denote the multivariate arithmetic sample mean. Then, for n

, we have
1
Xn N ,
n
and

n Xn N (0, ).
208
Remarks:
A multivariate random sample from the random vector X
arises naturally by replacing all univariate random variables
in Definition 5.1 (Slide 194) by corresponding multivariate
random vectors
Note the formal analogy to the univariate case in Theorem
5.4
(be aware of matrix-calculus rules!)
Next:
Famous theorem on the arithmetic sample mean
209
Theorem 5.6: (Weak law of large numbers)

Let X1, X2, . . . be a sequence of independent and identically distributed random variables with
E(Xi) = < ,
Var(Xi) = 2 < .
Consider the random variable
n
1 X
Xi
Xn =
n i=1
(arithmetic sample mean). Then, for any > 0 we have
lim P X n = 0.
n
210
Remarks:
Theorem 5.6 is known as the weak law of large numbers
Irrespective of how small we choose > 0, the probability
that X n deviates more than from its expectation tends
to zero when the sample size increases
Notice the analogy between a sequence of independent and
identically distributed random variables and the definition of
a random sample from X on Slide 194
Next:
The first important concept of limiting behaviour
211
Definition 5.7: (Convergence in probability)

Let Y1, Y2, . . . be a sequence of random variables. We say that
the sequence Y1, Y2, . . . converges in probability to , if for any
> 0 we have
lim P (|Yn | ) = 0.
We denote convergence in probability by

plim Yn =
or
Yn .
Remarks:
Specific case: Weak law of large numbers
plim X n =
or
Xn
212
Typically (but not necessarily) a sequence of random variables converges in probability to a constant R
For multivariate sequences of random vectors Y1, Y2, . . . the
Definition 5.7 has to be applied to the respective corresponding elements
The concept of convergence in probability is important to
qualifying estimators
Next:
Alternative concepts of stochastic convergence
213
Definition 5.8: (Convergence in distribution)

Let Y1, Y2, . . . be a sequence of random variables and let Z also be
a random variable. We say that the sequence Y1, Y2, . . . converges
in distribution to the distribution of Z if
lim FYn (y) = FZ (y) for any y R.
We denote convergence in distribution by
n
Yn Z.
Remarks:
Specific case: central-limit theorem
Xn d
U N (0, 1)
Yn = n
In the case of convergence in distribution, the sequence of

rvs always converges to a limiting random variable
214
Theorem 5.9: (Rules for probability limits)

Let X1, X2, . . . and Y1, Y2, . . . be sequences of random variables
with plim Xn = a and plim Yn = b. Then
(a) plim (Xn Yn) = a b,
(b) plim (Xn Yn) = a b,
(c) plim
Xn = a (for b 6= 0).
Yn
b
(d) (Slutsky-Theorem) If g : R R is a continuous function in

a R, then
plim g (Xn) = g(a).
215
Remark:
There is a property similar to Slutskys theorem that holds
for the convergence in distribution
Theorem 5.10: (Rule for limiting distributions)

Let X1, X2, . . . be a sequence of random variables and let Z be a
d
random variable such that Xn Z. If h : R R is a continuous
function, then
d
h (Xn) h(Z).
Next:
Connection of both convergence concepts
216
er-Theorem)
Theorem 5.11: (Cram
Let X1, X2, . . . and Y1, Y2, . . . be sequences of random variables,
let Z be a random variable and a R a constant. Assume that
d
plim Xn = a and Yn Z. Then
d
(a) Xn + Yn a + Z,
d
(b) Xn Yn a Z.
Example:
Let X1, . . . , Xn be a random sample from X with E(X) =
and Var(X) = 2
217
It can be shown that
n
2
X
1
2
Xi X n = 2
plim Sn = plim
n 1 i=1
n
2
X
1
2
Xi X n = 2
plim Sn = plim
n i=1
For g1(x) = x/ 2 Slutkskys theorem yields
plim g1 Sn2
plim g1 Sn2
Sn2
= plim 2 = g1( 2) = 1
Sn2
= plim 2 = g1( 2) = 1
218
For g2(x) = / x Slutkskys theorem yields
2
plim g2 Sn
= plim = g2( 2) = 1
Sn
plim g2 Sn2
= plim
= g2( 2) = 1
Sn
From the central-limit theorem we know that

Xn d
n
U N (0, 1)
219
Now, Cram
ers theorem yields
Xn
2
g2 S n n
Xn
=
n
Sn
Xn
n
=
Sn
d
1U
= U N (0, 1)
Analogously, Cram
ers theorem yields
Xn d
U N (0, 1)
n
Sn
220
5.2 Properties of Estimators

Content of Definition 5.3 (Slide 202):
An estimator is defined to be a statistic
(a function of the random sample)
there are several alternative estimators of the unknown
parameter vector
Example:
Assume that X N (0, 2) with unknown variance 2 and let
X1, . . . , Xn be a random sample from X
Alternative estimators of = 2 are
n
2
1 X
1 =
Xi X
n i=1
and
n
2
1 X
2 =
Xi X
n 1 i=1
221
Important questions:
Are there reasonable criteria according to which we can select
a good estimator?
How can we construct good estimators?
First goodness property of point estimators:
Concept of repeated sampling:
Draw several random samples from X
Consider the estimator for each random sample
An average of the estimates should be close to the
unknown parameter
(no systematic bias)
unbiasedness of an estimator
222
Definition 5.12: (Unbiasedness, bias)

An estimator (X1, . . . , Xn) of the unknown parameter is defined
to be an unbiased estimator if its expectation coincides with the
parameter to be estimated, i.e. if
h
E (X1, . . . , Xn) = .
The bias of the estimator is defined as
Bias() = E() .
Remarks:
Definition 5.12 easily generalizes to the multivariate case
The bias of an unbiased estimator is equal to zero
223
Now:
Important and very general result
Theorem 5.13: (Unbiased estimators of E(X) and Var(X))
Let X1, . . . , Xn be a random sample form X where X may be
arbitrarily distributed with unknown expectation = E(X) and
unknown variance 2 = Var(X). Then the estimators
and
n
1 X
(X1, . . . , Xn) = X =
Xi
n i=1
2(X1, . . . , Xn) = S 2 =
n
2
X
1
Xi X
n 1 i=1
are always unbiased estimators of the parameters = E(X) and

2 = Var(X), respectively.
224
Remarks:
Proof: Class
Note that no explicit distribution of X is required
Unbiasedness does, in general, not carry over to parameter
transformations.
For example,
2 is not a unbiased estimator of = SD(X) =

=
S
S
q
Var(X)
Question:
How can we compare two alternative unbiased estimators of
the parameter ?
225
Definition 5.14: (Relative efficiency)

Let 1 and
rameter .
if
2 be two unbiased estimators of the unknown pa1 is defined to be relatively more efficient than 2
Var(1) Var(2)
for all possible parameter values of and

Var(1) < Var(2)
for at least one possible parameter value of .
226
Example:
Assume = E(X)
Consider the estimators
n
1 X
Xi
1(X1, . . . , Xn) =
n i=1
n
X
1
X1
2(X1, . . . , Xn) =
+
Xi
2
2(n 1) i=2
Which estimator is relatively more efficient?

(Class)
Question:
How can we compare two estimators if (at least) one estimator is biased?
227
Definition 5.15: (Mean-squared error)

Let be an estimator of the parameter . The mean-squared
error of the estimator is defined to be
MSE() = E
2
h
= Var + Bias()
i2
Remarks:
If an estimator is unbiased, then its MSE is equal to the
variance of the estimator
The MSE of an estimator depends on the value of the
unknown parameter
228
Next:
Comparison of alternative estimators via their MSEs
Definition 5.16: (MSE efficiency)
Let 1 and 2 be two alternative estimators of the unknown
parameter . 1 is defined to be more MSE efficient than 2 if
MSE(1) MSE(2)
for all possible parameter values of and
MSE(1) < MSE(2)
for at least one possible parameter value of .
229
Unbiased vs biased estimator
2 ( X 1, K , X n )
1 ( X 1, K , X n )
230
Remarks:
Frequently 2 estimators of are not comparable with respect
to MSE efficiency since their respective MSE curves cross
There is no general mathematical principle for constructing
MSE efficient estimators
However, there are methods for finding the estimator with
uniformly minimum-variance among all unbiased estimators
restriction to the class of all unbiased estimators
These specific methods are not discussed here
(Rao-Blackwell-Theorem, Lehmann-Scheff
e-Theorem)
Here, we consider only one important result
231
Theorem 5.17: (Cram

er-Rao lower bound for variance)
Let X1, . . . , Xn be a random sample from X and let be a parameter to be estimated. Consider the joint density of the random
sample fX1,...,Xn (x1, . . . , xn) and define the value
!21
fX1,...,Xn (X1, . . . , Xn)
CR() E
.
Under certain (regularity) conditions we have for any unbiased

estimator (X1, . . . , Xn)
Var() CR().
232
Remarks:
The value CR() is the minmal variance that any unbiased
estimator can take on
goodness criterion for unbiased estimators
If for an unbiased estimator (X1, . . . , Xn)
Var() = CR(),
then is called UMVUE
(Uniformly Minimum-Variance Unbiased Estimator)
233
Second goodness property of point estimators:

Consider an increasing sample size (n )
Notation: n(X1, . . . , Xn) = (X1, . . . , Xn)
Analysis of the asymptotic distribution properties of n
consistency of an estimator
Definition 5.18: ((Weak) consistency)
The estimator
if it converges
n(X1, . . . , Xn) is called (weakly) consistent for

in probability to , i.e. if
plim n(X1, . . . , Xn) = .
234
Example:
Assume that X N (, 2) with known 2 (e.g. 2 = 1)
Consider the following two estimators of :
n
1 X
Xi
n(X1, . . . , Xn) =
n i=1
n
X
2
1
Xi +
n(X1, . . . , Xn) =
n i=1
n

n is (weakly) consistent for
(Theorem 5.6, Slide 210: weak law of large numbers)
235
n is (weakly) consistent for

(this follows from Theorem 5.9(a), Slide 215)
n:
Exact distribution of
n N (, 2/n)
(linear transformation of the normal distribution)

Exact distribution of
n:
n N ( + 2/n, 2/n)
(linear transformation of the normal distribution)
236
Pdfs of the estimator

n for n = 2, 10, 20 ( 2 = 1)
8
6
4
2
0
-1
-0.5
=0
0.5
237
Pdfs of the estimator

n for n = 2, 10, 20 ( 2 = 1)
8
6
4
2
0
-0.5 =0
0.5
1.5
2.5
238
Remarks:
Sufficient (but not necessary) condition for consistency:
lim E(n) =
(asymptotic unbiasedness)
lim Var(n) = 0
n
Possible properties of an estimator:
consistent and unbiased
inconsistent and unbiased
consistent and biased
inconsistent and biased
239
Next:
Application of the central-limit theorem to estimators
asymptotic normality of an estimator
Definition 5.19: (Asymptotic normality)
An estimator n(X1, . . . , Xn) of the parameter is called asymptotically normal if there exist (1) a sequence of real constants
1, 2, . . . and (2) a function V () such that

d
n n n U N (0, V ()).
240
Remarks:
Alternative notation:
appr.
n N (n, V ()/n)
The concept of asymptotic normality naturally extends to

multivariate settings
241
5.3 Methods of Estimation

Up to now:
Definitions + properties of estimators
Next:
Construction of estimators
Three classical methods:
Method of Lesst Squares (LS)
Method of Moments (MM)
Maximum-Likelihood method (ML)
242
Remarks:
There are further methods
(e.g. the Generalized Method-of-Moments, GMM)
Here: focus on ML estimation
243
5.3.1 Least-Squares Estimators

History:
Introduced by
A.M. Legendre (1752-1833)
C.F. Gau (1777-1855)
Idee:
Approximate the (noisy) observations x1, . . . , xn by functions
gi(1, . . . , m), i = 1, . . . , n, m < n such that
S(x1, . . . , xn; ) =
n
X
[xi gi( )]2 min
i=1
The LS-estimator is then defined to be
b (X1, . . . , Xn) = argmin S(X1, . . . , Xn; )
244
Remark:
The LS-method is central to the linear regression model
(cf. the courses Econometrics I + II)
245
5.3.2 Method-of-moments Estimators

History:
Introduced by K. Pearson (1857-1936)
Definition 5.20: (Theoretical and sample moments)
(a) Let X be a random variable with expectation E(X). The
theoretical p-th moment of X, denoted by p0 , is defined as
p0 = E(X p).
The theoretical p-th central moment of X, denoted by p, is
defined as
p = E {[X E(X)]p} .
246
(b) Let X1, . . . , Xn be a random sample from X and let X denote

the arithmetic sample mean. Then the p-th sample moment,
denoted by
0p, is defined as
n
1 X
p
0
Xi .
p =
n i=1
The p-th central sample moment, denoted by

p, is defined
as
n
p
1 X
Xi X .
p =
n i=1
247
Remarks:
The theoretical moments 0p and p had already been introduced in Definition 2.21 (Slide 76)
The sample moments
p are estimators of the theo0p and
retical moments p0 and p
The arithmetic sample mean is the 1st sample moment of
X1, . . . , Xn
The sample variance is the 2nd central sample moment of
X1 , . . . , X n
248
General setting:
Based on the random sample X1, . . . , Xn from X estimate the
r unknown parameters 1, . . . , r
Basic idea of the method of moments:

1. Express the r theoretical moments as functions of the r unknown parameters:
01 = g1(1, . . . , r )
...
0r = gr (1, . . . , r )
249
2. Express the r unknown parameters as functions of the r theoretical moments:

1 = h1(1, . . . , r , 01, . . . , 0r )
...
r = hr (1, . . . , r , 01, . . . , 0r )
3. Replace the theoretical moments by the sample moments:
1(X1, . . . , Xn) = h1(
r ,
1 , . . . ,
01, . . . ,
0r )
...
r ,
01, . . . ,
r (X1, . . . , Xn) = hr (
1, . . . ,
0r )
250
Example: (Exponential distribution)

Let the random variable X have an exponential distribution
with parameter > 0 and pdf
fX (x) =
ex
0
, for x > 0
, elsewise
The expectation and the variance of X are given by
E(X) =
Var(X) =
1
2
251
Method-of-moments estimator via the expectation:

1. We know that
1
0
E(X) = 1 =
2. This implies
1
= 0
1
3. Method-of-moments estimator for :
1
(X1, . . . , Xn) =
Pn
1/n i=1 Xi
252
Method-of-moments estimator via the variance:

1. We know that
1
Var(X) = 2 = 2
2. This implies
=
1
2
3. Method-of-moments estimator for :

v
u
u
(X1, . . . , Xn) = u
t
2
Pn
1/n i=1 Xi X
Method-of-moment estimators of an unknown parameter

are not unique
253
Remarks:
Method-of-moment estimators are (weakly) consistent since
0 ,...,
plim 1 = plim h1(
r ,
1
1 , . . . ,
0r )
r0 )
= h1(plim
1, . . . , plim
r , plim
01, . . . , plim
= h1(1, . . . , r , 01, . . . , r0 )
= 1
In general, method-of-moments estimators are not unbiased
Method-of-moments estimators typically are asymptotically
normal
The asymptotic variances are often hard to determine
254
5.3.3 Maximum-Likelihood Estimators

History:
Introduced by Ronald Fisher (1890-1962)
Basic idea behind ML estimation:
We estimate the unknown parameters 1, . . . , r in such a
manner that the likelihood of the observed sample x1, . . . , xn,
which we express as a function of the unknown parameters,
becomes maximal
255
Example:
Consider an urn containing black and white balls
The ratio of numbers is known to be 3 : 1
It is not known if the black or the white balls are more numerous
Draw n balls with replacement
Let X denote the number of black balls in the sample
Discrete density of X:
n
P (X = x) =
px(1p)nx,
x
(binomial distribution)
x {0, 1, . . . , n}, p {0.25, 0.75}
256
p {0.25, 0.75} is the parameter to be estimated

Consider a particular sample of size n = 3
potential realizations:
Number of black balls: x
P (X = x; p = 0.25)
P (X = x; p = 0.75)
27
64
1
64
27
64
9
64
9
64
27
64
1
64
27
64
Intuitive estimation:
We estimate p by that value which ex-ante maximizes the
probability of observing the actual realization x
(
0.25 , f
ur x = 0, 1
ur x = 2, 3
0.75 , f
Maximum-Likelihood (ML) estimation
p =
257
Next:
Formalization of the ML estimation technique
Notions:
Likelihood-, Loglikelihood function
ML estimator
Definition 5.21: (Likelihood function)
The likelihood function of n random variables X1, . . . , Xn is defined to be the joint density of the n random variables, say
fX1,...,Xn (x1, . . . , xn; ), which is considered to be a function of
the parameter vector .
258
Remarks:
If X1, . . . , Xn is a random sample from the continuous random
variable X with pdf fX (x, ), then
fX1,...,Xn (x1, . . . , xn; ) =
n
Y
i=1
fXi (xi; ) =
n
Y
fX (xi; )
i=1
The likelihood function is often denoted by L( ; x1, . . . , xn)

or L( ), i.e. in the above-mentioned case
L( ; x1, . . . , xn) = L( ) =
n
Y
fX (xi; )
i=1
259
If the X1, . . . , Xn are a sample from a discrete random variable

X, the likelihood function is given by
L( ; x1, . . . , xn) = P (X1 = x1, . . . , Xn = xn; ) =
n
Y
P (X = xi; )
i=1
(likelihood = probability that the observed sample occurs)

Example:
Let X1, . . . , Xn be a random sample from X N (, 2). Then
= (, 2)0 and the likelihood function is given by
L( ; x1, . . . , xn) =
=
n
Y
i=1
1
2 2
1
2 2
2
1/2((x
)/)
i
e
n/2
exp
n
X
1
2
(x
)
i
2 2 i=1
260
Definition 5.22: (Maximum-likelihood estimator)

Let L( , x1, . . . , xn) be the likelihood function of the random varib (x , . . . , x ) is a function of
b [where
b =
ables X1, . . . , Xn. If
n
1
the observations x1, . . . , xn] is the value of which maximizes
b (X , . . . , X ) is the maximum-likelihood esL( , x1, . . . , xn), then
n
1
timator of .
Remarks:
We obtain the ML estimator via (1) maximizing the likelihood
function
b ; x , . . . , x ) = max L( ; x , . . . , x ),
L(
n
n
1
1
and (2) by replacing the realizations x1, . . . , xn by the random

variables X1, . . . , Xn
261
It is often easier to maximize the loglikelihood function

ln[L( ; x1, . . . , xn)]
(L( ) and ln[L( )] have their maxima at the same value of
)
b = (
1, . . . , r )0 by solving the system of equations
We derive
ln[L( ; x1, . . . , xn)] = 0

1
...
ln[L( ; x1, . . . , xn)] = 0

r
262
Example:
Let X1, . . . , Xn be a random sample from X N (, 2) with
the likelihood function
L(, 2) =
1
2 2
n/2
exp
n
X
1
2
)
(x
i
2 2 i=1
The loglikelihood function is given by

L(, 2) = ln[L(, 2)]
n
X
n
n
1
2
= ln(2) ln( 2)
x
)
(
i
2
2
2 2 i=1
263
The partial derivatives are given by
and
n
L(, 2)
1 X
= 2
(xi )
i=1
n
L(, 2)
1 X
n 1
2
=
)
(x
i
2
2 2
2 4 i=1
Setting these equal to zero, solving the system of equations
and replacing the realizations by the random variables yields
the ML estimators
n
1 X
Xi = X
(X1, . . . , Xn) =
n i=1
2(X1, . . . , Xn) =
n
2
1 X
Xi X
n i=1
264
General properties of ML estimators:

Distributional assumptions are necessary
Under rather mild regularity conditions ML estimators have
nice properties:
1. If is the ML estimator of , then g() is the ML estimator
of g()
(equivariance property)
2. (Weak) consistency:
plim n =
265
3. Asymptotic normality:

d
n n U N (0, V ())
4. Asymptotic efficiency:
V () coincides with the Cram
er-Rao lower bound
5. Direct computation (numerical methods)
6. Quasi-ML estimation:
ML estimators computed on the basis of normally distributed random samples are robust even if the random
sample actually is not normally distributed
(robustness against distribution misspecification)
266
6. Hypothesis Testing
Setting:
Let X represent the random experiment under consideration
Let X have the unknown cdf FX (x)
We are interested in an unknown parameter in the distribution of X
Now:
Testing of a statistical hypothesis on the unknown on the
basis of a random sample X1, . . . , Xn
267
Example 1:
In a our local pub the glasses are said to contain 0.4 litres
of beer. We suspect that in many cases the glasses actually
contain less than 0.4 litres of beer
Let X represent the process of filling a glass of beer
Let = E(X) denote the expected amount of beer filled in
one glass
On the basis of a random sample X1, . . . , Xn we would like
to test
= 0.4
versus
< 0.4
268
Example 2:
We know from past data that the risk of a specific stock
(measured by the standard deviation of the stock return) has
been equal to 25%. Now, there is a change in the managerial
board of the firm. Does this change affect the risk of the
stock?
Let X represent the stock return
q
Let = Var(X) = SD(X) denote the standard deviation of

the return
On the basis of a random sample X1, . . . , Xn we would like
to test
= 0.25
versus
6= 0.25
269
6.1 Basic Terminology

Definition 6.1: (Parameter test)
Let X be a random variable and let be an unknown parameter in
the distribution of X. A parameter test constitutes a statistical
procedure for deciding on a hypothesis concerning the unknown
parameter on the basis of a random sample X1, . . . , Xn from
X.
Statistical hypothesis-testing problem:
Let denote the set of all possible parameter values
(i.e. ; we call the parameter space)
Let 0 be a subset of the parameter space
270
Consider the following statements:

H 0 : 0
versus
H1 : /0 = 1
H0 is called the null hypothesis, H1 is called the alternative

hypothesis
Types of hypotheses:
If |0| = 1 (i.e. 0 = {0}) and H0 : = 0, then H0 is called
simple
Otherwise H0 is called composite
An analogous terminology applies to H1
271
Types of hypothesis tests:

Let 0 be a real constant. Then
versus
H 1 : 6 = 0
H0 : 0
versus
H1 : > 0
H0 : 0
versus
H1 : < 0
H 0 : = 0
is called a two-sided test
The tests
and
are called one-sided tests (right- and left-sided tests)
272
Next:
Consider the general testing problem
H0 : 0
versus
H1 : 1 = /0
General procedure:
Based on a random sample X1, . . . , Xn from X decide on
whether to reject H0 in favor of H1 or not
Explicit procedure:
Select an appropriate test statistic T (X1, . . . , Xn) and determine an appropriate critical region K R
Decision:
T (X1, . . . , Xn) K = reject H0

/ K = do not reject (accept) H0
T (X1, . . . , Xn)
273
Notice:
T (X1, . . . , Xn) is a random variable
the decision is random
possibility of wrong decisions
Types of errors:
Reality
H0 true
H0 false
Decision based on test

reject H0
accept H0
type I error
correct decision
correct decision
type II error
Conclusion:
Type I error: test rejects H0 when H0 is true
Type II error: test accepts H0 when H0 is false
274
When do wrong decisions occur?

The type I error occurs if
T (X1, . . . , Xn) K
when for the true parameter we have 0
The type II error occurs if
T (X1, . . . , Xn)
/ K,
when for the true parameter we have 1
275
Question:
When does a hypothesis test of the form
H0 : 0
versus
H1 : 1 = /0
have good properties?

Intuitively:
A test is good if it possesses low probabilities of committing
type I and type II errors
Next:
Formal instrument for measuring type I and type II error
probabilities
276
Definition 6.2: (Power function of a test)

Consider a hypothesis test of the general form given on Slide 276
with the test statistic T (X1, . . . , Xn) and an appropriately chosen critical region K. The power function of the test, denoted
by G(), is defined to be the probability that the test rejects H0
when is the true (unknown) parameter. Formally,
G : [0, 1]
with
G() = P (T (X1, . . . , Xn) K).
277
Remark:
Using the power function of a test, we can express the probabilities of the type I error as
G()
for all 0
and the probabilities of the type II error as

1 G()
for all 1
Question:
What should an ideal test look like?
Intuitively:
A test would be ideal if the probabilities of both the type I
and the type II errors were constantly equal to zero
the test would yield the correct decision with probab. 1
278
Example:
For 0 consider the testing problem
H0 : 0
versus
H1 : > 0
Power function of an ideal test

279
Unfortunately:
It can be shown mathematically that, in general, such an
ideal test does not exist
Way out:
For the selected test statistic T (X1, . . . , Xn) consider the
maximal type-I-error probability
= max {P (T (X1, . . . , Xn) K)} = max {G()}
0
Now, fix the critical region K in such a way that takes on

a prespecified small value
280
all type-I-error probabilities are less than or equal to

Frequently used -values: = 0.01, = 0.05, = 0.1
Definition 6.3: (Size of test)
Consider a hypothesis test of the general form given on Slide
276 with the test statistic T (X1, . . . , Xn) and an appropriately
chosen critical region K. The size of the test (also known as
the significance level of the test) is defined to be the maximal
type-I-error probability
= max {P (T (X1, . . . , Xn) K)} = max {G()}.
0
281
Implications of this test construction:

The probability of the test rejecting H0 when in fact H0 is
true (i.e. the type-I-error probability) is at the utmost
if, for a concrete sample, the test rejects H0, we can be
quite sure that H0 is in fact false
(we say that H1 is statistically significant)
By contrast, we cannot control for the type-II-error probability (i.e. for the probability of the test accepting H0 when
in fact H0 is false)
if, for a concrete sample, the test accepts H0, then there
is no probability assessment of a potentially wrong decision
(acceptance of H0 simply means: the data are not inconsistent with H0)
282
Therefore:
It is crucial how to formulate H0 and H1
We formulate our research hypothesis in H1
(hoping that, for a concrete sample, our test rejects H0)
Example:
Consider Example 1 on Slide 268
If, for a concrete sample, our test rejects H0 we can be quite
sure that (on average) the glasses contain less than 0.4 litres
of beer
If our test accepts H0 we cannot make a statistically significant statement
(the data are not inconsistent with H0)
283
6.2 Classical Testing Procedures

Next:
Three general classical testing procedures based on the loglikelihood function of a random sample
Setting:
Let X1, . . . , Xn be a random sample from X
Let R be an unknown parameter
Let L() = L(; x1, . . . , xn) denote the likelihood function
284
Let ln[L()] denote the loglikelihood function

Assume g : R R to be a continuous function
Consider the testing problem:
H0 : g() = q
versus
H1 : g() =
6 q
Fundamental to all three tests:

Maximum-Likelihood estimator M L of
285
6.2.1 Wald Test

History:
Suggested by A. Wald (1902-1950)
Idea behind this test:
If H0 : g() = q is true, then the random variable g(M L) q
should not be significantly different from zero
286
Previous knowledge:
Equivariance property of the ML estimator (Slide 265)
g(M L) is the ML estimator of g()
Asymptotic normality (Slide 266)
d
g(M L) g() U N (0, Var(g(M L)))
The asymptotic variance Var(g(M L)) needs to be estimated

from the data
Wald test statistic:

h
i2
d
g M L q (under H0)
i
h
U 2
W =
1
d g
M L
Var
287
Test decision:
Reject H0 at the significance level if W > 2
1;1
Remarks:
The Wald test is a pure test against H0
(it is not necessary to exactly specify H1)
The Wald principle can be applied to any consistent, asymptotically normally distributed estimator
288
Wald test statistic for H0 : g() = 0 versus H1 : g() 6= 0
g( )
ML
ln[ L( )]
289
6.2.2 Likelihood-Ratio Test (LR Test)

Consider the likelihood function L() at 2 points:
max L()
(= L(H0 ))
{:g()=q}
max L()
(= L(M L))
Consider the quantity
L(H0 )
=
L(M L)
Properties of :
01
If H0 is true, then should be close to one
290
LR test statistic:
n
io (under H )
0
LR = 2 ln() = 2 ln L(M L) ln L(H0 )
U 2
1
Properties of the LR test statistic:

0 LR <
If H0 is true, then LR should be close to zero
Test decision:
Reject H0 at the significance level if LR > 2
1;1
291
Remarks:
The LR test verifies if the distance in the loglikelihood functions, ln[L(M L)] ln[L(H0 )], is significantly larger than 0
The LR test does not require the computation of any asymptotic variance
292
LR test statistic for H0 : g() = 0 versus H1 : g() 6= 0
ln[ L( ML )]
g( )
LR
ln[ L( H 0 )]
H 0
ML
ln[L( )]
293
6.2.3 Lagrange-Multiplier Test (LM Test)

History:
Suggested by J.L. Lagrange (1736-1813)
For the ML estimator M L we have
ln[L()]
=0
=M L
If H0 : g() = q is true, then the slope of the loglikelihood

function at the point H0 should not be significantly different
from zero
294
LM test statistic:
i1 (under H )
ln[L()]
0
2
d
H
LM =
Var
1
0
H0
Test decision:
2
Reject H0 at the significance level if LM > 1;1
295
LM test statistic for H0 : g() = 0 versus H1 : g() =

6 0
ln[L( )]
g()
LM
H0
ML
ln[ L( )]
296
Remarks:
The test statistics of both, the Wald and the LM tests, contain the estimated variances of the estimator H0
These unknown variances can be estimated consistently by
the co-called Fisher-information
Many econometric tests are based on these three construction principles
The three tests are asymptotically equivalent, i.e. for large
sample sizes they produce identical test decisions
The three principles can be extended to the testing of hypotheses on a parameter vector
If Rm, then all 3 test statistics have a 2
m distribution
under H0
297
The 3 tests in one graph
ln L( )
ln[( ML )]
ln[( H 0 )]
LR
g( )
LM
H 0
ML
ln L( )
298

Fortstat English PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fortstat English PDF

Uploaded by

Copyright:

Available Formats

Slides

Prof. Dr. Bernd Wilfling

Random Variables, Distribution Functions, Expectation,

Joint and Conditional Distributions, Stochastic Independence

Distributions of Functions of Random Variables

References and Related Reading

How to get prepared for the exam:

Auxiliary material to be used in the exam:

1.2 Why Advanced Statistics ?

Aim of this course:

2. Random Variables, Distribution Functions, Expectation, Moment generating Functions

(see Wilfling (2011), Chapter 2)

2.1 Basic Terminology

(a) for which we know in advance all conceivable outcomes that

(b) for which we do not know in advance the actual outcome

Random experiments are performed in controllable trials.

Examples of random experiments:

Definition 2.2: (Sample point, sample space)

Combining events (set operations):

Ai occurs, if all Ai occur

Ai occurs, if at least one Ai occurs

Definition 2.4: (Kolmogorov-axioms)

Nonnegativity: P (A) 0 for every A

Addition rule (I):

Addition rule (II):

Probability of the difference event:

2.2 Random Variable, Cumulative Distribution

Monthly salary of a randomly selected person

Definition 2.7: (Random variable [rv])

When the random experiment is carried out, the random

Definition 2.8: (Cumulative distribution function [cdf])

For the probabilities of X we find

P (X = 0) = P ({(T, T, T )}) = 1/8

Thus, the cdf is given by

FX is continuous from the right; that is,

Definition 2.9: (Quantile function)

The value of the quantile function xp = QX (p) is called the pth

Definition 2.10: (Discrete random variable)

each with strictly positive probability; that is, if for all j =

Examples of discrete variables:

Definition 2.11: (Support of a discrete random variable)

supp(X) = {x1, x2, . . .}.

Definition 2.12: (Discrete density function)

The discrete density function fX () has the following properties:

For any arbitrary set A R the probability of the event

The cdf of X is given by (see slide 32)

Definition 2.13: (Continuous rv, probability density function)

The function fX (x) is called the probability density function (pdf)

Cdf FX () and pdf fX ()

Properties of the pdf fX ():

2. The area under a pdf is equal to one, i.e.

3. If the cdf FX (x) is differentiable we have

Example: (Uniform distribution over [0, 10])

Derivation of the cdf FX :

For x [0, 10] we have

For x > 10 we have

Interval probability between the limits a and b

Important result for a continuous rv X:

Probability of a single value

2.3 Expectation, Moments and Moment Generating Functions

Example 1: (Discrete random variable)

The discrete density function of X is given by

Example 2: (Continuous random variable)