You are on page 1of 59

CHAPTER 3

Convergence of Random
Variables
3.1 Introduction
In probability and statistics, it is often necessary to consider
the distribution of a random variable that is itself a function
of several random variables, for example, Y = g(X
1
, , X
n
); a
simple example is the sample mean of random variables X
1
, , X
n
.
Unfortunately, nding the distribution exactly is often very dicult
or very time-consuming even if the joint distribution of the random
variables is known exactly. In other cases, we may have only partial
information about the joint distribution of X
1
, , X
n
in which
case it is impossible to determine the distribution of Y . However,
when n is large, it may be possible to obtain approximations to
the distribution of Y even when only partial information about
X
1
, , X
n
is available; in many cases, these approximations can
be remarkably accurate.
The standard approach to approximating a distribution function
is to consider the distribution function as part of an innite
sequence of distribution functions; we then try to nd a limiting
distribution for the sequence and use that limiting distribution to
approximate the distribution of the random variable in question.
This approach, of course, is very common in mathematics. For
example, if n is large compared to x, one might approximate
(1 +x/n)
n
by exp(x) since
lim
n
_
1 +
x
n
_
n
= exp(x).
(However, this approximation may be very poor if x/n is not close
to 0.) A more interesting example is Stirlings approximation, which
is used to approximate n! for large values of n:
n!

2 exp(n)n
n+1/2
= s(n)
where the approximation holds in the sense that n!/s(n) 1 as
c 2000 by Chapman & Hall/CRC
Table 3.1 Comparison of n! and its Stirling approximation s(n).
n n! s(n)
1 1 0.92
2 2 1.92
3 6 5.84
4 24 23.51
5 120 118.02
6 720 710.08
n . In fact, Stirlings approximation is not too bad even for
small n as Table 3.1 indicates.
In a sense, Stirlings approximation shows that asymptotic
approximations can be useful in a more general context. In
statistical practice, asymptotic approximations (typically justied
for large sample sizes) are very commonly used even in situations
where the sample size is small. Of course, it is not always clear that
the use of such approximations is warranted but nonetheless there
is a suciently rich set of examples where it is warranted to make
the study of convergence of random variables worthwhile.
To motivate the notion of convergence of random variables,
consider the following example. Suppose that X
1
, , X
n
are i.i.d.
random variables with mean and variance
2
and dene

X
n
=
1
n
n

i=1
X
i
to be their sample mean; we would like to look at the behaviour
of the distribution of

X
n
when n is large. First of all, it seems
reasonable that

X
n
will be close to if n is suciently large; that
is, the random variable

X
n
should have a distribution that, for
large n, is concentrated around 0 or, more precisely,
P[|

X
n
| ] 1
when is small. (Note that Var(

X
n
) =
2
/n 0 as n .)
This latter observation is, however, not terribly informative about
the distribution of

X
n
. However, it is also possible to look at the
dierence between

X
n
and on a magnied scale; we do this
c 2000 by Chapman & Hall/CRC
by multiplying the dierence

X
n
by

n so that the mean and
variance are constant. Thus dene
Z
n
=

n(

X
n
)
and note that E(Z
n
) = 0 and Var(Z
n
) =
2
. We can now consider
the behaviour of the distribution function of Z
n
as n increases. If
this sequence of distribution functions has a limit (in some sense)
then we can use the limiting distribution function to approximate
the distribution function of Z
n
(and hence of

X
n
). For example, if
we have
P(Z
n
x) = P
_
n(

X
n
) x
_
F
0
(x)
then
P(

X
n
y) = P
_
n(

X
n
)

n(y )
_
F
0
_
n(y )
_
provided that n is suciently large to make the approximation
valid.
3.2 Convergence in probability and distribution
In this section, we will consider two dierent types of convergence
for sequences of random variables, convergence in probability and
convergence in distribution.
DEFINITION. Let {X
n
}, X be random variables. Then {X
n
}
converges in probability to X as n (X
n

p
X) if for each
> 0,
lim
n
P(|X
n
X| > ) = 0.
If X
n

p
X then for large n we have that X
n
X with
probability close to 1. Frequently, the limiting random variable X
is a constant; X
n

p
(a constant) means that for large n there is
almost no variation in the random variable X
n
. (A stronger form of
convergence, convergence with probability 1, is discussed in section
3.7.)
DEFINITION. Let {X
n
}, X be random variables. Then {X
n
}
converges in distribution to X as n (X
n

d
X) if
lim
n
P(X
n
x) = P(X x) = F(x)
for each continuity point of the distribution function F(x).
c 2000 by Chapman & Hall/CRC
It is important to remember that X
n

d
X implies convergence
of distribution functions and not of the random variables them-
selves. For this reason, it is often convenient to replace X
n

d
X
by X
n

d
F where F is the distribution function of X, that is,
the limiting distribution; for example, X
n

d
N(0,
2
) means that
{X
n
} converges in distribution to a random variable that has a
Normal distribution (with mean 0 and variance
2
).
If X
n

d
X then for suciently large n we can approximate
the distribution function of X
n
by that of X; thus, convergence in
distribution is potentially useful for approximating the distribution
function of a random variable. However, the statement X
n

d
X
does not say how large n must be in order for the approximation
to be practically useful. To answer this question, we typically need
a further result dealing explicitly with the approximation error as
a function of n.
EXAMPLE 3.1: Suppose that X
1
, , X
n
are i.i.d. Uniform
random variables on the interval [0, 1] and dene
M
n
= max(X
1
, , X
n
).
Intuitively, M
n
should be approximately 1 for large n. We will
rst show that M
n

p
1 and then nd the limiting distribution
of n(1 M
n
). The distribution function of M
n
is
F
n
(x) = x
n
for 0 x 1.
Thus for 0 < < 1,
P(|M
n
1| > ) = P(M
n
< 1 )
= (1 )
n
0
as n since |1 | < 1. To nd the limiting distribution of
n(1 M
n
), note that
P(n(1 M
n
) x) = P(M
n
1 x/n)
= 1
_
1
x
n
_
n
1 exp(x)
as n for x 0. Thus n(1 M
n
) has a limiting Exponential
distribution with parameter 1. In this example, of course, there is
no real advantage in knowing the limiting distribution of n(1M
n
)
as its exact distribution is known.
c 2000 by Chapman & Hall/CRC
EXAMPLE 3.2: Suppose that X
1
, , X
n
are i.i.d. random
variables with
P(X
i
= j) =
1
10
for j = 0, 1, 2, , 9
and dene
U
n
=
n

k=1
X
k
10
k
.
U
n
can be thought of as the rst n digits of a decimal representation
of a number between 0 and 1 (U
n
= 0.X
1
X
2
X
3
X
4
X
n
). It
turns out that U
n
tends in distribution to a Uniform on the
interval [0, 1]. To see this, note that each outcome of (X
1
, , X
n
)
produces a unique value of U
n
; these possible values are j/10
n
for
j = 0, 1, 2, , 10
n
1, and so it follows that
P(U
n
= j/10
n
) =
1
10
n
for j = 0, 1, 2, , 10
n
1.
If j/10
n
x < (j + 1)/10
n
then
P(U
n
x) =
j + 1
10
n
and so
|P(U
n
x) x| 10
n
0 as n
and so P(U
n
x) x for each x between 0 and 1.
Some important results
We noted above that convergence in probability deals with conver-
gence of the random variables themselves while convergence in dis-
tribution deals with convergence of the distribution functions. The
following result shows that convergence in probability is stronger
than convergence in distribution unless the limiting random vari-
able is a constant in which case the two are equivalent.
THEOREM 3.1 Let {X
n
}, X be random variables.
(a) If X
n

p
X then X
n

d
X.
(b) If X
n

d
(a constant) then X
n

p
.
Proof. (a) Let x be a continuity point of the distribution function
of X. Then for any > 0,
P(X
n
x) = P(X
n
x, |X
n
X| )
+P(X
n
x, |X
n
X| > )
P(X x +) +P(|X
n
X| > )
c 2000 by Chapman & Hall/CRC
where the latter inequality follows since [X
n
x, |X
n
X| ]
implies [X x +]. Similarly,
P(X x ) P(X
n
x) +P(|X
n
X| > )
and so
P(X
n
x) P(X x ) P(|X
n
X| > ).
Thus putting the two inequalities for P(X
n
x) together, we have
P(X x ) P(|X
n
X| > )
P(X
n
x)
P(X x +) +P(|X
n
X| > ).
By hypothesis, P(|X
n
X| > ) 0 as n for any > 0.
Moreover, since x is a continuity point of the distribution function
of X, P(X x ) can be made arbitrarily close to P(X x) by
taking close to 0. Hence,
lim
n
P(X
n
x) = P(X x).
(b) Dene F(x) to be the distribution function of the degenerate
random variable taking the single value ; thus, F(x) = 0 for x <
and F(x) = 1 for x . Note that F is continuous at all but one
point. Then
P(|X
n
| > ) = P(X
n
> +) +P(X
n
< )
1 P(X
n
+) +P(X
n
).
However, since X
n

d
, it follows that P(X
n
+ ) 1 and
P(X
n
) 0 as n and so P(|X
n
| > ) 0.
It is often dicult (if not impossible) to verify the convergence
of a sequence of random variables using simply its denition.
Theorems 3.2, 3.3 and 3.4 are sometimes useful for showing
convergence in such cases.
THEOREM 3.2 (Continuous Mapping Theorem) Suppose
that g(x) is a continuous real-valued function.
(a) If X
n

p
X then g(X
n
)
p
g(X).
(b) If X
n

d
X then g(X
n
)
d
g(X).
The proofs will not be given here. The proof of (a) is sketched as
an exercise. The proof of (b) is somewhat more technical; however,
if we further assume g to be strictly increasing or decreasing
(so that g has an inverse function), a simple proof of (b) can
c 2000 by Chapman & Hall/CRC
be given. (Also see Example 3.16 for a simple proof assuming
more technical machinery.) The assumption of continuity can also
be relaxed somewhat. For example, Theorem 3.2 will hold if g
has a nite or countable number of discontinuities provided that
these discontinuity points are continuity points of the distribution
function of X. For example, if X
n

d
(a constant) and g(x) is
continuous at x = then g(X
n
)
d
g().
THEOREM 3.3 (Slutskys Theorem) Suppose that X
n

d
X
and Y
n

p
(a constant). Then
(a) X
n
+Y
n

d
X +.
(b) X
n
Y
n

d
X.
Proof. (a) Without loss of generality, let = 0. (If = 0 then
X
n
+ Y
n
= (X
n
+ ) + (Y
n
) and Y
n

p
0.) Let x be a
continuity point of the distribution function of X. Then
P(X
n
+Y
n
x) = P(X
n
+Y
n
x, |Y
n
| )
+P(X
n
+Y
n
x, |Y
n
| > )
P(X
n
x +) +P(|Y
n
| > ).
Also,
P(X
n
x ) = P(X
n
x , |Y
n
| )
+P(X
n
x , |Y
n
| > )
P(X
n
+Y
n
x) +P(|Y
n
| > )
(since [X
n
x , |Y
n
| ] implies [X
n
+Y
n
x]). Hence,
P(X
n
x ) P(|Y
n
| > ) P(X
n
+Y
n
x)
P(X
n
x +) +P(|Y
n
| > ).
Now take x to be continuity points of the distribution function
of X. Then
lim
n
P(X
n
x ) = P(X x )
and the limit can be made arbitrarily close to P(X x) by taking
to 0. Since P(|Y
n
| > ) 0 as n the conclusion follows.
(b) Again we will assume that = 0. (To see that it suces to
consider this single case, note that X
n
Y
n
= X
n
(Y
n
) + X
n
.
Since X
n

d
X the conclusion will follow from part (a) if we
show that X
n
(Y
n
)
p
0.) We need to show that X
n
Y
n

p
0.
c 2000 by Chapman & Hall/CRC
Taking > 0 and M > 0, we have
P(|X
n
Y
n
| > ) = P(|X
n
Y
n
| > , |Y
n
| 1/M)
+P(|X
n
Y
n
| > , |Y
n
| > 1/M)
P(|X
n
Y
n
| > , |Y
n
| 1/M) +P(|Y
n
| > 1/M)
P(|X
n
| > M) +P(|Y
n
| > 1/M).
Since Y
n

p
0, P(|Y
n
| > 1/M) 0 as n for any xed M > 0.
Now take and M such that M are continuity points of the
distribution function of X; then P(|X
n
| > M) P(|X| > M)
and the limit can be made arbitrarily close to 0 by making M
suciently large.
Since Y
n

p
is equivalent to Y
n

d
when is a constant, we
could replace Y
n

p
by Y
n

d
in the statement of Slutskys
Theorem. We can also generalize this result as follows. Suppose that
g(x, y) is a continuous function and that X
n

d
X and Y
n

p
for
some constant . Then it can be shown that g(X
n
, Y
n
)
d
g(X, ).
In fact, this result is sometimes referred to as Slutskys Theorem
with Theorem 3.3 a special case for g(x, y) = x+y and g(x, y) = xy.
THEOREM 3.4 (The Delta Method) Suppose that
a
n
(X
n
)
d
Z
where is a constant and {a
n
} is a sequence of constants with
a
n
. If g(x) is a function with derivative g

() at x = then
a
n
(g(X
n
) g())
d
g

()Z.
Proof. Well start by assuming that g is continuously dierentiable
at . First, note that X
n

p
. (This follows from Slutskys
Theorem.) By a Taylor series expansion of g(x) around x = ,
we have
g(X
n
) = g() +g

n
)(X
n
)
where

n
lies between X
n
and ; thus |

n
| |X
n
| and
so

n

p
. Since g

(x) is continuous at x = , it follows that


g

n
)
p
g

(). Now,
a
n
(g(X
n
) g()) = g

n
)a
n
(X
n
)

d
g

()Z
by Slutskys Theorem. For the more general case (where g is not
necessarily continuously dierentiable at ), note that
g(X
n
) g() = g

()(X
n
) +R
n
c 2000 by Chapman & Hall/CRC
where R
n
/(X
n
)
p
0. Thus
a
n
R
n
= a
n
(X
n
)
R
n
a
n
(X
n
)

p
0
and so the conclusion follows by Slutskys Theorem.
A neater proof of the Delta Method is given in Example 3.17. Also
note that if g

() = 0, we would have that a


n
(g(X
n
) g())
p
0.
In this case, we may have
a
k
n
(g(X
n
) g())
d
some V
for some k 2; see Problem 3.10 for details.
If X
n

d
X (or X
n

p
X), it is tempting to say that E(X
n
)
E(X); however, this statement is not true in general. For example,
suppose that P(X
n
= 0) = 1 n
1
and P(X
n
= n) = n
1
. Then
X
n

p
0 but E(X
n
) = 1 for all n (and so converges to 1). To ensure
convergence of moments, additional conditions are needed; these
conditions eectively bound the amount of probability mass in the
distribution of X
n
concentrated near for large n. The following
result deals with the simple case where the random variables {X
n
}
are uniformly bounded; that is, there exists a constant M such that
P(|X
n
| M) = 1 for all n.
THEOREM 3.5 If X
n

d
X and |X
n
| M (nite) then E(X)
exists and E(X
n
) E(X).
Proof. For simplicity, assume that X
n
is nonnegative for all n; the
general result will follow by considering the positive and negative
parts of X
n
. From Chapter 1, we have that
|E(X
n
) E(X)| =

_

0
(P(X
n
> x) P(X > x)) dx

_
M
0
(P(X
n
> x) P(X > x)) dx

(since P(X
n
> M) = P(X > M) = 0)

_
M
0
|P(X
n
> x) P(X > x)| dx
0
since P(X
n
> x) P(X > x) for all but a countable number of
xs and the interval of integration is bounded.
c 2000 by Chapman & Hall/CRC
3.3 Weak Law of Large Numbers
An important result in probability theory is the Weak Law of Large
Numbers (WLLN), which deals with the convergence of the sample
mean to the population mean as the sample size increases. We start
by considering the simple case where X
1
, , X
n
are i.i.d. Bernoulli
random variables with P(X
i
= 1) = and P(X
i
= 0) = 1
so that E(X
i
) = . Dene S
n
= X
1
+ + X
n
, which has a
Binomial distribution with parameters n and . We now consider
the behaviour of S
n
/n as n ; S
n
/n represents the proportion of
1s in the n Bernoulli trials. Our intuition tells us that for large n,
this proportion should be approximately equal to , the probability
that any X
i
= 1. Indeed, since the distribution of S
n
/n is known,
it is possible to show the following law of large numbers:
S
n
n

p

as n .
In general, the WLLN applies to any sequence of independent,
identical distributed random variables whose mean exists. The
result can be stated as follows:
THEOREM 3.6 (Weak Law of Large Numbers) Suppose
that X
1
, X
2
, are i.i.d. random variables with E(X
i
) = (where
E(|X
i
|) < ). Then

X
n
=
1
n
n

i=1
X
i

p

as n .
While this result certainly agrees with intuition, a rigorous proof
of the result is certainly not obvious. However, before proving the
WLLN, we will give a non-trivial application of it by proving that
the sample median of i.i.d. random variables X
1
, , X
n
converges
in probability to the population median.
EXAMPLE 3.3: Suppose that X
1
, , X
n
are i.i.d. random
variables with a distribution function F(x). Assume that the X
i
s
have a unique median (F() = 1/2); in particular, this implies
that for any > 0, F( +) > 1/2 and F( ) < 1/2.
Let X
(1)
, , X
(n)
be the order statistics of the X
i
s and dene
Z
n
= X
(m
n
)
where {m
n
} is a sequence of positive integers with
m
n
/n 1/2 as n . For example, we could take m
n
= n/2 if n
c 2000 by Chapman & Hall/CRC
is even and m
n
= (n+1)/2 if n is odd; in this case, Z
n
is essentially
the sample median of the X
i
s. We will show that Z
n

p
as
n .
Take > 0. Then we have
P(Z
n
> +) = P
_
1
n
n

i=1
I(X
i
> +)
m
n
n
_
and
P(Z
n
< ) = P
_
1
n
n

i=1
I(X
i
)
n m
n
n
_
.
By the WLLN, we have
1
n
n

i=1
I(X
i
> +)
p
1 F( +) < 1/2
and
1
n
n

i=1
I(X
i
> )
p
1 F( ) > 1/2.
Since m
n
/n
p
1/2, it follows that P(Z
n
> + ) 0 and
P(Z
n
< ) 0 as n and so Z
n

p
.
Proving the WLLN
The key to proving the WLLN lies in nding a good bound for
P[|

X
n
| > ]; one such bound is Chebyshevs inequality.
THEOREM 3.7 (Chebyshevs inequality) Suppose that X is
a random variable with E(X
2
) < . Then for any > 0,
P[|X| > ]
E(X
2
)

2
.
Proof. The key is to write X
2
= X
2
I(|X| ) + X
2
I(|X| > ).
Then
E(X
2
) = E[X
2
I(|X| )] +E[X
2
I(|X| > )]
E[X
2
I(|X| > )]

2
P(|X| > )
where the last inequality holds since X
2

2
when |X| > and
E[I(|X| > )] = P(|X| > ).
c 2000 by Chapman & Hall/CRC
From the proof, it is quite easy to see that Chebyshevs inequality
remains valid if P[|X| > ] is replaced by P[|X| ].
Chebyshevs inequality is primarily used as a tool for proving
various convergence results for sequences of random variables; for
example, if {X
n
} is a sequence of random variables with E(X
2
n
) 0
then Chebyshevs inequality implies that X
n

p
0. However,
Chebyshevs inequality can also be used to give probability bounds
for random variables. For example, let X be a random variable with
mean and variance
2
. Then by Chebyshevs inequality, we have
P[|X | k] 1
E[(X )
2
]
k
2

2
= 1
1
k
2
.
However, the bounds given by Chebyshevs inequality are typically
very crude and are seldom of any practical use. Chebyshevs
inequality can be also generalized in a number of ways; these
generalizations are examined in Problem 3.8.
We will now sketch the proof of the WLLN. First of all, we will
assume that E(X
2
i
) < . In this case, the WLLN follows trivially
since (by Chebyshevs inequality)
P[|

X
n
| > ]
Var(

X
n
)

2
=
Var(X
1
)
n
2
and the latter quantity tends to 0 as n for each > 0.
How can the weak law of large numbers be proved if we assume
only that E(|X
i
|) < ? The answer is to write
X
k
= U
nk
+V
nk
where U
nk
= X
k
if |X
k
| n (for some 0 < < 1) and U
nk
= 0
otherwise; it follows that V
nk
= X
k
if |X
k
| > n and 0 otherwise.
Then

X
n
=
1
n
n

i=1
U
ni
+
1
n
n

i=1
V
ni
=

U
n
+

V
n
and so it suces to show that

U
n

p
and

V
n

p
0. First, we
have
E[(

U
n
)
2
] =
Var(U
n1
)
n
+ (E(U
n1
) )
2

E(U
2
n1
)
n
+ (E(U
n1
) )
2
E[|U
n1
|] +E
2
(U
n1
X
1
)
E[|X
1
|] +E
2
[|X
1
|I(|X
1
| > n)]
c 2000 by Chapman & Hall/CRC
and so by Chebyshevs inequality
P[|

U
n
| > ]
E[|X
1
|] +E
2
[|X
1
|I(|X
1
| > n)]

2
,
which can be made close to 0 by taking n and then to 0.
Second,
P[|

V
n
| > ] P
_
n
_
i=1
[V
ni
= 0]
_

i=1
P[V
ni
= 0]
= nP[X
1
> n]
and the latter can be shown to tend to 0 as n (for any > 0).
The details of the proof are left as exercises.
The WLLN can be strengthened to a strong law of large numbers
(SLLN) by introducing another type of convergence known as
convergence with probability 1 (or almost sure convergence). This
is discussed in section 3.7.
The WLLN for Bernoulli random variables was proved by Jacob
Bernoulli in 1713 and strengthened to random variables with nite
variance by Chebyshev in 1867 using the inequality that bears his
name. Chebyshevs result was extended by Khinchin in 1928 to
sums of i.i.d. random variables with nite rst moment.
3.4 Proving convergence in distribution
Recall that a sequence of random variables {X
n
} converges in dis-
tribution to a random variable X if the corresponding sequence of
distribution functions {F
n
(x)} converges to F(x), the distribution
function of X, at each continuity point of F. It is often dicult
to verify this condition directly for a number of reasons. For ex-
ample, it is often dicult to work with the distribution functions
{F
n
}. Also, in many cases, the distribution function F
n
may not be
specied exactly but may belong to a wider class; we may know,
for example, the mean and variance corresponding to F
n
but little
else about F
n
. (From a practical point of view, the cases where F
n
is not known exactly are most interesting; if F
n
is known exactly,
there is really no reason to worry about a limiting distribution F
unless F
n
is dicult to work with computationally.)
For these reasons, we would like to have alternative methods for
c 2000 by Chapman & Hall/CRC
establishing convergence in distribution. Fortunately, there are
several other sucient conditions for convergence in distribution
that are useful in practice for verifying that a sequence of random
variables converges in distribution and determining the distribution
of the limiting random variable (the limiting distribution).
Suppose that X
n
has density function f
n
(for n 1) and
X has density function f. Then f
n
(x) f(x) (for all but a
countable number of x) implies that X
n

d
X. Similarly, if X
n
has frequency function f
n
and X has frequency function f then
f
n
(x) f(x) (for all x) implies that X
n

d
X. (This result
is known as Schees Theorem.) The converse of this result is
not true; in fact, a sequence of discrete random variables can
converge in distribution to a continuous variable (see Example
3.2) and a sequence of continuous random variables can converge
in distribution to a discrete random variable.
If X
n
has moment generating function m
n
(t) and X has moment
generating function m(t) then m
n
(t) m(t) (for all |t| some
b > 0) implies X
n

d
X. Convergence of moment generating
functions is actually quite strong (in fact, it implies that
E(X
k
n
) E(X
k
) for integers k 1); convergence in distribution
does not require convergence of moment generating functions.
It is also possible to substitute other generating functions
for the moment generating function to prove convergence in
distribution. For example, if X
n
has characteristic function

n
(t) = E[exp(i tX)] and X has characteristic function (t) then

n
(t) (t) (for all t) implies X
n

d
X; in fact, X
n

d
X if,
and only if,
n
(t) (t) for all t.
In addition to the methods described above, we can also use some
of the results given earlier (for example, Slutskys Theorem and the
Delta Method) to help establish convergence in distribution.
EXAMPLE 3.4: Suppose that {X
n
} is a sequence of random
variables where X
n
has Students t distribution with n degrees of
freedom. The density function of X
n
is
f
n
(x) =
((n + 1)/2)

n(n/2)
_
1 +
x
2
n
_
(n+1)/2
.
Stirlings approximation, which may be stated as
lim
y

y(y)

2 exp(y)y
y
= 1
c 2000 by Chapman & Hall/CRC
allows us to approximate ((n + 1)/2) and (n/2) for large n. We
then get
lim
n
((n + 1)/2)

n(n/2)
=
1

2
.
Also
lim
n
_
1 +
x
2
n
_
(n+1)/2
= exp
_

x
2
2
_
and so
lim
n
f
n
(x) =
1

2
exp
_

x
2
2
_
where the limit is a standard Normal density function. Thus
X
n

d
Z where Z has a standard Normal distribution.
An alternative (and much simpler) approach is to note that
the t distributed random variable X
n
has the same distribution
as Z/
_
V
n
/n where Z and V
n
are independent random variables
with Z N(0, 1) distribution and V
n

2
(n). Since V
n
can be
thought of as a sum of n i.i.d.
2
random variables with 1 degree of
freedom, it follows from the WLLN and the Continuous Mapping
Theorem that
_
V
n
/n
p
1 and hence from Slutskys Theorem
that Z/
_
V
n
/n
d
N(0, 1). The conclusion follows since X
n
has
the same distribution as Z/
_
V
n
/n.
EXAMPLE 3.5: Suppose that U
1
, , U
n
are i.i.d. Uniform
random variables on the interval [0, 1] and let U
(1)
, , U
(n)
be their
order statistics. Dene Z
n
= U
(m
n
)
where m
n
n/2 in the sense
that

n(m
n
/n 1/2) 0 as n ; note that we are requiring
that m
n
/n converge to 1/2 at a faster rate than in Example 3.3.
(Note that taking m
n
= n/2 for n even and m
n
= (n + 1)/2 for
n odd will satisfy this condition.) We will consider the asymptotic
distribution of the sequence of random variables {

n(Z
n
1/2)}
by computing its limiting density. The density of

n(Z
n
1/2) is
f
n
(x) =
n!
(m
n
1)!(n m
n
)!

n
_
1
2
+
x

n
_
m
n
1
_
1
2

x

n
_
nm
n
for

n/2 x

n/2. We will show that f
n
(x) converges to a
Normal density with mean 0 and variance 1/4. First using Stirlings
approximation (as in Example 3.4, noting that n! = n(n)), we
obtain
n!

n(m
n
1)!(n m
n
)!

2
n

2
c 2000 by Chapman & Hall/CRC
in the sense that the ratio of the right-hand to left-hand side tends
to 1 as n . We also have
_
1
2
+
x

n
_
m
n
1
_
1
2

x

n
_
nm
n
=
1
2
n1
_
1
4x
2
n
_
m
n
1 _
1
2x

n
_
n2m
n
+1
.
We now obtain
_
1
4x
2
n
_
m
n
1
exp(2x
2
)
and
_
1
2x

n
_
n2m
n
+1
1
where, in both cases, we use the fact that (1 +t/a
n
)
c
n
exp(kt) if
a
n
and c
n
/a
n
k. Putting the pieces from above together,
we get
f
n
(x)
2

2
exp(2x
2
)
for any x. Thus

n(Z
n
1/2)
d
N(0, 1/4).
EXAMPLE 3.6: We can easily extend Example 3.5 to the case
where X
1
, , X
n
are i.i.d. random variables with distribution
function F and unique median where F(x) is dierentiable at
x = with F

() > 0; if F has a density f then F

() = f()
typically. Dening F
1
(t) = inf{x : F(x) t} to be the inverse of
F, we note that the order statistic X
(k)
has the same distribution
as F
1
(U
(k)
) where U
(k)
is an order statistic from an i.i.d. sample
of Uniform random variables on [0, 1] and also that F
1
(1/2) = .
Thus

n(X
(m
n
)
) =
d

n(F
1
(U
(m
n
)
) F
1
(1/2))
and so by the Delta Method, we have

n(X
(m
n
)
)
d
N(0, (F

())
2
/4)
if

n(m
n
/n1/2) 0. The limiting variance follows from the fact
that F
1
(t) is dierentiable at t = 1/2 with derivative 1/F

().
Note that existence of a density is not sucient to imply existence
of the derivative of F(x) at x = ; however, if F is continuous but
c 2000 by Chapman & Hall/CRC
x
d
e
n
s
i
t
y
0.0 0.5 1.0 1.5
0
.
0
0
.
5
1
.
0
1
.
5
Figure 3.1 Density of X
(n/2)
for n = 10 Exponential random variables; the
dotted line is the approximating Normal density.
not dierentiable at x = then

n(X
(m
n
)
) may still converge
in distribution but the limiting distribution will be dierent.
As an illustration of the convergence of the distribution of the
sample median to a Normal distribution, we will consider the
density of the order statistic X
(n/2)
for i.i.d. Exponential random
variables X
1
, , X
n
with density
f(x) = exp(x) for x 0.
Figures 3.1, 3.2, and 3.3 give the densities of X
(n/2)
for n = 10, 50,
and 100 respectively; the corresponding approximating Normal
density is indicated with dotted lines.
EXAMPLE 3.7: Suppose that {X
n
} is a sequence of Binomial
random variables with X
n
having parameters n and
n
where
n
n
> 0 as n . The moment generating function of
X
n
is
m
n
(t) = (1 +
n
(exp(t) 1))
n
=
_
1 +
n
n
(exp(t) 1)
n
_
n
.
c 2000 by Chapman & Hall/CRC
x
d
e
n
s
i
t
y
0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Figure 3.2 Density of X
(n/2)
for n = 50 Exponential random variables; the
dotted line is the approximating Normal density.
Since n
n
, it follows that
lim
n
m
n
(t) = exp[(exp(t) 1)]
where the limiting moment generating function is that of a Poisson
distribution with parameter . Thus X
n

d
X where X has a
Poisson distribution with parameter . This result can be used to
compute Binomial probabilities when n is large and is small so
that n n(1 ). For example, suppose that X has a Binomial
distribution with n = 100 and = 0.05. Then using the Poisson
approximation
P[a X b]

axb
exp(n)(n)
x
x!
we get, for example, P[4 X 6] 0.497 compared to the exact
probability P[4 X 6] = 0.508.
c 2000 by Chapman & Hall/CRC
x
d
e
n
s
i
t
y
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
1
2
3
4
Figure 3.3 Density of X
(n/2)
for n = 100 Exponential random variables; the
dotted line is the approximating Normal density.
EXAMPLE 3.8: As in Example 3.2, dene U
n
by
U
n
=
n

k=1
X
k
10
k
where X
1
, X
n
, are i.i.d. discrete random variables uniformly
distributed on the integers 0, 1, , 9. We showed in Example 3.2
that U
n

d
Unif(0, 1) by showing convergence of the distribution
functions. In this example, we will do the same using moment
generating functions. The moment generating function of each X
k
is
m(t) =
1
10
(1 + exp(t) + + exp(9t)) =
exp(10t) 1
10(exp(t) 1)
and so the moment generating function of U
n
is

n
(t) =
n

k=1
m(t/10
k
)
=
n

k=1
_
exp(t/10
k1
) 1
10(exp(t/10
k
) 1
_
c 2000 by Chapman & Hall/CRC
=
exp(t) 1
10
n
(exp(t/10
n
) 1)
.
Using the expansion exp(x) = 1 +x +x
2
/2 + , it follows that
lim
n
10
n
(exp(t/10
n
) 1) = t
and so
lim
n

n
(t) =
1
t
(exp(t) 1) =
_
1
0
exp(tx) dx,
which is the moment generating function of the Uniform distri-
bution on [0, 1]. Thus we have shown (using moment generating
functions) that U
n

d
Unif(0, 1).
3.5 Central Limit Theorems
In probability theory, central limit theorems (CLTs) establish
conditions under which the distribution of a sum of random
variables may be approximated by a Normal distribution. (We have
seen already in Examples 3.4, 3.5, and 3.6 cases where the limiting
distribution is Normal.) A wide variety of CLTs have been proved;
however, we will consider CLTs only for sums and weighted sums
of i.i.d. random variables.
THEOREM 3.8 (CLT for i.i.d. random variables) Suppose
that X
1
, X
2
, are i.i.d. random variables with mean and vari-
ance
2
< and dene
S
n
=
1

n
n

i=1
(X
i
) =

n(

X
n
)

.
Then S
n

d
Z N(0, 1) as n .
(In practical terms, the CLT implies that for large n, the
distribution of

X
n
is approximately Normal with mean and
variance
2
/n.)
Before discussing the proof of this CLT, we will give a little
of the history behind the result. The French-born mathematician
de Moivre is usually credited with proving the rst CLT (in the
18th century); this CLT dealt with the special case that the
X
i
s were Bernoulli random variables (that is, P[X
i
= 1] =
and P[X
i
= 0] = 1 ). His work (de Moivre, 1738) was not
signicantly improved until Laplace (1810) extended de Moivres
c 2000 by Chapman & Hall/CRC
work to sums of independent bounded random variables. The
Russian mathematician Chebyshev extended Laplaces work to
sums of random variables with nite moments E(|X
i
|
k
) for all
k 1. However, it was not until the early twentieth century that
Markov and Lyapunov (who were students of Chebyshev) removed
nearly all unnecessary moment restrictions. Finally, Lindeberg
(1922) proved the CLT assuming only nite variances. It should
be noted that most of the work subsequent to Laplace dealt with
sums of independent (but not necessarily identically distributed)
random variables; it turns out that this added generality does not
pose any great technical complications.
Proving the CLT
We will consider two proofs of the CLT for i.i.d. random variables.
In the rst proof, we will assume the existence of moment gener-
ating function of the X
i
s and show that the moment generating
function of S
n
converges to the moment generating function of Z.
Of course, assuming the existence of moment generating functions
implies that E(X
k
i
) exists for all integers k 1. The second proof
will require only that E(X
2
i
) is nite and will show directly that
P[S
n
x] P[Z x].
We can assume (without loss of generality) that E(X
i
) = 0 and
Var(X
i
) = 1. Let m(t) = E[exp(tX
i
)] be the moment generating
function of the X
i
s. Then
m(t) = 1 +
t
2
2
+
t
3
E(X
3
i
)
6
+
(since E(X
i
) = 0 and E(X
2
i
) = 1) and the moment generating
function of S
n
is
m
n
(t) = [m
n
(t/

n)]
n
=
_
1 +
t
2
2n
+
t
3
E(X
3
i
)
6n
3/2
+
_
n
=
_
1 +
t
2
2n
_
1 +
tE(X
3
i
)
3n
1/2
+
t
2
E(X
4
i
)
12n
+
__
n
=
_
1 +
t
2
2n
r
n
(t)
_
n
where |r
n
(t)| < and r
n
(t) 1 as n for each |t| some
c 2000 by Chapman & Hall/CRC
b > 0. Thus
lim
n
m
n
(t) = exp
_
t
2
2
_
and the limit is the moment generating function of a standard
Normal random variable. It should be noted that a completely
rigorous proof of Theorem 3.8 can be obtained by replacing moment
generating functions by characteristic functions in the proof above.
The second proof we give shows directly that
P[S
n
x]
_
x

2
exp
_

t
2
2
_
dt;
we will rst assume that E(|X
i
|
3
) is nite and then indicate
what modications are necessary if we assume only that E(X
2
i
) is
nite. The method used in this proof may seem at rst somewhat
complicated but, in fact, is extremely elegant and is actually the
method used by Lindeberg to prove his 1922 CLT.
The key to the proof directly lies in approximating P[S
n
x]
by E[f
+

(S
n
)] and E[f

(S
n
)] where f
+

and f

are two bounded,


continuous functions. In particular, we dene f
+

(y) = 1 for y x,
f
+

(y) = 0 for y x + and 0 f


+

(y) 1 for x < y < x + ; we


dene f

(y) = f
+

(y +). If
g(y) = I(y x),
it is easy to see that
f

(y) g(y) f
+

(y)
and
P[S
n
x] = E[g(S
n
)].
Then if Z is a standard Normal random variable, we have
P[S
n
x] E[f
+

(S
n
)]
E[f
+

(S
n
)] E[f
+

(Z)] +E[f
+

(Z)]
|E[f
+

(S
n
)] E[f
+

(Z)]| +P[Z x +]
and similarly,
P[S
n
x] P[Z x ] |E[f

(S
n
)] E[f

(Z)]|.
Thus we have
|E[f
+

(S
n
)] E[f
+

(Z)]| +P[Z x +]
P[S
n
x]
P[Z x ] |E[f

(S
n
)] E[f

(Z)]|;
c 2000 by Chapman & Hall/CRC
since P[Z x ] can be made arbitrarily close to P(Z x)
(because Z has a continuous distribution function), it suces to
show that E[f
+

(S
n
)] E[f
+

(Z)] and E[f

(S
n
)] E[f

(Z)] for
suitable choices of f
+

and f

. In particular, we will assume that


f
+

(and hence f

) has three bounded continuous derivatives.


Let f be a bounded function (such as f
+

or f

) with three
bounded continuous derivatives and let Z
1
, Z
2
, Z
3
, be a sequence
of i.i.d. standard Normal random variables that are also indepen-
dent of X
1
, X
2
, ; note that n
1/2
(Z
1
+ +Z
n
) is also standard
Normal. Now dene random variables T
n1
, , T
nn
where
T
nk
=
1

n
k1

j=1
Z
j
+
1

n
n

j=k+1
X
j
(where the sum is taken to be 0 if the upper limit is less than the
lower limit). Then
E[f(S
n
)] E[f(Z)]
=
n

k=1
E[f(T
nk
+n
1/2
X
k
) f(T
nk
+n
1/2
Z
k
)].
Expanding f(T
nk
+n
1/2
X
k
) in a Taylor series around T
nk
, we get
f(T
nk
+n
1/2
X
k
) = f(T
nk
) +
X
k

n
f

(T
nk
) +
X
2
k
2n
f

(T
nk
) +R
X
k
where R
X
k
is a remainder term (whose value will depend on the
third derivative of f); similarly,
f(T
nk
+n
1/2
Z
k
) = f(T
nk
) +
Z
k

n
f

(T
nk
) +
Z
2
k
2n
f

(T
nk
) +R
Z
k
.
Taking expected values (and noting that T
nk
is independent of both
X
k
and Z
k
), we get
E[f(S
n
)] E[f(Z)] =
n

k=1
[E(R
X
k
) E(R
Z
k
)].
We now try to nd bounds for R
X
k
and R
Z
k
; it follows that
|R
X
k
|
K
6
|X
k
|
3
n
3/2
and |R
Z
k
|
K
6
|Z
k
|
3
n
3/2
where K is an upper bound on |f

(y)|, which (by assumption) is


c 2000 by Chapman & Hall/CRC
nite. Thus
|E[f(S
n
)] E[f(Z)]|
n

k=1
|E(R
X
k
) E(R
Z
k
)|

k=1
[E(|R
X
k
|) +E(|R
Z
k
|)]

K
6

n
_
E[|X
1
|
3
] +E[|Z
1
|
3
]
_
0
as n since both E[|X
1
|
3
] and E[|Z
1
|
3
] are nite.
Applying the previous result to f
+

and f

having three bounded


continuous derivatives, it follows that E[f
+

(S
n
)] E[f
+

(Z)] and
E[f

(S
n
)] E[f

(Z)]. Now since


P[Z x ] P[Z x]
as 0, it follows from above that for each x,
P[S
n
x] P[Z x].
We have, of course, assumed that E[|X
i
|
3
] is nite; to extend the
result to the case where we assume only that E(X
2
i
) is nite, it is
necessary to nd a more accurate bound on |R
X
k
|. Such a bound is
given by
|R
X
k
| K

_
|X
k
|
3
6n
3/2
I(|X
k
|

n) +
X
2
k
n
I(|X
k
| >

n)
_
where K

is an upper bound on both |f

(y)| and |f

(y)|. It then
can be shown that
n

k=1
E[|R
X
k
|] 0
and so E[f(S
n
)] E[f(Z)].
Using the CLT as an approximation theorem
In mathematics, a distinction is often made between limit theorems
and approximation theorems; the former simply species the limit
of a sequence while the latter provides an estimate or bound on
the dierence between an element of the sequence and its limit.
For example, it is well-known that ln(1 + x) can be approximated
c 2000 by Chapman & Hall/CRC
by x when x is small; a crude bound on the absolute dierence
| ln(1 +x) x| is x
2
/2 when x 0 and x
2
/[2(1 +x)
2
] when x < 0.
For practical purposes, approximation theorems are more useful as
they allow some estimate of the error made in approximating by
the limit.
The CLT as stated here is not an approximation theorem. That
is, it does not tell us how large n should be in order for a Normal
distribution to approximate the distribution of S
n
. Nonetheless,
with additional assumptions, the CLT can be restated as an
approximation theorem. Let F
n
be the distribution function of S
n
and be the standard Normal distribution function. To gain some
insight into the factors aecting the speed of convergence of F
n
to ,
we will use Edgeworth expansions. Assume that F
n
is a continuous
distribution function and that E(X
4
i
) < and dene
=
E[(X
i
)
3
]

3
and =
E[(X
i
)
4
]

4
3;
and are, respectively, the skewness and kurtosis of the
distribution of X
i
both of which are 0 when X
i
is normally
distributed. It is now possible to show that
F
n
(x) = (x) (x)
_

6

n
p
1
(x) +

24n
p
2
(x) +

2
72n
p
3
(x)
_
+r
n
(x)
where (x) is the standard Normal density function and p
1
(x),
p
2
(x), p
3
(x) are the polynomials
p
1
(x) = x
2
1
p
2
(x) = x
3
3x
p
3
(x) = x
5
10x
3
+ 15x;
the remainder termr
n
(x) satises nr
n
(x) 0. From this expansion,
it seems clear that the approximation error |F
n
(x) (x)| depends
on the skewness and kurtosis (that is, and ) of the X
i
s. The
skewness and kurtosis are simple measures of how a particular
distribution diers from normality; skewness is a measure of the
asymmetry of a distribution ( = 0 if the distribution is symmetric
around its mean) while kurtosis is a measure of the thickness of
the tails of a distribution ( > 0 indicates heavier tails than
a Normal distribution while < 0 indicates lighter tails). For
example, a Uniform distribution has = 0 and = 1.2 while
c 2000 by Chapman & Hall/CRC
x
d
e
n
s
i
t
y
-10 -5 0 5 10
0
.
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
Figure 3.4 Density of the sum of 10 Uniform random variables; the dotted curve
is the approximating Normal density.
an Exponential distribution has = 2 and = 6. Thus we should
expect convergence to occur more quickly for sums of Uniform
random variables than for sums of Exponential random variables.
Indeed, this is true; in fact, the distribution of a sum of as few
as ten Uniform random variables is suciently close to a Normal
distribution to allow generation of Normal random variables on a
computer by summing Uniform random variables. To illustrate the
dierence in the accuracy of the Normal approximation, we consider
the distribution of X
1
+ +X
10
when the X
i
s are Uniform and
Exponential; Figures 3.4 and 3.5 give the exact densities and their
Normal approximations in these two cases.
The speed of convergence of the CLT (and hence the goodness of
approximation) can often be improved by applying transformations
to reduce the skewness and kurtosis of

X
n
. Recall that if

n(

X
n

)
d
Z then

n(g(

X
n
) g())
d
g

()Z.
If g is chosen so that the distribution of g(

X
n
) is more symmetric
and has lighter tails than that of

X
n
then the CLT should provide
a more accurate approximation for the distribution of

n(g(

X
n
)
g()) than it does for the distribution of

n(

X
n
).
c 2000 by Chapman & Hall/CRC
x
d
e
n
s
i
t
y
0 5 10 15 20
0
.
0
0
.
0
4
0
.
0
8
0
.
1
2
Figure 3.5 Density of the sum of 10 Exponential random variables; the dotted
curve is the approximating Normal density.
Although the Edgeworth expansion above does not always hold
when the X
i
s are discrete, the preceding comments regarding
speed of convergence and accuracy of the Normal approximation
are still generally true. However, when the X
i
s are discrete,
there is a simple technique that can improve the accuracy of
the Normal approximation. We will illustrate this technique for
the Binomial distribution. Suppose that X is a Binomial random
variable with parameters n and ; X can be thought of as a sum
of n i.i.d. Bernoulli random variables so the distribution of X can
be approximated by a Normal distribution if n is suciently large.
More specically, the distribution of
X n
_
n(1 )
is approximately standard Normal for large n. Suppose we want to
evaluate P[a X b] for some integers a and b. Ignoring the fact
that X is a discrete random variable and the Normal distribution
c 2000 by Chapman & Hall/CRC
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0
.
0
0
.
0
4
0
.
0
8
0
.
1
2
d
e
n
s
i
t
y
/
f
r
e
q
u
e
n
c
y
Figure 3.6 Binomial distribution (n = 40, = 0.3) and approximating Normal
density
is a continuous distribution, a naive application of the CLT gives
P[a X b]
= P
_
a n
_
n(1 )

X n
_
n(1 )

b n
_
n(1 )
_

_
b n
_
n(1 )
_

_
a n
_
n(1 )
_
.
How can this approximation be improved? The answer is clear if we
compare the exact distribution of X to its Normal approximation.
The distribution of X can be conveniently represented as a
probability histogram as in Figure 3.6 with the area of each
bar representing the probability that X takes a certain value.
The naive Normal approximation given above merely integrates
the approximating Normal density from a = 8 to b = 17; this
probability is represented by the shaded area in Figure 3.7. It
seems that the naive Normal approximation will underestimate
the true probability and Figures 3.7 and 3.8 suggest that a better
approximation may be obtained by integrating from a 0.5 = 7.5
c 2000 by Chapman & Hall/CRC
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0
.
0
0
.
0
4
0
.
0
8
0
.
1
2
d
e
n
s
i
t
y
/
f
r
e
q
u
e
n
c
y
Figure 3.7 Naive Normal approximation of P(8 X 17)
to b + 0.5 = 17.5. This corrected Normal approximation is
P[a X b] = P[a 0.5 X b + 0.5]

_
b + 0.5 n
_
n(1 )
_

_
a 0.5 n
_
n(1 )
_
.
The correction used here is known as a continuity correction and
can be applied generally to improve the accuracy of the Normal
approximation for sums of discrete random variables. (In Figures
3.6, 3.7, and 3.8, X has a Binomial distribution with parameters
n = 40 and = 0.3.)
Some other Central Limit Theorems
CLTs can be proved under a variety of conditions; neither the as-
sumption of independence nor that of identical distributions are
necessary. In this section, we will consider two simple modications
of the CLT for sums of i.i.d. random variables. The rst modi-
cation deals with weighted sums of i.i.d. random variables while
c 2000 by Chapman & Hall/CRC
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0
.
0
0
.
0
4
0
.
0
8
0
.
1
2
d
e
n
s
i
t
y
/
f
r
e
q
u
e
n
c
y
Figure 3.8 Normal approximation of P(8 X 17) with continuity
correction
the second deals with sums of independent but not identically dis-
tributed random variables with nite third moment.
THEOREM 3.9 (CLT for weighted sums) Suppose that X
1
,
X
2
, are i.i.d. random variables with E(X
i
) = 0 and Var(X
i
) = 1
and let {c
i
} be a sequence of constants. Dene
S
n
=
1
s
n
n

i=1
c
i
X
i
where s
2
n
=
n

i=1
c
2
i
.
Then S
n

d
Z, a standard Normal random variable, provided that
max
1in
c
2
i
s
2
n
0
as n .
What is the practical meaning of the negligibility condition on
the constants {c
i
} given above? For each n, it is easy to see that
c 2000 by Chapman & Hall/CRC
Var(S
n
) = 1. Now writing
S
n
=
n

i=1
c
i
s
n
X
i
=
n

i=1
Y
ni
and noting that Var(Y
ni
) = c
2
i
/s
2
n
, it follows that this condition
implies that no single component of the sum S
n
contributes an
excessive proportion of the variance of S
n
. For example, the
condition rules out situations where S
n
depends only on a negligible
proportion of the Y
ni
s. An extreme example of this occurs when
c
1
= c
2
= = c
k
= 1 (for some xed k) and all other c
i
s are 0;
in this case,
max
1in
c
2
i
s
2
n
=
1
k
,
which does not tend to 0 as n since k is xed. On the other
hand, if c
i
= i then
s
2
n
=
n

i=1
i
2
=
1
6
n(2n
2
+ 3n + 1)
and
max
1in
c
2
i
s
2
n
=
6n
2
n(2n
2
+ 3n + 1)
0
as n and so the negligibility condition holds. Thus if the X
i
s
are i.i.d. random variables with E(X
i
) = 0 and Var(X
i
) = 1, it
follows that
1
s
n
n

i=1
iX
i

d
Z
where Z has a standard Normal distribution.
When the negligibility condition of Theorem 3.9 fails, it may
still be possible to show that the weighted sum S
n
converges in
distribution although the limiting distribution will typically be non-
Normal.
EXAMPLE 3.9: Suppose that X
1
, X
2
, are i.i.d. random
variables with common density function
f(x) =
1
2
exp(|x|)
(called a Laplace distribution) and dene
S
n
=
1
s
n
n

k=1
X
k
k
c 2000 by Chapman & Hall/CRC
where s
2
n
=

n
i=1
k
2
. Note that s
2
n

k=1
k
2
=
2
/6 as
n and the negligibility condition does not hold. However,
it can be shown that S
n

d
S. Since s
n
/

6, we will
consider the limiting distribution of V
n
= s
n
S
n
; if V
n

d
V then
S
n

d

6/V = S. The moment generating function of X


i
is
m(t) =
1
1 t
2
(for |t| < 1)
and so the moment generating function of V
n
is
m
n
(t) =
n

k=1
m(t/k) =
n

k=1
_
k
2
k
2
t
2
_
.
As n , m
n
(t) m
V
(t) where
m
V
(t) =

k=1
_
k
2
k
2
t
2
_
= (1 +t)(1 t)
for |t| < 1. The limiting moment generating function m
V
(t) is not
immediately recognizable. However, note that
_

exp(tx)
exp(x)
(1 + exp(x))
2
dx =
_
1
0
u
t
(1 u)
t
du
=
(1 +t)(1 t)
(2)
= (1 +t)(1 t)
and so the density of V is
f
V
(x) =
exp(x)
(1 + exp(x))
2
(this distribution is often called the Logistic distribution). The
density of S is thus
f
S
(x) =

6
exp(x/

6)
(1 + exp(x/

6))
2
.
The limiting density f
S
(x) is shown in Figure 3.9.
Another useful CLT for sums of independent (but not identically
distributed) random variables is the Lyapunov CLT. Like the CLT
for weighted sums of i.i.d. random variables, this CLT depends on
a condition that can be easily veried.
c 2000 by Chapman & Hall/CRC
x
d
e
n
s
i
t
y
-4 -2 0 2 4
0
.
0
0
.
1
0
0
.
2
0
0
.
3
0
Figure 3.9 Density of S; the dotted curve is a Normal density with the same
mean and variance as S.
THEOREM 3.10 (Lyapunov CLT) Suppose that X
1
, X
2
,
are independent random variables with E(X
i
) = 0, E(X
2
i
) =
2
i
and E(|X
i
|
3
) =
i
and dene
S
n
=
1
s
n
n

i=1
X
i
where s
2
n
=

n
i=1

2
i
. If
lim
n
1
s
3/2
n
n

i=1

i
= 0
then S
n

d
Z, a standard Normal random variable.
It is possible to adapt the proof of the CLT for sums of i.i.d.
random variables to the two CLTs given in this section. In the case
of the CLT for weighted sums, the key modication lies in redening
c 2000 by Chapman & Hall/CRC
T
nk
to be
T
nk
=
1
s
n
k1

j=1
c
j
Z
j
+
1
s
n
n

j=k+1
c
j
X
j
where Z
1
, Z
2
, are independent standard Normal random vari-
ables independent of the X
i
s. Then letting f be a bounded function
with three bounded derivatives, we have
E[f(S
n
)] E[f(Z)] =
n

k=1
E[f(T
nk
+c
k
X
k
/s
n
) f(T
nk
+c
k
Z
k
/s
n
)].
The remainder of the proof is much the same as before and
is left as an exercise. It is also possible to give a proof using
moment generating functions assuming, of course, that the moment
generating function of X
i
exists.
Multivariate Central Limit Theorem
CLTs for sums of random variables can be generalized to deal with
sums of random vectors. For example, suppose that X
1
, X
2
, are
i.i.d. random vectors with mean vector and variance-covariance
matrix C; dene

X
n
=
1
n
n

i=1
X
i
to be the (coordinate-wise) sample mean of X
1
, , X
n
. The logical
extension of the CLT for i.i.d. random variables is to consider the
limiting behaviour of the distributions of

n(

X
n
).
Before considering any multivariate CLT, we need to extend
the notion of convergence in distribution to sequences of random
vectors. This extension is fairly straightforward and involves the
joint distribution functions of the random vectors; given random
vectors {X
n
} and X, we say that X
n

d
X if
lim
n
P[X
n
x] = P[X x] = F(x)
at each continuity point x of the joint distribution function F of
X. (X x means that each coordinate of X is less than or equal
to the corresponding coordinate of x.) This denition, while simple
enough, is dicult to prove analytically. Fortunately, convergence
in distribution of random vectors can be cast in terms of their one-
dimensional projections.
c 2000 by Chapman & Hall/CRC
THEOREM 3.11 Suppose that {X
n
} and X are random vectors.
Then X
n

d
X if, and only if,
t
T
X
n

d
t
T
X
for all vectors t.
Theorem 3.11 is called the Cramer-Wold device. The proof of
this result will not be given here. The result is extremely useful for
proving multivariate CLTs since it essentially reduces multivariate
CLTs to special cases of univariate CLTs. We will only consider a
multivariate CLT for sums of i.i.d. random vectors but more general
multivariate CLTs can also be deduced from appropriate univariate
CLTs.
THEOREM 3.12 (Multivariate CLT) Suppose that X
1
, X
2
,
X
3
, are i.i.d. random vectors with mean vector and variance-
covariance matrix C and dene
S
n
=
1

n
n

i=1
(X
i
) =

n(

X
n
).
Then S
n

d
Z where Z has a multivariate Normal distribution
with mean 0 and variance-covariance matrix C.
Proof. It suces to show that t
T
S
n

d
t
T
Z; note that t
T
Z is
Normally distributed with mean 0 and variance t
T
Ct. Now
t
T
S
n
=
1

n
n

i=1
t
T
(X
i
)
=
1

n
n

i=1
Y
i
where the Y
i
s are i.i.d. with E(Y
i
) = 0 and Var(Y
i
) = t
T
Ct. Thus
by the CLT for i.i.d. random variables,
t
T
S
n

d
N(0, t
T
Ct)
and the theorem follows.
The denition of convergence in probability can be extended quite
easily to random vectors. We will say that X
n

p
X if each
coordinate of X
n
converges in probability to the corresponding
coordinate of X. Equivalently, we can say that X
n

p
X if
lim
n
P[X
n
X > ] = 0
c 2000 by Chapman & Hall/CRC
where is the Euclidean norm of a vector.
It is possible to generalize many of the results proved above.
For example, suppose that X
n

d
X; then if g is a continuous
real-valued function, g(X
n
)
d
g(X). (The same is true if
d
is replaced by
p
.) This multivariate version of the Continuous
Mapping Theorem can be used to obtain a generalization of
Slutskys Theorem. Suppose that X
n

d
X and Y
n

p
. By using
the Cramer-Wold device (Theorem 3.11) and Slutskys Theorem,
it follows that (X
n
, Y
n
)
d
(X, ). Thus if g(x, y) is a continuous
function, we have
g(X
n
, Y
n
)
d
g(X, ).
EXAMPLE 3.10: Suppose that {X
n
} is a sequence of random
vectors with X
n

d
Z where Z N
p
(0, C) and C is non-singular.
Dene the function
g(x) = x
T
C
1
x,
which is a continuous function of x. Then we have
g(X
n
) = X
T
n
C
1
X
n

d
Z
T
C
1
Z = g(Z).
It follows from Chapter 2 that the random variable Z
T
C
1
Z has
a
2
distribution with p degrees of freedom. Thus for large n,
X
T
n
C
1
X
n
is approximately
2
with p degrees of freedom.
It is also possible to extend the Delta Method to the multivariate
case. Let {a
n
} be a sequence of constants tending to innity and
suppose that
a
n
(X
n
)
d
Z.
If g(x) = (g
1
(x), , g
k
(x)) is a vector-valued function that is
continuously dierentiable at x = , we have
a
n
(g(X
n
) g())
d
D()Z
where D() is a matrix of partial derivatives of g with respect to
x evaluated at x = ; more precisely, if x = (x
1
, , x
p
),
D() =
_
_
_
_
_
_

x
1
g
1
()

x
p
g
1
()

x
1
g
2
()

x
p
g
2
()
. . . . . . . . . . . . . . . . . . . . . . . .

x
1
g
k
()

x
p
g
k
()
_
_
_
_
_
_
.
c 2000 by Chapman & Hall/CRC
The proof of this result parallels that of the Delta Method given
earlier and is left as an exercise.
EXAMPLE 3.11: Suppose that (X
1
, Y
1
), , (X
n
, Y
n
) are i.i.d.
pairs of random variables with E(X
i
) =
X
> 0, E(Y
i
) =
Y
> 0,
and E(X
2
i
) and E(Y
2
i
) both nite. By the multivariate CLT, we
have

n
__

X
n

Y
n
_

Y
__

d
Z N
2
(0, C)
where C is the variance-covariance matrix of (X
i
, Y
i
). We want to
consider the asymptotic distribution of

X
n
/

Y
n
. Applying the Delta
Method (with g(x, y) = x/y), we have

n
_

X
n

Y
n

Y
_

d
D(
X
,
Y
)Z
N(0, D(
X
,
Y
)CD(
X
,
Y
)
T
)
where
D(
X
,
Y
) =
_
1

Y
,

2
Y
_
.
Letting Var(X
i
) =
2
X
, Var(Y
i
) =
2
Y
, and Cov(X
i
, Y
i
) =
XY
, it
follows that the variance of the limiting Normal distribution is
D(
X
,
Y
)CD(
X
,
Y
)
T
=

2
Y

2
X
2
X

XY
+
2
X

2
Y

4
Y
.
3.6 Some applications
In subsequent chapters, we will use many of the concepts and
results developed in this chapter to characterize the large sample
properties of statistical estimators. In this section, we will give some
applications of the concepts and results given so far in this chapter.
Variance stabilizing transformations
The CLT states that if X
1
, X
2
, are i.i.d. random variables with
mean and variance
2
then

n(

X
n
)
d
Z
where Z has a Normal distribution with mean 0 and variance
2
.
For many distributions, the variance
2
depends only on the mean
c 2000 by Chapman & Hall/CRC
(that is,
2
= V ()). In statistics, it is often desirable to nd a
function g such that the limit distribution of

n(g(

X
n
) g())
does not depend on . (We could then use this result to nd
an approximate condence interval for ; see Chapter 7.) If g is
dierentiable, we have

n(g(

X
n
) g())
d
g

()Z
and g

()Z is Normal with mean 0 and variance [g

()]
2
V (); in
order to make the limiting distribution independent of , we need
to nd g so that this variance is 1 (or some other constant). Thus,
given V (), we would like to nd g such that
[g

()]
2
V () = 1
or
g

() =
1
V ()
1/2
.
The function g can be the solution of either of the two dierential
equations depending on whether one wants g to be an increasing
or a decreasing function of ; g is called a variance stabilizing
transformation.
EXAMPLE 3.12: Suppose that X
1
, , X
n
are i.i.d. Bernoulli
random variables with parameter . Then

n(

X
n
)
d
Z N(0, (1 )).
To nd g such that

n(g(

X
n
) g())
d
N(0, 1)
we solve the dierential equation
g

() =
1
_
(1 )
.
The general form of the solutions to this dierential equation is
g() = sin
1
(2 1) +c
where c is an arbitrary constant that could be taken to be 0.
(The solutions to the dierential equation can also be written
g() = 2 sin
1
(

) +c.)
c 2000 by Chapman & Hall/CRC
Variance stabilizing transformations often improve the speed of
convergence to normality; that is, the distribution of g(

X
n
) can be
better approximated by a Normal distribution than that of

X
n
if
g is a variance stabilizing transformation. However, there may be
other transformations that result in a better approximation by a
Normal distribution.
A CLT for dependent random variables
Suppose that {U
i
} is an innite sequence of i.i.d. random variables
with mean 0 and variance
2
and dene
X
i
=
p

j=0
c
j
U
ij
where c
0
, , c
p
are constants. Note that X
1
, X
2
, are not
necessarily independent since they can depend on common U
i
s.
(In time series analysis, {X
i
} is called a moving average process.)
A natural question to ask is whether a CLT holds for sample means

X
n
based on X
1
, X
2
,
We begin by noting that
1

n
n

i=1
X
i
=
1

n
n

i=1
p

j=0
c
j
U
ij
=
1

n
p

j=0
c
j
n

i=1
U
ij
=
_
_
p

j=0
c
j
_
_
1

n
n

i=1
U
i
+
p

j=1
c
j
R
nj
where
R
nj
=
1

n
_
n

i=1
U
ij

n

i=1
U
i
_
=
1

n
(U
1j
+ +U
0
U
nj+1
U
n
) .
Now E(R
nj
) = 0, Var(R
nj
) = E(R
2
nj
) = 2j
2
/n and so by Cheby-
shevs inequality R
nj

p
0 as n for j = 1, , p; thus,
p

j=1
c
j
R
nj

p
0.
c 2000 by Chapman & Hall/CRC
Finally,
1

n
n

i=1
U
i

d
N(0,
2
)
and so applying Slutskys Theorem, we get
1

n
n

i=1
X
i

d
Z
where Z has a Normal distribution with mean 0 and variance

2
_
_
p

j=0
c
j
_
_
2
.
When

p
j=0
c
j
= 0, the variance of the limiting Normal distribution
is 0; this suggests that
1

n
n

i=1
X
i

p
0
(if

p
j=0
c
j
= 0). This is, in fact, the case. It follows from above
that
1

n
n

i=1
X
i
=
p

j=1
c
j
R
nj
,
which tends to 0 in probability. An extension to innite moving
averages is considered in Problem 3.22.
In general, what conditions are necessary to obtain a CLT for
sums of dependent random variables {X
i
}? Loosely speaking, it
may be possible to approximate the distribution of X
1
+ + X
n
by a Normal distribution (for suciently large n) if both
the dependence between X
i
and X
i+k
becomes negligible as
k (for each i), and
each X
i
accounts for a negligible proportion of the variance of
X
1
+ +X
n
.
However, the conditions above provide only a very rough guideline
for the possible existence of a CLT; much more specic technical
conditions are typically needed to establish CLTs for sums of
dependent random variables.
c 2000 by Chapman & Hall/CRC
Monte Carlo integration.
Suppose we want to evaluate the multi-dimensional integral
_

_
g(x) dx
where the function g is suciently complicated that the integral
cannot be evaluated analytically. A variety of methods exist for
evaluating integrals numerically. The most well-known of these
involve deterministic approximations of the form
_

_
g(x) dx
m

i=1
a
i
g(x
i
)
where a
1
, , a
n
, x
1
, , x
n
are xed points that depend on the
method used; for a given function g, it is usually possible to give
an explicit upper bound on the approximation error

_

_
g(x) dx
m

i=1
a
i
g(x
i
)

and so the points {x


i
} can be chosen to make this error acceptably
small. However, as the dimension of domain of integration B
increases, the number of points m needed to obtain a given accuracy
increases exponentially. An alternative is to use so-called Monte
Carlo sampling; that is, we evaluate g at random (as opposed to
xed) points. The resulting approximation is of the form
m

i=1
A
i
g(X
i
)
where the X
i
s and possibly the A
i
s are random. One advantage
of using Monte Carlo integration is the fact that the order of the
approximation error depends only on mand not the dimension of B.
Unfortunately, Monte Carlo integration does not give a guaranteed
error bound; hence, for a given value of m, we can never be
absolutely certain that the approximation error is suciently small.
Why does Monte Carlo integration work? Monte Carlo integra-
tion exploits the fact that any integral can be expressed as the
expected value of some real-valued function of a random variable
or random vector. Since the WLLN says that sample means ap-
proximate population means (with high probability) if the sample
c 2000 by Chapman & Hall/CRC
size is suciently large, we can use the appropriate sample mean
to approximate any given integral. To illustrate, we will consider
evaluating the integral
I =
_
1
0
g(x) dx.
Suppose that a random variable U has a Uniform distribution on
[0, 1]. Then
E[g(U)] =
_
1
0
g(x) dx.
If U
1
, U
2
, are i.i.d. Uniform random variables on [0, 1], the WLLN
says that
1
n
n

i=1
g(U
i
)
p
E[g(U)] =
_
1
0
g(x) dx as n ,
which suggests that
_
1
0
g(x) dx may be approximated by the Monte
Carlo estimate

I =
1
n
n

i=1
g(U
i
)
if n is suciently large. (In practice, U
1
, , U
n
are pseudo-random
variables and so are not truly independent.)
Generally, it is not possible to obtain a useful absolute bound on
the approximation error

1
n
n

i=1
g(U
i
)
_
1
0
g(x) dx

since this error is random. However, if


_
1
0
g
2
(x) dx < , it is
possible to use the CLT to make a probability statement about
the approximation error. Dening

2
g
= Var[g(U
i
)] =
_
1
0
g
2
(x) dx
__
1
0
g(x) dx
_
2
,
it follows that
P
_

1
n
n

i=1
g(U
i
)
_
1
0
g(x) dx

<
a
g

n
_
(a) (a)
where (x) is the standard Normal distribution function.
The simple Monte Carlo estimate of
_
1
0
g(x) dx can be improved in
c 2000 by Chapman & Hall/CRC
a number of ways. We will mention two such methods: importance
sampling and antithetic sampling.
Importance sampling exploits the fact that
_
1
0
g(x) dx =
_
1
0
g(x)
f(x)
f(x) dx = E
_
g(X)
f(X)
_
where the random variable X has density f on [0, 1]. Thus if
X
1
, , X
n
are i.i.d. random variables with density f, we can
approximate
_
1
0
g(x) dx by

I =
1
n
n

i=1
g(X
i
)
f(X
i
)
.
How do

I and

I compare as estimates of I =
_
1
0
g(x) dx? In simple
terms, the estimates can be assessed by comparing their variances
or, equivalently, Var[g(U
i
)] and Var[g(X
i
)/f(X
i
)]. It can be shown
that Var[g(X
i
)/f(X
i
)] is minimized by sampling the X
i
s from the
density
f(x) =
|g(x)|
_
1
0
|g(x)| dx
;
in practice, however, it may be dicult to generate random
variables with this density. However, a signicant reduction in
variance can be obtained if the X
i
s are sampled from a density
f that is approximately proportional to |g|. (More generally, we
could approximate the integral
_

g(x) dx by
1
n
n

i=1
g(X
i
)
f(X
i
)
where X
1
, , X
n
are i.i.d. random variables with density f.)
Antithetic sampling exploits the fact that
_
1
0
g(x) dx =
1
2
_
1
0
(g(x) +g(1 x)) dx.
Hence if U
1
, U
2
, are i.i.d. Uniform random variables on [0, 1], we
can approximate
_
1
0
g(x) dx by
1
2n
n

i=1
(g(U
i
) +g(1 U
i
)).
Antithetic sampling is eective when g is a monotone function
c 2000 by Chapman & Hall/CRC
(either increasing or decreasing); in this case, it can be shown that
Cov[g(U
i
), g(1 U
i
)] 0. Now comparing the variances of

I
s
=
1
2n
2n

i=1
g(U
i
) and

I
a
=
1
2n
n

i=1
(g(U
i
) +g(1 U
i
)),
we obtain
Var(

I
s
) =
Var[g(U
1
)]
2n
and
Var(

I
a
) =
Var[g(U
1
) +g(1 U
1
)]
4n
=
Var[g(U
1
)] + Cov[g(U
1
), g(1 U
1
)]
2n
.
Since Cov[g(U
1
), g(1U
1
)] 0 when g is monotone, it follows that
Var(

I
a
) Var(

I
s
).
EXAMPLE 3.13: Consider evaluating the integral
_
1
0
exp(x) cos(x/2) dx;
the integrand is shown in Figure 3.10. This integral can be evaluated
in closed-form as
2
4 +
2
(2 + exp(1)) 0.4551.
We will evaluate this integral using four Monte Carlo approaches.
Let U
1
, , U
1000
be i.i.d. Uniform random variables on the interval
[0, 1] and dene the following four Monte Carlo estimates of the
integral:

I
1
=
1
1000
1000

i=1
exp(U
i
) cos(U
i
/2)

I
2
=
1
1000
500

i=1
exp(U
i
) cos(U
i
/2)
+
1
1000
500

i=1
exp(U
i
1) cos((1 U
i
)/2)

I
3
=
1
1000
1000

i=1
exp(V
i
) cos(V
i
/2)
2(1 V
i
)
c 2000 by Chapman & Hall/CRC
x
g
(
x
)
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Figure 3.10 Graph of the function g(x) = exp(x) cos(x/2).

I
4
=
1
1000
1000

i=1
exp(W
i
) cos(W
i
/2)
2W
i
where V
i
= 1 (1 U
i
)
1/2
and W
i
= U
1/2
i
.

I
2
is an antithetic
sampling estimate of the integral while

I
3
and

I
4
are both
importance sampling estimates; note that the density of V
i
is
f
V
(x) = 2(1 x) while the density of W
i
is f
W
(x) = 2x (for
0 x 1 in both cases). Each of the four estimates was evaluated
for 10 samples of U
1
, , U
1000
and the results presented in Table
3.2.
A glance at Table 3.2 reveals that the estimates

I
2
and

I
3
are
the best while

I
4
is the clear loser;

I
2
comes the closet to the
true value 4 times while

I
3
comes the closest the other 6 times.
It is not surprising that

I
3
does so well; from Figure 3.10, we can
see that the integrand g(x) = exp(x) cos(x/2) is approximately
1 x so that g(x)/f
V
(x) 2 and so

I
3
should be close to the
optimal importance sampling estimate. The fact that g(x) 1 x
also explains the success of the antithetic sampling estimate

I
2
:
g(U
i
)+g(1U
i
) 1, which suggests that Var(g(U
i
)+g(1U
i
)) 0.
c 2000 by Chapman & Hall/CRC
Table 3.2 Monte Carlo estimates of
_
1
0
exp(x) cos(x/2) dx.

I
1

I
2

I
3

I
4
0.4596 0.4522 0.4555 0.4433
0.4706 0.4561 0.4563 0.5882
0.4653 0.4549 0.4560 0.6569
0.4600 0.4559 0.4559 0.4259
0.4496 0.4551 0.4546 0.4907
0.4412 0.4570 0.4527 0.4206
0.4601 0.4546 0.4552 0.4289
0.4563 0.4555 0.4549 0.4282
0.4541 0.4538 0.4555 0.4344
0.4565 0.4534 0.4556 0.4849
(In fact, Var(

I
2
) = 0.97 10
6
while Var(

I
3
) = 2.25 10
6
and
Var(

I
1
) = 90.95 10
6
; the variance of

I
4
is innite.)
3.7 Convergence with probability 1
Earlier in this chapter, we mentioned the existence of another
type of convergence for sequences of random variables, namely
convergence with probability 1. In the interest of completeness, we
will discuss this type of convergence although we will not make use
of it subsequently in the text; therefore, this section can be skipped
without loss of continuity.
DEFINITION. A sequence of random variables {X
n
} converges
with probability 1 (or almost surely) to a random variable X
(X
n

wp1
X) if
P
__
: lim
n
X
n
() = X()
__
= 1.
What exactly does convergence with probability 1 mean? By
the denition above, if X
n

wp1
X then X
n
() X() for all
outcomes A with P(A) = 1. For a given A and > 0,
there exists a number n

() such that |X
n
() X()| for all
c 2000 by Chapman & Hall/CRC
n n

(). Now consider the sequence of sets {B


n
()} with
B
n
() =

_
k=n
[|X
k
X| > ];
{B
n
()} is a decreasing sequence of sets (that is, B
n+1
() B
n
())
and its limit will contain all s lying in innitely many of the
B
n
()s. If A then for n suciently large |X
k
()X()| for
k n and so B
n
() (for n suciently large). Thus B
n
()A
as n . Likewise, if A then will lie in innitely many of
the B
n
s and so B
n
A
c
A
c
. Thus
P(B
n
()) = P(B
n
() A) +P(B
n
() A
c
)
lim
n
P(B
n
() A) + lim
n
P(B
n
() A
c
)
= 0
Thus X
n

wp1
X implies that P(B
n
()) 0 as n for all
> 0. Conversely, if P(B
n
()) 0 then using the argument given
above, it follows that X
n

wp1
X. Thus X
n

wp1
X is equivalent
to
lim
n
P
_

_
k=n
[|X
k
X| > ]
_
= 0
for all > 0.
Using the condition given above, it is easy to see that if X
n

wp1
X then X
n

p
X; this follows since
[|X
n
X| > ]

_
k=n
[|X
k
X| > ]
and so
P(|X
n
X| > ) P
_

_
k=n
[|X
k
X| > ]
_
0.
Note that if [|X
n+1
X| > ] [|X
n
X| > ] for all n then
P
_

_
k=n
[|X
k
X| > ]
_
= P(|X
n
X| > )
in which case X
n

p
X implies that X
n

wp1
X.
c 2000 by Chapman & Hall/CRC
EXAMPLE 3.14: Suppose that X
1
, X
2
, are i.i.d. Uniform
random variables on the interval [0, 1] and dene
M
n
= max(X
1
, , X
n
).
In Example 3.1, we showed that M
n

p
1. However, note that
1 M
n+1
() M
n
() for all and so
[|M
n+1
1| > ] = [M
n+1
< 1 ]
[M
n
< 1 ]
= [|M
n
1| > ].
Thus
P
_

_
k=n
[|M
k
1| > ]
_
= P(|M
n
1| > ) 0 as n
as shown in Example 3.1. Thus M
n

wp1
1.
Example 3.14 notwithstanding, it is, in general, much more
dicult to prove convergence with probability 1 than it is to prove
convergence in probability. However, assuming convergence with
probability 1 rather than convergence in probability in theorems
(for example, in Theorems 3.2 and 3.3) can sometimes greatly
facilitate the proofs of these results.
EXAMPLE 3.15: Suppose that X
n

wp1
X and g(x) is a
continuous function. Then g(X
n
)
wp1
g(X). To see this, let A
be the set of s for which X
n
() X(). Since g is a continuous
function X
n
() X() implies that g(X
n
()) g(X()); this
occurs for all s in the set A with P(A) = 1 and so g(X
n
)
wp1
g(X).
We can also extend the WLLN to the so-called Strong Law of
Large Numbers (SLLN).
THEOREM 3.13 (Strong Law of Large Numbers) Suppose that
X
1
, X
2
, are i.i.d. random variables with E(|X
i
|) < and
E(X
i
) = . Then

X
n
=
1
n
n

i=1
X
i

wp1

as n .
c 2000 by Chapman & Hall/CRC
The SLLN was proved by Kolmogorov (1930). Its proof is more
dicult than that of the WLLN but similar in spirit; see, for
example, Billingsley (1995) for details.
There is a very interesting connection between convergence with
probability 1 and convergence in distribution. It follows that
convergence with probability 1 implies convergence in distribution;
the converse is not true (since X
n

d
X means that the distribution
functions converge and so all the X
n
s can be dened on dierent
sample spaces). However, there is a partial converse that is quite
useful technically.
Suppose that X
n

d
X; thus F
n
(x) F(x) for all continuity
points of F. Then it is possible to dene random variables {X

n
}
and X

with X

n
F
n
and X

F such that X

n

wp1
X

.
Constructing these random variables is remarkably simple. Let U
be a Uniform random variable on the interval [0, 1] and dene
X

n
= F
1
n
(U) and X

= F
1
(U)
where F
1
n
and F
1
are the inverses of the distribution functions of
X
n
and X respectively. It follows now that X

n
F
n
and X

F.
Moreover, X

n

wp1
X

; the proof of this fact is left as an exercise


but seems reasonable given that F
n
(x) F(x) for all but (at most)
a countable number of xs. This representation is due to Skorokhod
(1956).
As mentioned above, the ability to construct these random
variables {X

n
} and X

is extremely useful from a technical point


of view. In the following examples, we give elementary proofs of
the Continuous Mapping Theorem (Theorem 3.2) and the Delta
Method (Theorem 3.4).
EXAMPLE 3.16: Suppose that X
n

d
X and g(x) is a
continuous function. We can then construct random variables {X

n
}
and X

such that X

n
=
d
X
n
and X

=
d
X and X

n

wp1
X

. Since
g is continuous, we have
g(X

n
)
wp1
g(X

).
However, g(X

n
) =
d
g(X
n
) and g(X

) =
d
g(X) and so it follows
that
g(X
n
)
d
g(X)
since g(X

n
)
d
g(X

).
c 2000 by Chapman & Hall/CRC
EXAMPLE 3.17: Suppose that Z
n
= a
n
(X
n
)
d
Z where
a
n
. Let g(x) be a function that is dierentiable at x = .
We construct the random variables {Z

n
} and Z

having the same


distributions as {Z
n
} and Z with Z

n

wp1
Z

. We can also dene


X

n
= a
1
n
Z

n
+ which will have the same distribution as X
n
;
clearly, X

n

wp1
. Thus
a
n
(g(X

n
) g()) =
_
g(X

n
) g()
X

n

_
a
n
(X

n
)

wp1
g

()Z

since
g(X

n
) g()
X

n


wp1
g

().
Now since a
n
(g(X

n
) g()) and a
n
(g(X
n
) g()) have the same
distribution as do g

()Z

and g

()Z, it follows that


a
n
(g(X
n
) g())
d
g

()Z.
Note that we have required only existence (and not continuity) of
the derivative of g(x) at x = .
3.8 Problems and complements
3.1: (a) Suppose that {X
(1)
n
}, , {X
(k)
n
} are sequences of random
variables with X
(i)
n

p
0 as n for each i = 1, , k. Show
that
max
1ik
|X
(i)
n
|
p
0
as n .
(b) Find an example to show that the conclusion of (a) is not
necessarily true if the number of sequences k = k
n
.
3.2: Suppose that X
1
, , X
n
are i.i.d. random variables with a
distribution function F(x) satisfying
lim
x
x

(1 F(x)) = > 0
for some > 0. Let M
n
= max(X
1
, , X
n
). We want to show
that n
1/
M
n
has a non-degenerate limiting distribution.
(a) Show that n[1F(n
1/
x)] x

as n for any x > 0.


c 2000 by Chapman & Hall/CRC
(b) Show that
P
_
n
1/
M
n
x
_
= [F(n
1/
)]
n
= [1 (1 F(n
1/
))]
n
1 exp
_
x

_
as n for any x > 0.
(c) Show that P(n
1/
M
n
0) 0 as n .
(d) Suppose that the X
i
s have a Cauchy distribution with
density function
f(x) =
1
(1 +x
2
)
.
Find the value of such that n
1/
M
n
has a non-degenerate
limiting distribution and give the limiting distribution function.
3.3: Suppose that X
1
, , X
n
are i.i.d. Exponential random vari-
ables with parameter and let M
n
= max(X
1
, , X
n
). Show
that M
n
ln(n)/
d
V where
P(V x) = exp[exp(x)]
for all x.
3.4: Suppose that X
1
, , X
n
are i.i.d. nonnegative random vari-
ables with distribution function F. Dene
U
n
= min(X
1
, , X
n
).
(a) Suppose that F(x)/x as x 0. Show that nU
n

d
Exp().
(b) Suppose that F(x)/x

as x 0 for some > 0. Find


the limiting distribution of n
1/
U
n
.
3.5: Suppose that X
N
has a Hypergeometric distribution (see
Example 1.13) with the following frequency function
f
N
(x) =
_
M
N
x
__
N M
N
r
N
x
_
_
_
N
r
N
_
for x = max(0, r
N
+ M
N
N), , min(M
N
, r
N
). When the
population size N is large, it becomes somewhat dicult to
compute probabilities using f
N
(x) so that it is desirable to nd
approximations to the distribution of X
N
as N .
c 2000 by Chapman & Hall/CRC
(a) Suppose that r
N
r (nite) and M
N
/N for 0 < < 1.
Show that X
N

d
Bin(r, ) as N
(b) Suppose that r
N
with r
N
M
N
/N > 0. Show that
X
N

d
Pois() as N .
3.6: Suppose that {X
n
} and {Y
n
} are two sequences of random
variables such that
a
n
(X
n
Y
n
)
d
Z
for a sequence of numbers {a
n
} with a
n
(as n ).
(a) Suppose that X
n

p
. Show that Y
n

p
.
(b) Suppose that X
n

p
and g(x) is a function continuously
dierentiable at x = . Show that
a
n
(g(X
n
) g(Y
n
))
d
g

()Z.
3.7: (a) Let {X
n
} be a sequence of random variables. Suppose that
E(X
n
) (where is nite) and Var(X
n
) 0. Show that
X
n

p
.
(b) A sequence of random variables {X
n
} converges in proba-
bility to innity (X
n

p
) if for each M > 0,
lim
n
P(X
n
M) = 0.
Suppose that E(X
n
) and Var(X
n
) kE(X
n
) for
some k < . Show that X
n

p
. (Hint: Use Chebyshevs
inequality to show that
P [|X
n
E(X
n
)| > E(X
n
)] 0
for each > 0.)
3.8: (a) Let g be a nonnegative even function (g(x) = g(x)) that
is increasing on [0, ) and suppose that E[g(X)] < . Show
that
P[|X| > ]
E[g(X)]
g()
for any > 0. (Hint: Follow the proof of Chebyshevs inequality
making the appropriate changes.)
(b) Suppose that E[|X
n
|
r
] 0 as n . Show that X
n

p
0.
3.9: Suppose that X
1
, , X
n
are i.i.d. Poisson random variables
with mean . By the CLT,

n(

X
n
)
d
N(0, ).
c 2000 by Chapman & Hall/CRC
(a) Find the limiting distribution of

n(ln(

X
n
) ln()).
(b) Find a function g such that

n(g(

X
n
) g())
d
N(0, 1).
3.10: Let {a
n
} be a sequence of constants with a
n
and
suppose that
a
n
(X
n
)
d
Z
where is a constant.
(a) Let g be a function that is twice dierentiable at and
suppose that g

() = 0. Show that
a
2
n
(g(X
n
) g())
d
1
2
Z
2
.
(b) Now suppose that g is k times dierentiable at with
g

() = = g
(k1)
() = 0. Find the limiting distribution
of a
k
n
(g(X
n
) g()). (Hint: Expand g(X
n
) in a Taylor series
around .)
3.11: The sample median of i.i.d. random variables is asymptoti-
cally Normal provided that the distribution function F has a
positive derivative at the median; when this condition fails, an
asymptotic distribution may still exist but will be non-Normal.
To illustrate this, let X
1
, , X
n
be i.i.d. random variables with
density
f(x) =
1
6
|x|
2/3
for |x| 1.
(Notice that this density has a singularity at 0.)
(a) Evaluate the distribution function of X
i
and its inverse (the
quantile function).
(b) Let M
n
be the sample median of X
1
, , X
n
. Find the
limiting distribution of n
3/2
M
n
. (Hint: use the extension of
the Delta Method in Problem 3.10 by applying the inverse
transformation from part (a) to the median of n i.i.d. Uniform
random variables on [0, 1].)
3.12: Suppose that X
1
, , X
n
are i.i.d. random variables with
common density
f(x) = x
1
for x 1
c 2000 by Chapman & Hall/CRC
where > 0. Dene
S
n
=
_
n

i=1
X
i
_
1/n
.
(a) Show that ln(X
i
) has an Exponential distribution.
(b) Show that S
n

p
exp(1/). (Hint: Consider ln(S
n
).)
(c) Suppose = 10 and n = 100. Evaluate P(S
n
> 1.12) using
an appropriate approximation.
3.13: Suppose that X
1
, , X
n
be i.i.d. discrete random variables
with frequency function
f(x) =
x
21
for x = 1, 2, , 6.
(a) Let S
n
=

n
k=1
kX
k
. Show that
(S
n
E(S
n
))
_
Var(S
n
)

d
N(0, 1).
(b) Suppose n = 20. Use a Normal approximation to evaluate
P(S
20
1000).
(c) Suppose n = 5. Compute the exact distribution of S
n
using
the probability generating function of S
n
(see Problems 1.18
and 2.8).
3.14: Suppose that X
n1
, X
n2
, , X
nn
are independent random
variables with
P(X
ni
= 0) = 1 p
n
and P(X
ni
x|X
ni
= 0) = F(x).
Suppose that (t) =
_

exp(tx) dF(x) < for t in a


neighbourhood of 0.
(a) Show that the moment generating function of X
ni
is
m
n
(t) = p
n
(t) + (1 p
n
)
(b) Let S
n
=

n
i=1
X
ni
and suppose that np
n
> 0 as
n . Show that S
n
converges in distribution to a random
variable S that has a compound Poisson distribution. (Hint: See
Problem 2.7 for the moment generating function of a compound
Poisson distribution.)
3.15: Suppose that X
n1
, X
n2
, , X
nn
are independent Bernoulli
c 2000 by Chapman & Hall/CRC
random variables with parameters
n1
, ,
nn
respectively.
Dene S
n
= X
n1
+X
n2
+ +X
nn
.
(a) Show that the moment generating function of S
n
is
m
n
(t) =
n

i=1
(1
ni
+
ni
exp(t)) .
(b) Suppose that
n

i=1

ni
> 0 and
max
1in

ni
0
as n . Show that
lnm
n
(t) = [exp(t) 1] +r
n
(t)
where for each t, r
n
(t) 0 as n . (Hint: Use the fact that
ln(1 +x) = x x
2
/2 +x
3
/3 + for |x| < 1.)
(c) Deduce from part (b) that S
n

d
Pois().
3.16: Suppose that {X
n
} is a sequence of nonnegative continuous
random variables and suppose that X
n
has hazard function

n
(x). Suppose that for each x,
n
(x)
0
(x) as n
where
_

0

0
(x) dx = . Show that X
n

d
X where
P(X > x) = exp
_

_
x
0

0
(t) dt
_
3.17: Suppose that X
1
, , X
n
are independent nonnegative ran-
dom variables with hazard functions
1
(x), ,
n
(x) respec-
tively. Dene U
n
= min(X
1
, , X
n
).
(a) Suppose that for some > 0,
lim
n
1
n

i=1

i
(t/n

) =
0
(t)
for all t > 0 where
_

0

0
(t) dt = . Show that n

U
n

d
V
where
P(V > x) = exp
_

_
x
0

0
(t) dt
_
.
(b) Suppose that X
1
, , X
n
are i.i.d. Weibull random variables
(see Example 1.19) with density function
f(x) = x
1
exp
_
x

_
(x > 0)
c 2000 by Chapman & Hall/CRC
where , > 0. Let U
n
= min(X
1
, , X
n
) and nd such
that n

U
n

d
V .
3.18: Suppose that X
n

2
(n).
(a) Show that

X
n


n
d
N(0, 1/2) as n . (Hint:
Recall that X
n
can be represented as a sum of n i.i.d. random
variables.)
(b) Suppose that n = 100. Use the result in part (a) to
approximate P(X
n
> 110).
3.19: Suppose that {X
n
} is a sequence of random variables such
that X
n

d
X where E(X) is nite. We would like to
investigate sucient conditions under which E(X
n
) E(X)
(assuming that E(X
n
) is well-dened). Note that in Theorem
3.5, we indicated that this convergence holds if the X
n
s are
uniformly bounded.
(a) Let > 0. Show that
E(|X
n
|
1+
) = (1 +)
_

0
x

P(|X
n
| > x) dx.
(b) Show that for any M > 0 and > 0,
_
M
0
P(|X
n
| > x) dx E(|X
n
|)

_
M
0
P(|X
n
| > x) dx
+
1
M

_

M
x

P(|X
n
| > x) dx.
(c) Again let > 0 and suppose that E(|X
n
|
1+
) K <
for all n. Assuming that X
n

d
X, use the results of parts (a)
and (b) to show that E(|X
n
|) E(|X|) and E(X
n
) E(X)
as n . (Hint: Use the fact that
_
M
0
|P(|X
n
| > x) P(|X| > x)| dx 0
as n for each nite M.)
3.20: A sequence of random variables {X
n
} is said to be bounded
in probability if for every > 0, there exists M

< such that


P(|X
n
| > M

) < for all n.


(a) If X
n

d
X, show that {X
n
} is bounded in probability.
c 2000 by Chapman & Hall/CRC
(b) If E(|X
n
|
r
) M < for some r > 0, show that {X
n
} is
bounded in probability.
(c) Suppose that Y
n

p
0 and {X
n
} is bounded in probability.
Show that X
n
Y
n

p
0.
3.21: If {X
n
} is bounded in probability, we often write X
n
=
O
p
(1). Likewise, if X
n

p
0 then X
n
= o
p
(1). This useful
shorthand notation generalizes the big-oh and little-oh notation
that is commonly used for sequences of numbers to sequences
of random variables. If X
n
= O
p
(Y
n
) (X
n
= o
p
(Y
n
)) then
X
n
/Y
n
= O
p
(1) (X
n
/Y
n
= o
p
(1)).
(a) Suppose that X
n
= O
p
(1) and Y
n
= o
p
(1). Show that
X
n
+Y
n
= O
p
(1).
(b) Let {a
n
} and {b
n
} be sequences of constants where a
n
/b
n

0 as n (that is, a
n
= o(b
n
)) and suppose that X
n
=
O
p
(a
n
). Show that X
n
= o
p
(b
n
).
3.22: Suppose that {U
i
} is an innite sequence of i.i.d. random
variables with mean 0 and variance 1, and dene {X
i
} by
X
i
=

j=0
c
j
U
ij
where we assume that

j=0
|c
j
| < to guarantee that the
innite summation is well-dened.
(a) Dene c
j
=

k=j+1
c
k
and dene
Z
i
=

j=0
c
j
U
ij
and assume that

j=0
c
2
j
< (so that Z
i
is well-dened). Show
that
X
i
=
_
_

j=0
c
j
_
_
U
i
+Z
i
Z
i1
.
This decomposition is due to Beveridge and Nelson (1981).
(b) Using the decomposition in part (a), show that
1

n
n

i=1
X
i

d
N(0,
2
)
c 2000 by Chapman & Hall/CRC
where

2
=
_
_

j=0
c
j
_
_
2
.
3.23: Suppose that A
1
, A
2
, is a sequence of events. We are some-
times interested in determining the probability that innitely
many of the A
k
s occur. Dene the event
B =

n=1

_
k=n
A
k
.
It is possible to show that an outcome lies in B if, and only
if, it belongs to innitely many of the A
k
s. (To see this, rst
suppose that an outcome lies in innitely many of the A
k
s.
Then it belongs to B
n
=

k=n
A
k
for each n 1 and hence
in B =

n=1
B
n
. On the other hand, suppose that lies in B;
then it belongs to B
n
for all n 1. If were in only a nite
number of A
k
s, there would exist a number m such that A
k
did not contain for k m. Hence, would not lie in B
n
for
n m and so would not lie in B. This is a contradiction, so
must lie in innitely many of the A
k
s.)
(a) Prove the rst Borel-Cantelli Lemma: If

k=1
P(A
k
) <
then
P(A
k
innitely often) = P(B) = 0.
(Hint: note that B B
n
for any n and so P(B) P(B
n
).)
(b) When the A
k
s are mutually independent, we can streng-
then the rst Borel-Cantelli Lemma. Suppose that

k=1
P(A
k
) =
for mutually independent events {A
k
}. Show that
P(A
k
innitely often) = P(B) = 1;
this result is called the second Borel-Cantelli Lemma. (Hint:
Note that
B
c
=

_
n=1

k=n
A
c
k
c 2000 by Chapman & Hall/CRC
and so
P(B
c
) P
_

k=n
A
c
k
_
=

k=n
(1 P(A
k
)).
Now use the fact that ln(1 P(A
k
)) P(A
k
).)
3.24: Suppose that {X
k
} is an innite sequence of identically
distributed random variables with E(|X
k
|) < .
(a) Show that for > 0,
P
_

X
k
k

> innitely often


_
= 0.
(From this, it follows that X
n
/n 0 with probability 1 as
n .)
(b) Suppose that the X
k
s are i.i.d. Show that X
n
/n 0 with
probability 1 if, and only if, E(|X
k
|) < .
3.25: Suppose that X
1
, X
2
, are i.i.d. random variables with
E(X
i
) = 0 and E(X
4
i
) < . Dene

X
n
=
1
n
n

i=1
X
i
.
(a) Show that E[|

X
n
|
4
] k/n
2
for some constant k. (Hint: To
evaluate
n

i=1
n

j=1
n

k=1
n

=1
E [X
i
X
j
X
k
X

]
note that most of the n
4
terms in the fourfold summation are
exactly 0.)
(b) Using the rst Borel-Cantelli Lemma, show that

X
n

wp1
0.
(This gives a reasonably straightforward proof of the SLLN
albeit under much stronger than necessary conditions.)
c 2000 by Chapman & Hall/CRC

You might also like