Bayesian Econometrics Introduction

Bayesian Econometrics: Applications in Economics & Finance
Topic 1: Overview of Bayesian Econometric Methods
Daniel Buncic
Sveriges Riksbank
August 20, 2018

Lecture Notes Version: [ibe1a]
Homepage
www.danielbuncic.com
Outline/Table of Contents
Outline
Introduction, Intuition and Background
History of Probability Some Model Based Examples
Different views of Probability Example 1: Binary Model
Bayesian view of Probability Example 2: Poisson Model
Bayesian statistical modelling Example 3: Normal Model
Some General Examples Example 4: Regression Model
Main differences between Bayesian and Frequentist views References
Relation to other Estimation Methods
Daniel Buncic (Sveriges Riksbank) Topic 1: Bayesian Econometrics August 20, 2018 : 2/107
History of Probability
Origins of Probability Theory

Notion of probability had its origins in the 17th and very early 18th century (see
Chapter 1 in DeGroot and Schervish (2010) and Chapter 3 in Spanos (1986))
‘classical’ definitions of probability were formulated by Jacob Bernoulli in Ars
Conjectandi (1713) and Abraham De Moivre in The Doctrine of Chances 1718)
all interest in probability theory arose as a consequence of the (notorious)
gambling activities in France at that time
games of chance, and with it, probability theory became a popular subject of
study as a consequence
one of the major obstacles in developing a mathematical theory of probability
has been to arrive at a definition of probability that is precise enough for use in
mathematics
yet it needs to be comprehensive enough to be applicable to a wide range of
phenomena and views on probability
a widely acceptable definition of probability took another two centuries to be

developed and was marked by much controversy
the matter was finally resolved in the 20th century by treating probability
theory on an axiomatic basis
in ”Grundbegriffe der Wahrscheinlichkeitsrechnung” (1933) Andrei
Kolmogorov outlined an axiomatic approach that forms the basis for the
modern theory of probability
since then, the ideas have been refined somewhat and probability theory is now
part of a more general mathematical discipline, measure theory.
a historical overview of the sources of Kolmogorov’s aximos is given in Shafer
and Vovk (2006)
Definition (Kolmogorov’s Three Axioms of Probability)

Let (Ω, F, P) be a probability space with Ω the sample space, with event space F ,
probability measure P, and let Ai denote the events in in F.
1) P(Ai ) ≥ 0 and P(Ai ) is a real number (ie., P(Ai ) ∈ R)
2) P(Ω) = 1 (needed for P(·) to be a valid probability measure)
3) If {Ai }n
i=1 is a sequence of n mutually exclusive events with no common
elements (ie., Ai ∩ Aj = ∅), then
P (A1 ∪ A2 ∪ · · · ∪ An ) = P(A1 ) + P(A2 ) + · · · + P(An ).
(adapted from the English translation of Kolmogorov (1933) page 2.)
Corollary (to Kolmogorov’s Three Axioms of Probability)

1) Let AC C
i be the complement of Ai , so that Ai + Ai = Ω. Then, it follows
that:
P(Ai ) + P(AC
i ) = 1
(1)
P(AC
i ) = 1 − P(Ai ).
2) If P(Aj ) 6= 0, then the conditional probability of Ai given Aj is defined as
P(Ai ∩ Aj )
P(Ai |Aj ) = . (2)
P(Aj )
3) Events Ai and Aj are said to be statistically independent if
P(Ai ∩ Aj ) = P(Ai )P(Aj ). (3)
(adapted from the English translation of Kolmogorov (1933) pages 6-10.)
Different views of Probability
What is Probability?
There exists no agreement of a formal definition or interpretation of probability!
all that we have is a set of rules (axioms) that define the mathematical
properties of probability and its building blocks
There are three common operational interpretations of probability
1) relative frequency approach

– Empirical frequencies in a large number of trials n, has the interpretation of ”on
average” or in ”long-run”, nA # of times event A is observed:
P (A) = lim nA /n. (4)
n→∞
2) combinatorial (Classical) definition

– N mutually exclusive/equaly likely outcomes, NA # of times event A can occur:
P (A) = NA /N. (5)
3) subjective interpretation
– probability as meaning the ”degree of belief” in an event held by an individual,
- what ever way formulated, ie., based on past info, combinatorial probabilities, etc.
the ”degree of belief” view is thus a quite general concept and allows for a
”subjective” as well as a ”classical” definition of probability.
”classical” definition of probability was formulated by Jacob Bernoulli in Ars
Conjectandi (1713) and Abraham De Moivre in The Doctrine of Chances
(1718)
the conditions for each trial or experiment have to remain the same for a
”relative frequency” view of probability to be valid
what does limn→∞ mean operationally? if only a finite number of trials can
ever be conducted, then the ”relative frequency” view of probability cannot
be made operational
the ”frequentist” interpretation of probability cannot be applied to situations
where there is no precedence of the event that one is interested in, when the
conditions of the experiment change
the ”combinatorial” interpretation is not feasible when there is an infinite set
of possible equally likely outcomes
Example (Coin Tossing)

Suppose that you are interested in deducing the ”fairness” of a particular type of coin
that someone gave to you to see whether it has probability of heads = tails = .5.
The frequentist view of probability would take the long-run relative frequency of
heads as an indication of the probability of heads.
What happens if the coin that is used in the experiment is switched after n tosses to
some other coin? Can we still use the relative frequency view of probability to
analyse whether the coin is fair or not?
There are many arguments that point out that the conditions for an experiment of
this sort are very difficult to keep constant, so that there is a violation of the basic
assumptions for the relative frequency definition of probability in (4).
Example (Fund performance)

Suppose that you have to decide where to invest a given amount of money and you
review a group of fundmanagers based on their past performance. Given the past
performance, you can create a ranking based on what manager had the highest (risk
adjusted) return.
what are you implicitly assuming when you apply this sort of reasoning to
figure out where to invest your money?
what set of conditions are violated when you think about a frequentist view of
the outcomes
you often see the caveat ”past performance is not an indication of future
performance” in the brochures or on the website of fundmanagers. What does
that mean in this context?
Bayesian view of Probability
Thinking differently about probability

Rather than think of probabilities as the relative frequencies of outcomes in a
random experiment, think of probabilities as ”degree of belief” in a
proposition
this proposition does not need to be thought within a random experiment or
random variables (RVs) in general
– Mr X killed Mr Y, and the court has to asses how probable it is that this
proposition is true, given the evidence.
– Sweden would have won the world cup, if ...
– the signature on a particular (individual) check is not genuine
– Mr Z has a genetic disease given that he has taken a test
– etc. etc.
most people are likely to use probability within a relative frequency context of
observed outcomes from the past
but probability is very often also based on the subjective view, that is, the
”degree of belief” view, where a probability can be assigned to a (personal or
individual) proposition without ever having observed it
”degree of belief” can be mapped onto probabilities if they satisfy simple
rules of consistency, which are known as the Cox axioms (Cox, 1946)
Cox axioms of probability ensure that if two people made the same prior
assumptions and were given the same data, then they will draw identical
conclusions
such a more general view of probability to quantify beliefs is known as the
Bayesian viewpoint or as the subjective interpretation of probability
Probabilistic modelling as an inversion problem

Purpose of statistical modelling is fundamentally an inversion problem
(Robert (2007) page 8)
we try to reduce the causes (as captured by the parameters of the probabilistic
data generating process) from the effects (which are contained in the sample
data)
statistical methods allow us to deduce from the sample data an inference
about the unknown parameter vector θ,
probabilistic modelling allows us to characterise the behaviour of (future or
out-of-sample) observations conditional on model and parameter vector θ.
inversion property of statistics is evident in the notion of the likelihood

function L(θ|x) and the probability density function (PDF) p(x|θ), with x a
vector of i.i.d. sample data from random variable X, where
L(θ|x) = p(x|θ)
N
Y
= p(xi |θ)
i=1
thus, given θ, we can plot a PDF for different values of x,

and given x, we can plot a likelihood function for different values of θ.
A general definition of the inversion problem is given by Bayes’ Theorem
Theorem (Bayes’ Theorem)

If A and B are two events such that P(B) 6= 0, then
P(B|A)P(A)
P(A|B) = (6)
P(B)
P(B|A)P(A)
P(A|B) = (7)
P(B|A)P(A) + P(B|AC )P(AC )
Bayes’ Theorem follows directly from the conditional probability corollary in (2).
By definition we have
P(B|A)P(A) = P(B ∩ A)
and also
P(B) = P(B ∩ A) + P(B ∩ AC )

= P(B|A)P(A) + P(B|AC )P(AC ) (8)
which together with (6) results in (7).

Bayes’ Theorem provides the basis for conducting Bayesian statistical inference.
Bayesian modelling combines in an optimal way all relevant sample information and
prior beliefs by using Bayes’ Theorem.
Bayesian statistical modelling
Definition (Likelihood Principle)

Suppose there are two experiments, E1 := (x1 , θ, {p1 (x1 |θ)}) and
E2 := (x2 , θ, {p2 (x2 |θ)}), where θ is the unknown parameter of interest (same in
both experiments), and xi and pi (xi |θ) are vectors of the observed sample data and
corresponding PDF for experiment i = 1, 2, respectively. Suppose that
L(θ|x1 ) = C (x1 , x2 ) L(θ|x2 )

L(θ|x1 ) ∝ L(θ|x2 )
for all θ, where C (x1 , x2 ) is constant with respect to θ, then the information content
in the two likelihoods L(θ|x1 ) and L(θ|x2 ) with respect to θ is the same.
(adapted from Casella and Berger (2001) pages 293-294.)
The Likelihood Principle states that all of the relevant information that is available in
a sample of data is contained in its likelihood function.
The Bayesian Statistical Model can then be defined as follows:
Definition (Bayesian Statistical Model)

A Bayesian statistical model is constructed from a parametric statistical model
p(x|θ) (the likelihood function) and a prior distribution on the parameters π(θ).
(adapted from Robert (2007) page 9.)
Optimal updating of prior beliefs and Bayes’ Theorem

combining the definitions of the Likelihood Principle, Bayes’ Theorem and
the Bayesian Statistical Model, we get the following relation
p(x|θ)π(θ)
p(θ|x) =
p(x)
p(x|θ)π(θ) (9)
= R
θ
p(x|θ)π(θ)dθ
p(θ|x) ∝ L(θ|x)π(θ)
where
– ∝ is the ”proportional to” operator
– p(θ|x) is the posterior probability of θ after observing the data
– L(θ|x) ≡ p(x|θ) is the likelihood function which is equal to the PDF of x given θ
– π(θ) is the prior
– p(x) 6= 0 is the marginal density of the data, does not depend on θ and is assumed
to exist
The relation in (9) is the key relation and describes how we proceed under a
Bayesian modelling approach
This can be summarised as follows:
1) given your current knowledge (or information available to you), form a prior
belief on the proposition of interest, express it with π(θ)
2) collect sample data and formulate a (parametric) statistical model, formulate
it as L(θ|x)
3) update your prior beliefs expressed by π(θ) with the new ”information” that
is available and contained in the likelihood L(θ|x) using the rule in (9) as:
p(θ|x) ∝ L(θ|x)π(θ)
This way of updating a rational person’s (personal or subjective) beliefs about

parameter vector θ as new information becomes available is known as Bayesian
Updating
Cox (1946) and Savage (1954) proved that if p(x|θ) and π(θ) represent a
rational person’s beliefs about the data and the prior, then using Bayes’
Theorem to update one’s beliefs is optimal from a decision theoretic
perspective
thus, from the view point of decision theory, Bayesian Updating is the only
optimal way of how we should update our beliefs about a proposition and
updating should be done using Bayes’ Theorem
more background on the information/decision theoretic foundations of
Bayesian statistics can be found in Savage (1954), Cox (1946) and in more
recent treatments in Bernardo and Smith (1994) and Robert (2007).
Some General Examples of Bayesian Probability
Example (Simple Scenario S)

Suppose you are interested in scenario S that may arise due to (only) two different
parameter (or model) settings θ1 and θ2 . Also, someone tells you (ie., the data via
the Likelihood) that scenario S is more likely under θ2 than under θ1 so that the
conditional probabilities are P(S|θ1 ) = 0.1 and P(S|θ2 ) = 0.6.
Further, you assign prior equally likely probabilities for θi , P(θ1 ) = P(θ2 ) = 0.5. Now,
P(S) = P(S|θ1 )P(θ1 ) + P(S|θ2 )P(θ2 ) = 0.1 × 0.5 + 0.6 × 0.5 = 0.35.
Using Bayes’ Theorem, we can work out the posterior beliefs as:
P(S|θ1 )P(θ1 ) 0.1 × 0.5 1
P(θ1 |S) = = =
P(S) 0.35 7
P(S|θ2 )P(θ2 ) 0.6 × 0.5 6
P(θ2 |S) = = =
P(S) 0.35 7
thus after updating our beliefs, θ2 is 6 times more likely than θ1 .
Example (Individual Probability)

Suppose Bob has to take some test concerning some disease. Let D = 1 if Bob has
the disease and 0 otherwise and let T = 1 if Bob returns a positive test result and 0
otherwise.
The test is 95% reliable, meaning that 95% of people who have the disease returned
a positive test result (P(T = 1|D = 1) = 95%) and also 95% of people who do not
have the disease returned a negative test result (P(T = 0|D = 0) = 95%).
Also, 1% of people that have similar characteristics to Bob have the disease (ie. age,
etc) (P(D = 1) = 1%).
Suppose now that Bob has a positive test result (T = 1), what is the probability that
Bob actually has the disease (D = 1)?
Frequentist view: P(D = 1|T = 1) = 95% (95% accuracy)
Example (Individual Probability cont.)

Bayesian view: Use prior P(D = 1) and update based on test results. That is
P(T = 1, D = 1)
P(D = 1|T = 1) =
P(T = 1)
P(T = 1|D = 1)P(D = 1)
=
P(T = 1|D = 1)P(D = 1) + P(T = 1|D = 0)P(D = 0)
0.95 × 0.01
=
0.95 × 0.01 + 0.05 × 0.99
= 0.161
Thus, despite the positive test result, the (posterior) probability that Bob has the
diseases given that he tested positive is only 16.1%.
So this result is very different from the typical frequentist interpretation.
Main differences between Bayesian and Frequentist views
Textbook style arguments of how views on probability differ

Frequentist
all probabilities are in repeated sampling context, not tailored to the individual
or individual proposition
this becomes most evident when interpreting a (say) 95% confidence interval
(CI)
95% CI does not mean that there is a 95% chance that the true parameter is
within this interval
95% CI only holds in a repeated sampling context!, ie., as # of trials → ∞
Frequentist view the population parameter of interest as fixed, while the
estimate is a random variable
Main differences between Bayesian and Frequentist views
Bayesian
95% CI constructed from posterior density (or probabilities) can be interpreted

to mean ”there is a 16.1% chance that Bob has the disease”
Bayesians view the population parameter as a random variable itself
the MLE point estimate, once the data has been observed, is fixed or
non-random given the sample data
Other Estimation Methods

the three main estimation methods that you have encountered so far are
– OLS
– GMM
– MLE
each has a different set of assumptions with it
– OLS needs E(t Xt ) = 0 for consistency
– GMM requires the instruments use in the moment conditions to be non-weak for
consistency
– MLE requires a correct specification of the model and hence the Likelihood
function
we know that MLE is the most efficient one as its variance attains the Cramér
Rao lower bound of the inverse of the information matrix
OLS and GMM can be as efficient under certain conditions
How does Bayesian estimation relate to any of these other methods?

Bayesian estimation (in general) requires a fully specified parametric model
– we need to supply p(x|θ) = L(θ|x), that is the density function (and hence the
likelihood function) of the data that we are analysing
– this is the same as MLE thus the Bayesian approach is also a full information
approach
– any errors that can come into MLE will thus also be present in Bayesian estimation
advantage of Bayesian approach over MLE is that there are two sources of
information
– the likelihood p(x|θ) = L(θ|x) and
– the prior π(θ)
– any identification problems that can occur in parametric models in general can be
dealt with by using priors as extra sources of information
– this is how DSGE models are estimated when weak or unidentified parameters are
present in the model
identification in terms of priors is thus another constraint (or structure) that is

imposed on the model that comes from outside
it’s an assumption that is imposed
– this may not be consistent with the data
– there is no way that one can use the data to test whether this prior is supported or
not, as one would normally do in frequentist (MLE) set up to see if a restriction is
valid
priors can thus be very powerful, but also potentially another source of error
that comes into the estimation procedure
priors can solely determine the shape and mode of the posterior if the
likelihood function is flat (ie., the data contain no information about the
model parameters).
but, as the sample size goes to infinity, Bayesian point estimator goes to MLE.
Some Model Based Examples
Example 1: Binary Model
Binary Model
Suppose our variable of interest is Y which is Binary so that
p(y|θ) = θy (1 − θ)1−y , y = 0, 1 and θ ∈ [0, 1] (10)

N PN
and we observe the sample {yi }i=1 with N = 130 and i=1 yi = 100 so that the
number of times yi = 1 is 100 and hence number of times yi = 0 is 30.
Assuming that the data are i.i.d., we can form the joint density function for
y = (y1 , . . . , yN ) as
N
Y
p(y|θ) = θyi (1 − θ)(1−yi ) (11)
i=1
PN PN
yi i=1 (1−yi )
= θ i=1 (1 − θ) (12)
N ȳ N (1−ȳ)
= θ (1 − θ) (13)
−1 PN
where ȳ = N i=1 yi .
Frequentist MLE
The MLE would maximise the likelihood function L(θ|y) = p(y|θ) w.r.t. θ.
-31
x 10
3.5
2.5
2
L(µ jy)
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Figure 1: Plot of L(θ|y) for θ ∈ [0, 1].
Analytically, we would get the MLE of θ by setting the derivative of L(θ|y) = p(y|θ)
w.r.t. θ to zero. As always, with MLE, we actually set the log of the likelihood
ln(L(θ|y)) to zero, that is:
∂ ln(L(θ|y))
=0
∂θ
∂ [N ȳ ln(θ) + N (1 − ȳ) ln(1 − θ)]
=0
∂θ
N ȳ N (1 − ȳ)
=
θ (1 − θ)
N ȳ − N ȳθ = θN − θN ȳ
θ̂MLE = ȳ
= 100/130 = 0.7692.
The variance of θ̂MLE can be found as the inverse of information matrix I(θ), where
2
∂ [N ȳ ln(θ) + N (1 − ȳ) ln(1 − θ)]
I(θ) = −E
∂θ2
= −E −N ȳθ−2 − N (1 − ȳ) (1 − θ)−2

= N E(ȳ) θ−2 + N (1 − E(ȳ)) (1 − θ)−2

| {z } | {z }
=θ =(1−θ)
−1 −1
= Nθ + N (1 − θ)
= N [θ−1 + (1 − θ)−1 ]
= N [θ(1 − θ)]−1
so Var(θ̂MLE ) = I(θ)−1 = θ(1 − θ)/N which, evaluated at θ̂MLE = 100/130, yields

0.001365.
Bayesian
Under a Bayesian approach we need to specify a prior π(θ) and then compute the
posterior p(θ|y) from the relation
p(y|θ)π(θ)
p(θ|y) =
p(y)
L(θ|y)π(θ)
= (14)
p(y)
We know that θ ∈ [0, 1]. If we are uncertain about what the most likely value of θ is
before seeing the data, we can assign an uninformative prior, so that θ ∈ [0, 1] has
equal probability of falling anywhere in this interval.
We can let θ ∼ Uniform (a, b), with a = 0 and b = 1.
Recall that a continuous Uniform RV X ∈ (a, b) has PDF

(
1
if a ≤ x ≤ b
p(x|a, b) = b−a . (15)
0 otherwise
With a = 0 and b = 1 we get the Uniform prior for θ as π(θ) = 1. From (14) we can
then form the posterior
L(θ|y)π(θ)
p(θ|y) = (16)
p(y)
θN ȳ (1 − θ)N (1−ȳ)
= (17)
p(y)
∝ θN ȳ (1 − θ)N (1−ȳ) (18)
| {z }
kernel
What kind of density is (18)?
A trick that we will frequently employ is to look at the numerator of (17) only,
without considering the marginal density of the data, p(y) at all.
the right hand side of (18) is often referred to as the kernel of a density
the kernel of a density determines its shape, hence all that matters
– but kernel does not integrate to 1 as a proper PDF should, but to some other
constant which we can call C, ie.
Z
θN ȳ (1 − θ)N (1−ȳ) dθ = C 6= 1.
θ
– so kernel tells us about the type of density we are dealing with

– and kernel divided by C gives a proper density so that it integrates to 1
the posterior is a proper density so it must integrate to 1, ie,
Z
p(θ|y)dθ = 1.
θ
because p (θ|y) p(y) = L(θ|y)π(θ).

thus we know that
Z Z
p(θ|y)p(y)dθ = θN ȳ (1 − θ)N (1−ȳ) dθ
θ θ
Z Z
p(y) p(θ|y)dθ = θN ȳ (1 − θ)N (1−ȳ) dθ
| θ {z } θ
=1
p(y) = C (19)
once we know what kind of posterior density we are dealing with, we can work
out the normalising constant C easily.
Recall that RV Z ∈ [0, 1] ⊂ R has a Beta(α, β) distribution if its PDF is
1
p(z|α, β) = z α−1 (1 − z)β−1 , ∀α, β > 0, (20)
B(α, β) | {z }
kernel
where both, α and β are shape parameters and

Z
B(α, β) = z α−1 (1 − z)β−1 dz, ∀α, β > 0,
z
Γ(α)Γ(β)
B(α, β) =
Γ(α + β)
and Γ(α) = (α − 1)Γ(α − 1) = (α − 1)! (! = factorial operator), which is the

standard Gamma Function from calculus.
Let us compare now the kernel of (18) with that of (20).

Note that these are the same!
We have θ = z, α − 1 = N ȳ and β − 1 = N (1 − ȳ), so that the posterior
p(θ|y) = Beta(N ȳ + 1, N (1 − ȳ) + 1) (21)

| {z } | {z }
=α =β
Γ(α)Γ(β)
xα−1 (1 − x)β−1 dx =
R
and the term p(y) = C from (19) is x Γ(α+β)
= B(α, β).
From our earlier numerical values we have:
α = N ȳ + 1 = 101
β = N (1 − ȳ) + 1 = 31
α + β = N ȳ + 1 + N (1 − ȳ) = N + 2 = 132
so that
Γ(101)Γ(31)
C =
Γ(132)
= 100! × 30!/131!
C = 2.9221e − 032 (pretty small number!)
[the Matlab command beta(101,31) computes what we need]

Why do we get this small number?
Recall that
L(θ|y)π(θ)
p(θ|y) =
p(y)
Under our setting we had π(θ) = 1 and with p(y) = C = 2.9221e − 032 we get
L(θ|y)
p(θ|y) =
p(y)
L(θ|y)
=
C
thus, posterior is just a re-scaled version of the likelihood function with the
same shape as the likelihood function!
this is due to the prior being π(θ) = 1.
the maximum of L(θ|y) was at 3.1696e − 031 (see Figure (1))
the maximum of the posterior (ie., the mode of θ|y) thus has to be
L(θ|y)
max(p(θ|y)) =
C
3.1696e − 031
=
2.9221e − 032
≈ 10.8469
which we can see from Figure (2) below.
12
Posterior
Prior
10
8
p (µ jy)
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Figure 2: Plot of posterior p(θ|y) and the prior π(θ) = 1.
What do we do with the posterior?

note that the posterior contains all the probabilistic information about θ once
we have observed the data y!
so we can compute any moments of interest, quantiles, tail probabilities, etc.
from the posterior.
We know that
p(θ|y) = Beta(α, β) (22)
where
α = N ȳ + 1 and β = N (1 − ȳ) + 1.
From the moments of a Beta distribution we know that, if θ|y ∼ Beta(α, β), then
α
E(θ|y) = α+β
,
αβ
Var(θ|y) = (α+β)2 (α+β+1)
, and
α−1
mode(θ|y) = α+β−2
, if α, β > 1.
To get a Bayesian point estimate (at the centre) of θ|y from (22), we can look at
either of the three well known measures of central tendencies, ie.,
1) the mode
2) the mean
3) or the median of the posterior (no closed form for the Beta PDF)
Which one we end up using depends on our loss function.
If a quadratic loss function is used, we get the standard result that the mean of the
posterior should be used, which is
α
E(θ|y) =
α+β
N ȳ + 1
=
N ȳ + 1 + N (1 − ȳ) + 1
N ȳ + 1
=
N +2
= 101/132
= 0.76515.
(see also Chapter 3 in Koop et al. (2007) for various types of loss functions)
If we use the posterior mode, then

α−1
mode(θ|y) =
α+β−2
N ȳ
=
N ȳ + N (1 − ȳ)
N ȳ
=
N
= 100/130
so we get the same result as from θ̂MLE = 0.7692.

The Posterior Variance can be easily found from
αβ − (N y + 1) (N (y − 1) − 1)
Var(θ|y) = =
(α + β)2 (α + β + 1) (N y − N (y − 1) + 2)2 (N y − N (y − 1) + 3)
which is 0.001351.
Use of priors
In the previous example we used a Uniform prior for θ
so effectively imposed no prior information on θ
we may want to express our prior beliefs more accurately though through a
more informative prior
all the prior in this example has to satisfy is the requirement that θ ∈ [0, 1] and
θ is continuous, so any density that satisfies these can be used, ie.,
– Beta, Minimax, Noncentral Beta, Standard Power distribution etc. (see Leemis
and McQueston (2008) for more details on these distributions)
the Beta distribution (and Noncentral Beta also) are particularly interesting
because they are known to be conjugate prior distributions for our Binary
model here.
Definition (Conjugate Prior)

A class P of prior distributions for θ is called a conjugate prior for the likelihood
based model p(y|θ) if
π(θ) ∈ P ⇒ p(θ|y) ∈ P,
that is, the prior and the posterior belong to the same class of distributions.
(adapted from Hoff (2009) page 38.)
Definition (Natural Conjugate Prior)

A prior distributions for θ is called a natural conjugate prior if it has the same
properties as a conjugate prior with the additional requirement that the likelihood
function also belong to the same class of distributions (ie., L(θ|y) = p(y|θ) ∈ P).
(see for example Koop (2003) page 18.)
Fact: Beta and Uniform densities

A Beta(α, β) density with α = β = 1 is a Uniform(a, b) density (on the interval
[0, 1]) with a = 0 and b = 1.
That is, a Beta distribution with α = β = 1 yields
1
p (θ|α, β) = θ1−1 (1 − θ)1−1 , ∀θ ∈ [0, 1]
B(α = 1, β = 1)
= 1
where
Γ (2)
B(α = 1, β = 1) = = 1.
Γ (1) Γ (1)
Using the more general (conjugate) Beta prior for the parameter of interest θ in this
set-up allows us to work with a model that still yields a Beta distribution as the
posterior.
this is the main advantage of using conjugate priors, posteriors are available
in closed form (ie., analytically)
the Beta distribution is very flexible and can take on many different shapes (see
Figure (3)) and has the Normal density as a special limiting case as β → ∞.
it can thus replicate many different prior assumptions, where one can put flat
or uninformative priors as well as highly informative ones over a given region of
importance to the investigator
for example, for AR(1) models, we can use the Beta prior to restrain the
parameter interval to [0, 1] with a peak at a value of say around 0.9, if we
know (or believe) a series has a fairly large persistence
p(x|θ)
p(x|θ)
x x
(a) Beta density (b) Symmetric Beta density (α = β)
Figure 3: Beta density plots for different values of α and β.
The Beta prior in action

Using the same likelihood function as in (13) and combining with the Beta prior θ,
ie., θ ∼ Beta(α0 , β0 ), where α0 , β0 are hyperparameters of the prior π(θ), we get:
p(θ|y) ∝ L(θ|y) × π (θ)

∝ θN ȳ (1 − θ)N (1−ȳ) × θα0 −1 (1 − θ)β0 −1
| {z } | {z }
kernel of L(θ|y) =kernel of π(θ)
∝ θN ȳ+α0 −1 (1 − θ)N (1−ȳ)+β0 −1

∝ θᾱ−1 (1 − θ)β̄−1
| {z }
= Beta(ᾱ,β̄) kernel
θ|y ∼ Beta(ᾱ, β̄), (23)
so the posterior distribution for θ will be a Beta(ᾱ, β̄) distribution, with
ᾱ = N ȳ + α0 and β̄ = N (1 − ȳ) + β0 ,
where ᾱ and β̄ are the posterior parameters of interest.

We can again compute moments, etc of interest as before.

Note that we have to specify the prior hyperparameters α0 and β0 of π(θ), so this
creates another layer of flexibility as well as ”ambiguity”.
The Beta conjugate prior result also holds if Y is a Binomial RV, ie., the number of
successes in a sequence of n independent trials from a binary outcome, with PDF:
!
n y
p(y|θ, N ) = θ (1 − θ)n−y , y = 0, 1, 2, . . . n. (24)
y
as the kernel of (24) is still θy (1 − θ)n−y .

Same is true if Y is a Negative Binomial or Geometric RV (see Leemis and
McQueston (2008) for details regarding the PDFs)
Figure 4: Plots of different priors and posteriors for Binomial RV.
Example 2: Poisson Model
The Poisson Model

Suppose we have count data such that random variable Y takes on the values
0, 1, 2, . . . The density for observation i is a Poisson PDF, taking the form:
θyi exp {−θ}
p(yi |θ) = , ∀yi = 0, 1, 2, . . .
yi !
where {θ > 0} ∈ R. With an i.i.d. sample of size N , we can form the joint density
and likelihood as:
N
Y
p(y|θ) = p(yi |θ)
i=1
N
Y θyi exp {−θ}
=
i=1
yi !
N
Y 1
L(θ|y) = θN ȳ exp {−θN } .
i=1
yi!
| {z }
=K
Frequentist MLE
The ML estimate is again obtained by setting the derivative of
ln(L(y|θ)) = ln(p(y|θ)) wrt to θ to zero, that is:
∂ ln(p(y|θ))
= 0
∂θ
∂ [N ȳ ln(θ) − N θ + ln(K)]
= 0
∂θ
θN = N ȳ
θ̂MLE = ȳ
QN 1
where ȳ as before and K = i=1 yi ! is some constant that does not depend on θ.
The variance of the ML estimate is I(θ)−1 , where

2
∂ [N ȳ ln(θ) − N θ + ln(K)]
I(θ) = −E
∂θ2
= −E −N ȳθ−2

= N E [ȳ] θ−2
= N θ−1
PN
since E(ȳ) = N −1 i=1 E(yi ) = θ, because E(y) = θ = Var(y) for a Poisson RV.
So Var(θ̂MLE ) = θ/N , where we would again replace θ by a consistent estimator such
as its MLE θ̂MLE to get an estimate of Var(θ̂MLE ) as θ̂MLE /N .
Bayesian
Under a Bayesian approach, we again need to specify a prior π(θ) for θ.
If we choose π(θ) = 1, this would be again an uninformative prior as before

R∞
but {θ > 0} ∈ R so 0 π(θ)dθ = ∞ thus the uninformative prior π(θ) = 1
does not integrate to unity but rather infinity and is thus known as an
improper prior
we can still compute the posterior (but cannot do model comparisons, such as
Bayes factors etc) because of the improper prior
The posterior would just be
p(θ|y) ∝ L(θ|y) × π(θ)

N
Y 1
∝ θN ȳ exp {−θN } ×1
| {z } i=1
y i!
kernel | {z }
some constant
N ȳ
∝θ exp {−θN } . (25)
| {z }
kernel
What does the kernel in (25) look like?
Recall that RV {Z > 0} ∈ R has a Gamma(α, β) density if its PDF is given by
1
p(z|α, β) = z (α−1) exp{−z/β}, ∀α, β > 0
Γ (α) β α
R∞
where Γ(α) = 0
z (α−1) exp{−z}dz = (α − 1)! as before for the Beta distribution.
in the Gamma Class of densities, α and β are commonly referred to as shape
and scale parameters
So the kernel in (25) is that of a Gamma(ᾱ, β̄) density with
1
ᾱ = N ȳ + 1 and β̄ = .
N
A Numerical Example
PN
Suppose we have a sample of N = 100 and compute i=1 yi = 201 as the values
that we observe.
then, the MLE of θ is just 2.01.
posterior density p(θ|y) under π(θ) = 1 is proportional to θN ȳ exp {−θN }, ie.,

1
θ|y ∼ Gamma N ȳ + 1,
N
From the standard properties of a Gamma(α, β) RV we have:

E(θ|y) = αβ
Var(θ|y) = αβ 2
(
(α − 1)β if α > 1
mode(θ|y) =
0 otherwise
If we are again interested in the mode as a point estimate, with posterior
θ|y ∼ Gamma(ᾱ, β̄)
we would compute
mode(θ|y) = (ᾱ − 1) β̄
= (N ȳ) /N
= ȳ = 2.01
as our Bayesian point estimate of θ.

If we use the mean as the point estimate, it would be:
E(θ|y) = ᾱβ̄
= (N ȳ + 1)/N = 2.02.
Example 2: Poisson Model (adapted from Hoff (2009) pages 43-50)
40 3
35
2.5
30
2
25
20 1.5
15
1
10
0.5
5
0 0
0 1 2 3 4 5 6 7 1 1.5 2 2.5 3 3.5 4 4.5 5
(a) Histogram of sample data (b) Posterior density p(θ|y)
Figure 5: Plot of observed Poisson (Count) data and posterior density p(θ|y) for N = 100
and ȳ = 2.01.
Results with a Conjugate prior

We saw from the previous result that using π(θ) = 1 is not only an uninformative
prior but also an improper one.
To stay in the Gamma class of posteriors, we again need to specify a Gamma
conjugate prior for θ, that is:
π(θ) ∝ θα0 −1 exp{−θ/β0 )}

| {z }
kernel of Gamma(α0 ,β0 )
where we are again using the notation α0 , β0 for the hyperparameters of the
Gamma(α0 , β0 ) prior.
The posterior is then
p(θ|y) ∝ L(θ|y) × π (θ)

∝ θN ȳ exp {−θN } × θα0 −1 exp{−θ/β0 )}
| {z } | {z }
kernel of L(θ|y) kernel of π(θ)
N ȳ+α0 −1
∝ θ exp {−θ (N + 1/β0 )} . (26)

The kernel of the posterior p(θ|y) in (26) can be recognised as a Gamma ᾱ, β̄
density with
ᾱ = N ȳ + α0 and β̄ = 1/ (N + 1/β0 ) .
3 3 3
:(3 )
L(3 jy)
2.5 2.5 p(3 jy) 2.5
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Figure 6: Plots of prior, posterior and likelihood function for three different prior
hyperparameter values α0 and β0 , and N = 100, ȳ = 2.01.
Example 3: Normal Model
Normal Model (aka Location and scale Model)

Suppose that we have an i.i.d. sample of size N from a Normal distribution, where
the PDF for observation i takes the form

−1/2 1
p(yi |µ, σ 2 ) = 2πσ 2 exp − 2 (yi − µ)2 , ∀yi ∈ R
2σ
with µ, σ 2 ∈ R × R+ .

The joint density and likelihoods are respectively:

N
Y −1/21
p(y|µ, σ 2 ) = 2πσ 2 exp − 2 (yi − µ)2
i=1
2σ
( N
)
2 2 −N/2 1 X 2
L(µ, σ |y) = 2πσ exp − 2 (yi − µ) .
2σ i=1
Frequentist MLE
The MLE for µ and σ 2 is obtained as before from
N
∂ ln L(µ, σ 2 |y)

2 X
= 2
(yi − µ) = 0
∂µ 2σ i=1
N
X
yi = N µ
i=1
N
X
N −1 yi = µ
i=1
so µ̂MLE = ȳ.
For σ 2 we get
N
∂ ln L(µ, σ 2 |y)

N 2 −1 1 2 −2 X
= − σ + σ (yi − µ)2 = 0
∂σ 2 2 2 i=1
N
−1 −2 X
N σ2 = σ2 (yi − µ)2
i=1
N
X
N σ2 = (yi − µ)2
i=1
PN
2
so σ̂MLE = N −1 i=1 (yi − µ̂MLE )2 .
Let θ = µ, σ 2 . The variance/covariance matrix of the MLE of θ we get again from

the inverse of the Fisher Information I(θ), where I(θ) = −E(H) and the Hessian H
∂ 2 ln (L(θ|y))
H=
∂θ∂θ 0
 ∂ 2 ln(L(θ|y)) ∂ 2 ln(L(θ|y))

∂µ∂µ ∂σ 2 ∂µ
=
 

2 2
∂ ln(L(θ|y)) ∂ ln(L(θ|y))
∂µ∂σ 2 ∂σ 2 ∂σ 2
 N

−N/σ 2 (yi − µ)/σ 4
P
 − 
i=1
=
 N N
.

P 4 4 P 2 6
− (yi − µ)/σ N/(2σ ) − (yi − µ) /σ
i=1 i=1
Since Y ∼ Normal(µ, σ 2 ), it follows that E(yi ) = µ and E(yi − µ)2 = σ 2 so that
I(θ) = −E(H)
 N

N/σ 2 E(yi − µ)/σ 4
P
 i=1

=N

N


E(yi − µ)/σ 4 4 2 6
P P
−N/(2σ ) + E(yi − µ) /σ
i=1 i=1
N/σ 2

0
=
0 −N/(2σ 4 ) + N/σ 4
N/σ 2

0
=
0 N/(2σ 4 )
so that
σ 2 /N

0
Var(θ̂MLE ) = I(θ)−1 = .
0 2σ 4 /N
Bayesian
Under a Bayesian setting, we again need to specify a prior, but now for the joint
parameter vector θ, ie., π (θ) = π µ, σ 2 . A few options exist here:
one could use a conjugate prior for θ (we will see that later)
another common one is the following combination for µ and σ 2 (related to
Jeffreys’ prior which will be discussed later as well)
– assume µ and σ 2 are independent, hence
π(θ) = π(µ, σ 2 )
= π(µ|σ 2 )π(σ 2 )
= π(µ)π(σ 2 ).
– then set a flat prior for µ, ie., π(µ) ∝ 1 (uninformative and improper prior for µ)
– set also a flat prior for the log of σ 2 , that is, let φ = ln σ 2 , then π (φ) ∝ 1.

∂φ 1
– noting that π(σ 2 ) = π(φ) ∂σ 2 = π (φ) σ 2 (from RV transform), we get
| {z }
∝1
π(θ) = π µ, σ 2

= π µ|σ 2 π σ 2

1
= π (µ) π (φ)
σ2
1
∝ . (27)
σ2
Given π(θ) in (27) and data density p (y|θ), the posterior becomes
p (θ|y) ∝ p (y|θ) × π (θ)

(
N
)
2 −N/2 1 X 2 −1
∝ 2πσ exp − 2 (yi − µ) × σ2
2σ i=1
( N
)
2−(N/2+1) 1 X 2
∝σ exp − 2 (yi − µ) (28)
2σ i=1
where we can drop the terms involving (2π)−N/2 as they do not depend on
θ = (µ, σ 2 ).
How do we use the posterior for the joint parameter vector θ in (28)?
Since we want to find the (marginal) posterior for µ and σ 2 , given the data, we need
to integrate out the other parameters that are not of interest to us, that is,
Z
p(µ|y) = p(θ|y)dσ 2
σ2
and Z
p(σ 2 |y) = p(θ|y)dµ
µ
in this setting.
Before we do that, note that
N
X N
X N
X
(yi − µ)2 = [(yi − ȳ) − (µ − ȳ)]2 , where ȳ = N −1 yi
i=1 i=1 i=1
N
X
(yi − ȳ)2 − 2 (yi − ȳ) (µ − ȳ) + (µ − ȳ)2

=
i=1
N
X N
X N
X
= (yi − ȳ)2 −2 (µ − ȳ) (yi − ȳ) + (µ − ȳ)2
i=1 i=1 i=1
| {z } | {z }
=SSE =0
= SSE + N (µ − ȳ)2 (29)
where SSE is Sum of Squared Errors

N
X
SSE = (yi − ȳ)2
i=1
= (N − 1) s2
| {z }
=ν
2
= νs ,
and
s2 is the standard (unbiased) estimate of the population variance, ie., the

sample variance and
ν is the Degrees of Freedom parameter.
Note that SSE is a function of the data only and hence does not depend on
parameter vector of interest θ!
Combining (28) and (29) we can re-express (28) as
( 2
)
2−(N/2+1) 1 SSE + N (µ − ȳ)
p (θ|y) ∝ σ exp − 2 (30)
σ 2
Rather then using the relation in (28) to integrate out the unwanted parameter, this
is commonly done on the relation in (30)!
Computing the marginal posterior for σ 2

To get p σ 2 |y we integrate µ out of the joint posterior in (30) as follows

Z
p(σ 2 |y) = p µ, σ 2 |y dµ

µ
( )
N (µ − ȳ)2
Z
2−(N/2+1) SSE
∝ σ exp − exp − dµ
µ 2σ 2 2σ 2
| {z }
does not depend on µ
( )
N (µ − ȳ)2
Z
2−(N/2+1) SSE
∝σ exp − exp − dµ
2σ 2 µ 2σ 2
| {z }
kernel of Normal(ȳ,σ 2 /N )
Note that ( )
N (µ − ȳ)2
Z
exp − dµ = (2πσ 2 /N )1/2
µ 2σ 2
so that
1/2
σ2

SSE
p(σ 2 |y) ∝ σ 2−(N/2+1) exp − 2π
2σ 2 N

N −1 SSE
∝ σ 2−( 2 +1) exp − . (31)
2σ 2
The marginal posterior in (31) can be recognised as an Inverse Gamma density with
α = (N − 1) /2 and β = SSE/2, that is,

2 N − 1 SSE
σ |y ∼ InvGam , . (32)
2 2
A Quick detour: Some Facts about Gamma RVs
Fact: Re-scaling a Gamma RV
If X ∼ Gamma (α, β) then Y = κX ∼ Gamma (α, κβ).

Proof:

∂x
p(y) = p(x)
∂y
y (α−1)
n y o
= [Γ(α)]−1 [β −α ] exp − β −1 κ−1
κ κ
−1 −α (α−1)
exp −y (κβ)−1 .

= [Γ(α)] [κβ] y (33)
| {z }
Gamma(α,κβ)
Fact: Re-scaling an Inverse Gamma RV
If X ∼ InvGam (α, β) then Y = κX ∼ InvGam (α, κβ).

Proof:

∂x
p(y) = p(x)
∂y
y −(α+1) y −1
−1
= [Γ(α)] α
[β ] exp −β κ−1
κ κ
= [Γ(α)]−1 [κβ]α y −(α+1) exp − (κβ) y −1 .

(34)
| {z }
InvGam(α,κβ)
Fact: Inverse of an Inverse Gamma RV
If X ∼ InvGam (α, β = 1) then Y = 1/X ∼ Gamma (α, β = 1).

Proof:

∂x
p(y) = p(x)
∂y
−(α+1) ( )
−1
−1 1 1
= [Γ(α)] exp − y −2
y y
= [Γ(α)]−1 y (α−1) exp {−y} . (35)

| {z }
Gamma(α,β=1)
Note: The reason why we are using β = 1 in the Inverse of an Inverse Gamma RV
transformation is to avoid having to re-define the scale parameter as β = 1/b as was
done in the distributions notes.
Fact: Gamma and Chi-square

A Gamma(α = ν/2, β = 2) RV is a Chi-square RV with ν degrees of freedom (χ2ν ).
In Matlab, 2*randg(ν/2, 1) ∼ Chi2(ν).
Summary and implementation in Matlab

If X ∼ Gamma (α, 1), then we have the following relations:
1) βX ∼ Gamma (α, β) [shape α, scale β], Matlab code: β∗randg(α, T, 1)

2) 1/X ∼ InvGam(α, 1) [shape α, scale 1], Matlab code: 1./randg(α, T, 1)
3) β/X ∼ InvGam(α, β) [shape α, scale β], Matlab code: β./randg(α, T, 1)
4) 2X ∼ Chi2(α) [shape α, scale 2], Matlab code: 2∗randg(α,T, 1)
The last relation in 4) will generate a Chi2 RV with α degrees of freedom if we draw
from 2*randg(α,T,1), with α = ν/2 ⇔ ν = α/2.
From (32) we had the marginal posterior for σ 2
ν νs2
z }| { z }| { !
2 N − 1 SSE
σ |y ∼ InvGam , .
2 2
Using the results from above, we have the relations
2σ 2

N −1
y ∼ InvGam , 1
SSE 2

SSE N −1
y ∼ Gamma , 1
2σ 2 2

SSE N −1
y ∼ Gamma , 2
σ2 2

SSE
y ∼ Chi2(N − 1)
σ2
Setting SSE = (N − 1) s2 we get
(N − 1) s2

y ∼ Chi2(N − 1)
σ2
σ2
s2 y

∼ Chi2(N − 1).
(N − 1)
Also, since E(z) = N − 1, when Z ∼ Chi2(N − 1), we get the standard result that
σ2
E( s2 y) =

E[Chi2(N − 1)]
(N − 1) | {z }
=N −1
2
=σ ,
so the sample variance s2 is an unbiased estimator of the population variance.
To obtain a point estimate, we can again use the mean (or mode or median). Since
the marginal posterior for σ 2 , given the data y is:

N −1 SSE
σ 2 |y ∼ InvGam α = ,β =
2 2
β
and E(z) = α−1
if RV Z ∼ InvGam (α, β), we get
SSE
E σ 2 |y = 2

N −1
2
− 1
SSE
=
(N − 3)
β
as the posterior mean and with mode(z) = α+1
.
Recal, if RV Z ∼ InvGam (α, β)

SSE
mode σ 2 |y = 2

N −1
2
+ 1
SSE
=
(N + 1)
Notice here that the mode(σ 2 |y) 6= max (L (θ|y)) because π(θ) ∝ 1/σ 2 and not
just a constant, so exact results differ because of the prior on θ.
But as N → ∞, this difference disappears and mode σ 2 |y → σ̂MLE2

.
Marginal posterior for µ

Now to get p (µ|y) we need to integrate σ 2 out of the posterior in (30), that is,
Z
p µ, σ 2 |y dσ 2

p(µ|y) =
σ2
( 2
)
1 SSE + N (µ − ȳ)
Z
∝ σ 2−(N/2+1) exp −
2
dσ 2
σ 2 σ 2
Z
β
p(µ|y) ∝ σ 2−(α+1) exp − 2 dσ 2 (36)
σ2 σ
| {z }
kernel of InvGam(α,β)
where α = N/2 and β = [SSE + N (µ − ȳ)2 ]/2.

The InvGam kernel in (36) has to integrate to Γ(α)β −α for it to be a proper density.
Thus (remember, the RV is µ)
p(µ|y) ∝ Γ(α)β −α
∝ Γ(N/2)([SSE + N (µ − ȳ)2 ]/2)−N/2
∝ [SSE + N (µ − ȳ)2 ]−N/2 (37)
where Γ (N/2) and 2 drop out as they do not depend on µ.

With SSE = s2 (N − 1) = N 2
P
i=1 (yi − ȳ) , also not depending on µ, we then get
p(µ|y) ∝ [SSE + N (µ − ȳ)2 ]−N/2

−N/2
N (µ − ȳ) 2

= SSE + SSE (38)
SSE
−N/2
N (µ − ȳ) 2

= SSE 1 + (39)
SSE
−N/2
N (µ − ȳ) 2

= SSE −N/2 1 + (40)
SSE
−N/2
N (µ − ȳ) 2

∝ 1+
SSE
−(N −1+1)/2
N (µ − ȳ) 2

1
= 1+
(N − 1) s2
" 2 #−(ν+1)/2
1 (µ − ȳ)
= 1+ √ (41)
ν s/ N
where ν = (N − 1).
The kernel in (41) is that of a (non-standard) Students’ t distribution with ν degrees
of freedom, and location ȳ and scale s2 /N .
√
Defining ω = N (µ − ȳ) /s we can compute the (standard) Students’ t distribution
for RV ω with PDF

∂µ
p(ω|y) = p(µ|y)
∂ω
−(ν+1)/2
1 s
∝ 1 + ω2 √
ν N
−(ν+1)/2
1
∝ 1 + ω2
ν
where E(ω|y) = 0, ∀ν > 1.

To get a Bayesian point estimate, we again look at some standard measures of
central tendency.
Given that µ|y ∼ Students(ν, ȳ, s2 /N ) we have
E(µ|y) =ȳ
mode(µ|y) =ȳ
median(µ|y) =ȳ
We further have the standard result that as ν → ∞,

d
t(ν, ȳ, s2 /N )) −→ Normal ȳ, s2 /N )

so that
µ|y ∼ Normal ȳ, s2 /N ) .

Note that since the prior on µ was ∝ 1, posterior (Bayesian) estimate is same as
MLE (µ̂MLE = ȳ).
Example 4: Plain Vanilla Regression Model
The Plain Vanilla Regression Model

Suppose we have an i.i.d. sample of size N from the regression model
p
X
yi = xij βj + εi , ∀i = 1, . . . , N
j=1
where εi ∼ Normal 0, σ 2 .

In matrix form we have

y = Xβ + ε, (42)
2

where ε ∼ MNormal 0(N ×1) , σ I(N ×N ) and
···
       
y1 x11 x12 x1p β1 ε1
 y2   x21 x22 ··· x2p  β2   ε2 
y =  . , X = . ..  , β =  . , ε = . 
       
.. ..
(N ×1)  ..  (N ×p)  .. . . .  (p×1)  ..  (N ×1)  .. 
yN xN 1 xN 2 ··· xN p βp εN
with β, σ 2 ∈ Rp × R+ .

Note that the first column of X is frequently a column of ones corresponding to the
intercept term in the regression model, but this is not important here.
In general, a random vector Z(k×1) is said to be Multivariate Normal distributed with
1) mean vector µ(k×1) and
2) co-variance matrix Σ(k×k)
if its PDF is

1
p(Z|µ, Σ) = (2π)−k/2 det (Σ)−1/2 exp − (Z − µ)0 Σ−1 (Z − µ)
2
Z ∼ MNormal(µ, Σ)
Frequentist MLE
The Likelihood function (joint density) is simply (where Σ = σ 2 I)

−1/2 1 −1
L(β, σ 2 |y, X) = (2π)−N/2 det σ 2 I exp − (y − Xβ)0 σ 2 I

(y − Xβ)
2

2 −N/2 1 0
= (2πσ ) exp − 2 (y − Xβ) (y − Xβ)
2σ
with the MLE for β and σ 2 being obtained as before as
∂ ln(L(β, σ 2 |y, X))

=0
∂β
1 0
X (y − Xβ) = 0
σ2
−1
so β̂MLE = (X0 X) (X0 y)
and
∂ ln(L(β, σ 2 |y, X))
=0
∂σ 2
N 2 −1 1 2 −2
(σ ) + (σ ) (y − Xβ)0 (y − Xβ) = 0
−
2 2
0
2
so that σ̂MLE = N −1 y − Xβ̂MLE y − Xβ̂MLE .
2
The variance/covariance matrix for θ̂MLE = (β̂MLE σ̂MLE ), is obtained as before from
the inverse of the Information matrix, where I(θ) = −E ( ·| X), ie., conditional on X:
2
∂ ln(L(β, σ 2 |y, X))

I(θ) = −E X
∂θ∂θ
−σ −2 (X0 X) −σ −4 ε0 X

= −E −6 X
−σ −4 ε0 X −4 0
1

2
N σ − ε εσ
−2 0
σ (X X) 0
=
0 N/2σ −4
yielding
−1
σ 2 (X0 X)

0
Var(θ̂MLE ) = I(θ)−1 = .
0 2σ 4 /N
Bayesian
Under a Bayesian setting, using again the same (Jeffrey’s) prior for
π (θ) = π(β, σ 2 ) ∝ 1/σ 2 as before in Example 3: The Normal Model, we get:

1
p(θ|y, X) ∝ (σ 2 )−(N/2+1) exp − 2 (y − Xβ)0 (y − Xβ) (43)
2σ
and we can again seek an expression for (y − Xβ)0 (y − Xβ) that relates y to the
OLS (and here also MLE) estimate β̂ = (X0 X)−1 (X0 y).
Re-writing the term
=(A−B)0 =(A−B)
z }| {z }| {
(y − Xβ)0 (y − Xβ) = [(y − Xβ̂) − X(β − β̂)]0 [(y − Xβ̂) − X(β − β̂)]
| {z } | {z } | {z } | {z }
=A =B =A =B
= A0 A − A0 B − B0 A + B0 B
where A0 A = (y − Xβ̂)0 (y − Xβ̂) = SSE, B0 B = (β − β̂)0 X0 X(β − β̂) and
A0 B = (y − Xβ̂)0 X(β − β̂)

−1 0
= [y − X X0 X (X y)]0 X(β − β̂)
| {z }
=H (hat matrix)
= [y − Hy]0 X(β − β̂)
= y0 (I − H)0 X(β − β̂). (44)
−1 0
Because the hat (or projection) matrix H = X (X0 X) X is a symmetric (and
idempotent) matrix, we have H = H0 (and H0 H = I(N ×N ) ) so that
(I − H)0 X = X − HX
−1 0
= X − X X0 X XX=0
Thus, A0 B = 0. Similarly, B0 A = 0. This yields the result
(y − Xβ)0 (y − Xβ) = SSE 0 0

| {z } +(β − β̂) (X X)(β − β̂).
scalar
Using this result, we can write (43) as

 
1
 
p(β, σ 2 |y, X) ∝ (σ 2 )−(N/2+1) exp − 2 [SSE +(β − β̂)0
(X0
X)(β − β̂)] (45)
 2σ | {z0 } 
=A A
and proceed to find p(σ 2 |y, X) as with the Normal model, ie., form
Z
p(σ 2 |y, X) = p(β, σ 2 |y, X)dβ
β

SSE
= (σ 2 )−(N/2+1) exp −
2σ 2
Z
1
× exp − 2 [(β − β̂)0 (X0 X)(β − β̂)] dβ (46)
β 2σ
where the kernel of (46) is a Multivariate Normal with mean β̂ and covariance matrix
−1
σ 2 (X0 X) , so will integrate to
−1 1/2 −1/2
(2π)p/2 det σ 2 X0 X = (2πσ 2 )p/2 det X0 X .
−1/2
We then get, after dropping the (2π)p/2 and the det (X0 X) constants:

SSE
p(σ 2 |y, X) ∝ (σ 2 )−(N/2+1) exp − × (σ 2 )p/2
2σ 2

−p
2 −( N 2 +1) SSE
∝ (σ ) exp − (47)
2σ 2
where the expression in (47) is the kernel of an InvGam (α, β) density with
α = (N − p)/2 and β = SSE/2.
So the result for the marginal posterior p(σ 2 |y, X) from the regression model is
analogous to the Normal model that we obtained earlier.
The marginal posterior for β is obtained by integrating σ 2 out of (46), ie., compute
Z
1
p(β|y, X) ∝ (σ 2 )−(N/2+1) exp − 2 [SSE + (β − β̂)0 (X0 X)(β − β̂)] dσ 2
σ2 2σ
(48)
As before, we can recognise the kernel in (48) to be that of an InvGam (α, β), where
α = N/2
and
β = [SSE + (β − β̂)0 (X0 X)(β − β̂)]/2,
so that
p(β|y, X) ∝ Γ (α) β −α
−N/2
= Γ(N/2) [SSE + (β − β̂)0 (X0 X)(β − β̂)]/2
∝ [SSE + (β − β̂)0 (X0 X)(β − β̂)]−N/2
(β − β̂)0 (X0 X)(β − β̂) −N/2

= [SSE + SSE ] (49)
SSE
" #−(ν+p)/2
1 (β − β̂)0 (X0 X)(β − β̂)
∝ 1+
ν s2
[Σ−1 = X0 X/s2 ]
−(ν+p)/2
1
∝ 1 + (β − β̂)0 Σ−1 (β − β̂) (50)
ν
where ν = (N − p), SSE = νs2 , Σ = s2 (X0 X)−1 and p = dim (β), ie., the number
of regressors.
The expression in (50) can
be recognised as a Multivariate Students’ t distribution
denoted by Mt ν, β̂, Σ with ν degrees of freedom and location and scale being β̂
and Σ, respectively, ie.,
−(ν+p)/2
Γ ((ν + p) /2) −1/2 1 0 −1
p(β|y, X) = det (Σ) 1 + (β − β̂) Σ (β − β̂)
Γ (ν/2) (νπ)p/2 ν
References
Bernardo, José M. and Adrian F. M. Smith (1994): Bayesian Theory, Wiley Series in Probability and Statistics, John
Wiley & Sons.
Casella, George and Rober L. Berger (2001): Statistical Inference, 2nd Edition, Duxbury Press.
Cox, Richard T. (1946): “Probability, frequency and reasonable expectation,” American journal of physics, 14(1), 1–13.
DeGroot, Morris H. and Mark J. Schervish (2010): Probability and Statistics, 4th edition Edition, Pearson.
Hoff, Peter D. (2009): A First Course in Bayesian Statistical Methods, Springer Verlag.
Kolmogorov, Andrei N. (1933): Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer.
Koop, Gary M. (2003): Bayesian Econometrics, John Wiley & Sons.
Koop, Gary M., Dale J. Poirier and Justin L. Tobias (2007): Bayesian Econometric Methods, Cambridge University
Press.
Leemis, Lawrence M. and Jacquelyn T. McQueston (2008): “Univariate distribution relationships,” The American
Statistician, 62(1), 45–53.
Robert, Christian P. (2007): The Bayesian Choice: From Decision-Theoretic Foundations to Computational
Implementation, Springer Verlag.
Savage, Leonard J. (1954): The Foundations of Statistical Inference, John Wiley and Sons.
Shafer, Glenn and Vladimir Vovk (2006): “The Sources of Kolmogorov’s Grundbegriffe,” Statistical Science, 21(1),
70–98.
Spanos, Aris (1986): Statistical Foundations of Econometric Modelling, Cambridge University Press.

Bayesian Econometrics Introduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Econometrics Introduction

Uploaded by

Copyright:

Available Formats

Bayesian Econometrics: Applications in Economics & Finance

Topic 1: Overview of Bayesian Econometric Methods

August 20, 2018

Origins of Probability Theory

 a widely acceptable definition of probability took another two centuries to be

Definition (Kolmogorov’s Three Axioms of Probability)

P (A1 ∪ A2 ∪ · · · ∪ An ) = P(A1 ) + P(A2 ) + · · · + P(An ).

(adapted from the English translation of Kolmogorov (1933) page 2.)

Corollary (to Kolmogorov’s Three Axioms of Probability)

2) If P(Aj ) 6= 0, then the conditional probability of Ai given Aj is defined as

3) Events Ai and Aj are said to be statistically independent if

P(Ai ∩ Aj ) = P(Ai )P(Aj ). (3)

(adapted from the English translation of Kolmogorov (1933) pages 6-10.)

There are three common operational interpretations of probability

1) relative frequency approach

2) combinatorial (Classical) definition

Example (Coin Tossing)

Example (Fund performance)

Thinking differently about probability

Probabilistic modelling as an inversion problem

 inversion property of statistics is evident in the notion of the likelihood

 thus, given θ, we can plot a PDF for different values of x,

A general definition of the inversion problem is given by Bayes’ Theorem

Theorem (Bayes’ Theorem)

P(B) = P(B ∩ A) + P(B ∩ AC )

which together with (6) results in (7).

Definition (Likelihood Principle)

L(θ|x1 ) = C (x1 , x2 ) L(θ|x2 )

(adapted from Casella and Berger (2001) pages 293-294.)

The Bayesian Statistical Model can then be defined as follows:

Definition (Bayesian Statistical Model)

Optimal updating of prior beliefs and Bayes’ Theorem

This way of updating a rational person’s (personal or subjective) beliefs about

Example (Simple Scenario S)

Example (Individual Probability)

Example (Individual Probability cont.)

Textbook style arguments of how views on probability differ

 95% CI constructed from posterior density (or probabilities) can be interpreted

Other Estimation Methods

How does Bayesian estimation relate to any of these other methods?

 identification in terms of priors is thus another constraint (or structure) that is

p(y|θ) = θy (1 − θ)1−y , y = 0, 1 and θ ∈ [0, 1] (10)

Figure 1: Plot of L(θ|y) for θ ∈ [0, 1].

= N E(ȳ) θ−2 + N (1 − E(ȳ)) (1 − θ)−2

so Var(θ̂MLE ) = I(θ)−1 = θ(1 − θ)/N which, evaluated at θ̂MLE = 100/130, yields

Recall that a continuous Uniform RV X ∈ (a, b) has PDF

What kind of density is (18)?

– so kernel tells us about the type of density we are dealing with

because p (θ|y) p(y) = L(θ|y)π(θ).

Recall that RV Z ∈ [0, 1] ⊂ R has a Beta(α, β) distribution if its PDF is

where both, α and β are shape parameters and

and Γ(α) = (α − 1)Γ(α − 1) = (α − 1)! (! = factorial operator), which is the

Let us compare now the kernel of (18) with that of (20).

p(θ|y) = Beta(N ȳ + 1, N (1 − ȳ) + 1) (21)

[the Matlab command beta(101,31) computes what we need]

which we can see from Figure (2) below.

Figure 2: Plot of posterior p(θ|y) and the prior π(θ) = 1.

What do we do with the posterior?

If we use the posterior mode, then

so we get the same result as from θ̂MLE = 0.7692.

Definition (Conjugate Prior)

Definition (Natural Conjugate Prior)

Fact: Beta and Uniform densities

a widely acceptable definition of probability took another two centuries to be

inversion property of statistics is evident in the notion of the likelihood

thus, given θ, we can plot a PDF for different values of x,

95% CI constructed from posterior density (or probabilities) can be interpreted

identification in terms of priors is thus another constraint (or structure) that is

If we choose π(θ) = 1, this would be again an uninformative prior as before

s2 is the standard (unbiased) estimate of the population variance, ie., the