You are on page 1of 38

Cooperating Intelligent Systems

Bayesian networks
Chapter 14, AIMA
Inference
• Inference in the statistical setting means
computing probabilities for different
outcomes to be true given the information

P (Outcome | Information)

• We need an efficient method for doing


this, which is more powerful than the
naïve Bayes model.
Bayesian networks
A Bayesian network is a directed graph in which
each node is annotated with quantitative
probability information:
1. A set of random variables, {X1,X2,X3,...}, makes up
the nodes of the network.
2. A set of directed links connect pairs of nodes,
parent → child
3. Each node Xi has a conditional probability
distribution P(Xi | Parents(Xi)) .
4. The graph is a directed acyclic graph (DAG).
The dentist network

Weather Cavity

Toothache Catch
The alarm network
Burglar alarm responds to
both earthquakes and burglars.
Burglary Earthquake
Two neighbors: John and Mary,
who have promised to call you
when the alarm goes off.

Alarm John always calls when there’s


an alarm, and sometimes when
there’s not an alarm.

Mary sometimes misses the


alarms (she likes loud music).
JohnCalls MaryCalls
The cancer network
Age Gender

Toxics Smoking

Genetic
Cancer
Damage

Serum Lung
Calcium Tumour

From Breese and Coller 1997


The cancer network
P(A,G) = P(A)P(G)

Age Gender

Toxics Smoking P(C|S,T,A,G) = P(C|S,T)

Genetic
Cancer
Damage

Serum Lung
Calcium Tumour P(A,G,T,S,C,SC,LT,GD) =
P(A)P(G)P(T|A)P(S|A,G)×
P(SC,C,LT,GD) = P(SC|C)P(LT|C,GD)P(C) P(GD) P(C|T,S)P(GD)P(SC|C)×
From Breese and Coller 1997
P(LT|C,GD)
The product (chain) rule

P ( X 1 = x1 ∧ X 2 = x2 ∧  ∧ X n = xn ) =
n
P ( x1 , x2 ,  , xn ) = ∏ P ( xi | parents ( X i ))
i =1

(This is for Bayesian networks, the general case comes


later in this lecture)
Bayes network node is a function
A B
¬a b

a∧b a∧¬b ¬a∧b ¬a∧¬b

Min 0.1 0.3 0.7 0.0

Max 1.5 1.1 1.9 0.9

P(C|¬a,b) = U[0.7,1.9]

0.7 1.9
Bayes network node is a function
A B

A BN node is a conditional
distribution function
• Inputs = Parent values
• Output = distribution over values

Any type of function from values


to distributions.
Example: The alarm network
Note: Each number in
the tables represents a
Burglary Earthquake boolean distribution.
Hence there is a
distribution output for
P(B=b) P(E=e)
every input.
0.001 0.002

Alarm

B E P(A=a)
b e 0.95
b ¬e 0.94
¬b e 0.29
JohnCalls MaryCalls ¬b ¬e 0.001

A P(J=j) A P(M=m)
a 0.90 a 0.70
¬a 0.05 ¬a 0.01
Example: The alarm network
P ( j ∧ m ∧ a ∧ ¬b ∧ ¬ e )
= P (¬b) P (¬e) P (a | ¬b, ¬e) P (m | a ) P ( j | a )
= 0.999 ⋅ 0.998 ⋅ 0.001 ⋅ 0.70 ⋅ 0.90 = 0.00063
Burglary Earthquake
Probability distribution for
P(B=b) P(E=e) ”no earthquake, no burglary,
0.001 0.002 but alarm, and both Mary and
John make the call”
Alarm

B E P(A=a)
b e 0.95
b ¬e 0.94
¬b e 0.29
JohnCalls MaryCalls ¬b ¬e 0.001

A P(J=j) A P(M=m)
a 0.90 a 0.70
¬a 0.05 ¬a 0.01
Meaning of Bayesian network
The general chain rule (always true):

P( x1 , x2 ,  , xn ) = P( x1 | x2 , x3 , , xn ) P( x2 , x3 ,  , xn ) =
P( x1 | x2 , x3 ,  , xn ) P( x2 | x3 , x4 ,  , xn ) P ( x3 , x4 , , xn ) = 
n
= ∏ P( xi | xi +1 ,  , xn )
i =1

The Bayesian network chain rule:


n
P( x1 , x2 ,  , xn ) = ∏ P ( xi | parents ( X i ))
i =1

The BN is a correct representation of the domain iff each node is


conditionally independent of its predecessors, given its parents.
The alarm network
The fully correct alarm
network might look
Burglary Earthquake something like the figure.

The Bayesian network (red)


assumes that some of the
variables are independent (or
Alarm that the dependecies can be
neglected since they are very
weak).

The correctness of the


Bayesian network of course
JohnCalls MaryCalls depends on the validity of
these assumptions.

It is this sparse connection structure that makes the BN approach


feasible (~linear growth in complexity rather than exponential)
How construct a BN?
• Add nodes in causal order (”causal”
determined from expertize).
• Determine conditional independence using
either (or all) of the following semantics:
– Blocking/d-separation rule
– Non-descendant rule
– Markov blanket rule
– Experience/your beliefs
Path blocking & d-separation

Genetic
Cancer
Damage

Serum Lung
Calcium Tumour

Intuitively, knowledge about Serum Calcium influences our belief


about Cancer, if we don’t know the value of Cancer, which in turn
influences our belief about Lung Tumour, etc.

However, if we are given the value of Cancer (i.e. C= true or false),


then knowledge of Serum Calcium will not tell us anything about
Lung Tumour that we don’t already know.

We say that Cancer d-separates (direction-dependent separation)


Serum Calcium and Lung Tumour.
Path blocking & d-separation
Xi and Xj are d-separated if all paths betweeen them are blocked
Two nodes Xi and Xj are conditionally independent given a set Ω =
{X1,X2,X3,...} of nodes if for every undirected path in the BN
between Xi and Xj there is some node Xk on the path having one
of the following three properties:

1. Xk ∈ Ω , and both arcs on the path
lead out of Xk.
Xi
2. Xk ∈ Ω , and one arc on the path Xk2

leads into Xk and one arc leads out. Xk1

3. Neither Xk nor any descendant of Xk


is in Ω , and both arcs on the path Xk3
Xj
lead into Xk.
Xk blocks the path between Xi and Xj P ( X i , X j | Ω ) = P ( X i | Ω ) P ( X j | Ω)
Non-descendants
A node is conditionally
independent of its
non-descendants (Zij),
given its parents.

P ( X , Z1 j ,  , Z nj | U1 ,  , U m ) =
P ( X | U1 ,  , U m ) P ( Z1 j ,  , Z nj | U1 ,  , U m )
Markov blanket
X2 X3
X1
X4
A node is conditionally
independent of all
other nodes in the
network, given its
parents, children, and
children’s parents
These constitute the
nodes Markov blanket.
X5

X6
Xk
P( X , X 1 ,  , X k | U1 , , U m , Z1 j ,  , Z nj , Y1 ,  , Yn ) =
P( X | U1 ,  , U m , Z1 j ,  , Z nj , Y1 , , Yn ) P ( X 1 , , X k | U1 ,  , U m , Z1 j ,  , Z nj , Y1 , , Yn )
Efficient representation of PDs

P(C|a,b) ?
C

• Boolean → Boolean
• Boolean → Discrete
B • Boolean → Continuous
• Discrete → Boolean
• Discrete → Discrete
• Discrete → Continuous
• Continuous → Boolean
• Continuous → Discrete
• Continuous → Continuous
Efficient representation of PDs
Boolean → Boolean:
Noisy-OR, Noisy-AND
Boolean/Discrete → Discrete:
Noisy-MAX
Bool./Discr./Cont. → Continuous:
Parametric distribution (e.g. Gaussian)
Continuous → Boolean:
Logit/Probit
Noisy-OR example
Boolean → Boolean

P(E|C1,C2,C3)

C1 0 1 0 0 1 1 0 1

C2 0 0 1 0 1 0 1 1

C3 0 0 0 1 0 1 1 1

P(E=0) 1 0.1 0.1 0.1 0.01 0.01 0.01 0.001

P(E=1) 0 0.9 0.9 0.9 0.99 0.99 0.99 0.999

The effect (E) is off (false) when none of the causes are true. The
probability for the effect increases with the number of true causes.

P( E = 0) = 10 − (#True ) (for this example)


Example from L.E. Sucar
Noisy-OR general case
Boolean → Boolean
n
P( E = 0 | C1 , C2 ,  , Cn ) = ∏ q Ci
i
i =1

1 if true
Ci = 
0 if false
C1 C2 Cn

Example on previous slide used


q1 q2 qn
qi = 0.1 for all i. PROD

P(E|C1,...)

Image adapted from Laskey & Mahoney 1999


Noisy-MAX
Boolean → Discrete

C1 C2 Cn Effect takes on the max value from


different causes
Restrictions:
e1 e2 en
– Each cause must have an off state,
MAX which does not contribute to effect
– Effect is off when all causes are off
Observed effect – Effect must have consecutive
escalating values: e.g., absent, mild,
moderate, severe.
Image adapted from Laskey & Mahoney 1999

n
P( E = ek | C1 , C2 , , Cn ) = ∏ q Ci
i ,k
i =1
Parametric probability densities
Boolean/Discr./Continuous → Continuous

Use parametric probability densities, e.g.,


the normal distribution

1  − ( x − µ )2 
P( X ) = exp  = N ( µ , σ )
σ 2π  2σ
2

Gaussian networks (a = input to the node)

1  − ( x − α − βa) 2 
P( X ) = exp  
σ 2π  2σ 2

Probit & Logit
Discrete → Boolean

If the input is continuous but output is


boolean, use probit or logit

1
Logit : P ( A = a | x) =
1 + exp[ − 2( µ − x) / σ ]
x
1
Probit : P ( A = a | x) = ∫ exp(−( x − µ ) / σ )dx
2 2

2π −∞
T he lo g i s ti c s i g m o i d
1

0 .8

P(A|x)
0 .6

0 .4

0 .2

0
-8 -6 -4 -2 0 2 4 6 8

x
The cancer network

Age Gender

Discrete Discrete/boolean

Age: {1-10, 11-20,...}


Toxics Smoking Gender: {M, F}
Toxics: {Low, Medium, High}
Discrete Discrete Smoking: {No, Light, Heavy}
Cancer: {No, Benign, Malignant}
Cancer
Serum Calcium: Level
Lung Tumour: {Yes, No}
Discrete
Serum Lung
Calcium Tumour

Continuous Discrete/boolean
Inference in BN
Inference means computing P(X|e), where X is
a query (variable) and e is a set of evidence
variables (for which we know the values).

Examples:

P(Burglary | john_calls, mary_calls)


P(Cancer | age, gender, smoking, serum_calcium)
P(Cavity | toothache, catch)
Exact inference in BN

P ( X , e)
P ( X | e) = = αP ( X , e) = α ∑ P( X , e, y )
P(e) y

”Doable” for boolean variables: Look up


entries in conditional probability tables
(CPTs).
Example: The alarm network
What is the probability for a burglary if both John and Mary call?
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}

Burglary Earthquake Evidence variables = {J,M}


P(E=e) Query variable = B
P(B=b) 0.002
0.001

Alarm

JohnCalls MaryCalls B E P(A=a)


b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
Example: The alarm network
What is the probability for a burglary if both John and Mary call?
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}

Burglary Earthquake P( B = b, E , A, j , m) =
P(E=e) P( j , m | b, E , A)P(b, E , A) =
P(B=b) 0.002 P( j | A) P(m | A)P(b, E , A) =
0.001
P( j | A) P(m | A) P(a | b, E )P(b, E ) =
Alarm
P( j | A) P(m | A) P(a | b, E ) P(b) P( E )

= 0.001 = 10-3
10 −3 × P( j | A) P(m | A) P( A | b, E ) P( E )
JohnCalls MaryCalls B E P(A=a)
b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
Example: The alarm network
What is the probability for a burglary if both John and Mary call?
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}
P(b, j , m) =
Burglary Earthquake
P(E=e)
10 −3 ∑ P( j | A) P(m | A) P( A | b, E ) P( E ) =
A={a ,¬a}
P(B=b) 0.002
E ={e ,¬e}

0.001 10 −3 [ P( j | a) P (m | a) P(a | b, e) P (e) +


P( j | a) P(m | a ) P(a | b, ¬e) P (¬e) +
Alarm
P( j | ¬a) P(m | ¬a ) P(¬a | b, e) P (e) +
P( j | ¬a) P(m | ¬a ) P(¬a | b, ¬e) P(¬e)] =
0.5923 ×10 −3
P(¬b, j , m) = 1.491×10 −3
JohnCalls MaryCalls B E P(A=a)
b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
Example: The alarm network
What is the probability for a burglary if both John and Mary call?
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}

Burglary Earthquake
P(E=e) P(b, j , m) = 0.5923 ×10 −3
P(B=b) 0.002
0.001 P(¬b, j , m) = 1.491×10 −3
α = P( j , m) −1 = [ P(b, j , m) + P(¬b, j , m)] =
−1

Alarm
= [2.083 ×10 −3 ]−1
P(b | j , m) = αP(b, j , m) = 0.284
P(¬b | j , m) = αP(¬b, j , m) = 0.716

JohnCalls MaryCalls B E P(A=a)


b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
Example: The alarm network
What is the probability for a burglary if both John and Mary call?

Answer: 28%
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}

Burglary Earthquake
P(E=e) P(b, j , m) = 0.5923 ×10 −3
P(B=b) 0.002
0.001 P(¬b, j , m) = 1.491×10 −3
α = P( j , m) −1 = [ P(b, j , m) + P(¬b, j , m)] =
−1

Alarm
= [2.083 ×10 −3 ]−1
P(b | j , m) = αP(b, j , m) = 0.284
P(¬b | j , m) = αP(¬b, j , m) = 0.716

JohnCalls MaryCalls B E P(A=a)


b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
Use depth-first search

A lot of unneccesary repeated computation...


Complexity of exact inference
• By eliminating repeated calculation &
uninteresting paths we can speed up the
inference a lot.
• Linear time complexity for singly
connected networks (polytrees).
• Exponential for multiply connected
networks.
– Clustering can improve this
Approximate inference in BN
• Exact inference is intractable in large
multiply connected BNs ⇒
use approximate inference:
Monte Carlo methods (random sampling).
– Direct sampling
– Rejection sampling
– Likelihood weighting
– Markov chain Monte Carlo
Markov chain Monte Carlo
1. Fix the evidence variables (E1, E2, ...) at their
given values.
2. Initialize the network with values for all other
variables, including the query variable.
3. Repeat the following many, many, many times:
a. Pick a non-evidence variable at random (query Xi or
hidden Yj)
b. Select a new value for this variable, conditioned on the
current values in the variable’s Markov blanket.

Monitor the values of the query variables.

You might also like