AIMA ch14

Cooperating Intelligent Systems
Bayesian networks
Chapter 14, AIMA
Inference
• Inference in the statistical setting means
computing probabilities for different
outcomes to be true given the information
P (Outcome | Information)
• We need an efficient method for doing

this, which is more powerful than the
naïve Bayes model.
Bayesian networks
A Bayesian network is a directed graph in which
each node is annotated with quantitative
probability information:
1. A set of random variables, {X1,X2,X3,...}, makes up
the nodes of the network.
2. A set of directed links connect pairs of nodes,
parent → child
3. Each node Xi has a conditional probability
distribution P(Xi | Parents(Xi)) .
4. The graph is a directed acyclic graph (DAG).
The dentist network
Weather Cavity
Toothache Catch
The alarm network
Burglar alarm responds to
both earthquakes and burglars.
Burglary Earthquake
Two neighbors: John and Mary,
who have promised to call you
when the alarm goes off.
Alarm John always calls when there’s

an alarm, and sometimes when
there’s not an alarm.
Mary sometimes misses the

alarms (she likes loud music).
JohnCalls MaryCalls
The cancer network
Age Gender
Toxics Smoking
Genetic
Cancer
Damage
Serum Lung
Calcium Tumour
From Breese and Coller 1997

The cancer network
P(A,G) = P(A)P(G)
Age Gender
Toxics Smoking P(C|S,T,A,G) = P(C|S,T)
Genetic
Cancer
Damage
Serum Lung
Calcium Tumour P(A,G,T,S,C,SC,LT,GD) =
P(A)P(G)P(T|A)P(S|A,G)×
P(SC,C,LT,GD) = P(SC|C)P(LT|C,GD)P(C) P(GD) P(C|T,S)P(GD)P(SC|C)×
From Breese and Coller 1997
P(LT|C,GD)
The product (chain) rule
P ( X 1 = x1 ∧ X 2 = x2 ∧  ∧ X n = xn ) =
n
P ( x1 , x2 ,  , xn ) = ∏ P ( xi | parents ( X i ))
i =1
(This is for Bayesian networks, the general case comes

later in this lecture)
Bayes network node is a function
A B
¬a b
a∧b a∧¬b ¬a∧b ¬a∧¬b
Min 0.1 0.3 0.7 0.0
Max 1.5 1.1 1.9 0.9
P(C|¬a,b) = U[0.7,1.9]
0.7 1.9
Bayes network node is a function
A B
A BN node is a conditional
distribution function
• Inputs = Parent values
• Output = distribution over values
Any type of function from values

to distributions.
Example: The alarm network
Note: Each number in
the tables represents a
Burglary Earthquake boolean distribution.
Hence there is a
distribution output for
P(B=b) P(E=e)
every input.
0.001 0.002
Alarm
B E P(A=a)
b e 0.95
b ¬e 0.94
¬b e 0.29
JohnCalls MaryCalls ¬b ¬e 0.001
A P(J=j) A P(M=m)
a 0.90 a 0.70
¬a 0.05 ¬a 0.01
P ( j ∧ m ∧ a ∧ ¬b ∧ ¬ e )
= P (¬b) P (¬e) P (a | ¬b, ¬e) P (m | a ) P ( j | a )
= 0.999 ⋅ 0.998 ⋅ 0.001 ⋅ 0.70 ⋅ 0.90 = 0.00063
Burglary Earthquake
Probability distribution for
P(B=b) P(E=e) ”no earthquake, no burglary,
0.001 0.002 but alarm, and both Mary and
John make the call”
Alarm
B E P(A=a)
b e 0.95
b ¬e 0.94
¬b e 0.29
JohnCalls MaryCalls ¬b ¬e 0.001
A P(J=j) A P(M=m)
a 0.90 a 0.70
¬a 0.05 ¬a 0.01
Meaning of Bayesian network
The general chain rule (always true):
P( x1 , x2 ,  , xn ) = P( x1 | x2 , x3 , , xn ) P( x2 , x3 ,  , xn ) =
P( x1 | x2 , x3 ,  , xn ) P( x2 | x3 , x4 ,  , xn ) P ( x3 , x4 , , xn ) = 
n
= ∏ P( xi | xi +1 ,  , xn )
i =1
The Bayesian network chain rule:

n
P( x1 , x2 ,  , xn ) = ∏ P ( xi | parents ( X i ))
i =1
The BN is a correct representation of the domain iff each node is

conditionally independent of its predecessors, given its parents.
The alarm network
The fully correct alarm
network might look
Burglary Earthquake something like the figure.
The Bayesian network (red)

assumes that some of the
variables are independent (or
Alarm that the dependecies can be
neglected since they are very
weak).
The correctness of the

Bayesian network of course
JohnCalls MaryCalls depends on the validity of
these assumptions.
It is this sparse connection structure that makes the BN approach

feasible (~linear growth in complexity rather than exponential)
How construct a BN?
• Add nodes in causal order (”causal”
determined from expertize).
• Determine conditional independence using
either (or all) of the following semantics:
– Blocking/d-separation rule
– Non-descendant rule
– Markov blanket rule
– Experience/your beliefs
Path blocking & d-separation
Genetic
Cancer
Damage
Serum Lung
Calcium Tumour
Intuitively, knowledge about Serum Calcium influences our belief

about Cancer, if we don’t know the value of Cancer, which in turn
influences our belief about Lung Tumour, etc.
However, if we are given the value of Cancer (i.e. C= true or false),

then knowledge of Serum Calcium will not tell us anything about
Lung Tumour that we don’t already know.
We say that Cancer d-separates (direction-dependent separation)

Serum Calcium and Lung Tumour.
Path blocking & d-separation
Xi and Xj are d-separated if all paths betweeen them are blocked
Two nodes Xi and Xj are conditionally independent given a set Ω =
{X1,X2,X3,...} of nodes if for every undirected path in the BN
between Xi and Xj there is some node Xk on the path having one
of the following three properties:
Ω
1. Xk ∈ Ω , and both arcs on the path
lead out of Xk.
Xi
2. Xk ∈ Ω , and one arc on the path Xk2
leads into Xk and one arc leads out. Xk1
3. Neither Xk nor any descendant of Xk

is in Ω , and both arcs on the path Xk3
Xj
lead into Xk.
Xk blocks the path between Xi and Xj P ( X i , X j | Ω ) = P ( X i | Ω ) P ( X j | Ω)
Non-descendants
A node is conditionally
independent of its
non-descendants (Zij),
given its parents.
P ( X , Z1 j ,  , Z nj | U1 ,  , U m ) =
P ( X | U1 ,  , U m ) P ( Z1 j ,  , Z nj | U1 ,  , U m )
Markov blanket
X2 X3
X1
X4
A node is conditionally
independent of all
other nodes in the
network, given its
parents, children, and
children’s parents
These constitute the
nodes Markov blanket.
X5
X6
Xk
P( X , X 1 ,  , X k | U1 , , U m , Z1 j ,  , Z nj , Y1 ,  , Yn ) =
P( X | U1 ,  , U m , Z1 j ,  , Z nj , Y1 , , Yn ) P ( X 1 , , X k | U1 ,  , U m , Z1 j ,  , Z nj , Y1 , , Yn )
Efficient representation of PDs
P(C|a,b) ?
C
• Boolean → Boolean
• Boolean → Discrete
B • Boolean → Continuous
• Discrete → Boolean
• Discrete → Discrete
• Discrete → Continuous
• Continuous → Boolean
• Continuous → Discrete
• Continuous → Continuous
Efficient representation of PDs
Boolean → Boolean:
Noisy-OR, Noisy-AND
Boolean/Discrete → Discrete:
Noisy-MAX
Bool./Discr./Cont. → Continuous:
Parametric distribution (e.g. Gaussian)
Continuous → Boolean:
Logit/Probit
Noisy-OR example
Boolean → Boolean
P(E|C1,C2,C3)
C1 0 1 0 0 1 1 0 1
C2 0 0 1 0 1 0 1 1
C3 0 0 0 1 0 1 1 1
P(E=0) 1 0.1 0.1 0.1 0.01 0.01 0.01 0.001
P(E=1) 0 0.9 0.9 0.9 0.99 0.99 0.99 0.999
The effect (E) is off (false) when none of the causes are true. The
probability for the effect increases with the number of true causes.
P( E = 0) = 10 − (#True ) (for this example)

Example from L.E. Sucar
Noisy-OR general case
Boolean → Boolean
n
P( E = 0 | C1 , C2 ,  , Cn ) = ∏ q Ci
i
i =1
1 if true
Ci = 
0 if false
C1 C2 Cn
Example on previous slide used

q1 q2 qn
qi = 0.1 for all i. PROD
P(E|C1,...)
Image adapted from Laskey & Mahoney 1999

Noisy-MAX
Boolean → Discrete
C1 C2 Cn Effect takes on the max value from

different causes
Restrictions:
e1 e2 en
– Each cause must have an off state,
MAX which does not contribute to effect
– Effect is off when all causes are off
Observed effect – Effect must have consecutive
escalating values: e.g., absent, mild,
moderate, severe.
Image adapted from Laskey & Mahoney 1999
n
P( E = ek | C1 , C2 , , Cn ) = ∏ q Ci
i ,k
i =1
Parametric probability densities
Boolean/Discr./Continuous → Continuous
Use parametric probability densities, e.g.,

the normal distribution
1  − ( x − µ )2 
P( X ) = exp  = N ( µ , σ )
σ 2π  2σ
2

Gaussian networks (a = input to the node)
1  − ( x − α − βa) 2 
P( X ) = exp  
σ 2π  2σ 2

Probit & Logit
Discrete → Boolean
If the input is continuous but output is

boolean, use probit or logit
1
Logit : P ( A = a | x) =
1 + exp[ − 2( µ − x) / σ ]
x
1
Probit : P ( A = a | x) = ∫ exp(−( x − µ ) / σ )dx
2 2
2π −∞
T he lo g i s ti c s i g m o i d
1
0 .8
P(A|x)
0 .6
0 .4
0 .2
0
-8 -6 -4 -2 0 2 4 6 8
x
The cancer network
Age Gender
Discrete Discrete/boolean
Age: {1-10, 11-20,...}

Toxics Smoking Gender: {M, F}
Toxics: {Low, Medium, High}
Discrete Discrete Smoking: {No, Light, Heavy}
Cancer: {No, Benign, Malignant}
Cancer
Serum Calcium: Level
Lung Tumour: {Yes, No}
Discrete
Serum Lung
Calcium Tumour
Continuous Discrete/boolean
Inference in BN
Inference means computing P(X|e), where X is
a query (variable) and e is a set of evidence
variables (for which we know the values).
Examples:
P(Burglary | john_calls, mary_calls)

P(Cancer | age, gender, smoking, serum_calcium)
P(Cavity | toothache, catch)
Exact inference in BN
P ( X , e)
P ( X | e) = = αP ( X , e) = α ∑ P( X , e, y )
P(e) y
”Doable” for boolean variables: Look up

entries in conditional probability tables
(CPTs).
What is the probability for a burglary if both John and Mary call?
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}
Burglary Earthquake Evidence variables = {J,M}

P(E=e) Query variable = B
P(B=b) 0.002
0.001
Alarm
JohnCalls MaryCalls B E P(A=a)

b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}
Burglary Earthquake P( B = b, E , A, j , m) =
P(E=e) P( j , m | b, E , A)P(b, E , A) =
P(B=b) 0.002 P( j | A) P(m | A)P(b, E , A) =
0.001
P( j | A) P(m | A) P(a | b, E )P(b, E ) =
Alarm
P( j | A) P(m | A) P(a | b, E ) P(b) P( E )
= 0.001 = 10-3
10 −3 × P( j | A) P(m | A) P( A | b, E ) P( E )
b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}
P(b, j , m) =
Burglary Earthquake
P(E=e)
10 −3 ∑ P( j | A) P(m | A) P( A | b, E ) P( E ) =
A={a ,¬a}
P(B=b) 0.002
E ={e ,¬e}
0.001 10 −3 [ P( j | a) P (m | a) P(a | b, e) P (e) +

P( j | a) P(m | a ) P(a | b, ¬e) P (¬e) +
Alarm
P( j | ¬a) P(m | ¬a ) P(¬a | b, e) P (e) +
P( j | ¬a) P(m | ¬a ) P(¬a | b, ¬e) P(¬e)] =
0.5923 ×10 −3
P(¬b, j , m) = 1.491×10 −3
b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}
Burglary Earthquake
P(E=e) P(b, j , m) = 0.5923 ×10 −3
P(B=b) 0.002
0.001 P(¬b, j , m) = 1.491×10 −3
α = P( j , m) −1 = [ P(b, j , m) + P(¬b, j , m)] =
−1
Alarm
= [2.083 ×10 −3 ]−1
P(b | j , m) = αP(b, j , m) = 0.284
P(¬b | j , m) = αP(¬b, j , m) = 0.716

b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
Answer: 28%
P ( B | j , m) = α ∑ ∑ P( B, E, A, j, m)
E ={e ,¬e} A ={ a ,¬a}
Burglary Earthquake
P(E=e) P(b, j , m) = 0.5923 ×10 −3
P(B=b) 0.002
0.001 P(¬b, j , m) = 1.491×10 −3
α = P( j , m) −1 = [ P(b, j , m) + P(¬b, j , m)] =
−1
Alarm
= [2.083 ×10 −3 ]−1
P(b | j , m) = αP(b, j , m) = 0.284
P(¬b | j , m) = αP(¬b, j , m) = 0.716

b e 0.95
A P(J=j) A P(M=m) b ¬e 0.94
a 0.90 a 0.70 ¬b e 0.29
¬a 0.05 ¬a 0.01 ¬b ¬e 0.001
Use depth-first search
A lot of unneccesary repeated computation...

Complexity of exact inference
• By eliminating repeated calculation &
uninteresting paths we can speed up the
inference a lot.
• Linear time complexity for singly
connected networks (polytrees).
• Exponential for multiply connected
networks.
– Clustering can improve this
Approximate inference in BN
• Exact inference is intractable in large
multiply connected BNs ⇒
use approximate inference:
Monte Carlo methods (random sampling).
– Direct sampling
– Rejection sampling
– Likelihood weighting
– Markov chain Monte Carlo
Markov chain Monte Carlo
1. Fix the evidence variables (E1, E2, ...) at their
given values.
2. Initialize the network with values for all other
variables, including the query variable.
3. Repeat the following many, many, many times:
a. Pick a non-evidence variable at random (query Xi or
hidden Yj)
b. Select a new value for this variable, conditioned on the
current values in the variable’s Markov blanket.
Monitor the values of the query variables.

AIMA ch14

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AIMA ch14

Uploaded by

Copyright:

Available Formats

Cooperating Intelligent Systems

• We need an efficient method for doing

Alarm John always calls when there’s

Mary sometimes misses the

From Breese and Coller 1997

Toxics Smoking P(C|S,T,A,G) = P(C|S,T)

(This is for Bayesian networks, the general case comes

a∧b a∧¬b ¬a∧b ¬a∧¬b

Min 0.1 0.3 0.7 0.0

Max 1.5 1.1 1.9 0.9

Any type of function from values

The Bayesian network chain rule:

The BN is a correct representation of the domain iff each node is

The Bayesian network (red)

The correctness of the

It is this sparse connection structure that makes the BN approach

Intuitively, knowledge about Serum Calcium influences our belief

However, if we are given the value of Cancer (i.e. C= true or false),

We say that Cancer d-separates (direction-dependent separation)

leads into Xk and one arc leads out. Xk1

3. Neither Xk nor any descendant of Xk

P(E=0) 1 0.1 0.1 0.1 0.01 0.01 0.01 0.001

P(E=1) 0 0.9 0.9 0.9 0.99 0.99 0.99 0.999

P( E = 0) = 10 − (#True ) (for this example)

Example on previous slide used

Image adapted from Laskey & Mahoney 1999

C1 C2 Cn Effect takes on the max value from

Use parametric probability densities, e.g.,

Gaussian networks (a = input to the node)

If the input is continuous but output is

Age: {1-10, 11-20,...}

P(Burglary | john_calls, mary_calls)

”Doable” for boolean variables: Look up

Burglary Earthquake Evidence variables = {J,M}

JohnCalls MaryCalls B E P(A=a)

0.001 10 −3 [ P( j | a) P (m | a) P(a | b, e) P (e) +

JohnCalls MaryCalls B E P(A=a)

JohnCalls MaryCalls B E P(A=a)

A lot of unneccesary repeated computation...

Monitor the values of the query variables.

You might also like