Discrete Probability

Introduction to Discrete Probability Theory and Bayesian Networks
Dr Michael Ashcroft September 15, 2011
This document remains the property of Inatas. Reproduction in whole or in part without the written permission of Inatas is strictly forbidden. 1
Contents
1 Introduction to Discrete Probability 1.1 Discrete Probability Spaces . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.1.2 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 Sample Spaces, Outcomes and Events . . . . . . . . . . . Probability Functions . . . . . . . . . . . . . . . . . . . . 3 3 3 3 4 5 6 6 7 8 9 10 10 10 12 14 16 17 21 . .
The probabilities of events . . . . . . . . . . . . . . . . . . . . . . Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . Combinations of events . . . . . . . . . . . . . . . . . . . . . . . Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditional Independence . . . . . . . . . . . . . . . . . . . . . . The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Introduction to Bayesian Networks 2.1 2.2 2.3 2.4 2.5 2.6 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . D-Separation, The Markov Blanket and Markov Equivalence Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Inference on a Bayesian Network: The Variable Elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Inference on a Bayesian Network: The Junction Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inexact Inference on a Bayesian Network: Likelihood Sampling .
1
1.1
Introduction to Discrete Probability

Discrete Probability Spaces
A discrete probability space is a pair: < S, P >, where S is a Sample Space and P is a probability function. 1.1.1 Sample Spaces, Outcomes and Events An outcome is a value that the stochastic system we are modeling can take. The sample space of our model is the set of all outcomes. An event is a subset of the sample space. (So they are also sets of outcomes.) Example 1. The sample space corresponding to rolling a die is 1, 2, 3, 4, 5, 6. The outcomes of this sample space are 1, 2, 3, 4, 5 and 6. The events of this sample space are the members of its power set: , {1}, {2}, {3}, {4}, {5}, {6}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {3, 4}, {3, 5}, {3, 6}, {4, 5}, {4, 6}, {5, 6}, {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 3, 4}, {1, 3, 5}, {1, 3, 6}, {1, 4, 5}, {1, 4, 6}, {1, 5, 6}, {2, 3, 4}, {2, 3, 5}, {2, 3, 6}, {2, 4, 5}, {2, 4, 6}, {2, 5, 6}, {3, 4, 5}, {3, 4, 6}, {3, 5, 6}, {4, 5, 6}, {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 3, 6}, {1, 2, 4, 5}, {1, 2, 4, 6}, {1, 2, 5, 6}, {1, 3, 4, 5}, {1, 3, 4, 6}, {1, 3, 5, 6}, {1, 4, 5, 6}, {2, 3, 4, 5}, {2, 3, 4, 6}, {2, 3, 5, 6}, {2, 4, 5, 6}, {3, 4, 5, 6}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 6}, {1, 2, 3, 5, 6}, {1, 2, 4, 5, 6}, {1, 3, 4, 5, 6}, {2, 3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}
1.1.2
Probability Functions
A probability function is a function from S to the real numbers, such that: (i) 0 p(s) 1, for each s S. (ii) (p(s)) = 1.
Example 2. If a six sided dice is fair, the sample space associated with the result of a single throw is {1, 2, 3, 4, 5, 6} and the probability function is: p(1) = 1/6 p(2) = 1/6 3
p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6
Example 3. Now imagine a die that is not fair: It has twice the probability of coming up six as it does of coming up any other number. The probability function for such a die would be: p(1) = 1/7 p(2) = 1/7 p(3) = 1/7 p(4) = 1/7 p(5) = 1/7 p(6) = 2/7
1.2
The probabilities of events
The probability of an event, E is dened: p(E) =

oE
p(o)
Example 4. Continuing example 2 (the fair die), the event E, that the roll of the die produces an odd number, is {1, 3, 5}. Therefore: p(E) =
oE
p(o)
= p(1) + p(3) + p(5) = 1/6 + 1/6 + 1/6 = 1/2
Example 5. Likewise, for example 3 (the biased die), the event F , that the roll of the die produces 5 or 6, is {5, 6}. Therefore:
p(F ) =
oE
p(o)
= p(5) + p(6) = 1/7 + 2/7 = 3/7 Notice that an event represents the disjunctive claim that one of the outcomes that are its members occurred. So event E ({1, 3, 5}) represents the event that the roll of the die produced 1 or 2 or 3.
1.3
Random Variables
Note: A random variable is neither random nor a variable! Often, we are interested in numerical values that are connected with our outcomes. We use random variables to model these. A random variable is a function from a sample space to the real numbers. Example 6. Suppose a coin is ipped three times. Let X(t) be the random variable that equals the number of heads that appear when t is the outcome. Then X(t) takes the following values: X(HHH) = 3 X(HHT ) = X(HT H) = X(T HH) = 2 X(HT T ) = X(T HT ) = X(T T H) = 1 X(T T T ) = 0 Notice that a random variable divides the sample space into disjoint and exhaustive set of events, each mapped to a unique real number, r. Let us term this set of events Er . The probability distribution of a random variable, X, on a sample space, S, is the set of ordered pairs < r, p(X = r) > for all r X(S), where p(X = r) = (p(o)). This is the probability that an outcome o
oEr
occurred such that X(o) = r and is often characterized by saying that X took the value r. As we would expect: (i) 0 p(Er ) 1, for each r X(S). (ii)
rX(S)
p(Er ) = 1.
Placing this in table form gives us a familiar discrete probability distribution. Points to note about random variables: 1. A random variable and its probability distribution together consistute a probability space. 2. A function from the codomain of a random variable to the real numbers is itself a random variable.
1.4
Combinations of events
Some theorems: 1. p(E) = 1 p(E) 2. p(E and F ) = p(E, F ) = p(E F ) 3. p(E or F ) = p(E F ) = p(E) + p(F ) p(E F ) Theorem 1 should be obvious! Note, though, that combined with the denition of a probability distribution, it entails that p() = 0. Lets look at an example for theorem 2. Example 7. Returning to example 4 with the fair die, let the event E, be {1, 3, 5} (that the roll of the die produces an odd number), and event F be {5, 6} (that the roll of the die produces 5 or 6). We want to calculate the probability that one of these events occurs, which is to say that the event E F occurs. Using theorem two we see: p(E F ) = p(E) + p(F ) p(E F ) = (p(1) + p(3) + p(5)) + (p(5) + p(6)) p(5) = p(1) + p(3) + p(5) + p(6) = 4/6 Which is as it should be!
1.5
Conditional Probability
The conditional probability of one event, E, given another, F , is denoted: p(E|F ). p(E|F ) =
p(EF ) p(F )
Example 8. Continuing example 7, we can calculate the probability that the roll of the die produces 5 or 6 given that it produces an odd number: p(E|F ) = = =
p(EF ) p(F )
p(5) p(1)+p(3)+p(5) 1/6 3/6
= 1/3
1.6
Independence
Two events, E and F, are independent if and only if p(E, F ) = p(E)p(F ). That is to say, that the probability of both events occurring is simply the probability of the rst event occurring multiplied by the probability that the second event occurs. Likewise, two random variables are independent if and only if p(X(s) = r1 ), Y (s) = r2 ) = p(X(s) = r1 )p(Y (s) = r2 ), for all real numbers r1 and r2 . Independence is of great practical importance: It signicantly simplies working out complex probabilities. Where a number of events are independent, we can quickly calculate their joint probability distribution from their individual probabilities. Example 9. Imagine we are examining the results of the (ordered) tosses of three coins. Given the possible results of each coin is {H, T }, the sample space for our model will be {H, T }3 . Let us dene three random variables, X1 , X2 , X3 . X1 maps outcomes to 1 if the rst coin lands heads, and 0 otherwise. X2 and X3 do likewise for the second and third coins. Now assume we are given the following information: 1. p(X1 = 1) = 0.7 2. p(X2 = 1) = 0.4 3. p(X3 = 1) = 0.1 7
If we also know that these random variables are independent, then we can immediately calculate the joint probability distribution for the three random variables from these three values alone (remembering that En = 1 En ): 1. p(X1 = 1, X2 = 1, X3 = 1) = 0.7 0.4 0.1 = 0.028 2. p(X1 = 1, X2 = 1, X3 = 0) = 0.7 0.4 0.9 = 0.252 3. p(X1 = 1, X2 = 0, X3 = 1) = 0.7 0.6 0.1 = 0.042 4. p(X1 = 1, X2 = 0, X3 = 0) = 0.7 0.6 0.9 = 0.378 5. p(X1 = 0, X2 = 1, X3 = 1) = 0.3 0.4 0.1 = 0.012 6. p(X1 = 0, X2 = 1, X3 = 0) = 0.3 0.4 0.9 = 0.108 7. p(X1 = 0, X2 = 0, X3 = 1) = 0.3 0.6 0.1 = 0.018 8. p(X1 = 0, X2 = 0, X3 = 0) = 0.3 0.6 0.9 = 0.162 If we do not know these random variables are independent, we require much more information. In fact, we will need to have the values for each of the entries in the joint probability distribution. Notice that: our storage requirements have jumped from linear on the number of random variables to exponential. (Very bad.) our computational complexity has fallen from linear of the number of random variables to constant. (Good, but we could obtain this in the early case as well, if we kept the probabilities after we calculated them.) Typically, the probability distributions that are of interest to us are such that this exponential storage complexity renders them intractable. Some methods for dealing with this, such as the naive Bayes classier, simply assume independence among the random variables they are modeling. But this can lead to signicantly lower accuracy from the model.
1.7
Conditional Independence
Analogously to independence, we say that two events, E and F, are conditionally independent given another, G, if and only if P (G) = 0 and one of the following holds: 1. p(E|F G) = p(E|G) and p(E|G) = 0, p(F |G) = 0. 2. p(E|G) = 0 or p(F |G) = 0. 8
Example 10. Say we have 13 objects. Each object is either black (B) or white(W), each object has either a 1 or a 2 written on it, and each object is either a square (2) or a diamond(3). The objects are: B12 , B12 , B22 , B22 , B22 , B22 , B13 , B23 , B23 W12 , W22 , W13 , W23 If we are interested in the characteristics of a randomly drawn object and assume all objects have equal chance of being drawn, then using the techniques we have already looked at, we can see that the event, E1 , that a randomly selected box has a 1 written on it is not independent from the event, E2 , that such a box is square. But they are conditionally independent given the event, EB that the box is black (and, in fact, also given the event that the box is white): p(E1 ) = 5/13 p(E1 |E2 ) = 3/8 p(E1 |EB ) = 3/9 = 1/3 p(E1 |E2 EB ) = 2/6 = 1/3 There is little more to say about conditional independence at this point, but soon it will take center stage as a means of obtaining the accuracy of using the full joint distribution of the random variables we are modeling while avoiding the complexity issues that accompany this.
1.8
The Chain Rule
The chain rule for events says that given n events, E1 , E2 , ...En , dened on the same sample space S: p(E1 , E2 , ...En ) = p(En |En1 , En2 , ...E1 )...p(E2 |E1 )p(E1 ) Applied to random variables, this gives us that for n random variables, X1 , X2 , ...Xn , dened on the same sample space S: p(X1 = x1 , X2 = x2 , ...Xn = xn ) = p(Xn = nx |Xn1 = xn1 , Xn2 = xn2 , ...X1 = x1 ) ... = p(X2 = x2 |X1 = x1 )p(X1 = x1 ) It is straightforward to prove this rule using the rule for conditional probability. 9
1.9
Bayes Theorem
p(E|F )p(F ) p(E|F )p(F )+p(E|F )p(F )
Bayes theorem is: p(F |E) = Proof: 1. By the denition of conditional probability p(F |E) =
p(EF ) p(F ) . p(EF ) p(E)
and p(E|F ) =
2. Therefore, p(E F ) = p(F |E)p(E) = p(E|F )p(F ). 3. Therefore, p(F |E) =

p(E|F )p(F ) p(E)
4. p(E) = p(E S) = p(E (F F )) = p((E F ) (E F )) 5. (E F ) and (E F ) are disjoint (otherwise x (F F ) = ), so p(E) = )p(F ) p((E F ) (E F )) = p(E|F )p(F ) + p(E|F 6. Therefore p(F |E) =
p(E|F )p(F ) p(E)
(Bayes theorem)
Example 11. Suppose 1 person in 100000 has a particular rare disease. There exists a diagnostics test for this disease that is accurate 99% of the time when given to those who have the disease and 99.5% of the time when given to those who do not. Given this information, we can nd the probability that someone who tests positive for the disease actually has the disease: Let E be the event that someone tests positive for the disease and F be the event that a person has the disease. We want to nd p(F |E). We know that 1 99999 99 p(F ) = 100000 and so p(F ) = 100000 . We also know that p(E|F ) = 100 , so 1 995 5 p(E|F ) = 100 . Likewise we know that p(E|F ) = 1000 , so p(E|F ) = 1000 . So by Bayes theorem: p(F |E) =
(.99)(.00001) (.99)(.00001)+(.005)(.99999)
.002
Notice that the result was not intuitively obvious. Most people, if told only the information we had available, assume that testing positive means a very high probability of having the disease.
2
2.1
Introduction to Bayesian Networks

Bayesian Networks
A Bayesian Network is a model of a system, which in turn consists of a number of random variables. It consists of:
10
1. A directed acyclic graph (DAG), within which each random variable is represented by a node. The topology of this DAG must meet the Markov Condition: Each node must be conditionally independent of its nondescendants given its parents. 2. A set of conditional probability distributions, one for each node, which give the probability of the random variable represented by the given node taking particular values given the values the random variables represented by the nodes parents take. Examine the DAG in Figure 1 the information in Table , and compare with From the chain rule, we know that the joint probability distribution of the random variables p(A, B, C, D, E) = p(E|D, C, B, A)p(D|C, B, A)p(C|B, A)p(B|A)p(A). But given the conditional independencies present in P , we know that: p(C|B A) = p(C|A) p(D|C B A) = p(D|C B) p(E|D C B A) = p(E|C) So we know that p(A, B, C, D, E) = p(E|C)p(D|C, B)p(C|A)p(B|A)p(A). This may not seem a huge improvement, but it is. It means we can calculate the full joint distribution from the (normally much, much smaller) conditional probability tables associated with each node. As the networks get bigger, the advantages of such a method become crucial. advantages of such a method become crucial. What we have done is pull the joint probability distribution apart by its conditional independencies. We now have a means of obtaining tractable calculations using the full joint distribution. It has been proven that every discrete probability distribution (and many continuous ones) can be represented by a Bayesian Network, and that every Bayesian Network represents some probability distribution. Of course, if there are no conditional independencies in the joint probability distribution, representing it with a Bayesian Network gains us nothing. But in practice, while independence relationship between random variables in a system we are interested in modeling are rare (and assumptions regarding such independence dangerous), conditional independencies are plentiful. Some important points about Bayesian Networks: Bayesian Networks provide much more information than simple classiers (like neural networks, or support vector machines, etc). Most importantly, when used to predict the value a random variable will take, they return a probability distribution rather than simply specifying what value is most probable. Obviously, there are many advantages to this.
11
Bayesian Networks have easily understandable and informative physical interpretation (unlike neural networks, or support vector machines, etc, which are eectively black boxes to all but experts). We will see one advantage of this in the next section. We can use Bayesian Networks to simply model the correlations and conditional independencies between the random variables of systems. But generally we are interested in inferring the probability distributions of a subset of the random variables of the network given knowledge of the values taken by another (possibly empty) subset. Bayesian Networks can also be extended to Inuence Diagrams, with decision and utility nodes, in order to perform automated decision decision making.
2.2
D-Separation, The Markov Blanket and Markov Equivalence
The Markov Condition also entails other conditional independencies. Because of the Markov Condition, these conditional independencies have a graph theoretic criterion called D-Separation (which we will not dene, as it is dicult). Accordingly, when one set of random variables, , is conditionally independent of another, , given a third, , them we will say that the nodes representing the random variables in are D-Separated from by . The most important case of D-Separation/Conditional Independence is: A node is D-Separated of the entire graph given its parents, its children, and the other parents of its children. Because of this, the parents, children and other parents of a nodes children are called the Markov Blanket of the node. This is important. Imagine we have a node, , (which is associated with a random variable) whose probability distribution we wish to predict and whose Markov Blanket is the set of nodes, . If we know the value of (the random variables associated with) every node in , then we know that there is no more information regarding the value taken by (the random variable associated with) . In this way, if we are condent that we can always establish the values of some of the random variables our network is modeling, we can often see that certain of the random variables are superuous, and we need not continue to include them in the network nor collect information on them. Since, in practice, collecting data on random variables can be costly, this can be very helpful. We will also say that two DAGs are Markov Equivalent if they have the same D-Separations.
12
Figure 1: A DAG with ve nodes
Node A B C D E
Conditional Independencies C and E, given A B, given A A and E, given B and C A, B and D, given C
Table 1: Conditional independencies required of random variables the DAG in Figure 1 to be a Bayesian Network
Figure 2: The Markov Blanket of Node L
13
2.3
Potentials
Where V is a set of random variables {v1 , ...vn }, let V be the Cartesian product of the co-domains of the random variables in V . So V consists of all the possible combinations of values that the random variables of V can be take. Let V be a mapping V V R, such that V (vi , x) = the ith term of x, where x V . Ie V gives us the value assigned to a particular member of V V by a particular member of V . If W V , let W be a mapping V W , V V such that W (x, W (y)) = V (x, y), for all x W, y V . So W gives us the member of W in which all the members of W are assigned the same values as a particular member of V . A potential is an ordered pair < V, F >, where V is a set of random variables, and F is a mapping V R. Given a set of potentials, {< V1 , F1 >, ... < Vn , Fn >}, the multiplication of these potentials is itself a potential, < V , F >, where:
n
V =
i=1
Vi
n
F (x) =
i=1
F Fi (Fi (x))
This is simpler than it appears. We call the set of random variables in a potential the potentials scheme. The scheme of a product of a set of potentials is the union of the schemes of the factors. Likewise, the values assigned by the function in the potential to particular value combinations of the random variables is the product of the values assigned by the functions of the factors to the same value combinations (for those random variables present in the factor). Example 12. Take the multiplication of two potentials pot1 =< {X1 , X2 }, f > and pot2 =< {X1 , X3 }, g >, where all random variables are binary: x1 1 1 0 0 x2 1 0 1 0 f (X1 = x1 , X2 = x2 ) 0.7 0.3 0.4 0.6 Table 2: pot1
Where pot3 = pot1 pot2 , we have: Given a set of potentials, < V, F >, the marginalization out of some random variable v V from this potential is itself a potential, < V , F >, where: 14
x1 1 1 0 0
x3 1 0 1 0
g(X1 = x1 , X3 = x3 ) 0.1 0.9 0.2 0.8 Table 3: pot2
x1 1 1 1 1 0 0 0 0
x2 1 1 0 0 1 1 0 0
x3 1 0 1 0 1 0 1 0
0.7 0.1 0.7 0.9 0.3 0.1 0.3 0.9 0.4 0.2 0.4 0.8 0.6 0.2 0.6 0.8
h(X1 = x1 , X2 = x2 , X3 = x3 ) 0.07 0.63 0.03 0.27 0.08 0.32 0.12 0.48
Table 4: pot3
V = V \ v F (x) =
yV F F (y), where F (y) = x
. Example 13. If pot4 is the result of marginalizing X1 out of pot1 from Example 12, then: x2 1 0 i(X2 = x2 ) 1.1 0.9
Table 5: pot4
Some points: Note that potentials are simply generalizations of probability distributions, and that the latter are necessarily the former, but not vice versa. In fact, a conditional probability table is a potential, not a distribution. Unlike distributions, potentials need not sum to 1.
15
2.4
Exact Inference on a Bayesian Network: The Variable Elimination Algorithm
Let be a subset of random variables in our network. Let f be a function that assigns each random variable, v a particular value from those that v can take, f (v). To obtain the probability that the random variables in take the values assigned to them by f : 1. Perform a topological sort on the DAG. This gives us an ordering where all nodes occur before their descendants. From the denition of a DAG, this is always possible. 2. For each node, n, construct a bucket, bn . Also construct a null bucket, b . 3. For each conditional probability distribution in the network: (a) Create a list of random variables present in the conditional probability distribution. (b) For each random variable, v , eliminate all rows corresponding to values other than f (v), and eliminate v from the associated list. (c) Associate this list with the resulting potential and place this potential in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the potential in the null bucket. 4. Proceed in the given order through the buckets: (a) Create a new potential by multiplying all potential in the bucket. Associate with this potential a list of random variables that includes all random variables on the lists associated with the original potential in the bucket. (b) In this potential, marginalize out (ie sum over) the random variable associated with the bucket. Remove the random variable associated with the bucket from the associated list. (c) Place the distribution in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the distribution in the null bucket. 5. Multiply together the distributions in the null bucket (this is simply scalar multiplication). To obtain the a posteriori probability that a subset of random variables, , takes particular values given the observation that a second subset, , has taken particular values, we run the algorithm twice: First on , then on Delta, and we divide the rst by the second. Some points to note: 16
The algorithm can be extended to obtain good estimates of error bars for our probability estimates, and wishing to do so is the main reason for using the algorithm. The complexity of the algorithm is dominated by the largest potential, which will be at least the size of the largest conditional probability table and which is, in practice, much, much smaller than the full joint distribution. When used to calculate a large number of probabilities (such as the a posteriori probability distributions for each unobserved random variable), the algorithm is relatively inecient, since, if f is a function from the random variables in the network to the number of values each can take, it must be run f (v) 1 times for each unobserved random variable, v. The algorithm can be run on the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to nd that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know.
2.5
Exact Inference on a Bayesian Network: The Junction Tree Algorithm
The Junction Tree algorithm is the work horse of Bayesian Network inference algorithms, permitting ecient exact inference. It does not, though, permit the calculation of error bars for our probability estimates. Since the Junction Tree algorithm is a generalization of the Variable Elimination algorithm, there is hope that the extension to the latter that permits us to obtain such error bars can likewise be generalized so as to be utilized in the former. Whether, and if so how, this can be done is an open research question. This algorithm utilizes a secondary structure formed from the Bayesian Network called a Junction Tree or Join Tree. We rst show how to create this structure. Some Denitions: A cluster is a maximally connected sub-graph. The weight of a node is the number of values its associated random variable has. The weight of a cluster is the product of the weight of its constituent nodes. The Create (an Optimal) Junction Tree Algorithm: 1. Take a copy, G, of the DAG, join all unconnected parents and undirect all edges. 17
Figure 3: A simple Bayesian Network
2. While there are still nodes left in G: (a) Select a node, n, from G, such that n causes the least number of edges to be added in step 2b, breaking ties by choosing the node which induces the cluster with the least weight. (b) Form a cluster, C, from the node and its neighbors, adding edges as required. (c) If C is not a sub-graph of a previously stored cluster, store C as a clique. (d) Remove n from G. 3. Create n trees, each consisting of a single stored clique. Also create a set, S. Repeat until n 1 sepsets have been inserted into the forest: (a) Select from S the sepset, s, that has the largest number of variables in it, breaking ties by calculating the product of the number of values of the random variables in the sets, and choosing the set with the lowest. Further ties can be broken arbitrarily. (b) Delete s from S. (c) Insert s between the cliques X and Y only if X and Y are on dierent trees in the forest. (This merges their two trees into a larger tree, until you are left with a single tree: The Junction Tree.) Before explaining how to perform inference using a Junction Tree, we require some denitions: Evidence Potentials 18
{B, E}
{E}
{D, E, G}
{G}
{G, I}
{D}
{F, H}
{F }
{C, D, F }
{C, D}
{A, C, D} Figure 4: The Junction Tree constructed from Figure 3
Variable A B C D E
Value 1 1 1 1 0.75 150
Value 2 1 0 0 0.2 40
Value 3 1 0 1 0.05 10
Notes Nothing known Observed to be value 1. Observed to not be value 2 Soft evidence, with actual probabilities Soft evidence, assigns same probabilities as D
Table 6: EvidenceP otentials
19
An evidence potential has a singleton set of random variables, and maps real numbers to the random variables values. If working with hard evidence, it will map 0 to values which evidence has ruled out, and 1 to all other values (where at least one value must be mapped to 1). Where all values are mapped to 1, nothing is know about the random variables. Where all values except one are mapped to 1, it is known that the random variable takes the specied value. If working with soft evidence, values can be mapped to any non-negative real number, but the sum of these must be non-zero. Such a potential assigns values probabilities as specied by the its normalization. Message Pass We pass a message from one clique, c1 , to another, c2 , via the intervening sepset, s, by: 1. Save the potential associated with s. 2. Marginalize a new potential for s, containing only those variables in s, out of c1 . 3. Assign a new potential to c2 , such that: pot(c2 )new = pot(c2 )old ( pot(s)new ) pot(s)old Collect Evidence When called on a clique, c, Collect Evidence does the following: 1. Marks c. 2. Calls Collect Evidence recursively on the unmarked neighbors of c, if any. 3. Passes a message from c to the clique that called collect evidence, if any. Disperse Evidence When called on a clique, c, Disperse Evidence does the following: 1. Marks c. 2. Passes a message to each of the unmarked neighbors of c, if any. 3. Calls Disperse Evidence recursively on the unmarked neighbors of c, if any. To perform inference on a Junction Tree, we use the following algorithm: 1. Associate with each clique and sepset a potential, whose random variables are those of the clique/subset, and which associates with all value combinations of these random variables the value 1. 2. For each node:
20
(a) Associate with the node an evidence potential representing current knowledge. (b) Find a clique containing the node and its parents (it is certain to exist) and multiply in the nodes conditional probability table to the cliques potential. (By multiply in is meant: multiply the nodes conditional probability table and the cliques potential, and replace the cliques potential with the result.) (c) Multiply in the evidence potential associated with the node. 3. Pick an arbitrary root clique, and call collect evidence and then disperse evidence on this clique: 4. For each node you wish to obtain a posteriori probabilities for: (a) Select the smallest clique containing this node. (b) Create a copy of the potential associated with this clique. (c) Marginalize all other nodes out of the clique. (d) Normalize the resulting potential. This is the random variables a posteriori probability distribution. Some points to note: The complexity of the algorithm is dominated by the largest potential associated a clique, which will be at least the size of, and probably much larger than, the largest conditional probability table. But it is, in practice, much smaller than the full joint distribution. When cliques are relatively small, the algorithm is comparatively ecient. There are also numerous techniques to improve eciency available in the literature. A Junction Tree can be formed from the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to nd that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know.
2.6
Inexact Inference on a Bayesian Network: Likelihood Sampling
If the network is suciently complex, exact inference algorithms will become intractable. In such cases we turn to likelihood sampling. Using this algorithm, given a set of random variables, E, whose values we know (or are assuming), we can estimate a posteriori probabilities for the other random variables, U, in the network: 1. Perform a topological sort on the DAG. 21
2. Set all random variables in E to the value they are known/assumed to take. 3. For each random variable in U, create a score card, with a number for each value the random variable can take. Initially set all numbers to zero. 4. Repeat: (a) In the order generated in step 1, for each node in U, randomly assign values to each random variable using their conditional probability tables. (b) Given the values assigned, calculate the p(E = e), from the conditional probability tables of the random variables in E. Ie, where P ar(v) is random variables associated with the parents of the node associated with random variable v, par(v) are the values these parents have been assigned and E = {E1 , ...En }, calculate: p(E = e) =
En E
(p(En = en |P ar(En ) = par(En ))
(c) For each random variable in U, add p(E = e) to the score for the value it was assigned in this sample. 5. For each random variable in U, normalize its score card. This is an estimate of the random variables a posteriori probability distribution.
22

Discrete Probability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discrete Probability

Uploaded by

Copyright:

Available Formats

Introduction to Discrete Probability Theory and Bayesian Networks

Dr Michael Ashcroft September 15, 2011

Introduction to Discrete Probability

p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6

The probabilities of events

The probability of an event, E is dened: p(E) =

= p(1) + p(3) + p(5) = 1/6 + 1/6 + 1/6 = 1/2

p(5) p(1)+p(3)+p(5) 1/6 3/6

The Chain Rule

2. Therefore, p(E F ) = p(F |E)p(E) = p(E|F )p(F ). 3. Therefore, p(F |E) =

p(E|F )p(F ) p(E|F )p(F )+p(E|F )p(F )

Introduction to Bayesian Networks

D-Separation, The Markov Blanket and Markov Equivalence

Figure 1: A DAG with ve nodes

Figure 2: The Markov Blanket of Node L

g(X1 = x1 , X3 = x3 ) 0.1 0.9 0.2 0.8 Table 3: pot2

h(X1 = x1 , X2 = x2 , X3 = x3 ) 0.07 0.63 0.03 0.27 0.08 0.32 0.12 0.48

Exact Inference on a Bayesian Network: The Variable Elimination Algorithm

Exact Inference on a Bayesian Network: The Junction Tree Algorithm

Figure 3: A simple Bayesian Network

{A, C, D} Figure 4: The Junction Tree constructed from Figure 3

Value 1 1 1 1 0.75 150

Table 6: EvidenceP otentials

Inexact Inference on a Bayesian Network: Likelihood Sampling

(p(En = en |P ar(En ) = par(En ))

You might also like