Machine Learning 05: Graphical Models
 Graphical Models
 Bayesian Networks
 Bayesian Networks Definition
 Conditional Independence, Revisited
 Markov Blanket
 Probabilities in Bayes Nets
 Learning of Bayes Nets
 Learning CPTs from Fully Observed Data
 Estimate θ from partly observed data
 How can we learn Bayes Net graph structure?
 KullbackLeibler Divergence
 ChowLiu Algorithm
 Resources
Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017
Graphical Models
 Key Ideas:
 Conditional Independence assumptions useful to reduce # of parameters
 but Naive Bayes is extreme!
 Graphical models express sets of conditional independence assumptions via graph structure
 Graph structure plus associated parameters define joint probability distribution over sets of parameters
 Why Grapical Models:
 xxx
Conditional Independence
P(thunder  rain, light) = P(thunder  light)
Margin Independence
(Any i, j) P(X = x_i  Y = y_j) = P(X = x_i)
(Any i, j) P(Y = y_j  X = x_i) = P(Y = y_j)
Bayesian Networks
Bayes Nets are convinient representation for encoding dependencies / conditional independence.
Bayesian Networks Definition
 4 for
P(W  L, R) = P(W  Pa)
, as shown inW
column in the diagram  2 for
 2
 2
 1 for P(StormClouds  ?) = P(StormClouds)
Bayesian Networks Joint Distribution
The joint distribution over all variables:
P(X1 ... Xn) = ∏_i P(Xi  Pa(Xi))
in which, Pa(X) = Parents = Immediat Parents
So the joint distribution of the graph is:
In this case, for each node Xi
actually describes P(Xi  Pa(Xi))
.
According to chain rule:
P(S,L,R,T,W) = P(S) P(LS) P(RS,L) P(TS,L,R) P(WS,L,R,T)
But in a Bayes Net:
P(X1 ... Xn) = ∏_i P(Xi  Pa(Xi))
P(S,L,R,T,W) = P(S) P(LS) P(RS) P(TL) P(WL,R)
P(S=1,L=0,R=1,T=0,W=1) = P(S=1) P(L=0S=1) P(R=1S=1) P(T=0L=0) P(W=1L=0,R=1)
The last item P(W=1L=0,R=1)=0.2
according to the table above.
Learning a Bayes Net
Bayes Network for X1…4 with NO assumed conditional independencies
P() = P(X1)P(X2X1)P(X3X1,X2)P(X4X1,X2,X3)
Bayes Network for Naive Bayes
P(YX1...Xn) = P(X1...XnY) P(Y) / P(X1...Xn)
= P(X1Y) P(X2Y) ... P(XnY) P(Y)
What if variables are mix of dicrete and real values?
Bayes Network for a Hidden Markov Model
Bayesian Networks Definition
A Bayes network represents the joint probability distribution over a collection of random variables.
A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD’s)
 Each node denotes a random variable
 Edges denote dependencies
 For each node
Xi
its CPD definesP(Xi  Pa(Xi))
 The joint distribution over all variables is defined to be:
Conditional Independence, Revisited
 We said:
 Each node is conditionally independent of its nondescendents, given its immediate parents.
 Does this rule give us all of the conditional independence relations implied by the Bayes network?
 No!
 E.g.,
X1
andX4
are conditionally indep given{X2, X3}
 But
X1
andX4
not conditionally indep givenX3
 For this, we need to understand Dseparation
Easy Network 1: Head to Tail
Easy Network 2: Tail to Tail
Easy Network 3: Head to Head
X
and Y
are conditionally independent given Z
, if and only if X
and Y
are Dseperated by Z
.
Suppose we have three sets of random variables: X, Y and Z
X
and Y
are Dseparated by Z
(and therefore conditionally indep, given Z
) iff every path from every variable in X
to every variable in Y
is blocked.
A path from variable A
to variable B
is blocked by Z
if it includes a node such that either:

arrows on the path meet either headtotail or tailtotail at the node and this node is in
Z

the arrows meet headtohead at the node, and neither the node, nor any of its descendants, is in
Z
Exercise:
X1
indep ofX3
givenX2
? YX3
indep ofX1
givenX2
? YX4
indep ofX1
givenX2
? YX4
indep ofX1
givenX3
? NX4
indep ofX1
given{X3, X2}
? YX4
indep ofX1
given{}
? Y
Another Exercise:
a
indep ofb
givenc
? Na
indep ofb
givenf
? Ya
indep ofb
given{}
? Y
Markov Blanket
It is actually a "sufficient condition rule": saying that to compute xi
, we need and only need to consider the nodes in purple.
Why Markov Blanket is Useful for Inference
According to Markov Blanket, we can get:
P(ha,c,h,g) = P(hc,j)
Solve the equation as following:
Probabilities in Bayes Nets
All following parts in this section are based on this graph:
Prob. of joint assignment: easy
Suppose we are interested in joint assignment <F=f,A=a,S=s,H=h,N=n>
P(f,a,s,h,n) = P(f)P(a)P(sf,a)P(hs)P(ns)
Prob. of marginal probabilities P(Xi): not so easy
How to calculate P(N=n)
?
P(N=1) = ΣfΣaΣhΣs P(N=1,f,a,h,s)
In this equation, we need calculate 16 values to get the marginal probabilities.
Generating a sample from joint distribution: easy
How can we generate random samples drawn according to P(F,A,S,H,N)
?
Any 21th century programming language has a primitive function you can call to generate random variables between 0 ~ 1 following uniform distribution.
To generate a random sample for roots of network ( F or A ):
 let θ = P(F=1) # look up from CPD
 r = random number drawn uniformly between 0 and 1
 if r<θ then output 1, else 0
To generate a random sample for S, given F, A:
 let θ = P(S=1F=f,A=a) # look up from CPD
 r = random number drawn uniformly between 0 and 1
 if r<θ then output 1, else 0
e.g., P(N=n)
by generating many samples from joint distribution, then count the fraction of samples for which N=n
Similarly, for anything else we care about: P(F=1H=1, N=0)
Gibbs Sampling
A commonly used method in research.
Approach:
 start with arbitrary initial values for unobserved X1(0) ,…,Xn(0) (and the observed Xn+1, ..., Xm)
 iterate for s=0 to a big number:
Eventually (after burnin), the collection of samples will constitute a sample of the true P(X1,…,Xn  Xn+1, ..., Xm)
Learning of Bayes Nets
 Four categories of learning problems
 Graph structure may be known/unknown
 Variable values may be fully observed / partly unobserved
 Easy case: learn parameters when graph structure is known, and training data is fully observed
 Interesting case: graph known, data partly observed
 Gruesome case: graph structure unknown, data partly unobserved
Learning CPTs from Fully Observed Data
 Example: Consider learning the parameter:
MLE
 MLE (Max Likelihood Estimate) is:
Remember why?
 Maximum likelihood estimate:
 In our case:
MAP
 Maximum likelihood estimate
 MAP estimate
Estimate θ from partly observed data
 What if FAHN observed, but not S?
 Can't calculate MLE:
 Let
X
be all observed variable values (over all examples)  Let
Z
be all unobserved variable values  Can’t calculate MLE:
 EM seeks the estimate:
EM Algorithm  Informally
EM is a general procedure for learning from partly observed data.
Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S} in the example)
Begin with arbitrary choice for parameters θ:
Iterate until convergence:
• E Step: use X, θ to estimate the unobserved Z values
• M Step: use X values and estimated Z values to derive a better θ
Guaranteed to find local maximum.
Each iteration increases:
EM Algorithm  Precisely
EM is a general procedure for learning from partly observed data.
Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S})
Define
Iterate until convergence:
• E Step: Use X and current θ to calculate P(ZX,θ)
• M Step: Replace current θ by θ ← argmax Q(θ'θ)
Example using EM
EM seeks the estimate:
here, observed X = {F,A,H,N}, unobserved Z = {S}
E Step: Use X, θ, to Calculate P(ZX,θ)
Observed X = {F,A,H,N}, Unobserved Z = {S}
How? Bayes net inference problem.
Recall that MLE was:
More generally, Given observed set X, unobserved set Z of boolean values:
Using Unlabeled Data to Help Train Naïve Bayes Classifier
How can we learn Bayes Net graph structure?
In general case, open problem
 can require lots of data (else high risk of overfitting)
 can use Bayesian priors, or other kinds of prior assumptions about graph structure to constrain search
One key result:
 ChowLiu algorithm: finds “best” treestructured network
 What’s best?
 suppose P(X) is true distribution, T(X) is our treestructured network, where X =
 ChowLiu minimizes KullbackLeibler divergence.
 suppose P(X) is true distribution, T(X) is our treestructured network, where X =
KullbackLeibler Divergence
 KL(P(X)  T(X)) is a measure of the difference between distribution P(X) and T(X)
 It is asymetric, always greater or equal to 0
 It is 0 iff P(X) = T(X)
ChowLiu Algorithm
Key result: To minimize KL(P  T) over possible tree networks T approximating true P, it suffices to find the tree network T that maximizes the sum of mutual informations over its edges.
 for each pair of variables A,B, use data to estimate P(A,B), P(A), and P(B)
 for each pair A, B calculate mutual information

calculate the maximum spanning tree over the set of variables, using edge weights I(A,B) — given N vars, this costs only O(N2) time
 add arrows to edges to form a directedacyclic graph
 learn the CPD’s for this graph
E.g. Greedy Algorithm to find MaxSpanning Tree
Note that for each iteration, we need to check no loop is formed becaused we need to get a tree structure.
Also note that there should be many other edges in the graph like A—F, here is just a simple representation.
ChowLiu for improving Naïve Bayes
Can we use structure learning to modify Naïve Bayes cond. indep. assumptions?
If X2 = living in Pittburg, X3 = attending CMU, the assumption of cond. indep. is not perfect fit. By adding dependencies between X variables, and prune using ChowLiu Algorithm, the dependencies between X variables can be represented and learnt.
Resources
 Bishop Ch. 8