🎠

Probabilistic Graphical Models Intro

TOC

1. Why Probabilistic Graphical Models

1.1 Motivation and Representation

Assume we have some data points with labels. When we want to train an applicable model to predict new data points, we have two choices.
  • We can find a funtion which can directly map an input value to its class label, that is . Our target is to approximate using models, for example, linear regression, SVM, decision trees, ANNs, etc.
  • We can find the probability distribution over the variables and then use this distribution to answer queries about new data points, that is, approximate the joint probability function where are input variables (features). Then
However, the second method is computation costy because the Joint Probability Distribution is exponential to the number of states of each variable. What’s worse, when we want to inference the Joint Probability Distribution to handle new queries (calculate conditional probability, etc.), it’s also computation costy. Probabilistic Graphical Models (PGM) is a technique of compactly representing Joint Probability Distribution over random variables by exploiting the (conditional) independencies between the variables. What’s better, PGM also provides us methods for efficiently doing inference over these β€œjoint distributions”.
There are two kinds of PGM:
  • Bayesian Networks: A Bayesian Network consists of a directed acyclic graph and Conditional Probability Distributions (CPDs, directed arrows shown in the graph) associated with each of the node. Each CPD is of the form .
  • Markov Networks (Markov Random Field): A Markov Network consists of an undirected graph and are parameterized by factors. Factors represent how much two or more variables agree with each other.
Some classical Probabilistic Graphical Models are shown as below.
Fig1. Classical Probabilistic Graphical Models
Fig1. Classical Probabilistic Graphical Models
Usually, the variables in Markov Networks and Bayesian Networks are discrete. When variables are continuous (assumed following Gaussian Distribution), we put up with Gaussian Networks including Gaussian Bayesian Networks and Gaussian Markov Networks.

1.2 Bayesian Networks

Definition
Let be a directed acyclic graph (DAG), let be a set of random variables indexed by . There are two equivalent definitions of Bayesian Networks.
(Factorization Definition) is a Bayesian network w.r.t. if its joint pdf can be written as a product of the individual pdfs, conditional on their parent variables:
where is the set of parents of (nodes pointing directly to )
(Local Markov Property Definition) is a Bayesian network w.r.t. if it satisfies the local markov property: each variable is conditionallly independent of its non-descendants given its parent variables:
Local Markov Property is equivalent to that each node in is only dependent on its Markov Blanket, which is a set of nodes consists of its parents, its children, and any of other parents of its children.
Fig2. Markov Blanket of v
Fig2. Markov Blanket of v
D-Seperation
There are three basic components of Bayesian Networks which imply the (conditional) dependence relationship.
Fig3. Basic Components of Bayesian Networks
Fig3. Basic Components of Bayesian Networks
  • For fork, .
    • Using Chain Rule, we have . Using factorization definition, we havet . Therefore, and thus which means . Directed chain and inverted fork have similar proofs.
  • For directed chain, .
  • For inverted fork, .
Let be a trail from node to . A trail is a connected line ignoring directions. is said to be d-separated by a set of nodes if any of the following condition holds:
  • contains a fork , such that the middle node is in .
  • contains (but does not need to be entirely) a directed chain or , such that the middle node is in .
  • contains a fork , such that the middle node is not in and no descendant of is in .
The nodes and are d-separated by if all trails between them are d-separated. If and are not d-separated, they are d-connect.
is a Bayesian Network w.r.t. iff
where is a set which d-separated and .
Note that the Markov Blanket is the minimal set of nodes which d-separates node from all other other nodes.
In Fig1, Naive Bayesian (single model), GMM (mix model), HMM / Kalman Filter / Particle Filter (dynamic models) are all Bayesian Networks.

1.3 Markov Networks (Markov Random Fields)

Definition
Given an undirected graph (can be cyclic) , a set of random variables form a Markov random field w.r.t. iff they satisfy any of the following Markov Properties (which are equivalent for positive distributions):
  • Pairwise Markov Property: Any two non-adjacent variables are conditionally independent given all other variables:
  • Local Markov Property: A variable is conditionally independent of all other variables given its neighbors:
    • where is the set of neighbors of , and is the closed neighbourhood of .
  • Global Markov Property: Any two subsets of variables are conditionally independent given a separating subset:
    • where every path from a node in to a node in passes through .
Clique Factorization
Clique is a set of nodes where there is a path between any two nodes. Maximal Clique is a clique which cannot add one another node to persist as a clique. The joint pdf can be factorized over cliques of :
Here, is the set of cliques of . The definition is equivalent if only maximal cliques are used. is a multiplier to ensure is a pdf:
is potential function whose value should be positive and usually in the form of exponential functions where is energy function.
In Fig1, Conditional Random Field (CRF) and Restricted Boltzmann Machine (RBM) are Markov Networks.
Moralization
A Bayesian network can always be converted into an undirected network with normalization constant one. The convert process is called moralization. If we take a directed graph and add side edges to all parents of a given node (and remove their directionality), then the CPDs (seen as factors over a variable and its ancestors) factorize over the resulting undirected graph.
Fig4. cs228 notes: Moralization
Fig4. cs228 notes: Moralization
The converse is also possible, but may be computationally intractable and may produce a very large (even full connected) directed graph.
A general rule of thumb is to use Bayesian networks whenever possible, and only switch to MRFs if there is no natural way to model the problem with a directed graph.

2. Inference

2.1 Inference Targets

Assume we have the knowledge of joint probability distribution , then we can inference it to calculate:
  • Marginal Probability:
    • , where is the set of all variables and is a subset of .
  • Conditional Probability:
    • , where are subsets of .
  • MAP (Maximum a Posteriori):
    • , where are subsets of .
These inferences are computation costy using sum method while there are some other efficient inference methods based on PGM.
Inference Methods of PGM can be classified into two categories: Accurate Inference and Approximate Inference. Accurate Inference methods include: Variable Elimination (VE), Belief Propagation (BP), Junction Tree Algorithm, Sum-Product Algorithm, etc. Approximate inference methods include: Loop Belief Propagation, Monte Carlo Inference (Importance Sampling, MCMC), Variational Inference, etc.

2.2 Variable Elimination (VE)

The core idea of Variable Elimination (VE) is distribution property of multiplication.
Assume we are given a graphical model as product of factors (factorization of Bayesian networks and Markov random fields):
The variable elimination algorithm repeatedly performs two factor operations: product and marginalization
  • Factor Product: the product of two factors is simply defined as . For example, .
  • Marginalization: the operation β€œlocally” eliminates a set of variables from a factor. If we have a factor over two sets of variables , marginalizing produces a new factor where the sum is over all joint assignments to the set of variables . Not that (called as marginalized factor) does not necessarily correspond to a probability distribution, even if was a CPD.
The variable elimination needs an ordering in which variables will be β€œeliminated”. And different ordering may dramatically alter the running time of the variable elimination algorithm. However, it’s NP-hard to find the best ordering.
πŸ‘¨πŸ»β€πŸ’»
Pseudocode: VE Algorithm
Loop ordered according to :
Multiply all factors containing
Marginalize out to obtain a new factor
Replace the factors with
The running time of VE is where is the possible values of discrete variables, is the number of variables and is maximum size of any factor formed during the elinimation process. It is much more efficient than the sum up method whose running time is .
Clearly some ordering are more efficient than others. Unfortunately, choosing the optimal VE ordering is an NP-hard problem. However, in practice, we may resort to the following heuristics:
  • Min-neighbors: Choose a variable with the fewest dependent variables.
  • Min-weight: Choose variables to minimize the product of the cardinalities of its dependent variables.
  • Min-fill: Choose vertices to minimize the size of the factor that will be added to the graph.
These methods often result in reasonably good performance.

2.3 Belief Propagation (BP) & Junction Tree

Variable Elimination makes PGM inference efficient while it still has shortcomings: It may involve repeated computation. For example, after we query and want to ask for another query , we need to restart the algorithm from scratch. Note that when computing marginals, VE produces many intermediate factors as a side-product. These factors turn out to be the same as the ones that we need to answer other marginal queries. By caching them after a first run of VE, we can easily answer new marginal queries at essentially no additional cost. There are two algorithms acting as β€œVE + Cache”: belief propagation (BP), and junction tree method. The former one is applied for tree-structured graphs while the latter one is applicable to general networks.
As for a tree-structured graph, if we want to compute a marginal , the optimal ordering is to root the tree at and iterating through the nodes in post-order (from leaves to up). In this way, the largest clique formed during VE will have size 2. At each step, will eliminate involving computing the factor , where is the parent of in the tree. At a later step, will be eliminated and will be passed up the tree to the parent of in order to be multiplied by the factor before being marginalized out. The factor can be thought of as a message that sends to that summarizes all of the information from the subtree rooted at . An example is shown as below.
Fig5. cs228 notes: Message passing order when using VE to compute  on a small tree
Fig5. cs228 notes: Message passing order when using VE to compute on a small tree
At the end of VE, receives messages from all of its immediate children, marginalizes them out, and we then obtain the final marginal. When we want to compute other marginals, the messages can be reused. All the messages will be sent out after precisely steps, since each edge can receive messages only twice: once from , and once in the opposite direction.
Then main idea of BP has been shown as above. There are two variants of BP:
  • Sum-product message passing: used for marginal inference (as above), i.e. computing
  • Max-product message passing: used for MAP (maximum a posteriori) inference, i.e. computing
Sum-product message passing
While there is a node ready to transmit to , the message sent is defined as:
where refers to the set of nodes that are neighbors of , excluding . Note that is precisely the factor that would transmit to during a round of variable elimination with the goal of computing .
After we have computed all messages, we may answer any marginal query over in constant time using the equation
where refers to the set of nodes that are neighbors of .
Recall that a factor graph is a bipartite graph with edges going between variables and factors, with an edge signifying a factor depends on a variable. On factor graphs, there are two types of messages: variable-to-factor messages and factor-to-variable messages . Both messages require taking a product, but only the factor-to-variable messages require a sum.
Fig6. cs228 notes: Sum-product message passing for factor trees
Fig6. cs228 notes: Sum-product message passing for factor trees
For undirected graphs, the algorithm proceeds in the same way: as long as there is a factor (or variable) ready to transmit to a variable (or factor), send the appropriate factor-to-variable (or variable-to-factor) message as defined above.
Max-product message passing
As for the MAP inference
we can just replace in equation with .
For example, for a chain MRF: where
To compute the maximum value of , we simply replace sums with maxes:
As for the inference, we just need to keep back-pointers during the optimizaiton procedure.
Junction Tree Method
The core idea of the junction tree algorithm is to turn a graph into a tree of clusters that are amenable to the variable elimination algorithm. Then we simply perform message-passing on this tree.
Suppose we have an undirected graphical model (if the model is directed, we consider its moralized graph). A junction tree over is a tree whose nodes are associated with subsets of the graph vertices (i.e. sets of variables). The junction tree must satisfy the following properties:
  • Family preservation: For each factor , there is a cluster such taht .
  • Running intersection: For every pair of clusters , every cluster on the path between contains .
Below is an example of an MRF with graph and junction tree . MRF potentials are denoted using different colors; circles indicate nodes of the junction trees; rectangular nodes represent sepsets (separation sets), which are sets of variables shared by neighboring clusters.
Fig7. cs228 notes: A junction tree example of MRF
Fig7. cs228 notes: A junction tree example of MRF
An example of invalid junction tree which does not satisfy the running intersection property is shown as below left. When itself is a tree, we may define a cluster for each edge in the tree. An example is shown as below right.
notion image
Junction Tree Algorithm
Define the potential of each cluster as the product of all the factors in that have been assigned to . By the family preservation peoperty, this is well defined, and we may assume that our distribution is in the form
Then, at each step of the algorithm, we choose a pair of adjacent clusters in and compute a message whose scope is the sepset between the two clusters:
We choose only if has received messages from all of its neighbors except . This procedure will also terminate in exactly steps. After it terminates, we will define the belief of each cluster based on all the messages that it receives
After all the messages have been passed, beliefs will be proportional to the marginal probabilities over their scopes, i.e. . We may answer queries of the form for by marginalizing out the variable in its belief
To get the actual (normalized) probability, we divide by the partition function which is computed by summing all the beliefs in a cluster, .
The running time will be exponential in the size of the largest cluster thus we want small clusters. As for how to construct good junction trees:
  • By hand: Typically, our models will have a very regular structure, for which there will be an obvious solution. For example, very often our model is a grid, in which case clusters will be associated with pairs of adjacent rows (or columns) in the grid.
  • Using variable elimination: Running VE elimination algorithm implicitly generates a junction tree over the variables. Thus it is possible to use the heuristics to define this ordering.
Loopy Belief Propagation
The junction tree algorithm has a running time that is potentially exponential in the size of the largest cluster. For many graphs, it will be difficult to find a good junction tree, applying the algorithm will not be possible. In other cases, we may not need the exact solution that the junction tree algorithm provides but be satisfied with a quick approximate solution instead.
Loopy belief propagation (LBP) is another technique for performing inference on complex (non-tree structure) graphs. Unlike the junction tree algorithm, which attenpted to efficiently find the exact solution, LBP will form our first example of an approximate inference algorithm.
The main idea of LBP is to disregard loops in the graph and perform message passing anyway. In other words, given an ordering on the edges, at each time we iterate over a pair of adjacent variables in that order and simply perform the update:
Keep performing these updates for a fixed number of steps or until convergence. This heuristic approach often works surprisingly well in practice.
Fig10. cs228 notes: Marginals obtained via LBP v.s. JT algorithm
Fig10. cs228 notes: Marginals obtained via LBP v.s. JT algorithm
LBP is a special case of variational inference algorithms.

2.4 Sampling-based inference

Β 

2.5 Variational inference

Β 

References

Wikipedia: Bayesian Network.
ermongroup: CS228 Notes.
Β 

Loading Comments...