Lecture 6: Learning Partially Observed GM and the EM Algorithm
Introduction to the process of estimating the parameters of graphical models from data using the EM (Baum-Welch) algorithm.
Introduction
In this previous lectures, we introduced the concept of learning Graphical Models from data, where learning is the process of estimating the parameters, and in some cases, the network structure from the data. Lecture 5 introduced this concept in the setting of completely observed GMs. Maximum likelihood estimation (MLE) in the setting of fully observed nodes and globally independent parameters for each conditional probability distribution is straightforward because the likelihood function can be fully decomposed as a product of independent terms, one for each conditional probability distribution in the network. This means that we can maximize each local likelihood function independently of the rest of the network, and then combine the solutions to get an MLE solution.
In this lecture, we turn our attention to partially observed graphical models, an equally important class of models including Hidden Markov Models and Gaussian Mixture Models. As we will see, in the presence of partially observed data, we lose important properties of the likelihood function: its unimodality, closed-form representation, and decomposition as a product of likelihoods for the different parameters. As a result the learning problem becomes substantially more complex, and we turn to the Expectation-Maximization algorithm to enable estimation of our model parameters.
Partially Observed GMs
Directed but partially observed GM
First consider the case of a fully observed, directed graphical model:

Compare this to the case of directed, but partially observed GMs. Suppose we now do not observe one of the variables, yet we would still like to write down the likelihood of the data. To do this, we marginalize (i.e. integrate or sum out) the unobserved probability.

Unobserved variables:
- A variable can be unobserved or latent because it is a(n):
- Abstract or imaginary quantity meant to simplifiy the data generation process, e.g. speech recognition models, mixture models.
- A real-world object that is difficult or impossible to measure, e.g. the temperature of a star, causes of disease, evolutionary ancestors.
- A real-world object that was not measured due to missed samples, e.g. faulty sensors.
- Discrete latent variables can used to partition or cluster data into sub-groups
- Continuous latent variables (factors) can be used for dimensionality reduction (e.g. factor analysis, etc)
Example: HMMs for Speech Recognition
Our phones are capable of recognizing speech patterns and converting them to text. The initial approaches to this problem were based on Hidden Markov Models (HMMs). We assume there is a latent state that generates the noisy signal of speech that can be “chunked” into different components, or phonemes. We create dictionaries of different phonemes for different languages, and then we try to infer the sequence of phonemes that generated the speech.

Example: A Baysian Network for Biological Evolution

Probabilistic Inference
A GM
Task 1. How do we answer queries about
- We use inference as a name for the process of computing answers to these queries.
Task 2. How do we estimate a plausible model
- We use learning as a name for the process of a obtaining point estimate of
M . For the Bayesian approach, we seekP(M|D) , which is also an inference problem.
There are many approaches for inference in GMs. They can be divided into two classes:
- Exact inference algorithms. Including the elimination algorithm, message-passing algorithm (sum-product, belief propagation), the junction tree algorithms. These algorithms can give the precise result of query. The major topic of this lecture is on exact inference algorithms.
- Approximate inference techniques. Including stochastic simulation / sampling methods, Markov chain Monte Carlo (MCMC) methods, variational algorithms. These algorithms only gives an approximate answer to the inference query. We will cover these methods in future lectures.
Mixture Models
A density model
Z is a latent class indicator vector:
X is a conditional Gaussian variable with a class-specific mean/covariance:

- The likelihood of a sample:
MLE Solution for a Fully-Observed Gaussian Mixture Model
- The data log-likelihood can be decomposed when our latent variable
Z is also observed:
- Thus the MLE solution for the parameters can be found separately for these parameters:
We do not typically have knowledge of
Estimating the Parameters of a Partially-Observed GMM
What if we do not know
- Estimate some “missing” or “unobserved” data from observed data and a current estimate of the parameters.
- Using this “complete” data, find the maximum likelihood parameter estimates.
Thus we alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess. So what does the EM algorithm look like for a GMM? Recall that estimate
- The expected complete log likelihood:
We aim to maximize
- Expectation Step: Computing the expected value of the sufficient statistics of the latent variables (e.g.
z ) given our current estimate of the parameters (i.e.,\pi and\mu ).
- Maximization Step : Compute the parameters under current results of the expected value of the hidden variables:
This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations. In the general formulation of the E-M algorithm they are replaced by their corresponding “sufficient statistics”.
Relationship to K-Means Clustering
Big picture: The EM algorithm for mixtures of Gaussians is like a “soft version” of the K-means algorithm. Suppose we have our dataset
This non-convex objective function because we need minimize
- Minimize
J with respect tor_{nk} , keeping\mu_k fixed. “E-step” - Minimize
J with respect to\mu_{k} , keepingr_{nk} fixed. “M-step”
Before iterating, we randomly initialize the centroid
- “E-step” : We assign each data point to the closest cluster center using a hard assignment:
- “M-step” : We set
\mu_k equal to the mean of all data points assigned to clusterk . Recall that the weights are either0 or1 .
Each iteration reduces the objective funciton
Expectation-Maximization (E-M) Algorithm
Complete and Incomplete Log Likelihood
Complete Log Likelihood
Complete Log Likelihood is likelihood if both
Incomplete Log Likelihood
If
Expected Complete Log Likelihood
Define the expected complete log likelihood for any distribution
The expected complete log likelihood can be used to create a lower bound on the incomplete log likelihood. The proof uses Jensen’s inequality (
The second term
Lower Bounds and Free Energy
For fixed data
We can perform EM algorithm (generally MM algorithm) on
- E-step:
q^{t+1}=\text{arg max}_q F(q, \theta^t) - M-step:
\theta^{t+1}=\text{arg max}_\theta F(q^t, \theta)
E-step
The E-step is a maximization over the posterior over latent variables given data and parameters
If
M-step
The M-step is a maximization over the parameters given data and latent variables. As discussed previously, the free energy breaks into two terms, one of which (
If
Example: Hidden Markov Models & Baum-Welch Algorithm

Hidden Markov Models (HMMs) are one of the common ways of modeling time series data. It is used in traditional speech recognition systems, natural language processing, computational biology etc. HMMs allow us to deal with both observed and hidden events. Most of the common HMMs follow the first order markov assumption, which states that, when predicting the future the past doesn’t matter, only the present. If there is a sequence of states
Hidden Markov Model Definitions
An HMM is specified in following components
- A set of N states,
\mathbf{y} = y_1,y_2,…,y_N - Transition probability between any two states
-
Output observations,
\mathbf{x} = x_1, x_2,\ldots,x_T -
Emmision Probabilities, the probability of an observation being generated from a state.
- Start or prior probabilities
EM for Hidden Markov Models
For an HMM with
- The complete log likelihood
- The expected complete log likelihood
- Expectation Step : Fix
\theta and compute the marginal posterior
- Maximization Step : Update
\theta by MLE
Example: EM for a general BN
For a general bayesian network

Example: EM for conditional mixture models
For example we will model
- Latent variable
Z selects expert using softmax for normalization (softmax guarantees output is a distribution)
- Each expert is a linear regression. The output distribution has variance
\sigma_k^2 and mean\theta_k^Tx .
- Posterior expert responsibilities assigning points to experts are
- Using the expected complete log likelihood as loss
The resulting EM algorithm:
- E-step calculates the responsibilities
- M-step calculates linear regressions for each expert, each data point weighted by the expert’s responsibility (see homework)
EM Variants
- Partially Hidden Variables: We can use EM when there are missing (hidden) variables in some cases and not in others. In the E-Step we can estimate the hidden variables for only the incomplete cases. The M-step then optimizes the log-likelihood on the complete data plus the expected likelihood of the incomplete data found using the E-step.
- Sparse EM: In this variant, you do not exactly re-compute the posterior probability on each data point under all models, because it is practically zero. Instead, you keep an “active list” which you update every once in a while.
- Generalized (Incomplete) EM: It might be hard to find the ML parameters in the M-step, even given the completed data. We can still make progress by doing an M-step that improves the likelihood a bit (e.g. gradient step). Recall the IRLS step in the mixture of experts model.
Summary
Learning the parameters of partially observed graphical models, or latent variable models, is considerably more difficult than for fully observed GMs. This is because the unobserved or latent variables tie the observed variables together via marginalization, and we can no longer decompose the maximum likelihood function into independent conditional probabilities.
The EM Algorithm provides a way of maximizing the likelihood function for latent variable models. We can find the MLE parameters when the original problem can be broken into 2 steps: (1) Estimating or “filling in” the unobserved latent variables using the best guess (posterior) provided from our observed data and current parameters, and (2) updating the parameters based on this guess:
E-Step
M-Step
- EM Pros:
- No learning rate (step-size) parameter. It is possible that one or both minimizations can be computed analytically.
- Automatically enforces parameter constraints. Problems can be subdivided into discrete and continuous optimization problems.
- Very fast for low dimensions. Each minimization can be computed quickly and analytically.
- Each iteration guaranteed to improve the likelihood
- EM Cons
- Can get stuck in local minima. Even if both subproblems are convex does not mean that iteratively solving both reaches a global minimum.
- Can be slower than conjugate gradient (especially near convergence). Each subproblem is solved from scratch in each iteration as opposed to an iterative method that refines a solution. We solve for
\theta given currentq ignoring current\theta , then solve for bestq given\theta ignoring the currentq . - Requires computationally expensive inference step
- Is a maximum likelihood/MAP method
Additional Resources
- Jordan textbook, Ch. 11
- Koller textbook, Ch. 19.1-19.4
- Borman, The EM algorithm (A short tutorial)
- Variations on EM algorithm by Neal and Hinton
References
- Speech & language processing
Jurafsky, D., 2000. - Bayesian Reasoning and Machine Learning
Barber, D., 2009. - An Introduction to Probablistic Graphical Models
Jordan, M.I., 2003. University of California, Berkley. - Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning [PDF]
Koller, D. and Friedman, N., 2009. The MIT Press. - The expectation maximization algorithm - a short tutorial [PDF]
Borman, S., 2009. - Learning in Graphical Models [PDF]
Neal, R.M. and Hinton, G.E., 1999. , pp. 355--368. MIT Press.