Hidden Markov Models and Particle Filtering

Learning Objectives

Be able to work through multiple iterations of particle filtering.
Implement the Forward-Backward Algorithm for HMMs.
Implement particle filtering for a variety of Bayesian Networks.
Apply smoothing to HMM queries for each time step.

Hidden Markov Models

Motivation

Hidden Markov Models are used to describe time or spatial series data; i.e for situations that dynamically change over space or time such as speech recognition. To model such situations, we look at the states at each point in time. Another important aspect to consider while modeling the world is that not all states are observable. Rather, we may only be able to observe some proxy for the state, known as evidence. For example, if you are tracking the movement of an object on a grid, you may only be able to observe the general region it is located in.

Definition

\(X_t\) denotes the set of state variables at time \(t\) that are not observable, or hidden. \(E_t\) likewise denotes the set of observable state variables.

We define HMMs with three components:

\(P(X_0)\): Initial Distribution
\(P(X_t | X_{t - 1})\) Transition (Dynamics) Model
\(P(E_t | X_t)\): Sensor (Observation) Model

Consistent with Bayes Nets, we can represent the joint probability of HMMs as: \[P(X_0, E_1, X_1, ..., E_t, X_t) = P(X_0)\prod_t P(X_t | X_{t-1})P(E_t | X_t)\] Consider the example discussed in lecture. You want to know whether or not it rains every day. However, being quarantined in a windowless home, the only outside information about outside world is whether your friend comes to your home with an umbrella or not. In this case, the time interval \(t\) will be each day, \(X_t\) is set of variables denoting whether it rains in day \(t\), and \(E_t\) is set of variables denoting whether you see umbrella in day \(t\).

HMMs specifically describe the process with single discrete random variables. If you happen to have more than one state variable, for instance \(Rain_t\), and \(Sunny_t\), to make it consistent with HMM framework, we would combine them to a single variable by using tuple as a value. Although this is out of scope for this class, the reason for this specific structure is for simple matrix implementation for inference tasks.

Inference with HMMs

The following section is a brief overview of inference tasks that could be done on HMMs. More mathematical details about the filtering and smoothing algorithms will be discussed in the next section.

Filtering

The goal of filtering is to find the belief state given all evidence to date: \(P(X_t | e_{1:t})\). In other words, we want to know what the current state is, given all of the current and past evidence. In the following picture, we query \(P(X_4 | e_{1:4})\). The belief state is \(X_4\).

Prediction

The goal of prediction is to learn about some future state, given all of the current and past evidence: \(P(X_{t+k} | e_{1:t})\), where \(k > 0\).

Smoothing

The goal of smoothing is to calculate a posterior distribution over the past state, given all of the current and past evidence: \(P(X_k | e_{1:t})\), where \(1 \leq k < t\). The reason for doing smoothing is that we now can make much better estimates of the past states with more evidence after the fact, which plays an important role for learning tasks.

Explanation and Most Likely Explanation

We might also want to infer about the whole sequence instead of single state. Given all the observations until \(t\), we might want to find the sequence of states that is most likely to have produced the observations. To do so, we compute most likely explanation: \(argmax_{X_{1:t}} P (X_{1:t} | e_{1:t})\). One application for this inference task is speech recognition.

Filtering Algorithms

Forward Algorithm

The forward algorithm is designed to answer filtering queries. It is based on the following expansion of the term \(P(X_t | e_{1:t})\): \[\begin{align*} P(X_t | e_{1:t}) &= P(X_t | e_t, e_{1:t-1}) \\ &Expanding\ shorthand\ e_{1:t} \\ &= \alpha P(X_t, e_t | e_{1:t-1}) \\ &Definition\ of\ Conditional\ Probability \\ &= \alpha \sum_{x_{t - 1}} P(X_t, e_t, x_{t - 1} | e_{1:t-1}) \\ &Sum\ over\ x_{t - 1} \\ &= \alpha \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1}, e_{1:t - 1})P(e_t | X_t, x_{t - 1}, e_{1:t - 1}) \\ &Chain\ Rule \\ &= \alpha \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1})P(e_t | X_t) \\ &Conditional\ Independence\ from\ BN \\ &= \alpha P(e_t | X_t) \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1}) \end{align*}\] Notice that the term \(P(x_{t - 1} | e_{1:t - 1})\) appears in the expression for \(P(X_t | e_{1:t})\). This suggests a recursive method of answering filtering queries, where we "unroll" a query as a combination of smaller versions of the same query. The cost of this procedure is \(O(|X|^2)\), where \(|X|\) is the number of states (you can verify this by solving a recurrence relation if interested). Quadratic time is prohibitive for problems with large state spaces, which leads us to approximate methods we will discuss later.

Also note that filtering is essentially two step process: prediction and update. Notice \(\sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1}) = P(X_t | e_{1:t-1})\) represents the one step prediction of the next state based on current distribution at \(t-1\). Then, it is multiplied with \(P(e_t | X_t)\) to update with the new evidence \(e_t\) just observed.

Example

For a more concrete example, consider the rain and umbrella example discussed in 2.2. Initially, we assume the uniform distribution. In other words, \(P(X_0 = 0) = 0.5\) and \(P(X_0 = 1) = 0.5\). Also, consider the following transition and sensor models: \[\begin{array}{|c|c|c|}
\hline
X_t & X_{t-1} & P(X_t | X_{t-1}) \\ \hline
1 & 0 & 0.3 \\\hline
1 & 1 & 0.7 \\\hline
0 & 0 & 0.7 \\\hline
0 & 1 & 0.3 \\\hline
\end{array}
\begin{array}{|c|c|c|}
\hline
X_t & E_{t} & P(E_t | X_t) \\ \hline
1 & 0 & 0.1\\\hline
1 & 1 & 0.9 \\\hline
0 & 0 & 0.8 \\\hline
0 & 1 & 0.2 \\\hline
\end{array}\]

On day 1, we observe your friend brought an umbrella, i.e \(E_1 = 1\). First, we perform prediction: \[P(X_1=1) = \sum_{x_0} P(X_1 | x_0)P(x_0) = 0.3*0.5 + 0.7*0.5 = 0.5\] (Intuitively, what we’re doing here is considering all possibilities of \(x_0\), whether it rained on day 0, and computing the likelihood of it raining on day 1 using the transition model.)

Then, we can update this prediction with the new observation: \[P(X_1 = 1 | E_1 = 1) = \alpha P(E_1=1 | X_1=1)P(X_1=1) = \alpha 0.9*0.5 = \alpha 0.45\] We can get the final posterior distribution for \(X_1\) by repeating for \(P(X_1 = 0 | E_1 = 1)\) and normalizing to incorporate \(\alpha\) back.

Backwards Algorithm

Having calculated these forward queries for filtering, we can use these results to retrieve smoothing query results as well. This is the "backwards" part of the algorithm. Recall that smoothing queries take the form \(P(X_k | e_{1:t}), 1 \leq k < t\).

We first observe that \(P(X_k | e_{1:t}) = \alpha P(e_{k + 1: t} | X_k) P(X_k | e_{1:k})\) (\(\alpha\) is not necessarily the same proportionality constant as is used in the forward pass). We derive this from the definition of conditional probability and the fact that \(e_{k + 1: t} \perp\!\!\!\perp e_{1:k} | X_k\). Notice that we have the second term from the forward algorithm. The first term is known as the "backwards message". \[\begin{align*} P(e_{k + 1: t} | X_k) = \sum_{x_{k + 1}}P(x_{k + 1} | X_k)P(e_{k + 1} | x_{k + 1}) P(e_{k+ 2: t}| x_{k + 1}) \end{align*}\] Where the first two terms in the sum can be obtained from the model itself, and the last term is recursive as in the forward algorithm.

As an exercise, try to derive the formula for the "backward message".

Particle Filtering Algorithm

In Particle Filtering, we use likelihood-weighted sampling to instead approximate (rather than exactly infer) queries. The basic idea is as follows: we start with some initial distribution of samples (referred to as particles), and iteratively "move them around" according to the transition model to estimate the probability distribution \(X_t\) at each timestep \(t\). To incorporate observed evidence at each timestep, we also weight each sample using the sensor model. More concretely,

Our representation of \(P(X_t)\) is a list of \(N\) particles/samples. We approximate \(P(X_t = x)\) by \[\hat{P}(X_t=x) = \frac{\text{number of particles with value }x}{N}\]

Generally, \(N << ~\mid \text{domain}(X) \mid\)

So, many \(x\) will have \(\hat{P}(x) = 0\).

With more particles, we will have higher accuracy with respect to the actual values.

If \(N=1\), there is no need for weighting or the resample step below (unless the weight is 0)

Particle Filtering Algorithm: Starting with our belief of some distribution \(\hat{P}(X_t)\),

Propagate Forward
- For each particle \(x_t\) draw samples for \(X_{t+1}\) from \(P(X_{t+1} \mid x_t)\).
Observe (i.e., weight samples based on evidence)
- Construct \(\hat{P}(X,e_{t+1})\). For each possible \(x\),
  - Weight \(w = P(e_{t+1} | x)\)
  - Compute \(\hat{P}(x, e_{t+1})\) by multiplying our current belief of \(X_{t+1}\)’s distribution (which doesn’t currently account for evidence) by \(w\): \(\hat{P}(x,e_{t+1}) = \hat{P}(x)P(e_{t+1} \mid x)\)
- Normalize \(\hat{P}(X,e_{t+1})\) to get \(\hat{P}(X \mid e_{t+1}) = \frac{\hat{P}(X,e_{t+1})}{\sum_{x} \hat{P}(x, e_{t+1})}\). This is our updated belief which now incorporates observed evidence.
Resample
- Resample \(N\) times from the new sample distribution \(\hat{P}(X \mid e_{t+1})\)

Note: We’re not picky on the \(P\) vs. \(\hat{P}\) notation - we just make the distinction here to explicitly show what’s a belief (i.e., our approximation) vs. a true value/probability distribution.