Be able to work through multiple iterations of particle filtering.
Implement the Forward-Backward Algorithm for HMMs.
Implement particle filtering for a variety of Bayesian Networks.
Apply smoothing to HMM queries for each time step.
Hidden Markov Models are used to describe time or spatial series data; i.e for situations that dynamically change over space or time such as speech recognition. To model such situations, we look at the states at each point in time. Another important aspect to consider while modeling the world is that not all states are observable. Rather, we may only be able to observe some proxy for the state, known as evidence. For example, if you are tracking the movement of an object on a grid, you may only be able to observe the general region it is located in.
\(X_t\) denotes the set of state variables at time \(t\) that are not observable, or hidden. \(E_t\) likewise denotes the set of observable state variables.
We define HMMs with three components:
\(P(X_0)\): Initial Distribution
\(P(X_t | X_{t - 1})\) Transition (Dynamics) Model
\(P(E_t | X_t)\): Sensor (Observation) Model
Consistent with Bayes Nets, we can represent the joint probability of HMMs as: \[P(X_0, E_1, X_1, ..., E_t,
X_t) =
P(X_0)\prod_t P(X_t | X_{t-1})P(E_t | X_t)\] Consider the example discussed in lecture. You want to know
whether or
not it rains every day. However, being quarantined in a windowless home, the only outside information about
outside
world is whether your friend comes to your home with an umbrella or not. In this case, the time interval
\(t\) will be
each day, \(X_t\) is set of variables denoting whether it rains in day \(t\), and \(E_t\) is set of
variables denoting
whether you see umbrella in day \(t\).
HMMs specifically describe the process with single discrete random variables. If you happen to have more than one state variable, for instance \(Rain_t\), and \(Sunny_t\), to make it consistent with HMM framework, we would combine them to a single variable by using tuple as a value. Although this is out of scope for this class, the reason for this specific structure is for simple matrix implementation for inference tasks.
The following section is a brief overview of inference tasks that could be done on HMMs. More mathematical details about the filtering and smoothing algorithms will be discussed in the next section.
The goal of filtering is to find the belief state given all evidence to date: \(P(X_t | e_{1:t})\). In other words, we want to know what the current state is, given all of the current and past evidence. In the following picture, we query \(P(X_4 | e_{1:4})\). The belief state is \(X_4\).
The goal of prediction is to learn about some future state, given all of the current and past evidence: \(P(X_{t+k} | e_{1:t})\), where \(k > 0\).
The goal of smoothing is to calculate a posterior distribution over the past state, given all of the current and past evidence: \(P(X_k | e_{1:t})\), where \(1 \leq k < t\). The reason for doing smoothing is that we now can make much better estimates of the past states with more evidence after the fact, which plays an important role for learning tasks.
We might also want to infer about the whole sequence instead of single state. Given all the observations until \(t\), we might want to find the sequence of states that is most likely to have produced the observations. To do so, we compute most likely explanation: \(argmax_{X_{1:t}} P (X_{1:t} | e_{1:t})\). One application for this inference task is speech recognition.
The forward algorithm is designed to answer filtering queries. It is based on the following expansion of
the term
\(P(X_t | e_{1:t})\): \[\begin{align*}
P(X_t | e_{1:t}) &= P(X_t | e_t, e_{1:t-1}) \\
&Expanding\ shorthand\ e_{1:t} \\
&= \alpha P(X_t, e_t | e_{1:t-1}) \\
&Definition\ of\ Conditional\ Probability \\
&= \alpha \sum_{x_{t - 1}} P(X_t, e_t, x_{t - 1} | e_{1:t-1}) \\
&Sum\ over\ x_{t - 1} \\
&= \alpha \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1}, e_{1:t - 1})P(e_t | X_t, x_{t -
1}, e_{1:t
- 1}) \\
&Chain\ Rule \\
&= \alpha \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1})P(e_t | X_t) \\
&Conditional\ Independence\ from\ BN \\
&= \alpha P(e_t | X_t) \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1})
\end{align*}\] Notice that the term \(P(x_{t - 1} | e_{1:t - 1})\)
appears
in the expression for \(P(X_t | e_{1:t})\). This suggests a recursive
method of
answering filtering queries, where we "unroll" a query as a combination of smaller versions of the same
query. The
cost of this procedure is \(O(|X|^2)\), where \(|X|\) is the number of states (you can verify this by solving a recurrence
relation if
interested). Quadratic time is prohibitive for problems with large state spaces, which leads us to
approximate methods
we will discuss later.
Also note that filtering is essentially two step process: prediction and update. Notice \(\sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1}) = P(X_t |
e_{1:t-1})\)
represents the one step prediction of the next state based on current distribution at \(t-1\). Then, it is multiplied with \(P(e_t |
X_t)\) to
update with the new evidence \(e_t\) just
observed.
Example
For a more concrete example, consider the rain and umbrella example discussed in 2.2. Initially, we assume
the
uniform distribution. In other words, \(P(X_0 = 0) = 0.5\) and \(P(X_0 = 1) = 0.5\). Also, consider the
following
transition and sensor models: \[\begin{array}{|c|c|c|}
\hline
X_t & X_{t-1} & P(X_t |
X_{t-1}) \\
\hline
1 & 0 & 0.3 \\\hline
1 & 1 & 0.7 \\\hline
0 & 0 & 0.7 \\\hline
0
& 1 & 0.3 \\\hline
\end{array}
\begin{array}{|c|c|c|}
\hline
X_t & E_{t}
& P(E_t |
X_t) \\ \hline
1 & 0 & 0.1\\\hline
1 & 1 & 0.9 \\\hline
0 & 0 & 0.8
\\\hline
0 & 1 & 0.2 \\\hline
\end{array}\]
On day 1, we observe your friend brought an umbrella, i.e \(E_1 = 1\). First, we perform prediction:
\[P(X_1=1) =
\sum_{x_0} P(X_1 | x_0)P(x_0) = 0.3*0.5 + 0.7*0.5 = 0.5\] (Intuitively, what we’re doing here is considering
all
possibilities of \(x_0\), whether it rained on day 0, and computing the likelihood of it raining on day 1
using the
transition model.)
Then, we can update this prediction with the new observation: \[P(X_1 = 1 |
E_1 = 1) =
\alpha P(E_1=1 | X_1=1)P(X_1=1) = \alpha 0.9*0.5 = \alpha 0.45\] We can get the final posterior distribution
for
\(X_1\) by repeating for \(P(X_1 = 0 | E_1 = 1)\) and normalizing to incorporate \(\alpha\) back.
Having calculated these forward queries for filtering, we can use these results to retrieve smoothing query
results
as well. This is the "backwards" part of the algorithm. Recall that smoothing queries take the form \(P(X_k | e_{1:t}), 1 \leq k < t\).
We first observe that \(P(X_k | e_{1:t}) = \alpha P(e_{k + 1: t} | X_k) P(X_k | e_{1:k})\) (\(\alpha\) is not necessarily the same proportionality constant as is used in the forward pass). We derive this from the definition of conditional probability and the fact that \(e_{k + 1: t} \perp\!\!\!\perp e_{1:k} | X_k\). Notice that we have the second term from the forward algorithm. The first term is known as the "backwards message". \[\begin{align*} P(e_{k + 1: t} | X_k) = \sum_{x_{k + 1}}P(x_{k + 1} | X_k)P(e_{k + 1} | x_{k + 1}) P(e_{k+ 2: t}| x_{k + 1}) \end{align*}\] Where the first two terms in the sum can be obtained from the model itself, and the last term is recursive as in the forward algorithm.
As an exercise, try to derive the formula for the "backward message".
In Particle Filtering, we use likelihood-weighted sampling to instead approximate (rather than exactly infer) queries. The basic idea is as follows: we start with some initial distribution of samples (referred to as particles), and iteratively "move them around" according to the transition model to estimate the probability distribution \(X_t\) at each timestep \(t\). To incorporate observed evidence at each timestep, we also weight each sample using the sensor model. More concretely,
Our representation of \(P(X_t)\) is a list of \(N\) particles/samples. We approximate \(P(X_t = x)\) by \[\hat{P}(X_t=x) = \frac{\text{number of particles with value }x}{N}\]
Generally, \(N << ~\mid \text{domain}(X) \mid\)
So, many \(x\) will have \(\hat{P}(x) =
0\).
With more particles, we will have higher accuracy with respect to the actual values.
If \(N=1\), there is no need for weighting or the resample step below (unless the weight is 0)
Particle Filtering Algorithm: Starting with our belief of some distribution \(\hat{P}(X_t)\),
Propagate Forward
For each particle \(x_t\) draw samples for \(X_{t+1}\) from \(P(X_{t+1} \mid x_t)\).
Observe (i.e., weight samples based on evidence)
Construct \(\hat{P}(X,e_{t+1})\). For each possible \(x\),
Weight \(w = P(e_{t+1} | x)\)
Compute \(\hat{P}(x, e_{t+1})\) by multiplying our current belief of \(X_{t+1}\)’s distribution (which doesn’t currently account for evidence) by \(w\): \(\hat{P}(x,e_{t+1}) = \hat{P}(x)P(e_{t+1} \mid x)\)
Normalize \(\hat{P}(X,e_{t+1})\) to get \(\hat{P}(X \mid e_{t+1}) = \frac{\hat{P}(X,e_{t+1})}{\sum_{x} \hat{P}(x, e_{t+1})}\). This is our updated belief which now incorporates observed evidence.
Resample
Resample \(N\) times from the new sample distribution \(\hat{P}(X \mid e_{t+1})\)
Note: We’re not picky on the \(P\) vs. \(\hat{P}\) notation - we just make the distinction here to explicitly show what’s a belief (i.e., our approximation) vs. a true value/probability distribution.