$ \def\v{\mathbf{v}} \def\w{\mathbf{w}} \def\x{\mathbf{x}} \def\D{\mathbf{D}} \def\V{\mathbf{V}} \def\S{\mathbf{S}} \def\F{\mathcal F} \def\bold#1{\bf #1} $

MLSP Fall 2016: Homework 3
Expectation Maximization

Part I: EM and Shift-Invariant Models

In this problem we will consider shift-invariant mixtures of multi-variate multinomial distributions.

Consider data that have multiple discrete attributes. "Discrete" attributes are attributes that can take only one of a countable set of values. We will consider discrete attributes of a particular kind -- integers that have not only a natural rank ordering, but also a definite notion of distance.

Let $(X,Y)$ be the pair of discrete attributes defining any data instance. Since both $X$ and $Y$ are discrete, the probability distribution of $(X,Y)$ is a bi-variate multinomial.

We describe $(X,Y)$ as the outcome of generation by the following process:

The process has at its disposal several urns. Each urn has two sub-urns inside it. The first sub-urn represents a bi-variate multinomial: it contains balls, such that each ball has an $(X_1,Y_1)$ value marked on it. The second sub-urn represents a uni-variate multinomial -- it contains balls, such that each ball has a $X_2$ value marked on it.

In the following explanation we will use the notation $P_x(X)$ to indicate the probability that the Random Variable $x$ takes the value $X$.

We represent the content of the larger sub-urn within each urn as $(x_1, y_1)$. The smaller sub-urn generates the random variable $x_2$.

Drawing procedure: At each draw the drawing process performs the following operations.

It first randomly selects one of the larger urns according to a probability distribution $P_z(Z)$. Here $Z$ represents the urn selected.
Then it selects one ball each from each of the sub-urns in the selected urn.The probability of balls in the $(x_1,y_1)$ sub-urn of the $Z$^th urn is $P_{x_1,y_1|z}(X_1,Y_1|Z)$. The probability of balls in the $(x_2)$ sub-urn of the $Z$^th urn is $P_{x_2|z}(X_2|Z)$. Drawing from these distributions, the process obtains a $(X_1,Y_1)$ pair from the $(x_1, y_1)$ sub-urn, and $X_2$ from the $x_2$ sub-urn
It finally output $(X,Y) = (X_1 + X_2, Y_1)$.

Thus, the final observation is:

$(X,Y) = (X_1 + X_2, Y_1)$.

Representing the output random variable as $(x,y)$, the probability that it takes a value $(X,Y)$ is given by $P_{x,y}(X,Y)$.

Problem 1.1

Give the expression for $P_{x,y}(X,Y)$ in terms of $P_z(Z)$, $P_{x_1,y_1}(X_1,Y_1|Z)$ and $P_{x_2}(X_2|Z)$.

Problem 1.2

You are given a histogram of counts $H(X,Y)$ obtained from a large number of observations. $H(X,Y)$ represents the number of times $(X,Y)$ was observed. Give the EM update rules to estimate $P_z(Z)$, $P_{x_1,y_1}(X_1,Y_1|Z)$ and $P_{x_2}(X_2|Z)$.

Problem part 1.3

In this problem we will try to deblur a picture that has become blurry due to a slight left-to-right shake of the camera. You can download the actual picture from this link:

We model the picture as a histogram (the value of any pixel at a position $(X,Y)$, which ranges from 0-255, is viewed as the count of ``light elements'' at that position). We model this distribution as a shift-invariant mixture of one component (i.e. one large urn).

Assuming a very slight 20-pixel strictly-horizontal shake, we model that within the $X_2$ sub-urn $X_2$ can take integer values 0-19 (i.e. 20 wide). The $X_1$ value in the $(X_1,Y_1)$ sub-urn can range from 0 to (width-of-picture - 20). $Y_1$ can take values in the range 0 to (hieght-of-picture - 1).

Estimate and plot $P_{x_2}(X_2)$ and $P_{x_1,y_1}(X_1, Y_1)$. You will need the solution to problem 2 for this problem. If the solution to problem 2 is incorrect, the solution of problem 3 will not be considered or given any points.

Part II: Predicting the Election

In this problem we will try to track a number of opinion polls and try to estimate the true support for the candidates in a recent election.

The election is between four candidates. Public sentiment about the candidates fluctuates all the time. A number of opinion polls try to gauge public sentiment. However, since opinion polls are fundamentally noisy procedures (affected by factors such as the specific subset of people they poll, or the number of samples in their poll), each of them can be viewed as a noisy measurement of the true public sentiment. We will try to obtain a better estimate of the true sentiment, as well as the uncertainty of the estimate (which a pollster could use to establish a margin of error).

We will model the polls as the output of a linear Gaussian process as follows:

Let $S_t$ represent the state of public sentiment. In our case, its a 3-dimensional vector representing the percent of public that currently favors candidates 1, 2 and 3. Although there are 4 candidiates, we need not model the 4th explicitly, since the fourth is linearly dependent on the first three -- the four must sum to 100.
We expect public opinion to generally stay consistent in the absence of other effects. So our state equation is given by \[ S_t = S_{t-1} + \epsilon_t \] where $\epsilon_t$ is the innovation in the time period $t$ that changes the state.
1. We will assume that a priori probability distribution of $S$ (i.e. before any opinion polls) is a Gaussian with mean $\bar{S}_0$ and variance $R_0$: $P(S_0) = {\mathcal N}(S_0; \bar{S}_0, R_0)$.
2. We will assume that the innovation $\epsilon$ is also Gaussian with mean 0 and variance $\Theta_\epsilon$, i.e. $P(\epsilon) = {\mathcal N}(\epsilon; 0, \Theta_\epsilon)$.
Let $O_t$ represent the vector of opinion poll measurements. We will consider 17 different polls. So, in our problem $O_t$ is a 51-dimensional vector (since each poll reports numbers for 3 candidates). However not all polls are obtained each week. So for some weeks, some polls may be absent, in which case the dimensionality of the observation will be $3K$, where $K$ is the number of obtained polls and will be less than 17.
We model the opinion polls as a noisy measurement of $S_t$ given by \[ O_t = A S_t + \gamma_t \] where $A$ is a $D \times 3$ “observation matrix”, and $D$ here is the number of collected opinions at time $t$ (which, for $K$ polls, will be $3K \leq 51$). $\gamma$ is an observation noise, and assumed to have a Gaussian distribution with mean $\mu_\gamma$ and variance $\Theta_\gamma$, i.e. $P(\gamma) = {\mathcal N}(\gamma; \mu_\gamma, \Theta_\gamma)$.

Our objective is to use the measurements $O_{0:t}$ (i.e. all measurements from time 0 to $t$) to estimate the true sentiment $S_t$ at time $t$.

Problem 2.1

Write out the Kalman filtering equations to estimate $S_t$ at each time $t$.

Problem 2.2

Implement the Kalman filter (you must submit the code). Run it on the provided data series (which only comprises the sequence of observations $O_0,\cdots,O_T$) and predict the true $S_t$ at each $t$. Plot the estimated $S_t$ as a function of time (this will be a single plot with 3 curves). Submit both the plot, and the the estimated state at every time. Also submit the final state uncertainty (i.e. the variance matrix of the state).

Problem 2.3

You wont be scored on this, but compare the final estimate (at the final instant) with the true voting percentages in the 2016 presidential election. You can get this from the RealClearPolitics webpage on 25th November or later for the final count (or close to it).

Data for the problem

The data here includes the following:

A file called “opinionpoll.mat”. Each row of this file is the set of opinion poll numbers for one week. Some numbers are NaN. These represent opinion poll numbers that were not obtained, i.e. in that particular week the actual number of opinion polls obtained was less than 17, and so the NaN entries represent polls that were not taken. Note that since each poll gives you three numbers (one per candidate), the NaNs will occur in groups of three. You will have to keep track of which data entries are missing in any week, because you will have to remove that entry from the observation to get the true observation vector, and also remove the corresponding row from $A$, and the corresponding columns and rows of $\Theta_\gamma$ to get the actual covariance matrix for the observation noise that week.
A file called “prior.mat” with the parameters of the a priori probability distribution of $S_0$. The first line in this matrix contains $\bar{S}_0$. The second line contains the diagonal entries of $R_0$. The off-diagonal entries are assumed to be 0.
A file called “epsilon.mat” with the diagonal entries of $\Theta_\epsilon$. The off-diagonal entries of $\Theta_\epsilon$ are assumed to be 0.
A file called “gamma.mat”. The first row of the file gives you $\mu_\gamma$. The second row has the diagonal entries of $\Theta_\gamma$. The off-diagonal entries of $\Theta_\gamma$ are assumed to be 0.

N.B: Please remember that this is only a homework problem and may not in any way be indicative of reality. Our model is unrealistic -- its unlikely that either the noise nor the innovation is Gaussian. We're also not explicitly handling other factors that affect the polling, or the constraint that the samples are strictly non-negative (you can't have a negative percent of the population voting for anyone). Various other factors are being ignored (although, in principle, all of these could be included in the model). Nonetheless, we believe the computational exercise itself is interesting and should tell you something of the power of MLSP techniques.

Due date

The assignment is due on 30 Nov 2016. The solutions must be emailed to Bhiksha, Chiyu and Anurag. Please use the format given here for your submissions.

MLSP Fall 2016: Homework 3 Expectation Maximization