Lecture 5: Parameter Estimation in Fully Observed Bayesian Networks

Introduction to the problem of Parameter Estimation in fully observed Bayesian Networks

Introduction

Last class, we presented two major types of graphical model task, Inference and Learning, and we mainly discussed about inference. In this lecture, we start to introduce the learning task, and explore some learning techniques used in fully observable Bayesian Networks.

Learning Graphical Models

Before we get started, it’s better for us to get some intuition about what learning is and which goal learning is achieving.

The goal

The goal: Given a set of independent samples (assignments of random variables), find the best (or the most likely) Bayesian Network (both DAG and CPDs).

As shown in the figure below, we are given a set of independent samples (assignments of binary random variables). Assume it’s a DAG, we are going to learn the directed links (causality relationships) between nodes, this process is called Structural Learning. Learning the conditional possibility is another task called Parameter learning.

The goal

Overview

As listed below, there’re several learning scenarios we are interested in:

  1. Completely observed GMs:
    • 1.1. directed
    • 1.2. undirected
  2. Partially or unobserved GMs:
    • 1.1. directed
    • 1.2. undirected (an open research topic)

Some useful estimation principles are also listed here:

  1. Maximal likelihood estimation (MLE)
  2. Bayesian estimation
  3. Maximal conditional likelihood
  4. Maximal “Margin”
  5. Maximum entropy

We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

Parameter Learning

In this section, we are going to introduce Parameter Learning, and some useful related models.

When doing parameter learning, we assume that the Graphical model G itself is known and fixed, G could come from either expert design or an intermediate outcome of iterative structure learning.

The goal of parameter learning is to estimate parameters from a dataset of NN independent, identically distributed (i.i.d.) training samples D = {x_1, \cdots, x_N}.

In general, each training case xn=(xn,1,,xn,M)\mathbf{x}_n = (x_{n,1}, \cdots, x_{n, M}) is a vector of MM values, one per node. Depending on the model, elements in xn\mathbf{x}_n could be all known (no missing values, no hidden variables) if the model is completely observable, or partially known (\exists i, x_{n,i} is not observed) if the model is partially observable.

In this class, we mainly consider learning parameters for a BN that is completely observable with a given structure.

Exponential Family Distributions

There are various density estimation tasks that can be viewed as single-node GMs (see the supplementary section for examples), and they are in fact instances of Exponential Family Distributions.

Exponential Family Distributions are building blocks for general GMs, and have nice properties allowing us to easily do MLE and Bayesian estimation.

Definition

For a numeric random variable X,

pX(xη)=h(x)exp{ηTT(x)A(η)}=1Z(η)h(x)exp{ηTT(x)}\displaystyle p_X(x|\eta) = h(x)\exp\{\eta^T T(x) - A(\eta)\}=\frac{1}{Z(\eta)}h(x)\exp\{\eta^T T(x)\}

Function T(x) is a sufficient statistic (will be explained in sufficiency section).

Function A(\eta) = \log{Z(\eta)} is the log normalizer.

Examples

We can see lots of probability distributions actually belonging to this family (including Bernoulli, Multinomial, Gaussian, Poisson, Gamma).

Example 1: Multivariate Gaussian Distribution

For a continuous vector random variable X \in R^k:

\begin{aligned} p_X(x|\mu,\Sigma) &= \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right\} \\ &= \frac{1}{(2\pi)^{k/2}} \exp\left\{-\frac{1}{2}tr(\Sigma^{-1}xx^T)+\mu^T\Sigma^{-1}x-\frac{1}{2}\mu^T\Sigma^{-1}\mu-\log{|\Sigma|}\right\} \end{aligned}

Exponential family representation:

\begin{aligned} \eta &= [\Sigma^{-1}\mu;-\frac{1}{2}vec(\Sigma^{-1})] = [\eta_1,vec(\eta_2)], \eta_1 = \Sigma^{-1}\mu, \eta_2 = -\frac{1}{2}\Sigma^{-1} \\ T(x) &= [x;vec(xx^T)] \\ A(\eta) &= \frac{1}{2}\mu^T\Sigma^{-1}\mu+\log{|\Sigma|}=-\frac{1}{2}tr(\eta_2\eta_1\eta_1^T)-\frac{1}{2}\log{(-2\eta_2)} \\ h(x) &= (2\pi)^{-k/2} \end{aligned}

After this conversion, we see that the Multivariate Gaussian Distribution indeed belongs to exponential family.

Note that a k-dimension Multivariate Gaussian Distribution has a k + k^2-dimensional natural parameter (and sufficient statistic). However, as Σ\Sigma has to be symmetric and PSD, parameters actually have a lower degree of freedom.

Example 2: Multinomial Distribution

For a binary vector random variable x\sim multinomial(x \vert \pi),

p(xπ)=π1x1π2x2πKxK=exp{kxklnπk}\displaystyle p(x|\pi) = \pi_1^{x_1}\pi_2^{x_2} \cdots \pi_K^{x_K}= \exp\{\sum_k x_k\ln\pi_k\}
=exp{k=1K1xklnπk+(1k=1K1xk)ln(1k=1K1πk)}\displaystyle =\exp\{\sum_{k=1}^{K-1}x_k\ln\pi_k+(1-\sum_{k=1}^{K-1}x_k)\ln(1-\sum_{k=1}^{K-1}\pi_k)\}
=exp{k=1K1xkln(πkk=1K1πk))+ln(1k=1K1πk)}\displaystyle =\exp\{\sum_{k=1}^{K-1}x_k\ln(\frac{\pi_k}{\sum_{k=1}^{K-1}\pi_k)})+\ln(1-\sum_{k=1}^{K-1}\pi_k)\}

Exponential family representation:

η=[ln(πk/πK);0]\displaystyle \eta = [\ln(\pi_k/\pi_K);0]
T(x)=[x]\displaystyle T(x) = [x]
A(η)=ln(1k=1K1πk)=ln(k=1Keηk)\displaystyle A(\eta) = -\ln(1-\sum_{k=1}^{K-1}\pi_k) = \ln(\sum_{k=1}^{K}e^{\eta_k})
h(x)=1\displaystyle h(x) = 1

Properties of exponential family

Moment generating property

The qthq^{th} derivative of exponential family gives the qthq^{th} centered moment.

dA(η)dη=mean\displaystyle \frac{dA(\eta)}{d\eta} = mean
d2A(η)dη2=variance\displaystyle \frac{d^2A(\eta)}{d\eta^2} = variance
\displaystyle \cdots

So we can take this advantage to compute moments of any exponential family distribution by taking the derivatives of the log normalizer A(η)A(\eta).

Note: when the sufficient statistic is a stacked vector, partial derivatives need to be considered.

Moment vs. canonical parameters

Applying the moment generating property, the moment parameter μ\mu can be derived from the natural (canonical) parameter by:

dA(η)dη=μ\displaystyle \frac{dA(\eta)}{d\eta} = \mu

Note that A(η)A(\eta) is convex since

d2A(η)dη2=Var[T(x)]>0\displaystyle \frac{d^2A(\eta)}{d\eta^2} = Var[T(x)]>0

Thus we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

η=ψ(μ)\displaystyle \eta = \psi(\mu)

So we can say that a distribution in the exponential family can be parameterized not only by η\eta - the canonical parameterization, but also by μ\mu - the moment parameterization.

MLE for Exponential Family

For i.i.d. data, we have the log-likelihood:

(η;D)=lognh(xn)exp{ηTT(xn)A(η)}\displaystyle \ell(\eta;D) = \log \prod_nh(x_n)\exp\{\eta^T T(x_n) - A(\eta)\}
=nlogh(xn)+(ηTnT(xn))NA(η)\displaystyle =\sum_n\log h(x_n) + (\eta^T\sum_n T(x_n))-NA(\eta)

Take the derivatives and set to zero:

η=nT(xn)NA(η)η=0\displaystyle \frac{\partial\ell}{\partial\eta} = \sum_n T(x_n)-N\frac{\partial A(\eta)}{\partial\eta} = 0

and

A(η)η=1NnT(xn)\displaystyle \frac{\partial A(\eta)}{\partial\eta}=\frac{1}{N}\sum_n T(x_n)
μ~MLE=1NnT(xn)\displaystyle \tilde{\mu}_{MLE} = \frac{1}{N}\sum_n T(x_n)

This amounts to moment matching. Also, we could infer the canonical parameters using η~MLE=ψ(μ~MLE)\tilde{\eta}_{MLE}=\psi(\tilde{\mu}_{MLE}).

Sufficiency

We can see from above that for p(xθ)p(x|\theta), T(x)T(x) is sufficient for θ\theta if there is no information in XX regarding θ\theta beyond that in T(x)T(x). Then, We can throw away XX for the purpose of inference w.r.t θ\theta. These could be shown by both Bayesian view and Frequentist view.

Bayesian view:

Bayesian view
p(θT(x),x)=p(θT(x))\displaystyle p(\theta|T(x),x)=p(\theta|T(x))

Frequentist view:

Frequentist view
p(xT(x),θ)=p(xT(x))\displaystyle p(x|T(x),\theta)=p(x|T(x))

The Neyman factorization theorem:

T(x)T(x) is sufficient for \theta if

p(x,T(x),θ)=ψ1(T(x),θ)ψ2(x,T(x))\displaystyle p(x,T(x),\theta) = \psi_1(T(x),\theta)\psi_2(x,T(x))
p(xθ)=g(T(x),θ)h(x,T(x))\displaystyle p(x|\theta) = g(T(x),\theta)h(x,T(x))

Examples

Gaussian:

η=[Σ1μ;12vec(Σ1)]=[η1,vec(η2)],η1=Σ1μ,η2=12Σ1\displaystyle \eta = [\Sigma^{-1}\mu;-\frac{1}{2}vec(\Sigma^{-1})] = [\eta_1,vec(\eta_2)], \eta_1 = \Sigma^{-1}\mu, \eta_2 = -\frac{1}{2}\Sigma^{-1}
T(x)=[x;vec(xxT)]\displaystyle T(x) = [x;vec(xx^T)]
A(η)=12μTΣ1μ+logΣ=12tr(η2η1η1T)12log(2η2)\displaystyle A(\eta)=\frac{1}{2}\mu^T\Sigma^{-1}\mu+\log{|\Sigma|}=-\frac{1}{2}tr(\eta_2\eta_1\eta_1^T)-\frac{1}{2}\log{(-2\eta_2)}
h(x)=(2π)k/2\displaystyle h(x)=(2\pi)^{-k/2}
μMLE=1NnT1(xn)=1Nnxn\displaystyle \mu_{MLE}=\frac{1}{N}\sum_n T_1(x_n)=\frac{1}{N}\sum_n x_n

Multinomial:

η=[ln(πk/πK);0]\displaystyle \eta = [\ln(\pi_k/\pi_K);0]
T(x)=[x]\displaystyle T(x) = [x]
A(η)=ln(1k=1K1πk)=ln(k=1Keηk)\displaystyle A(\eta) = -\ln(1-\sum_{k=1}^{K-1}\pi_k) = \ln(\sum_{k=1}^{K}e^{\eta_k})
h(x)=1\displaystyle h(x) = 1
μMLE=1Nnxn\displaystyle \mu_{MLE}=\frac{1}{N}\sum_n x_n

Poisson:

η=logλ\displaystyle \eta = \log \lambda
T(x)=x\displaystyle T(x) = x
A(η)=λ=eη\displaystyle A(\eta) =\lambda = e^\eta
h(x)=1x!\displaystyle h(x)=\frac{1}{x!}
μMLE=1Nnxn\displaystyle \mu_{MLE}=\frac{1}{N}\sum_n x_n

Generalized Linear Models

There are various conditional density estimation tasks that can be viewed as two-node GMs (see supplementary section for examples). Many are instances of Generalized Linear Models.

Generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows the linear model to be related to response variables via a link function, that have error distribution models other than a normal distribution. For example both linear regression and logistic regression can be unified by generalized linear model.

GLM

Commonality

We model the expecatation of yy: Ep(y)=μ=f(θTx)E_p(\mathbf{y)} = \mu = f(\theta^T\mathbf{x})

The observed input x\mathbf{x} is assumed to enter into the model via a linear combination of its elements, and the conditional mean μ\mu is represented as a function f(ξ)f(\xi) of ξ\xi, where f is known as the link function of ξ=θTx\xi=\theta^T\mathbf{x}. The observed output y\mathbf{y} is assumed to be characterized by an exponential family distribution with conditional mean μ\mu.

Example 1: Linear Regression

Let’s take a quick recap on Linear Regression.

Assume that the target variable and the inputs are related by:

yi=θTxi+ϵi\displaystyle y_i = \theta^T x_i + \epsilon_i

where ϵ\epsilon is an error term of unmodeled effects or random noise.

Now assume that \epsilon follows a Gaussian distribution N(0,σ)N(0, \sigma), then

p(yixi;θ)=12πσexp((yiθTxi)22σ2)\displaystyle p(y_i|x_i;\theta) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(y_i -\theta^T\mathbf x_i)^2}{2\sigma^2})

We can use te LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter.

Example 2: Logistic Regression (sigmoid classifier, perceptron, etc.)

The condition distribution is a Bernoulli:

p(yx)=μ(x)y(1μ(x))1y\displaystyle p(y|x) = \mu(x)^y (1-\mu(x))^{1-y}

where μ\mu is a logistic function

μ(x)=11+eθTx\displaystyle \mu(x) = \frac{1}{1+e^{-\theta^T x}}

We can use the brute force gradient method as in LR. But we can also apply generic laws by observing the p(yx)p( y \vert x ) is an exponential family function - a generalized linear model.

More examples: parameterizing graphical models

Markov random fields

p(x)=1Zexp{cCϕc(xc)}=1Zexp{H(x)}p(\mathbf x) = \frac{1}{Z}\exp\{ -\sum_{c\in C} \phi_c(\mathbf x_c)\} = \frac{1}{Z}\exp\{-H(\mathbf x)\} p(X)=1Zexp{i,jNiθijXiXj+iθi0Xi}p(X) = \frac{1}{Z}\exp\{\sum_{i,j\in N_i}\theta_{ij}X_iX_j+\sum_i\theta_{i0}X_i\}

Restricted Boltzmann Machines

p(x,hθ)=exp{iθiϕi(xi)+jθjϕj(xj)+i,jθi,jϕi,j(xi,hj)A(θ)}\displaystyle p(x,h|\theta) = \exp\{\sum_i \theta_i \phi_i(x_i)+\sum_j \theta_j \phi_j(x_j)+\sum_{i,j} \theta_{i,j} \phi_{i,j} (x_{i},h_j )-A(\theta)\}

Conditional Random Fields

pθ(yx)=1Z(θ,x)exp{cθcfc(x,yc)}p_\theta(y \vert x) = \frac{1}{Z(\theta,x)}\exp\{\sum_c\theta_cf_c(x,y_c)\}

MLE for GLIMs

Example canonical response functions

Log-likelihood

=nlogh(yn)+n(θTxnynA(ηn))\displaystyle \ell = \sum \limits_n \log h(y_n)+\sum \limits_n(\theta^Tx_ny_n-A(\eta_n))

Derivative is

ddθ=n(xnyndA(ηn)ηndηdθ)=nxn(ynμn)=XT(yμ)\displaystyle \frac{d\ell}{d \theta}=\sum_n(x_ny_n-\frac{dA(\eta_n)}{\eta_n}\frac{d\eta}{d\theta})=\sum_nx_n(y_n-\mu_n)=X^T(y-\mu)

which is a fixed point function since μ\mu is a function of θ\theta.

Iteratively Reweighted Least Squares (IRLS)

Recall that the Hessian matrix H=XTWXH=-X^T WX, and θ=(XTX)1XTy\theta^*=(X^TX)^{-1} X^T y in least mean square optimization, we use Newton-Raphson method with cost function \ell:

θt+1=θtH1θ=(XTWtX)1XTWtzt\displaystyle \theta^{t+1}=\theta^t-H^{-1}\nabla_\theta\ell=(X^TW^tX)^{-1}X^TW^tz^t

where the response is:

zt=Xθt+(Wt)1(yμt)\displaystyle z^t=X\theta^t+(W^t)^{-1}(y-\mu^t)

Hence, this can be understood as solving the Iteratively reweighted least squares problem. θt+1=argminθ(zXθ)T(zXθ)\theta^{t+1}=arg\min_\theta(z-X\theta)^T(z-X\theta)

Global and local parameter independence

Simple graphical models can be viewed as building blocks of complex graphical models. With the same concept, if we assume the parameters for each local conditional probabilistic distribution to be globally independent, and all nodes are fully observed, then the log-likelihood function can be decomposed into a sum of local terms, one per node:

(θ,D)=logp(Dθ)=i(nlogp(xn,ixn,πi,θi))\displaystyle \ell(\theta,D)=\log p(D|\theta)=\sum_i(\sum_n \log p(x_{n,i}|\mathbf{x_{n,\pi_i},\theta_i}))

Plate

A plate is a macro that allows subgraphs to be replicated. Conventionally, instead of drawing each repeated variable individually, a plate is used to group these variables into a subgraph that repeat together, and a number is drawn on the plate to represent the number of repetitions of the subgraph in the plate.

Rules for plates: Repeat every structure in a box a number of times given by the integer in the corner of the box (e.g. NN), updating the plate index variable (e.g. nn) as you go; Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.

For example, in the directed acyclic network, it can be decomposed as p(xθ)=i=1np(xixπi)=p(x1θ1)p(x2x1,θ2)p(x3x1,θ3)p(x4x2,x3,θ4)p(x|\theta)=\sum_{i=1}^np(x_i|\mathbf{x}_{\pi_i})=p(x_1|\theta_1)p(x_2|x_1,\theta_2)p(x_3|x_1,\theta_3)p(x_4|x_2,x_3,\theta_4) This is exactly like learning four separate small BNs, each of which consists of a node and its parents, as shown by the following graph:

Global parameter independence: For every DAG model, p(θmG)=i=1Mp(θiG)p(\theta_m|G)=\prod_{i=1}\limits^Mp(\theta_i|G).

Local parameter independence: For every node, p(θiG)=i=1qip(θxikxπijG)p(\theta_i|G)=\prod_{i=1}^{q_i}p(\theta_{x_i^k|\mathbf{x}_{\pi_i}^j}|G).

Global parameter sharing

Consider a network structure GG over a set of variables X={X1,,Xn}X = \{X_1, \cdots, X_n\}, parameterized by a set of parameters θ\theta. Each variable XiX_i is associated with a CPD P(XiUi,θ)P(X_i | U_i,\theta). Now, rather than assuming that each such CPD has its own parameterization θXiUi\theta_{X_i|U_i}, we assume that we have a certain set of shared parameters that are used by multiple variables in the network. That is the global parameter sharing. Assuming θ\theta is partitioned into disjoint subsets θ1,,θk\theta_1, \cdots, \theta_k, and with each subset we assign a disjoint set of variables VkX\mathcal{V}_k\subset\mathcal{X}. For XiVkX_i\subset\mathcal{V_k}, P(XiUi,θ)=P(XiUi,θk)P(X_i|\mathbf{U}_i,\theta)=P(X_i|\mathbf{U}_i,\theta^k) and for Xi,XjVkX_i,X_j\subset\mathcal{V_k},

P(XiUi,θk)=P(XjUj,θk)\displaystyle P(X_i|\mathbf{U}_i,\theta^k)=P(X_j|\mathbf{U}_j,\theta^k)

Supervised ML estimation

Hidden Markov Model (HMM)

We are given a HMM, and let hidden states be y_1, \cdots, y_N, our observations be x_1, \cdots, x_N, which are all known to us.

In the training time, we can record the frequency of each transition from a hidden state to another, or from a hidden state to an observation state.

Let A_{ij} be a transition from a hidden state i to a hidden state j, and let B_{ik} be a transition from a hidden state ii to an observation state k. We calculate our parameters using maximum likelihood estimation by the following:

aijML=#transitions from i to j#transitions from i to any hidden\displaystyle a_{ij}^{ML} = \frac{\#transitions\ from\ i\ to\ j}{\#transitions\ from\ i\ to\ any\ hidden}
bikML=#transitions from i to k#transitions from i to any observation\displaystyle b_{ik}^{ML} = \frac{\#transitions\ from\ i\ to\ k}{\#transitions\ from\ i\ to\ any\ observation}

If what we observe are continuous, we can treat them as Gaussian, and apply corresponding learning rules for Gaussian.

Pros

This method gives us the parameters that “best fit”, or maximizes the likelihood of seeing the training data. Therefore for the test data, intuitively it should be at least close to the truth. MLE tends to give good performance in the reality.

Cons

We just show that the parameters for calculation of probability upon seeing a test data depend on the frequency of seeing the same pattern in the training time. This leads to a problem. If there’s a test data case that’s never seen in the training data, then we will auto assign a zero probability to that test data, which is “infinitely wrong” because it will lead to a infinity cross entropy from real world to our model.

This also shows the overfitting problem, that we fit our training data too well that we cannot generalize our model to the real world well.

Example on the slide

bF4=0b_{F4}=0 because there’s no casino roll such that x=4x=4 and y=Fy=F in the training data. However we all know this is not true!

Pseudocounts

To solve this problem, we can add “hallucinated counts” to all the cases, so that even the cases which are never seen in the training time will get some counts. Therefore at test time there will be no zero probability assigned to any case.

How many Pseudocounts do we add

Imagine if the total frequency of the cases in training is 100, and we add 10000 counts to one case, then that case will be assigned a very high probability in the test time. What we just have done is equal to saying that “I strongly believe this will happen”. In another word, we put lots of belief in that case.

However, if we just add one Pseudocount to each case, then the probability of frequent cases will decrease, but not too much. Their rankings in probability won’t change, because it’s just their denominators in MLE calculation have changed, by the same amount. The spared probability will be distributed to cases never seen in the training. They will be very small, but this eliminates the problem of zero probability in testing. This is what we call smoothing.

We can also see this as Bayesian Estimation under a uniform prior with “parameter strength”, while we add pseudocounts to cases.

Supplementary

Density Estimation

Density estimation can be viewed as single-node graphical models. It is the building block of general GM.

For density estimation we have:

Discrete Distributions

Bernoulli distribution: Ber(p)Ber(p)

P(x)=px(1p)1x\displaystyle P(x)=p^x(1-p)^{1-x}

Multinomial distribution: Multinomial(n,θ)Multinomial(n,\theta)

It’s generally similar to Binomial. However, the “1” in parameter indicates there will be only one trial. Therefore it’s similar to Bernoulli, except now we have kk instances that could happen, each with a probability θi\theta_i, where \sum_{i=1}^{k}{\theta_i}=1.

Suppose n is the observed vector with:

n=[n1nk],jnj=N\displaystyle n = \begin{bmatrix} n_1 \\ \cdots \\ n_k \end{bmatrix}, \sum_j n_j = N
P(n)=N!n1!n2!nk!θn\displaystyle P(n)=\frac{N!}{n_1!n_2! \dots n_k!}\theta^n

MLE: constrained optimization with Lagrange multipliers

Objective Function:

l(θ;D)=knklogθk\displaystyle l(\theta;D)=\sum_k{n_klog\theta_k}

constraint:

k=1Kθk=1\displaystyle \sum_{k=1}^K{\theta_k=1}

Constrained cost function with a Lagrange multiplier:

l=knklogθk+λ(1k=1Kθk)\displaystyle l^-=\sum_k{n_klog\theta_k+\lambda(1-\sum_{k=1}^K{\theta_k})}

Bayesian estimation

Dirichlet prior:
P(θ)=C(α)kθkαk1P(\theta)=C(\alpha)\prod_k{\theta_k^{\alpha_k-1}}

Posterior distribution of θ\theta:
P(θx1,,xN)=kθkαk+nk1P(\theta|x_1, \cdots, x_N)=\prod_k{\theta_k^{\alpha_k+n_k-1}}

Posterior mean estimation:
θk=nk+αkN+α\theta_k=\frac{n_k+\alpha_k}{N+|\alpha|}

Sequential Bayesian updating

Start with Dirichlet prior:
P(θˉαˉ)=Dir(θˉ:αˉ)P(\bar{\theta}|\bar{\alpha})=Dir(\bar{\theta}:\bar{\alpha})

Observe N’ samples with sufficient statistics nˉ\bar{n}'. Posterior becomes:
P(θˉαˉ,nˉ)=Dir(θˉ:αˉ+nˉ)P(\bar{\theta}|\bar{\alpha},\bar{n}')=Dir(\bar{\theta}:\bar{\alpha}+\bar{n}')

Hierarchical Bayesian Models

θ\theta are the parameters for the likelihood P(xθ)P(x \vert \theta)

α\alpha are the parameters for the prior P(θα)P(\theta \vert \alpha)

We can have hyper-parameters, etc.

We stop when the choice of hyper-parameters makes no difference to the marginal likelihood; typically make hyper-parameters constants.

Limitation of Dirichlet Prior

Dirichlet prior could only put emphasis/bias on all coordinates or one single coordinate; it cannot, for example, emphasize two coordinates simultaneorly. The prior corresponding to different hyperparamter α\alpha is shown in the following picture:

The Logistic Normal Prior

Pro: co-variance structure
Con: non-conjugate

Continuous Distributions

Uniform Probability Density Function

p(x)={1ba axb0 elsewhere% <![CDATA[ p(x) = \begin{cases} \frac{1}{b-a} & \ a \leq x \leq b \\ 0 & \ \text{elsewhere} \end{cases} %]]>

Normal (Gaussian) Probability Density Function

P(x)=12πσe(xμ)22σ2P(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Multivariate Gaussian

P(X;μˉ,)=1(2π)n212exp(12(Xμˉ)T1(Xμˉ))P(X;\bar{\mu},\sum)=\frac{1}{(\sqrt{2\pi})^\frac{n}{2}|\sum|^{\frac{1}{2}}}\exp(-\frac{1}{2}(X-\bar{\mu})^T\sum^{-1}(X-\bar{\mu}))

MLE for a multivariate-Gaussian:

It can be shown that the MLE for μ\mu and Σ\Sigma are:
μMLE=1Nn(xn)\mu_{MLE} = \frac{1}{N}\sum_n (x_n)
ΣMLE=n(xnμMLE)(xnμMLE)T=1NS\Sigma_{MLE} = \sum_n (x_n - \mu_{MLE})(x_n - \mu_{MLE})^T = \frac{1}{N} S

where the scatter matrix is:
S=n(xnμMLE)(xnμMLE)T=nxnxnTNμMLμMLTS = \sum_n (x_n - \mu_{MLE})(x_n - \mu_{MLE})^T = \sum_n x_n x_n^T - N \mu_{ML} \mu_{ML}^T

Note that XTX=nxnxnTX^TX=\sum_n{x_n x_n^T} may not be full rank (eg. if N<D% <![CDATA[ N<D %]]>), in which case ML\sum_{ML} is not invertible

Bayesian Parameter Estimation for a Gaussian

Reasons for Bayesian Approach:

Now that we only consider conjugate priors, and consider various cases of increasing complexity:

Unknown μ\mu, known σ\sigma

Known μ\mu, unkown λ=σ2\lambda = \sigma^{-2}

Unkown μ\mu, unkown λ\lambda

Two nodes fully observed BNs

Two nodes fully observed BNs

Classification

Goal: wish to learn f:XYf: X \to Y

Conditional Gaussian

The data: {(x1,y1),(x2,y2),,(xn,yn)}\{ (x_1,y_1),(x_2, y_2), \cdots, (x_n, y_n) \}

yy is a class indicator vector (one-hot encoding): p(yn)=multi(yn;π)=k=1Kπkyn,kp(y_n) = multi(y_n; \pi) = \prod_{k=1}^K \pi_k^{y_{n,k}}

xx is a conditional Gaussian variable with a class-specific mean: p(xnyn,k=1,μ,σ)=12πσ2exp{12σ2(xnμk)2}p(x_n | y_{n,k} = 1, \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\{-\frac{1}{2\sigma^2}(x_n-\mu_k)^2 \}

p(xy,μ,σ)=np(xnyn,μ,σ)=n(kN(xn;μk,σk)yn,k)\displaystyle p(x|y, \mu, \sigma)=\prod_n p(x_n | y_n, \mu, \sigma) = \prod_n (\prod_k \mathcal{N} (x_n; \mu_k, \sigma_k)^{y_{n,k}} )

MLE of Conditional Gaussian

MLE of Conditional Gaussian.

Bayesian Estimation of Conditional Gaussian

Bayesian Estimation of Conditional Gaussian

Gausian Distriminative Analysis:

Linear Regression

The data: {(x1,y1),(x2,y2),,(xN,yN)}\{(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \}, where xx is an input vector and yy is a response vector(either continuous or discrete)

A regression scheme can be used to model p(yx)p(y \vert x) directly rather than p(x,y)p(x,y)

A discrimintative probabilitstic model

Assume yi=θTxi+ϵiy_i = \theta^T x_i + \epsilon_i, where ϵ\epsilon is an error term of unmodeled effects or random noise. Assume ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2), then p(yixi;θ)N(θTxi,σ2)=12πσ2exp{12σ2(yiθTxi)2}p(y_i | x_i; \theta) \sim \mathcal{N}(\theta^Tx_i, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\{ -\frac{1}{2\sigma^2} (y_i - \theta^Tx_i)^2 \}

And by that each (yi,xi)(y_i, x_i) are i.i.d, we have: L(θ)=i=1Np(yixi;θ)=(12πσ2)Nexp{12σ2i=1N(yiθTxi)2}L(\theta) = \prod_{i=1}^N p(y_i | x_i; \theta) = (\frac{1}{\sqrt{2\pi\sigma^2}})^N \exp\{ - \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - \theta^T x_i)^2 \}

hence: l(θ)=nlog12πσ212σ2i=1N(yiθTxi)2l(\theta) = n\log \frac{1}{\sqrt{2\pi\sigma^2}} - \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - \theta^Tx_i)^2

we notice that maximizing this term w.r.t \theta is equivalent to minimizing the MSE(mean square error): J(θ)=12i=1N(xiTθyi)2J(\theta) = \frac{1}{2}\sum_{i=1}^N(x_i^T \theta - y_i)^2

ML Structure Learning for Completely Observed GMs

Two “Optimal” Approaches

“Optimal” means the employed algorithms are guaranteed to return a structure that maximizes the objective. Some popluar heuristics, however, provide no gurantee on attaining optimality, interpretability or even do not have an explicit objective. e.g. structured EM, module network, greedy structural search, deep learning via auto-encoders, etc.

Will learn two classes of algorithms for guaranteed structure learning, albeit only applying to certain families of graphs:

But we can find exact solution of an optimal tree under MLE:

Lead to the Chow-Liu Algorithm

Chow-Liu Algorithm

\begin{aligned} l(\theta_G, G; D) &= \log P(D | \theta_G, G)\\ &= \log \prod_n (\prod_i p(x_{n,i} | x_{n, \pi_i(G)}, \theta_{i | \pi_i(G)} ) )\\ &= \sum_{i} (\sum_{n}\log p(x_{n,i} | x_{n, \pi_i(G), \theta_{i | \pi_i(G) }}) )\\ &= M\sum_i (\sum_{x_i, x_{\pi_i(G)}} \hat p(x_i, x_{\pi_i(G)}\log p(x_i | x_{\pi_i(G)}, \theta_{i | \pi_i(G)} ))\\ &= M\sum_i(\sum_{x_i, x_{\pi_i(G)}} \hat{p}(x_i, x_{\pi_i(G)}) \log \frac{\hat p(x_i, x_{\pi_i(G)}, \theta_{i | \pi_i(G)})}{\hat p(x_{\pi_i(G)}) \hat p(x_i) } ) - M \sum_i (\sum_{x_i} \hat p(x_i) \log \hat p(x_i))\\ &= M\sum_i \hat I(x_i, x_{\pi_i(G)}) - M \sum_i \hat H(x_i) \end{aligned}

Structure Learning for General Graphs

Theorem: The problem of learning a BN structure with at most dd parents is NP-hard for any (fixed) d2d \geq 2.

Most structure learning approaches use heuristics

Footnotes