Last class, we presented two major types of graphical model task, Inference and Learning, and we mainly discussed about inference. In this lecture, we start to introduce the learning task, and explore some learning techniques used in fully observable Bayesian Networks.
Learning Graphical Models
Before we get started, it’s better for us to get some intuition about what learning is and which goal learning is achieving.
The goal
The goal: Given a set of independent samples (assignments of random variables), find the best (or the most likely) Bayesian Network (both DAG and CPDs).
As shown in the figure below, we are given a set of independent samples (assignments of binary random variables). Assume it’s a DAG, we are going to learn the directed links (causality relationships) between nodes, this process is called Structural Learning. Learning the conditional possibility is another task called Parameter learning.
The goal
Overview
As listed below, there’re several learning scenarios we are interested in:
Completely observed GMs:
1.1. directed
1.2. undirected
Partially or unobserved GMs:
1.1. directed
1.2. undirected (an open research topic)
Some useful estimation principles are also listed here:
Maximal likelihood estimation (MLE)
Bayesian estimation
Maximal conditional likelihood
Maximal “Margin”
Maximum entropy
We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
Parameter Learning
In this section, we are going to introduce Parameter Learning, and some useful related models.
When doing parameter learning, we assume that the Graphical model G itself is known and fixed, G could come from either expert design or an intermediate outcome of iterative structure learning.
The goal of parameter learning is to estimate parameters from a dataset of N independent, identically distributed (i.i.d.) training samples D = {x_1, \cdots, x_N}.
In general, each training case xn=(xn,1,⋯,xn,M) is a vector of M values, one per node. Depending on the model, elements in xn could be all known (no missing values, no hidden variables) if the model is completely observable, or partially known (\exists i, x_{n,i} is not observed) if the model is partially observable.
In this class, we mainly consider learning parameters for a BN that is completely observable with a given structure.
Exponential Family Distributions
There are various density estimation tasks that can be viewed as single-node GMs (see the supplementary section for examples), and they are in fact instances of Exponential Family Distributions.
Exponential Family Distributions are building blocks for general GMs, and have nice properties allowing us to easily do MLE and Bayesian estimation.
After this conversion, we see that the Multivariate Gaussian Distribution indeed belongs to exponential family.
Note that a k-dimension Multivariate Gaussian Distribution has a k + k^2-dimensional natural parameter (and sufficient statistic). However, as Σ has to be symmetric and PSD, parameters actually have a lower degree of freedom.
Example 2: Multinomial Distribution
For a binary vector random variable x\sim multinomial(x \vert \pi),
The qth derivative of exponential family gives the qth centered moment.
dηdA(η)=mean
dη2d2A(η)=variance
⋯
So we can take this advantage to compute moments of any exponential family distribution by taking the derivatives of the log normalizer A(η).
Note: when the sufficient statistic is a stacked vector, partial derivatives need to be considered.
Moment vs. canonical parameters
Applying the moment generating property, the moment parameter μ can be derived from the natural (canonical) parameter by:
dηdA(η)=μ
Note that A(η) is convex since
dη2d2A(η)=Var[T(x)]>0
Thus we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):
η=ψ(μ)
So we can say that a distribution in the exponential family can be parameterized not only by η - the canonical parameterization, but also by μ - the moment parameterization.
MLE for Exponential Family
For i.i.d. data, we have the log-likelihood:
ℓ(η;D)=logn∏h(xn)exp{ηTT(xn)−A(η)}
=n∑logh(xn)+(ηTn∑T(xn))−NA(η)
Take the derivatives and set to zero:
∂η∂ℓ=n∑T(xn)−N∂η∂A(η)=0
and
∂η∂A(η)=N1n∑T(xn)
μ~MLE=N1n∑T(xn)
This amounts to moment matching. Also, we could infer the canonical parameters using η~MLE=ψ(μ~MLE).
Sufficiency
We can see from above that for p(x∣θ), T(x) is sufficient for θ if there is no information in X regarding θ beyond that in T(x). Then, We can throw away X for the purpose of inference w.r.t θ. These could be shown by both Bayesian view and Frequentist view.
There are various conditional density estimation tasks that can be viewed as two-node GMs (see supplementary section for examples). Many are instances of Generalized Linear Models.
Generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows the linear model to be related to response variables via a link function, that have error distribution models other than a normal distribution. For example both linear regression and logistic regression can be unified by generalized linear model.
GLM
Commonality
We model the expecatation of y:
Ep(y)=μ=f(θTx)
p is the conditional distribution of y
f is the response function
The observed input x is assumed to enter into the model via a linear combination of its elements, and the conditional mean μ is represented as a function f(ξ) of ξ, where f is known as the link function of ξ=θTx. The observed output y is assumed to be characterized by an exponential family distribution with conditional mean μ.
Example 1: Linear Regression
Let’s take a quick recap on Linear Regression.
Assume that the target variable and the inputs are related by:
yi=θTxi+ϵi
where ϵ is an error term of unmodeled effects or random noise.
Now assume that \epsilon follows a Gaussian distribution N(0,σ), then
p(yi∣xi;θ)=2πσ1exp(−2σ2(yi−θTxi)2)
We can use te LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter.
Example 2: Logistic Regression (sigmoid classifier, perceptron, etc.)
The condition distribution is a Bernoulli:
p(y∣x)=μ(x)y(1−μ(x))1−y
where μ is a logistic function
μ(x)=1+e−θTx1
We can use the brute force gradient method as in LR. But we can also apply generic laws by observing the p(y∣x) is an exponential family function - a generalized linear model.
which is a fixed point function since μ is a function of θ.
Iteratively Reweighted Least Squares (IRLS)
Recall that the Hessian matrix H=−XTWX, and θ∗=(XTX)−1XTy in least mean square optimization, we use Newton-Raphson method with cost function ℓ:
θt+1=θt−H−1∇θℓ=(XTWtX)−1XTWtzt
where the response is:
zt=Xθt+(Wt)−1(y−μt)
Hence, this can be understood as solving the Iteratively reweighted least squares problem.
θt+1=argminθ(z−Xθ)T(z−Xθ)
Global and local parameter independence
Simple graphical models can be viewed as building blocks of complex graphical models. With the same concept, if we assume the parameters for each local conditional probabilistic distribution to be globally independent, and all nodes are fully observed, then the log-likelihood function can be decomposed into a sum of local terms, one per node:
ℓ(θ,D)=logp(D∣θ)=i∑(n∑logp(xn,i∣xn,πi,θi))
Plate
A plate is a macro that allows subgraphs to be replicated. Conventionally, instead of drawing each repeated variable individually, a plate is used to group these variables into a subgraph that repeat together, and a number is drawn on the plate to represent the number of repetitions of the subgraph in the plate.
Rules for plates: Repeat every structure in a box a number of times given by the integer in the corner of the box (e.g. N), updating the plate index variable (e.g. n) as you go; Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.
For example, in the directed acyclic network, it can be decomposed as
p(x∣θ)=∑i=1np(xi∣xπi)=p(x1∣θ1)p(x2∣x1,θ2)p(x3∣x1,θ3)p(x4∣x2,x3,θ4)
This is exactly like learning four separate small BNs, each of which consists of a node and its parents, as shown by the following graph:
Global parameter independence:
For every DAG model, p(θm∣G)=i=1∏Mp(θi∣G).
Local parameter independence:
For every node, p(θi∣G)=∏i=1qip(θxik∣xπij∣G).
Global parameter sharing
Consider a network structure G over a set of variables X={X1,⋯,Xn}, parameterized by a set of parameters θ. Each variable Xi is associated with a CPD P(Xi∣Ui,θ). Now, rather than assuming that each such CPD has its own parameterization θXi∣Ui, we assume that we have a certain set of shared parameters that are used by multiple variables in the network. That is the global parameter sharing. Assuming θ is partitioned into disjoint subsets θ1,⋯,θk, and with each subset we assign a disjoint set of variables Vk⊂X. For Xi⊂Vk, P(Xi∣Ui,θ)=P(Xi∣Ui,θk)
and for Xi,Xj⊂Vk,
P(Xi∣Ui,θk)=P(Xj∣Uj,θk)
Supervised ML estimation
Hidden Markov Model (HMM)
We are given a HMM, and let hidden states be y_1, \cdots, y_N, our observations be x_1, \cdots, x_N, which are all known to us.
In the training time, we can record the frequency of each transition from a hidden state to another, or from a hidden state to an observation state.
Let A_{ij} be a transition from a hidden state i to a hidden state j, and let B_{ik} be a transition from a hidden state i to an observation state k. We calculate our parameters using maximum likelihood estimation by the following:
If what we observe are continuous, we can treat them as Gaussian, and apply corresponding learning rules for Gaussian.
Pros
This method gives us the parameters that “best fit”, or maximizes the likelihood of seeing the training data. Therefore for the test data, intuitively it should be at least close to the truth. MLE tends to give good performance in the reality.
Cons
We just show that the parameters for calculation of probability upon seeing a test data depend on the frequency of seeing the same pattern in the training time. This leads to a problem. If there’s a test data case that’s never seen in the training data, then we will auto assign a zero probability to that test data, which is “infinitely wrong” because it will lead to a infinity cross entropy from real world to our model.
This also shows the overfitting problem, that we fit our training data too well that we cannot generalize our model to the real world well.
Example on the slide
bF4=0 because there’s no casino roll such that x=4 and y=F in the training data. However we all know this is not true!
Pseudocounts
To solve this problem, we can add “hallucinated counts” to all the cases, so that even the cases which are never seen in the training time will get some counts. Therefore at test time there will be no zero probability assigned to any case.
How many Pseudocounts do we add
Imagine if the total frequency of the cases in training is 100, and we add 10000 counts to one case, then that case will be assigned a very high probability in the test time. What we just have done is equal to saying that “I strongly believe this will happen”. In another word, we put lots of belief in that case.
However, if we just add one Pseudocount to each case, then the probability of frequent cases will decrease, but not too much. Their rankings in probability won’t change, because it’s just their denominators in MLE calculation have changed, by the same amount. The spared probability will be distributed to cases never seen in the training. They will be very small, but this eliminates the problem of zero probability in testing. This is what we call smoothing.
We can also see this as Bayesian Estimation under a uniform prior with “parameter strength”, while we add pseudocounts to cases.
Supplementary
Density Estimation
Density estimation can be viewed as single-node graphical models. It is the building block of general GM.
For density estimation we have:
MLE(maximum likelihood estimate)
Bayesian estimate
Discrete Distributions
Bernoulli distribution: Ber(p)
P(x)=px(1−p)1−x
Multinomial distribution: Multinomial(n,θ)
It’s generally similar to Binomial. However, the “1” in parameter indicates there will be only one trial. Therefore it’s similar to Bernoulli, except now we have k instances that could happen, each with a probability θi, where \sum_{i=1}^{k}{\theta_i}=1.
Suppose n is the observed vector with:
n=⎣⎡n1⋯nk⎦⎤,j∑nj=N
P(n)=n1!n2!…nk!N!θn
MLE: constrained optimization with Lagrange multipliers
Objective Function:
l(θ;D)=k∑nklogθk
constraint:
k=1∑Kθk=1
Constrained cost function with a Lagrange multiplier:
l−=k∑nklogθk+λ(1−k=1∑Kθk)
Bayesian estimation
Dirichlet prior: P(θ)=C(α)∏kθkαk−1
Posterior distribution of θ: P(θ∣x1,⋯,xN)=∏kθkαk+nk−1
Posterior mean estimation: θk=N+∣α∣nk+αk
Sequential Bayesian updating
Start with Dirichlet prior: P(θˉ∣αˉ)=Dir(θˉ:αˉ)
Observe N’ samples with sufficient statistics nˉ′. Posterior becomes: P(θˉ∣αˉ,nˉ′)=Dir(θˉ:αˉ+nˉ′)
Hierarchical Bayesian Models
θ are the parameters for the likelihood P(x∣θ)
α are the parameters for the prior P(θ∣α)
We can have hyper-parameters, etc.
We stop when the choice of hyper-parameters makes no difference to the marginal likelihood; typically make hyper-parameters constants.
Limitation of Dirichlet Prior
Dirichlet prior could only put emphasis/bias on all coordinates or one single coordinate; it cannot, for example, emphasize two coordinates simultaneorly. The prior corresponding to different hyperparamter α is shown in the following picture:
Posterior: P(μ∣x,μ0,τ2)∼N(μ~,σ~2)=2πσ~21exp{−2σ~1(μ−μ~)2}
where: μ~=N/σ2+1/τ2N/σ2xˉ+N/σ2+1/τ21/τ2μ0 σ~−2=σN2+τ21
The posterior mean is a convex combination of sample mean and prior mean, and the posterior precision is the precision of the prior plus 1/σ2 of contribution for each observed data point.
Known μ, unkown λ=σ−2
Conjugate prior for σ:
P(λ∣a,b)=Γ(a)−1baλa−1exp(−bλ)
Joint probability of a datum and its label is:
P(xn,yn,k=1∣μ,σ)=p(yn,k=1)p(xn∣yn,k=1,μ,σ)=πk2πσ21exp{−2σ21(xn−μk)2}
Predict the conditional probability of label given the datum:
p(yn,k=1∣xn,μ,σ)=∑k′πk′(2πσ2)−1/2exp{−(xn−μk′)2/2σ2}πk(2πσ2)−1/2exp{−(xn−μk)2/2σ2}
Frequentist Approach: fit π, μ and σ from data first
Bayesian Approach: compute the posterior dist. of the parameters first
Linear Regression
The data: {(x1,y1),(x2,y2),⋯,(xN,yN)}, where x is an input vector and y is a response vector(either continuous or discrete)
A regression scheme can be used to model p(y∣x) directly rather than p(x,y)
A discrimintative probabilitstic model
Assume yi=θTxi+ϵi, where ϵ is an error term of unmodeled effects or random noise. Assume ϵi∼N(0,σ2), then
p(yi∣xi;θ)∼N(θTxi,σ2)=2πσ21exp{−2σ21(yi−θTxi)2}
And by that each (yi,xi) are i.i.d, we have:
L(θ)=∏i=1Np(yi∣xi;θ)=(2πσ21)Nexp{−2σ21∑i=1N(yi−θTxi)2}
hence:
l(θ)=nlog2πσ21−2σ21∑i=1N(yi−θTxi)2
we notice that maximizing this term w.r.t \theta is equivalent to minimizing the MSE(mean square error):
J(θ)=21∑i=1N(xiTθ−yi)2
ML Structure Learning for Completely Observed GMs
Two “Optimal” Approaches
“Optimal” means the employed algorithms are guaranteed to return a structure that maximizes the objective. Some popluar heuristics, however, provide no gurantee on attaining optimality, interpretability or even do not have an explicit objective. e.g. structured EM, module network, greedy structural search, deep learning via auto-encoders, etc.
Will learn two classes of algorithms for guaranteed structure learning, albeit only applying to certain families of graphs:
Direction: pick any node as root, do BFS(breadth first search) do define directions, i.e.: we define an edge from node A to node B if we reach B from A in running BFS.
Structure Learning for General Graphs
Theorem: The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2.