Lecture 22: Bayesian non-parametrics
An introduction to Bayesian non-parametrics and the Dirichlet process.
Motivation
Clustering
Bayesian non-parametrics used to be a popular topic in machine learning. The topic is not as popular right now but it is still useful since it’s connected to a wide range of techniques we use today; for example, Gaussian Process can be applied to DNNs to make the layers infinitely wide.
Let’s start from a simpler problem (clustering). We already know a number of algorithms to solve this problem. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Algorithms we’ve seen before (e.g. K-means, GMMs) solves if the number of groups was specified.
However, in the real world, identifying the optimal number of clusters is still an open problem. In a two dimensional case, we can usually identify the number of clusters through direct visualization, but it becomes hard to tell when the data is high-dimensional. Nevertheless, we can choose
Moreover, when we are dealing with streaming data (e.g. new/media analysis), the clusters change every day. The algorithm constantly needs to delete and add clusters. People often run K-means on the data every day to accommodate changes, but this induces “the mapping problem”, since outputs of different K-means instances may not be aligned (i.e. Cluster 1 today might correspond to cluster 2 tomorrow).
From GMM to Bayesian Mixture Model
We first start from the formulation of a Mixture of Gaussian (over infinite data) for the clustering problem:
This above GMM is known as a parametric model. We estimate fixed number of parameters from the data.
A Bayesian way of estimating the mixing weights and mixture parameters is to put a prior on them and integrate them out and then do MLE.
For convenience we use conjugate priors. Note that
Note that we still have no clue how to estimate
The Dirichlet distribution
Reason for examining the Dirichlet distribution
The Dirichlet distribution is a distribution over vectors with values between 0 and 1 that sum to 1, meaning these vectors can be used as the probabilities of a discrete distribution. Because of this, the Dirichlet distribution can be seen as a distribution over distributions.
Details of the Dirichlet distribution
The Dirichlet distribution is defined as
where:
[\alpha_0 … \alpha_k] are the parameters of the Dirichlet distribution.\alpha_k \geq 0 , and\sum_k \alpha_k \gt 0 .[\pi_0 … \pi_k] are the values the Dirichlet distribution describes the probability of.\sum_k \pi_k = 1 , and can be seen as members of theK-1 standard simplex (generalized triangle).
Properties of the Dirichlet distribution
- Probability is evenly distributed when
\alpha_k = 1 for allk . \alpha_k values less than 1 spread the probability away from the center.\alpha_k values greater than 1 gather the probability in the center.- If the value of
\alpha_i is greater than\alpha_j , probability will gather more towards vertexj of the simplex.















Expectation:
The Dirichlet distribution is the conjugate prior to the multinomial and categorical distributions.
Dirichlet distribution as a distribution over distributions
Vectors sampled from the Dirichlet distribution can be used to represent distributions over the parameters of other distributions. For example, each probability in a sampled vector could be associated with the mean and variance of a normal distribution, and that normal distribution could itself be used to produce the parameters of another distribution. In this way, the Dirichlet distribution can act as a distribution over distributions of parameters.
Operations on the Dirichlet distribution
The dimension of the Dirichlet distribution can be modified while maintaining some properties of the distributions it describes.
Collapsing
If
Splitting
If
Renormalization
If
Parametric vs non-parametric
In this class, we started with parametric models, which assume all data can be represented with a fixed, finite number of parameters. Models of this type include Gaussian Mixture Models (GMM), generalized linear models (GLMs) with exponential families and so on.
On the other hand, number of parameters of non-parametric models can grow with sample size and may be random. An example of pure non-parameteric method is kernel density estimation, where the estimation of the probability density function(pdf) of a random variable is made solely based on a finite sample of data points and some parameters including bandwidth determining the smoothness and a kernel function for weighting the distance from the observations to a particular value the random variable may take. However, non-parametric models are sometimes cumbersome to estimate and are inflexible, especially for data with strong patterns such as those with clusters.
Bayesian non-parametrics is an approach combining the advantages of both parametric and non-parametric modeling. This type of model also allows infinite number of parameters in that the prior is a device defining a random distribution on an infinite space of possible parameters. But the finite dataset that the model will fit will only use a finite subset of those parameters. Given the Bayesian setting, we can integrate out the intermediate layers of random variables if using conjugate-priors, leaving just the hyper-parameters and the data. In this way, we bypass the need to specify the infinite space and can still do inference. By using this type of methods, we can marginalize on something hard to operationalize, focus on only on the more intuitive and direct example hyperparameters and still achieve the same statistical effect of infinite modeling.
The Dirichlet process
Background
By using Bayesian non-parametrics for mixture models, we can write down the posterior probability of the observation as the following:
Construct an appropriate prior
The process of construction starts with a 2-dimensional Dirichlet distribution paramterized by
The Dirichlet process
Considering a mixture model, given a Dirichlet prior
The discrete distribution enables sampling centroids with some clustering effect. This cannot be achieved with sampling from a continuous distribution such as Gaussian, since every sample will differ from each other. While with sampling from the Dirichlet process, chances are same posititions will be repeatively sampled.
Sample from the Dirichlet process
We call the point masses in the resulting distribution ‘atoms’ and the corresponding weights ‘sizes’. ‘Sizes’ are determined by the concentration parameter
Properties of the Dirichlet process
Sampling from Dirichlet process results in partitions of
Through this, we don’t need to express the infinity of the dimentionality in the partition. Instead, we can express by a finite number of partitions, say
Conjugacy of the Dirichlet process
Using the conjugacy property of
Predictive distribution
Using Dirichlet Process (DP), we can define a easy and handy predictive distribution. Instead of maintaining
Unlike
The second data point can have 2 choices, either join the existing cluster or a new cluster. We have
Thus,
By applying the integration trick over
Thus,
Metaphors for the Dirichlet process
There are many ways to visualize the Dirichlet process. Three common metaphors are the Pólya urn scheme, the Chinese restaurant process, and the stick-breaking process.
Pólya urn scheme

In the Pólya urn scheme (also called the Blackwell-MacQueen urn scheme)
Then, after sampling
From this scheme, we can also easily see the self-reinforcing property that the Dirichlet process exhibits: new samples are more likely to be drawn from partitions that already have many existing samples.
The Chinese restaurant process
The Chinese restaurant process is an equivalent description of the Pólya urn scheme.
Imagine a restaurant with an infinite number of tables, all initially empty. The first customer enters the restaurant, and picks a table to sit at. When the
Note on Exchangeability
Even though the two descriptions above have a notion of an order of the drawn samples, it turns out that the distribution over the partitions of the first
Distributions that do not depend on the order of the samples are called exchangeable distributions.
De Finetti’s theorem states that if a sequence of observations are exchangeable, then there must exist a distribution given which the samples are i.i.d. In this case, the samples are i.i.d. given the Dirichlet process.
The stick-breaking process

The two metaphors above primarily show the distribution of the data points among the resulting partitions. The stick-breaking process shows how the partitions of the DP can be constructed, along with their associated parameters, to give the parameters of the resulting distribution
Imagine we begin with a stick of unit length, which represents our total probability. We then repeatedly break off fractions of the stick, and assign parameters to each broken-off fraction according to our base measure.
Concretely, to construct the
- Sample a Beta
(1, \alpha) random variable\beta_k \in [0, 1] . - Break off a fraction
\beta_k of the remaining stick. This gives us thek -th partition. We can calculate its atom size\pi_k using the previous fractions\beta_1, …, \beta_k . - Sample a random parameter
\theta_k for the partition for this atom from our base measureG_0 . - Recur on the remaining stick.
In summary,
The stick-breaking metaphor shows us how we can approximate a Dirichlet process. The final distribution
References
- Ferguson Distributions Via Polya Urn Schemes [link]
Blackwell, D. and MacQueen, J.B., 1973. Ann. Statist., Vol 1(2), pp. 353--355. The Institute of Mathematical Statistics. DOI: 10.1214/aos/1176342372 - A Constructive Definition of the Dirichlet Prior
Sethuraman, J., 1994. Statistica Sinica, Vol 4, pp. 639-650.