Probability Reference Sheet

Learning Objectives

There are no offical learning objectives for this section. This can used as a reference sheet for Probability in this course!

Probability Notation

Suppose we have 3 random variables \(A, B\), and \(C\). Consider the expression \[P(+b, C) = \sum_{a \in \{a_1, a_2, a_3\}} P(a, +b, C)\]

In this course, we denote discrete random variables by capital letters and use them to represent all possible disjoint outcomes. In the above example, \(A, B,\) and \(C\) are random variables.

We use lower case letters to denote outcomes, i.e. possible values our variables can take on, such as \(+b\) for the variable \(B\), or \(a_1, a_2,\) and \(a_3\) for the variable \(A\) in the above example.

We also have variables for values like \(a\). Note that these variables are also represented by lower case letters and only represent a single outcome (as opposed to random variables).

Basic Rules

Definition of Conditional Probability

\[\begin{align*} P(X \mid Y) = \frac{P(X,Y)}{P(Y)} \end{align*}\]

Product Rule

\[\begin{align*} P(X, Y) &= P(X \mid Y) P(Y) \\[0.5em] &= P(Y \mid X) P(X) \\[0.5em] P(X_1,X_2, X_3) &= P(X_1, X_2 \mid X_3)P(X_3) \\[0.5em] &= P(X_1 \mid X_2,X_3)P(X_2, X_3) \end{align*}\]

Bayes' Theorem

\[\begin{align*} P(Y \mid X) = \frac{P(X \mid Y)P(Y)}{P(X)} \end{align*}\]

Normalization

\[\begin{align*} P(Y \mid X) &= \frac{P(X, Y)}{P(X)} = \frac{P(X , Y)}{\sum_{y} P(X , y)} \\[0.5em] P(Y \mid X) &\propto P(X , Y) \\[0.5em] P(Y \mid X) &= \alpha P(X , Y) \textit{~~~~~Note this difference between }\propto\textit{ and }\alpha \\[0.5em] \alpha &= \frac{1}{P(X)} = \frac{1}{\sum_{y} P(X , y)} \end{align*}\]

Chain Rule

\[\begin{align*} P(X_1,X_2, X_3) &= P(X_1 \mid X_2,X_3)P(X_2, X_3) \\[0.5em] &= P(X_1 \mid X_2,X_3)P(X_2 \mid X_3)P(X_3) \\[0.5em] P(X_1, ..., X_N) &= \prod_{n=1}^{N} P(X_n \mid X_1, ..., X_{n-1}) \end{align*}\]

Law of Total Probability

\[\begin{align*} P(A) = P(A \mid b_1) P(b_1) + P(A \mid b_2) P(b_2) \end{align*}\] where events \(b_1, b_2\) partition the sample space of events in the world (i.e., they are disjoint and their union makes up the entire sample space). More generically, \[\begin{align*} P(A) = \sum_b P(A \mid b) P(b) = \sum_b P(A, b) \end{align*}\]

All of these basic probability rules hold when conditioning on a set of random variables or outcomes. To make this work, the conditioned variables need to be included in each term in the rule.

Example

Take Bayes' Theorem from above, but now conditioned upon variables \(A\) and \(B\): \[\begin{align*} P(Y \mid X, A, B) = \frac{P(X \mid Y, A, B)P(Y \mid A, B)}{P(X \mid A, B)} \end{align*}\]

Marginalization

Marginalization uses the law of total probability to “sum out" variables from a joint distribution. This is useful when we are given the joint probability distribution and want to find the probability distribution over just a subset of the variables. Marginalization has the following forms:

To sum out a single variable: \[\begin{align*} P(X) = \sum_{y}P(X, y) \end{align*}\]

To sum out multiple variables: \[\begin{align*} P(X) = \sum_{z} \sum_{y} P(X, y, z) \end{align*}\]

This also works for conditional distributions when summing out a variable that is not conditioned upon, i.e. a variable to the left of the \(\mid\): \[\begin{align*} P(A \mid C, d) = \sum_{b} P(A, b \mid C, d) \end{align*}\]

This does NOT work when summing over a variable that is conditioned upon, i.e. a variable to the right of the \(\mid\): \[\begin{align*} P(A, b \mid C) \neq \sum_{d} P(A, b \mid C, d) \end{align*}\]

Independence

If two variables \(X\) and \(Y\) are independent (\(X \perp\mkern-10mu\perp Y\)), by definition the following are true:

\(P(X,Y) = P(X)P(Y)\)
\(P(X) = P(X \mid Y)\)
\(P(Y) = P(Y \mid X)\)

If two variables \(X\) and \(Y\) are conditionally independent given \(Z\) (\(X \perp\mkern-10mu\perp Y \mid Z\)), by definition the following are true:

\(P(X,Y \mid Z) = P(X \mid Z)P(Y \mid Z)\)
\(P(X \mid Y,Z) = P(X \mid Z)\)
\(P(Y \mid X,Z) = P(Y \mid Z)\)

Answering a Query from a CPT

Given: \(P(B \mid A)\), \(P(A)\)
To query: \(P(A \mid b)\)

Construct joint distribution (use product rule or chain rule)

Product Rule:¹ \(P(B,A) = P(B \mid A)P(A)\)
Answer query from joint distribution (use conditional probability or law of total probability)

By definition of Conditional Probability, \(P(A \mid b) = \frac{P(b,A)}{P(b)}\)

By the Law of Total Probability, \(P(A \mid b) = \frac{P(b,A)}{\sum_{a}P(b,a)}\)

Note that product rule is a smaller instance of chain rule↩︎

Probability Tables

When representing probabilities with capital letters, e.g. \(P(A, B)\), we are referring to all the combinations of outcomes that the discrete random variables can have. Thus, we have a table of probabilities rather than a single value. This is also true for conditional probabilities, e.g. \(P(A, B \mid C)\). When there is a mixture of capital letters and lower case letters, e.g. \(P(A, b \mid C, d)\), the table contains all the combinations of outcomes for the random variables, \(A\) and \(C\) (while the discrete values \(b\) and \(d\) are fixed).

Important Note about CPTs

It is important to understand when a probability table contains the complete distribution, or in other words, when a probability table sums to one.

A probability table will sum to one when:

there is exactly one specific combination of outcomes that is conditioned upon and
we are considering all possible combinations of the other random variables.

Another way to phrase this: a probability table will sum to one, when:

there are no capital letters on the right-hand side of the \(\mid\), and
there are only capital letters on the left-hand side.