Point Estimators

Consider the following problem. We have a number of samples $X_1, X_2, \cdots, X_N$ that have been obtained from a probability distribution $P_X(x)$ with unknown mean $\mu$ and a variance $\sigma^2$. We would like to estimate this mean and the variance.

The estimate for the mean of the distribution can be derived from the sample as the sample average. \[ \hat{\mu} = \frac{1}{N} \sum_{i=1}^N X_i \]

The estimate for the variance of the distribution can be derived from the sample as the following: \[ \hat{\sigma}^2 = \frac{1}{N-1} \sum_{i=1}^N (X_i - \hat{\mu})^2 \]

These estimators are instances of point estimators.

Bernoulli example

Consider a Bernoulli random variable $X$ (which only takes values $0$ or $1$) with $P_X(1) = P$. The expected value of $X$ is $E[X] = P$ as we saw in class.

Let $X_1, X_N, \cdots, X_N$ be $N$ random draws. E.g., $X$ could represent a coin with heads mapped to $1$ and tails to $0$, and $P(heads) = P$. Then $X_1, \cdots, X_N$ would be the outcomes of $N$ tosses of the coin.

The estimate for $E[X] = P$ is given by \[ \hat{P} = \frac{1}{N} \sum_{i=1}^N X_i = \frac{\#heads}{N} \]

Gaussian example

Consider a random variable $X$ with mean $\mu$ and variance $\sigma^2$. E.g. $X$ could be the heights of people in the world, and $\mu$ is the overall average height, and $\sigma^2$ is the overall variance in height of all the people in the world. We wish to estimate the mean $\mu$ and $\sigma^2$.

Let $X_1, X_N, \cdots, X_N$ be $N$ random draws, e.g. they could be $N$ randomly selected people. Then the point estimate of the global population mean and variance can be obtained by the direct application of the formulae give above.

Bias of estimators

Note that the estimate for the variance is actually divided by $N-1$ and not $N$, and is not the sample variance of $X_1, \cdots, X_N$, which is given by $S^2 = \frac{1}{N} \sum_{i=1}^N (X_i - \hat{\mu})^2$.

The reason is that the estimate $\hat{\sigma}^2$ above is an unbiased estimate of $\sigma^2$, where as $S^2$ is a biased estimate.

To understand these terms, consider the estimate of the mean $\hat{\mu}$. The expected value of this estimate is the true mean: \[ E[\hat{\mu}] = E\left[\frac{1}{N} \sum_{i=1}^N X_i \right] = \frac{1}{N} \sum_{i=1}^N E[X_i] = \frac{1}{N} \sum_{i=1}^N \mu = \mu \]

In lay terms, if we were to conduct a large number of experiments, where each time we draw $N$ samples from the population and compute the sample mean, then the average of all of these sample means would be the true population mean.

In other words, the estimate for the mean, $\hat{\mu}$, is unbiased.

On the other hand, consider the sample variance $S^2$. The expected value of the sample variance is given by \[ E[S^2] = E\left[\frac{1}{N} \sum_{i=1}^N (X_i - \frac{1}{N}\sum_{j=1}^N X_j)^2\right] \] Here we've written out the explicit form of $\hat{\mu}$. We leave it as an exercise to show that this works out to $\frac{N-1}{N}\sigma^2$.

In other words, if we were to conduct many experiments, where each time we draw $N$ samples from the population and compute the sample variance $S^2$, then the average of all of these sample variances would be slightly off from the true variance $\sigma^2$ by a small amount. The sample variance is a biased estimate of the true variance. However, we do note that as $N$ increases, $\frac{N-1}{N}$ tends to 1, and the expected value of the sample variance heads towards the true variance. Hence we say the sample variance is an asymptotically unbiased estimator of $\sigma^2$, i.e. as $N \rightarrow \infty$, the bias tends to 0.

On the other hand the variance estimator $\hat{\sigma}^2$ we gave earlier (with a denominator of $N-1$) is a truly unbiased estimate of the sample variance.

Variance of estimators

Consider the variance of the estimator $\hat{\mu}$. \[ Var(\hat{\mu}) = Var\left(\frac{1}{N}\sum_{i=1}^N X_i\right) \\ \;\;\;\; = \frac{1}{N^2} Var\left(\sum_{i=1}^N X_i\right) \\ \;\;\;\; = \frac{1}{N^2} \sum_{i=1}^N Var(X_i) \\ \;\;\;\; = \frac{1}{N^2} \sum_{i=1}^N \sigma^2 \] leading to \[ Var(\hat{\mu}) = \frac{\sigma^2}{N} \]

The variance of the estimator of $\hat{\mu}$ characterizes how much sample estimates of the mean, $\hat{\mu}$, would vary from one another across different experiments in which we drew $N$ samples. What the above equation tells us is that this variance decreases linearly with $N$, and goes to 0 as $N$ tends to infinity. For very large $N$, any draw of $N$ samples from the population would give you more or less the same estimate for the mean.

Similarly, we would find that the variance of the sample variance $S^2$ is given by $Var(S^2) = \frac{2\sigma^4}{N-1}$, and the variance of the unbiased stimator $\hat{\sigma}^2$ is given by $Var(\hat{\sigma}^2) = \frac{2N^2\sigma^4}{(N-1)^3}$.

Bias, Variance and Consistency

Low variance is a desired characteristic of an estimator, because we would prefer low variability between experiments. Ideally, the variance of the estimator must also decrease with increasing $N$; that way, as we can reduce the variation between experiments by increasing sample size.

Low variance does not mean low bias. We can have biased estimators with low variance, e.g. if we always estimate the mean of the distribution as 0. The estimate is wrong and has a bias $\mu$, but has zero variance.

We can have unbiased estimators with high variance. For example, if we estimated the mean of a distribution as $\frac{1}{N}\sum_{i=1}^N X_i + z$, where $z$ is independent noise drawn from a standard normal, then the variance of the estimator would never go below 1.

Unbiased estimators whose variance decreases with increasing $N$ are said to be consistent. Both $\hat{\mu}$ and $\hat{\sigma}^2$ are consistent estimators. Consistent estimators are desirable because we want to be sure that as sample size increases, our estimates will go to the true value with low variability.