Hypothesis Testing

One-sample tests are tests where we test only one sample, without any comparisons to any other. The distinction between these and two-sample tests will become apparent in the next section.

Hypothesis testing a Bernoulli variable

Bernoulli random variables are random variables that take one of two values. For convenience, let us represent these values are

$1$ and

$0$ . So, formally a Bernoulli RV has the form

$X = \begin{cases} 1~~~with~probability~P\\ 0~~~with~probability~(1-P) \end{cases}$

In the discussion below, we will refer to the

$X=1$ outcome as a “positive” outcome (this is only terminology; there need not be anything positive about the outcome that maps to

$X=1$ ). There are many examples of Bernoulli RVs. We are familiar with the coin flip, which may be either heads or tails. But many common problems can be modelled by Bernoulli random variables. For instance,

The hypothesis testing problem for Bernoulli variables is as follows. The null hypothesis

$H_0$ is that the Bernoulli parameter

$P$ , representing the proba has some value

$P_0$ . For example, it be a claim that a given classifier has an accuracy

$P_0$ . The alternate hypthesis

$H_a$ is that the true value of the Bernoulli parameter is not

$P_0$ .

In the right-sided test the alternate hypothesis claims specifically that the true

$P$ is greater than the value

$P_0$ claimed by the null hypothesis. E.g. we may claim that the classifier is more accurate than

$P_0$ . This may happen, for instance, if your company claims that a classifier you have developed is no better than the default classifier used by the company, which has accuracy

$P_0$ .

To test the hypothesis

$N$ samples

$X_1, X_2, \cdots, X_N$ are obtained from the process. We compute the statstic

$Y = \sum_{i=1}^{N} X_i$ .

$Y$ may represent the number of samples (from

$X_1, \cdots, X_N$ ) that were correctly classified, for instance.

$Y$ has as Binomial distribution:

$P_Y(K) = {N \choose K}P^K (1-P)^{N-K}$ .

We know that if the true Bernoulli parameter is greater than that proposed by the null hypothesis, i.e if

$P > P_0$ , then the number of positive outcomes will be greater than what we would obtain under the null hypothesis. Under the null hypothesis (

$P = P_0$ ) the expected number of positive outcomes in

$N$ trials is

$NP_0$ . We will reject

$H_0$ (and accept

$H_a$ ) if

$Y \geq \theta_r$ , i.e.

$Y$ is is greater than or equal to a threshold

$\theta_z$ that is sufficiently larger than

$NP_0$ .

Since we do not want to make a type 1 error (rejecting

$H_0$ when it is true), we specify the acceptable probability of type 1 error, i.e.

$\alpha_z$ . Typical values (as mentioned earlier) are 0.01 and 0.05.

The probability of type-1 error using a threshold

$\theta$ is the probability of obtaining

$Y \geq \theta$ under the null hypothesis, and is given by

$\alpha(\theta) = \sum_{K= \theta}^N {N \choose K}P_0^K (1=P_0)^{N-K}$

For the right-sided test, we choose threshold

$\theta_z$ as the smallest

$\theta$ such that

$\alpha(\theta) \leq \alpha_z$ .

Consider for instance that our desired

$\alpha_r = 0.05$ . For

$N=10$ trials with

$P_0 = 0.5$ , for instance, we obtain

$\alpha(6) = 0.38$ , which is too high. Trying other values of

$\theta$ ,

$\alpha(7) = 0.172$ ,

$\alpha(8) = 0.055$ , and

$\alpha(9) = 0.011$ . So clearly, the smallest value of

$\theta$ that gives us a sufficiently low probability of type-1 error (i.e.

$\alpha \leq \alpha_r$ ) is 9. We can only reject the null hypothesis of

$P = P_0 = 0.5$ in favor of

$P > 0.5$ with greater than 95% confidence if we get 9 or more positive outcomes in 10 trials, i.e. if we use a threshold

$\theta_z = 9$ .

Threshold from alpha

Caption: The rejection region $Y \geq 8$ has an $\alpha$ value of 0.055. For $\alpha_r = 0.05$ , we must actually chose a higher threshold for this test

A left sided test would be similarly derived. Now we must find the threshold

$\theta_z$ as the largest

$\theta$ such that

$\alpha(\theta) = \sum_{i=K}^\theta {N \choose K}P_0^K(1-P_0)^{N-K}$ is less than

$\alpha_z$ .

In the two-sided test, we do not specify whether the true

$P$ value is greater or less than the value

$P_0$ proposed by the null hypothesis. Our altrenate hypothesis is only that

$P \neq P_0$ .

So now, we try to find a threshold

$\theta_r$ such that

$P(|Y - NP_0| \geq \theta_r) \leq \alpha_z$ , i.e. such that

$P(|Y - NP_0| \geq \theta_r) = P(Y \leq NP_0 -\theta_r) + P(Y \geq NP_0 + \theta_r) \leq \alpha_z$

For the Binormial RV, under the null hypothesis, for any threshold

$\theta$ , we have

$P(Y \leq NP_0-\theta) = \sum_{K=0}^{NP_0-\theta}{N \choose K}P_0^K (1-P_0)^{N-K}, \\ P(Y \geq NP_0+\theta) = \sum_{K=NP_0+\theta}^{N}P_0^K (1-P_0)^{N-K}$ and

$\alpha(\theta) = \sum_{K=0}^{NP_0-\theta}{N \choose K}P_0^K (1-P_0)^{N-K} + \sum_{K=NP_0+\theta}^{N}P_0^K (1-P_0)^{N-K}$

We find the smallest

$\theta$ value such that

$\alpha(\theta) \leq \alpha_z$ , and set that

$\theta$ to be

$\theta_r$ .

For our coin problem, with the null hypothesis of

$P_0 = 0.5$ , we get

$\alpha(6) = 0$ ,

$\alpha(5) = 0.002$ ,

$\alpha(4) = 0.02$ , and

$\alpha(3) = 0.11$ . Thus, if we set our maximum acceptable probability of type-1 error

$\alpha_z = 0.05$ , we find that the we must set

$\theta_r = 4$ , and our rejection to be

$Y \geq 9 ~or~Y \leq 1$ .

Two sided Bernoulli

Caption: The rejection region $|Y - 5| \geq 3$ has an $\alpha$ value of 0.115 For $\alpha_r = 0.05$ , we must actually chose a higher threshold, of $\theta = 4$ for this test

Note that if we had set

$\alpha_r = 0.001$ , we could never have conducted a two-sided test on this sample, since

$\alpha(0)$ , the

$\alpha$ value for the smallest possible

$\theta$ , is already 0.002. In that case you would need more samples in your test. With 20 samples, we find that we can use a threshold of

$\theta_r$ .

Learning algorithm

$A$ has an accuracy of 80% on some problem. You have developed a new algorithm

$B$ . If you test 1000 samples, how many of them must your new algorithm classify correctly before you can 95% confident that your new algorithm is superior to the default algorithm?

For large

$N$ working out the Binomial probability distribution values above can become expensive. In these cases we can obtain an excellent approximation using the Central Limit Thorem.

Our statistic

$Y = \sum_{i=1}^NX_i$ is a sum of

$N$ draws from the same distribution. Under the null hypothesis (that the Bernoulli parameter

$P=P_0$ ), the expected value of

$Y$ =

$E[Y] = NP_0$ . If the

$N$ draws are independent, the variance of

$Y$ is simply the sum of the variances of the

$N$ draws:

$var(Y) = \sum_{i=1}^N var(X_i)$ . Since each

$X_i$ is a Bernoulli RV with variance

$P_0(1-P_0)$ , we obtain

$var(Y) = NP_0(1-P_0)$ .

The central limit theorem tells us that for large

$N$

$Z = \frac{Y - E[Y]}{\sqrt{var(Y)}} = \frac{Y - NP_0}{\sqrt{NP_0(1-P_0)}}$ has a standard normal PDF.

Gaussian

Caption: A Gaussian PDF. The numbers to the the left of the mean at any $x$ are the CDF $\Phi(x)$ , while the numbers to the right are $1 - \Phi(x)$ . Note the symmetry.

Also, for any

$Z_\theta$

$Z \leq Z_\theta \Rightarrow \frac{Y - NP_0}{\sqrt{NP_0(1-P_0)}} \leq Z_\theta \\ \Rightarrow Y \leq NP_0 + Z_\theta\sqrt{NP_0(1-P_0)}$

Left-sided: For Left-sided tests, our objective is to find a threshold

$\theta_r$ such that

$P(Y \leq \theta_r) \leq \alpha_z$ under the null hypothesis. To do so, instead we can find a

$Z_\alpha$ such that

$P(Z \leq Z_\alpha) = \alpha_z$ , and set

$\theta_r = \lfloor NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)} \rfloor$ . Since

$Z$ is distributed according to a standard normal,

$P(Z \leq Z_\alpha) = \Phi(Z_\alpha)$ , the CDF of a standard normal.

$\Phi(Z_\alpha)$ can be obtained from standard normal tables, which can be found at this link.

A caveat is that that the standard normal tables actually provide

$P(Z \leq Z_\alpha)$ only for

$Z_\alpha \geq 0$ . But since the normal distribution is symmetric, for

$Z_\alpha \leq 0$ we can use the fact (refer to the Gaussian figure above) that

$P(Z \leq Z_\alpha) = P( Z \geq -Z_\alpha) = 1 - \Phi(-Z_\alpha)$ . So, to find a

$Z_\alpha$ such that

$P(Z \leq Z_\alpha) = \alpha_z$ , we must first find a smallest

$C$ such that

$\Phi(C) \geq 1 - \alpha_z$ from the table and set

$Z_\alpha = -C$ .

Consider the example of the coin. We have a null hypothesis that

$P_0 = 0.5$ . Lets assume a test of

$N=100$ trials of the coin. Our alternate hypothesis is that

$P < P_0$ . We would like to be 95% confident when we reject the null hypothesis, so we set

$\alpha_z = 0.05$ . From the standard normal tables, we find that

$C = 1.65$ is the smallest

$C$ value for which

$\Phi(C) \geq 0.95$ . Hence, we must set

$Z_\alpha = -1.65$ , from which we obtain

$\theta_r = NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)} = 100\times 0.5 - 1.65\sqrt{100\times 0.5\times 0.5}$ , giving us

$\theta_r = 41$ . I.e. if 41 or fewer tosses in 100 are heads, we can be 95% sure that the coin is not fair.

Right sided: For right sided tests our alternate hypothesis is that

$P > P_0$ , so we need to find a

$\theta_r$ such that if

$Y \geq \theta_r$ , we can confidently conclude that our alternate hypothesis is correct. Noting that under the null hypothesis

$P(Z \geq Z_\alpha) = P(Y \geq NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)}$ . So, in order to find a

$\theta_r$ such that

$P(Y \geq \theta_r) \leq \alpha_z$ , we need to find

$Z_\alpha$ such that

$P(Z \geq Z_\alpha) \leq \alpha_z$ , and set

$\theta_r = NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)}$ .

To find the necessary

$Z_\alpha$ from the standard normal table, we note that

$P(Z \geq Z_\alpha) = 1 - \Phi(Z_\alpha)$ . So we must find the smallest

$C$ such that

$\Phi(C) \geq 1 - \alpha_z$ , and set

$Z_\alpha = C$ .

$\theta_r$ can now be computed as

$\theta_r = \lceil NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)} \rceil$ .

For our coin example, if our alternate hypothesis is that

$P > P_0$ , for a test of

$N=100$ trials, and setting our threshold on the probability of type 1 error at 0.05 as before, we find that the smallest

$C$ for which

$\Phi(C) \geq 0.95$ is

$C = 1.65$ . So, setting

$Z_\alpha = 1.65$ , we get

$\theta_r = 59$ . We must obtain at least 59 heads in 100 tosses to be 95% confident that

$P > 0.5$ .

Two sided: For two sided tests our alternate hypothesis is that

$P \neq P_0$ , so we need to find a

$\theta_r$ such that if

$|Y - NP_0| \geq \theta_r$ , that is, if

$Y \geq NP_0 + \theta_r$ or

$Y \leq NP_0 - \theta_r$ , we can conclude that our alternate hypothesis is correct and that the coin is not fair.

We note that for large

$N$ the distribution of

$Y$ is symmetric, so

$P(Y \geq NP_0 + \theta_r) = P(Y \leq NP_0 - \theta_r)$ . Hence,

$P(|Y - NP_0| \geq \theta_r) = 2P(Y \leq NP_0 - \theta_r)$ . So, for any

$\alpha_z$ , we need to find a

$\theta_r$ such that

$2P(Y \leq NP_0 - \theta_r) \leq \alpha_z$ , or

$P(Y \leq NP_0 - \theta_r) \leq \frac{\alpha_z}{2}$ . The procedure for finding

$\theta_z$ is hence identical to that for the right-sided test with a confidence threshold of

$0.5\alpha_z$ .

We must find the smallest

$Z_{\frac{\alpha}{2}}$ from the normal table, such that

$\Phi(Z_{\frac{\alpha}{2}}) \geq 1 - \frac{\alpha}{2}$ . Note that unlike in the left and right sided tests,

$\theta_r$ is now not the actual threshold, but the distance from the mean

$NP_0$ . So we can compute

$\theta_r$ from

$Z_{\frac{\alpha}{2}}$ as

$\theta_r = \lceil Z_{\frac{\alpha}{2}} \sqrt{NP_0(1-P_))}\rceil$ .

For our coin example, if our alternate hypothesis is that

$P \neq P_0$ , for a test of

$N=100$ trials, and setting our threshold on the probability of type 1 error at 0.05 as before, we find that

$Z_{\frac{\alpha}{2}} = 1.96$ , giving us

$\theta_r = \lceil 1.96 \sqrt{100\times 0.5\times 0.5} \rceil = 10$ . We must obtain 40 or fewer heads, or 60 or more heads, to be 95% confident that the coin is unfair.