One-sample tests are tests where we test only one sample, without any comparisons to any other. The distinction between these and two-sample tests will become apparent in the next section.
Bernoulli random variables are random variables that take one of two values. For convenience, let us represent these values are 1 and 0. So, formally a Bernoulli RV has the form X={1 with probability P0 with probability (1−P)
In the discussion below, we will refer to the X=1 outcome as a “positive” outcome (this is only terminology; there need not be anything positive about the outcome that maps to X=1). There are many examples of Bernoulli RVs. We are familiar with the coin flip, which may be either heads or tails. But many common problems can be modelled by Bernoulli random variables. For instance,
Many more such instances can be found.
The hypothesis testing problem for Bernoulli variables is as follows. The null hypothesis H0 is that the Bernoulli parameter P, representing the proba has some value P0. For example, it be a claim that a given classifier has an accuracy P0. The alternate hypthesis Ha is that the true value of the Bernoulli parameter is not P0.
One sided test:
One-sided tests may either be left-sided or right-sided.
In the right-sided test the alternate hypothesis claims specifically that the true P is greater than the value P0 claimed by the null hypothesis. E.g. we may claim that the classifier is more accurate than P0. This may happen, for instance, if your company claims that a classifier you have developed is no better than the default classifier used by the company, which has accuracy P0.
To test the hypothesis N samples X1,X2,⋯,XN are obtained from the process. We compute the statstic Y=∑Ni=1Xi. Y may represent the number of samples (from X1,⋯,XN) that were correctly classified, for instance. Y has as Binomial distribution: P_Y(K) = {N \choose K}P^K (1-P)^{N-K}.
We know that if the true Bernoulli parameter is greater than that proposed by the null hypothesis, i.e if P > P_0, then the number of positive outcomes will be greater than what we would obtain under the null hypothesis. Under the null hypothesis (P = P_0) the expected number of positive outcomes in N trials is NP_0. We will reject H_0 (and accept H_a) if Y \geq \theta_r, i.e. Y is is greater than or equal to a threshold \theta_z that is sufficiently larger than NP_0.
Since we do not want to make a type 1 error (rejecting H_0 when it is true), we specify the acceptable probability of type 1 error, i.e. \alpha_z. Typical values (as mentioned earlier) are 0.01 and 0.05.
The probability of type-1 error using a threshold \theta is the probability of obtaining Y \geq \theta under the null hypothesis, and is given by \alpha(\theta) = \sum_{K= \theta}^N {N \choose K}P_0^K (1=P_0)^{N-K}
For the right-sided test, we choose threshold \theta_z as the smallest \theta such that \alpha(\theta) \leq \alpha_z.
Consider for instance that our desired \alpha_r = 0.05. For N=10 trials with P_0 = 0.5, for instance, we obtain \alpha(6) = 0.38, which is too high. Trying other values of \theta, \alpha(7) = 0.172, \alpha(8) = 0.055, and \alpha(9) = 0.011. So clearly, the smallest value of \theta that gives us a sufficiently low probability of type-1 error (i.e. \alpha \leq \alpha_r) is 9. We can only reject the null hypothesis of P = P_0 = 0.5 in favor of P > 0.5 with greater than 95% confidence if we get 9 or more positive outcomes in 10 trials, i.e. if we use a threshold \theta_z = 9.
Caption: The rejection region Y \geq 8 has an \alpha value of 0.055. For \alpha_r = 0.05, we must actually chose a higher threshold for this test
A left sided test would be similarly derived. Now we must find the threshold \theta_z as the largest \theta such that \alpha(\theta) = \sum_{i=K}^\theta {N \choose K}P_0^K(1-P_0)^{N-K} is less than \alpha_z.
Two sided test:
In the two-sided test, we do not specify whether the true P value is greater or less than the value P_0 proposed by the null hypothesis. Our altrenate hypothesis is only that P \neq P_0.
So now, we try to find a threshold \theta_r such that P(|Y - NP_0| \geq \theta_r) \leq \alpha_z, i.e. such that P(|Y - NP_0| \geq \theta_r) = P(Y \leq NP_0 -\theta_r) + P(Y \geq NP_0 + \theta_r) \leq \alpha_z
For the Binormial RV, under the null hypothesis, for any threshold \theta, we have P(Y \leq NP_0-\theta) = \sum_{K=0}^{NP_0-\theta}{N \choose K}P_0^K (1-P_0)^{N-K}, \\ P(Y \geq NP_0+\theta) = \sum_{K=NP_0+\theta}^{N}P_0^K (1-P_0)^{N-K} and \alpha(\theta) = \sum_{K=0}^{NP_0-\theta}{N \choose K}P_0^K (1-P_0)^{N-K} + \sum_{K=NP_0+\theta}^{N}P_0^K (1-P_0)^{N-K}
We find the smallest \theta value such that \alpha(\theta) \leq \alpha_z, and set that \theta to be \theta_r.
For our coin problem, with the null hypothesis of P_0 = 0.5, we get \alpha(6) = 0, \alpha(5) = 0.002, \alpha(4) = 0.02, and \alpha(3) = 0.11. Thus, if we set our maximum acceptable probability of type-1 error \alpha_z = 0.05, we find that the we must set \theta_r = 4, and our rejection to be Y \geq 9 ~or~Y \leq 1.
Caption: The rejection region |Y - 5| \geq 3 has an \alpha value of 0.115 For \alpha_r = 0.05, we must actually chose a higher threshold, of \theta = 4 for this test
Note that if we had set \alpha_r = 0.001, we could never have conducted a two-sided test on this sample, since \alpha(0), the \alpha value for the smallest possible \theta, is already 0.002. In that case you would need more samples in your test. With 20 samples, we find that we can use a threshold of \theta_r.
Typical Problem:
Learning algorithm A has an accuracy of 80% on some problem. You have developed a new algorithm B. If you test 1000 samples, how many of them must your new algorithm classify correctly before you can 95% confident that your new algorithm is superior to the default algorithm?
Large Sample approximations to Bernoulli variables
For large N working out the Binomial probability distribution values above can become expensive. In these cases we can obtain an excellent approximation using the Central Limit Thorem.
Our statistic Y = \sum_{i=1}^NX_i is a sum of N draws from the same distribution. Under the null hypothesis (that the Bernoulli parameter P=P_0), the expected value of Y = E[Y] = NP_0. If the N draws are independent, the variance of Y is simply the sum of the variances of the N draws: var(Y) = \sum_{i=1}^N var(X_i). Since each X_i is a Bernoulli RV with variance P_0(1-P_0), we obtain var(Y) = NP_0(1-P_0).
The central limit theorem tells us that for large N Z = \frac{Y - E[Y]}{\sqrt{var(Y)}} = \frac{Y - NP_0}{\sqrt{NP_0(1-P_0)}} has a standard normal PDF.
Caption: A Gaussian PDF. The numbers to the the left of the mean at any x are the CDF \Phi(x), while the numbers to the right are 1 - \Phi(x). Note the symmetry.
Also, for any Z_\theta Z \leq Z_\theta \Rightarrow \frac{Y - NP_0}{\sqrt{NP_0(1-P_0)}} \leq Z_\theta \\ \Rightarrow Y \leq NP_0 + Z_\theta\sqrt{NP_0(1-P_0)}
Thus P(Z \leq Z_\theta) = P(Y \leq NP_0 + Z_\theta\sqrt{NP_0(1-P_0)}).
Left-sided: For Left-sided tests, our objective is to find a threshold \theta_r such that P(Y \leq \theta_r) \leq \alpha_z under the null hypothesis. To do so, instead we can find a Z_\alpha such that P(Z \leq Z_\alpha) = \alpha_z, and set \theta_r = \lfloor NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)} \rfloor. Since Z is distributed according to a standard normal, P(Z \leq Z_\alpha) = \Phi(Z_\alpha), the CDF of a standard normal. \Phi(Z_\alpha) can be obtained from standard normal tables, which can be found at this link.
A caveat is that that the standard normal tables actually provide P(Z \leq Z_\alpha) only for Z_\alpha \geq 0. But since the normal distribution is symmetric, for Z_\alpha \leq 0 we can use the fact (refer to the Gaussian figure above) that P(Z \leq Z_\alpha) = P( Z \geq -Z_\alpha) = 1 - \Phi(-Z_\alpha). So, to find a Z_\alpha such that P(Z \leq Z_\alpha) = \alpha_z, we must first find a smallest C such that \Phi(C) \geq 1 - \alpha_z from the table and set Z_\alpha = -C.
Consider the example of the coin. We have a null hypothesis that P_0 = 0.5. Lets assume a test of N=100 trials of the coin. Our alternate hypothesis is that P < P_0. We would like to be 95% confident when we reject the null hypothesis, so we set \alpha_z = 0.05. From the standard normal tables, we find that C = 1.65 is the smallest C value for which \Phi(C) \geq 0.95. Hence, we must set Z_\alpha = -1.65, from which we obtain \theta_r = NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)} = 100\times 0.5 - 1.65\sqrt{100\times 0.5\times 0.5}, giving us \theta_r = 41. I.e. if 41 or fewer tosses in 100 are heads, we can be 95% sure that the coin is not fair.
Right sided: For right sided tests our alternate hypothesis is that P > P_0, so we need to find a \theta_r such that if Y \geq \theta_r, we can confidently conclude that our alternate hypothesis is correct. Noting that under the null hypothesis P(Z \geq Z_\alpha) = P(Y \geq NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)}. So, in order to find a \theta_r such that P(Y \geq \theta_r) \leq \alpha_z, we need to find Z_\alpha such that P(Z \geq Z_\alpha) \leq \alpha_z, and set \theta_r = NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)}.
To find the necessary Z_\alpha from the standard normal table, we note that P(Z \geq Z_\alpha) = 1 - \Phi(Z_\alpha). So we must find the smallest C such that \Phi(C) \geq 1 - \alpha_z, and set Z_\alpha = C. \theta_r can now be computed as \theta_r = \lceil NP_0 + Z_\alpha\sqrt{NP_0(1-P_0)} \rceil.
For our coin example, if our alternate hypothesis is that P > P_0, for a test of N=100 trials, and setting our threshold on the probability of type 1 error at 0.05 as before, we find that the smallest C for which \Phi(C) \geq 0.95 is C = 1.65. So, setting Z_\alpha = 1.65, we get \theta_r = 59. We must obtain at least 59 heads in 100 tosses to be 95% confident that P > 0.5.
Two sided: For two sided tests our alternate hypothesis is that P \neq P_0, so we need to find a \theta_r such that if |Y - NP_0| \geq \theta_r, that is, if Y \geq NP_0 + \theta_r or Y \leq NP_0 - \theta_r, we can conclude that our alternate hypothesis is correct and that the coin is not fair.
We note that for large N the distribution of Y is symmetric, so P(Y \geq NP_0 + \theta_r) = P(Y \leq NP_0 - \theta_r). Hence, P(|Y - NP_0| \geq \theta_r) = 2P(Y \leq NP_0 - \theta_r). So, for any \alpha_z, we need to find a \theta_r such that 2P(Y \leq NP_0 - \theta_r) \leq \alpha_z, or P(Y \leq NP_0 - \theta_r) \leq \frac{\alpha_z}{2}. The procedure for finding \theta_z is hence identical to that for the right-sided test with a confidence threshold of 0.5\alpha_z.
We must find the smallest Z_{\frac{\alpha}{2}} from the normal table, such that \Phi(Z_{\frac{\alpha}{2}}) \geq 1 - \frac{\alpha}{2}. Note that unlike in the left and right sided tests, \theta_r is now not the actual threshold, but the distance from the mean NP_0. So we can compute \theta_r from Z_{\frac{\alpha}{2}} as \theta_r = \lceil Z_{\frac{\alpha}{2}} \sqrt{NP_0(1-P_))}\rceil.
For our coin example, if our alternate hypothesis is that P \neq P_0, for a test of N=100 trials, and setting our threshold on the probability of type 1 error at 0.05 as before, we find that Z_{\frac{\alpha}{2}} = 1.96, giving us \theta_r = \lceil 1.96 \sqrt{100\times 0.5\times 0.5} \rceil = 10. We must obtain 40 or fewer heads, or 60 or more heads, to be 95% confident that the coin is unfair.