In the context of active learning, we are assuming that the input distribution is known. With a mixture of Gaussians, one interpretation of this assumption is that we know and for each Gaussian. In that case, our application of EM will estimate only , , and .
Generally however, knowing the input distribution will not correspond to knowing the actual and for each Gaussian. We may simply know, for example, that is uniform, or can be approximated by some set of sampled inputs. In such cases, we must use EM to estimate and in addition to the parameters involving y. If we simply estimate these values from the training data, though, we will be estimating the joint distribution of instead of . To obtain a proper estimate, we must correct Equation 5 as follows:
Here, is computed by applying Equation 7 given the mean and x variance of the training data, and is computed by applying the same equation using the mean and x variance of a set of reference data drawn according to .
If our goal in active learning is to minimize variance, we should select training examples to minimize . With a mixture of Gaussians, we can compute efficiently. The model's estimated distribution of given is explicit:
where , and denotes the normal distribution with mean and variance . Given this, we can model the change in each separately, calculating its expected variance given a new point sampled from and weight this change by . The new expectations combine to form the learner's new expected variance
where the expectation can be computed exactly in closed form:
If, as discussed earlier, we are also estimating and , we must take into account the effect of the new example on those estimates, and must replace and in the above equations with
We can use Equation 9 to guide active learning. By evaluating the expected new variance over a reference set given candidate , we can select the giving the lowest expected model variance. Note that in high-dimensional spaces, it may be necessary to evaluate an excessive number of candidate points to get good coverage of the potential query space. In these cases, it is more efficient to differentiate Equation 9 and hillclimb on to find a locally maximal . See, for example, [Cohn 1994].