We begin by defining to be the unknown joint distribution over x and y, and to be the known marginal distribution of x (commonly called the input distribution). We denote the learner's output on input x, given training set as . We can then write the expected error of the learner as follows:
where denotes expectation over and over training sets . The expectation inside the integral may be decomposed as follows [Geman et al. 1992]:
where denotes the expectation over training sets and the remaining expectations on the right-hand side are expectations with respect to the conditional density . It is important to remember here that in the case of active learning, the distribution of may differ substantially from the joint distribution .
The first term in Equation 2 is the variance of y given x --- it is the noise in the distribution, and does not depend on the learner or on the training data. The second term is the learner's squared bias, and the third is its variance; these last two terms comprise the mean squared error of the learner with respect to the regression function . When the second term of Equation 2 is zero, we say that the learner is unbiased. We shall assume that the learners considered in this paper are approximately unbiased; that is, that their squared bias is negligible when compared with their overall mean squared error. Thus we focus on algorithms that minimize the learner's error by minimizing its variance:
(For readability, we will drop the explicit dependence on x and --- unless denoted otherwise, and are functions of x and .) In an active learning setting, we will have chosen the x-component of our training set ; we indicate this by rewriting Equation 3 as
where denotes given a fixed x-component of . When a new input is selected and queried, and the resulting added to the training set, should change. We will denote the expectation (over values of ) of the learner's new variance as