We begin by specifying a large collection of candidate features. We do
not require a priori that these features are actually relevant or useful.
Instead, we let the pool be as large as practically possible. Only a small
subset of this collection of features will eventually be employed in our final
model. In short, we would like to include in the model only a subset
of
the full set of candidate features
. We will call
the set of
active features. The choice of
must capture as much information about
the random process as possible, yet only include features whose expected values
can be reliably estimated.
With each one might associate a Bernoulli random variable
with the following (``objectivist'' or ``non-Bayesian'')
interpretation. Imagine that the set of
pairs comprising
were
drawn from an imaginary, infinite distribution of
pairs. In this imaginary
distribution, f is active (
) with some probability
, and
inactive (
) with probability
.
(A Bayesian would prefer to put a distribution over the possible values of
, informed both by some empirical evidence--such as the fraction of
times
in the sample
--and also by some outside knowledge of
what one miht expect
to be. This line of reasoning forms the basis of
``fuzzy maximum entropy,'' which is a quite intriguing line of investigation we
shall not pursue here.)
As a Bernoulli variable, f is subject to the law of large
numbers, which asserts that for
any
:
That is, by increasing the number (N) of examples , the fraction
for which
approaches the constant
. More
succinctly,
.
Unfortunately,
is not infinite, and the empirical estimate
for each
is not guaranteed to lie close to its ``true'' value
. However, a subset of
will exhibit
. And of these, a fraction will have a non-negligeable bearing on the
likelihood of the model. The goal of the inductive learning process is to identify
this subset and construct a conditional model from it.
Searching exhaustively for this subset is hopeless for all but the most
restricted problems. Even a priori fixing its size doesn't
help much: there are
unique outcomes
of this selection process. For concreteness, a typical set of figures in
documented experiments involving maxent modeling are
and
, meaning
possible sets of 25 active
features. Viewed as a pure search problem, finding the optimal set of
features is computationally inconceivable.
To find , we adopt an incremental approach to feature selection. The idea
is to build up
by successively adding features. The choice of feature to
add at each step is determined by the training data. Let us denote the set of
models determined by the feature set
as
. ``Adding'' a feature f
is shorthand for requiring that the set of allowable models all satisfy the
equality
. Only some members of
will satisfy this
equality; the ones that do we denote by
.
Thus, each time a candidate feature is adjoined to , another linear
constraint is imposed on the space
of models allowed by the features
in
. As a result,
shrinks; the model
in
with the
greatest entropy reflects ever-increasing knowledge and thus, hopefully,
becomes a more accurate representation of the process. This narrowing of the
space of permissible models was represented in figure 1 by a
series of intersecting lines (hyperplanes, in general) in a probability
simplex. Perhaps more intuitively, we could represent it by a series of nested
subsets of
, as in figure 2.
As an aside, the intractability of the ``all at once'' optimization problem is not unique to or an indictment of the exponential approach. Decision trees, for example, are typically constructed recursively using a greedy algorithm. And in designing a neural network representation of an arbitrary distribution, one typically either fixes the network topology in advance or performs a restricted search within a neighborhood of possible topologies for the optimal configuration, because the complete search over parameters and topologies is intractable.
Figure 2: A nested sequence of
subsets of
corresponding to increasingly large sets of features