Earlier we divided the statistical modeling problem into two steps: first,
finding appropriate facts about the data; second, incorporating these
facts into the model. Up to this point we have proceeded by assuming that the
first task was somehow performed for us. Even in the simple example provided
above, we did not explicitly state how we selected those particular
constraints. That is, why is the fact that dans or
à was chosen by the expert translator 50% of the time any more
important than countless other facts contained in the data? In fact, the
principle of maximum entropy does not directly concern itself with the issue of
feature selection: it merely provides a recipe for combining constraints into a
model. But the feature selection problem is critical, since the universe of
possible constraints is typically in the thousands or even millions. In this
section we introduce a method for automatically selecting the features to be
included in a maximum entropy model, and then offer a series of refinements to
ease the computational burden. What we will describe is a form of inductive
learning: from a distribution , derive a set of rules (features) which
characterize
.