We introduce the concept of maximum entropy through a simple example. Suppose
we wish to model an expert translator's decisions concerning the proper French
rendering of the English word in. Our model of the expert's
decisions assigns to each French word or phrase f an estimate,
, of
the probability that the expert would choose f as a translation of in.
To guide us in developing
, we collect a large sample of instances of the
expert's decisions. Our goal is to extract a set of facts about the
decision-making process from the sample (the first task of modeling) that will
aid us in constructing a model of this process (the second task).
One obvious clue we might glean from the sample is the list of allowed translations. For example, we might discover that the expert translator always chooses among the following five French phrases: {dans, en, à, au cours de, pendant}. With this information in hand, we can impose our first constraint on our model p:
This equation represents our first statistic of the process; we can now proceed
to search for a suitable model which obeys this equation. Of course, there are
an infinite number of models for which this identity holds. One model
which satisfies the above equation is
; in other words, the model
always predicts dans. Another model which obeys this constraint predicts
pendant with a probability of
, and à with a probability
of
. But both of these models offend our sensibilities: knowing only that
the expert always chose from among these five French phrases, how can we
justify either of these probability distributions? Each seems to be making
rather bold assumptions, with no empirical justification. Knowing only that the
expert chose exclusively from among these five French phrases, the
most intuitively appealing model is
This model, which allocates the total probability evenly among the five possible phrases, is the most uniform model subject to our knowledge. (It is not, however, the most uniform overall; that model would grant an equal probability to every possible French phrase.)
We might hope to glean more clues about the expert's decisions from our
sample. Suppose we notice that the expert chose either dans or en
30% of the time. We could apply this knowledge to update our model of the
translation process by requiring that satisfy two constraints:
Once again there are many probability distributions consistent with these two
constraints. In the absence of any other knowledge, a reasonable choice for
is again the most uniform--that is, the distribution which allocates
its probability as evenly as possible, subject to the constraints:
Say we inspect the data once more, and this time notice another interesting fact: in half the cases, the expert chose either dans or à. We can incorporate this information into our model as a third constraint:
We can once again look for the most uniform satisfying these constraints,
but now the choice is not as obvious. As we have added complexity, we have
encountered two problems. First, what exactly is meant by ``uniform,'' and how
can one measure the uniformity of a model? Second, having determined a
suitable answer to these questions, how does one find the most uniform model
subject to a set of constraints like those we have described?
The maximum entropy method answers both these questions. Intuitively, the
principle is simple: model all that is known and assume nothing about that
which is unknown. In other words, given a collection of facts, choose a model
which is consistent with all the facts, but otherwise as uniform as
possible. This is precisely the approach we took in selecting our model at
each step in the above example.