For all but the most simple problems, the that maximize
cannot be found analytically. Instead, we must resort to
numerical methods. From the perspective of numerical optimization, the
function
is well behaved, since it is smooth and convex-
in each
. Consequently, a variety of numerical methods can be used to
calculate
. One simple method is coordinate-wise ascent, in which
is computed by iteratively maximizing
one
coordinate at a time. When applied to the maximum entropy problem, this
technique yields the popular Brown algorithm. Other general
purpose methods that can be used to maximize
include gradient
ascent and conjugate gradient.
An optimization method specifically tailored to the maximum entropy problem is
the iterative scaling algorithm of Darroch and Ratcliff. We present here a
version of this algorithm specifically designed for the problem at hand; the
interested reader is referred to the further readings. The algorithm is applicable
whenever the feature functions are non-negative:
This is, of course, true for the binary-valued feature functions we are
considering here. The algorithm generalizes the Darroch-Ratcliff procedure,
which requires, in addition to the non-negativity, that the feature functions
satisfy for all x,y.
The key step in the algorithm is step (2a), the computation of the increments
that solve (19). Notice that this equation
contains the term
, which changes as
becomes updated in
step (2b). For this reason, the algorithm is iterative, and requires a pass
through the entire set of
pairs in the empirical sample
for
each iteration.
Equation (19) merits a brief comment. In words, is the
number of features
which are ``active'' (
) for x,y. In
some cases,
may be a constant (
for all x,y, say), in
which case
is given explicitly as
However, in practice it is typically the case that different numbers of features
will apply at different x,y. If is not constant, then the
must be computed numerically. A simple and effective way of
doing this is by Newton's method. This method computes the solution
of an equation
iteratively by the recurrence
with an appropriate choice for and suitable attention paid to the
domain of g.