For all but the most simple problems, the that maximize cannot be found analytically. Instead, we must resort to numerical methods. From the perspective of numerical optimization, the function is well behaved, since it is smooth and convex- in each . Consequently, a variety of numerical methods can be used to calculate . One simple method is coordinate-wise ascent, in which is computed by iteratively maximizing one coordinate at a time. When applied to the maximum entropy problem, this technique yields the popular Brown algorithm. Other general purpose methods that can be used to maximize include gradient ascent and conjugate gradient.
An optimization method specifically tailored to the maximum entropy problem is the iterative scaling algorithm of Darroch and Ratcliff. We present here a version of this algorithm specifically designed for the problem at hand; the interested reader is referred to the further readings. The algorithm is applicable whenever the feature functions are non-negative:
This is, of course, true for the binary-valued feature functions we are considering here. The algorithm generalizes the Darroch-Ratcliff procedure, which requires, in addition to the non-negativity, that the feature functions satisfy for all x,y.
The key step in the algorithm is step (2a), the computation of the increments that solve (19). Notice that this equation contains the term , which changes as becomes updated in step (2b). For this reason, the algorithm is iterative, and requires a pass through the entire set of pairs in the empirical sample for each iteration.
Equation (19) merits a brief comment. In words, is the number of features which are ``active'' ( ) for x,y. In some cases, may be a constant ( for all x,y, say), in which case is given explicitly as
However, in practice it is typically the case that different numbers of features will apply at different x,y. If is not constant, then the must be computed numerically. A simple and effective way of doing this is by Newton's method. This method computes the solution of an equation iteratively by the recurrence
with an appropriate choice for and suitable attention paid to the domain of g.