An elaboration on the feature induction equations

In ``A maximum entropy approach to natural language processing'' (Computational Linguistics 22:1, March 1996), the appendix describes an approach to computing the gain of a single feature f. This note elaborates on the equations presented there; in particular, we show how to derive equations (37) and (38).

Background

Recall the form of the model:

where

is some fixed distribution,
is an indicator function or ``feature,'' drawn from some large set of possible features,
is a real-valued weight corresponding to the feature f, and
is a normalizing term, required to make a probability distribution.

A typical measure of the quality of a model relative to some empirical distribution is the log-likelihood:

Furthermore, we introduce the gain of a feature, which is the increase in log-likelihood of the model relative to :

eqnarray37

arises as a natural and efficient approximation to ; see section 4.3 of the paper for details. Here is the fraction of times in the empirical sample; in other words, the ``empirical expected value'' of f.

Deriving

The efficient feature selection algorithm computes for each feature f the optimal weight if f were to be adjoined to the set of active features , under the assumption that the other parameters remain fixed. Since the optimal occurs at , we can apply a numerical root-finding technique (such as Newton's method) to to locate the optimal . Under suitable conditions, a numerical root-finding technique is guaranteed to move monotonically closer to the root with each iteration.

Differentiating (3), we get

eqnarray43

Introducing the abbreviation , we can rewrite this equation as

For , we differentiate once more:

eqnarray50

Before things get any hairier, we will write f for and introduce . The notation (in the latter case) is meant to indicate that is a constant, not taking part in summations. Carrying on, we have

eqnarray61

The last line matches equation (38) in the paper.

Adam Berger
Thu Aug 1 13:39:45 EDT 1996