In ``A maximum entropy approach to natural language processing'' (Computational Linguistics 22:1, March 1996), the appendix describes an approach to computing the gain of a single feature f. This note elaborates on the equations presented there; in particular, we show how to derive equations (37) and (38).
Background
Recall the form of the model:
where
A typical measure of the quality of a model relative to some
empirical distribution is the log-likelihood:
Furthermore, we introduce the gain of a feature, which is the increase in
log-likelihood of the model relative to
:
arises as a natural and efficient approximation to
;
see section 4.3 of the paper for details. Here
is the fraction of times
in the empirical sample; in
other words, the ``empirical expected value'' of f.
Deriving
The efficient feature selection algorithm computes for each feature f the
optimal weight if f were to be adjoined to the set of active features
, under the assumption that the other parameters remain fixed. Since the
optimal
occurs at
, we can apply a numerical root-finding
technique (such as Newton's method) to
to locate the optimal
. Under suitable conditions, a numerical root-finding technique is
guaranteed to move monotonically closer to the root with each iteration.
Differentiating (3), we get
Introducing the abbreviation ,
we can rewrite this equation as
For , we differentiate once more:
Before things get any hairier, we will write f for
and introduce
. The notation (in the latter case) is meant to
indicate that
is a constant, not taking part in
summations. Carrying on, we have
The last line matches equation (38) in the paper.