Our goal is to construct a statistical model of the process which generated the training sample . The building blocks of this model will be a set of statistics of the training sample. In the current example we have employed several such statistics: the frequency that in translated to either dans or en was ; the frequency that it translated to either dans or au cours de was ; and so on. These particular statistics were independent of the context, but we could also consider statistics which depend on the conditioning information x. For instance, we might notice that, in the training sample, if April is the word following in, then the translation of in is en with frequency .
To express the event that in translates as en when April is the following word, we can introduce the indicator function
The expected value of f with respect to the empirical distribution is exactly the statistic we are interested in. We denote this expected value by
We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f. We call such function a feature function or feature for short. (As with probability distributions, we will sometimes abuse notation and use to denote both the value of f at a particular pair as well as the entire function f.)
When we discover a statistic that we feel is useful, we can acknowledge its importance by requiring that our model accord with it. We do this by constraining the expected value that the model assigns to the corresponding feature function f. The expected value of f with respect to the model is
where is the empirical distribution of x in the training sample. We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require
Combining (1), (2) and (3) yields the more explicit equation
We call the requirement (3) a constraint equation or simply a constraint. By restricting attention to those models for which (3) holds, we are eliminating from consideration those models which do not agree with the training sample on how often the output of the process should exhibit the feature f.
To sum up so far, we now have a means of representing statistical phenomena inherent in a sample of data (namely, ), and also a means of requiring that our model of the process exhibit these phenomena (namely, ).
One final note about features and constraints bears repeating: though the words ``feature'' and ``constraint'' are often used interchangeably in discussions of maximum entropy, we will be vigilant to distinguish the two and urge the reader to do likewise: a feature is a binary-valued function of ; a constraint is an equation between the expected value of the feature function in the model and its expected value in the training data.