In order to learn and utilize knowledge of the target function ,
it is necessary to first choose an appropriate representation for
. This representation must be compatible with available
learning methods, and must allow the agent to evaluate learned knowledge
efficiently (i.e., with a delay negligible compared to typical page access
delays on the web). Notice that one issue here is that web pages, information
associated with hyperlinks, and user information goals are all predominantly
text-based, whereas most machine learning methods assume a more structured
data representation such as a feature vector. We have experimented with a
variety of representations that re-represent the arbitrary-length text
associated with pages, links, and goals as a fixed-length feature vector.
This idea is common within information retrieval retrieval systems
[Salton and McGill, 1983]. It offers the advantage that the information in an arbitrary
amount of text is summarized in a fixed length feature vector compatible with
current machine learning methods. It also carries the disadvantage that much
information is lost by this re-representation.
Table: Encoding of selected information for a given Page, Link, and
Goal.
The experiments described here all use the same representation. Information about the current Page, the user's information search Goal, and a particular outgoing Link is represented by a vector of approximately 530 boolean features, each feature indicating the occurrence of a particular word within the text that originally defines these three attributes. The vector of 530 features is composed of four concatenated subvectors:
To choose the encodings for the first three fields, it was necessary to select
which words would be considered. In each case, the words were selected by
first gathering every distinct word that occurred over the training set, then
ranking these according to their mutual information with respect to correctly
classifying the training data, and finally choosing the top N words in this
ranking. Mutual information is a common statistical
measure (see, e.g.,
[Quinlan, 1993]) of the degree to which an individual feature (in this case
a word) can correctly classify the observed data.
Figure 1 summarizes the encoding of information about the current Page, Link, and Goal.