What Learning Method Should be Used?

Next: Results Up: Learning Previous: How Should Pages

What Learning Method Should be Used?

The task of the learner is to learn the general function , given a sample of training data logged from users. In order to explore possible learning approaches and to determine the level of competence achievable by a learning agent, we applied the following four methods to training data collected by WebWatcher during 30 information search sessions:

Winnow [Littlestone, 1988] learns a boolean concept represented as a single linear threshold function of the instance features. Weights for this threshold function are learned using a multiplicative update rule. In our experiments we enriched the original 530 attributes by a transformation. Each attribute of an example vector was transformed into two attributes ,. One attribute is equivalent with the original, the other is its negation. After the learning phase we removed the threshold and used the output of the learned linear function as an evaluation for instances.
Wordstat attempts to make a prediction whether a link is followed based directly on the statistics of individual words. For each feature in the vector, it keeps two counts: a count of the number of times this feature was set over all training examples (total), and a count of the number of times this feature was set and the instance was classified as positive (pos). The ratio provides an estimate of the conditional probability that the link will be followed, given that this feature occurs. We experimented with various ways of combining these ratios. Of the approaches we tried, the one that worked best in our experiments, the results of which we report here, involves assuming that these single-word estimates are mutually independent. This assumptions allows us to combine individual estimates in a straightforward way. If are the individual probabilities, and I is the set of indexes for which a bit is set in a given test vector, then the probability that the corresponding link was followed is determined by .
TFIDF with cosine similarity measure [Salton and McGill, 1983,Lang, 1995] is a method developed in information retrieval. In the general case at first a vector of words is created. In our experiments it is already given by the representation described above. Every instance can now be represented as a vector with the same length as , replacing every word by a number. These numbers are calculated by the formula , with n being the total number of examples, the number of occurrences of in the actual example and the number of examples appears in. The length of the vector is normalized to 1. Prototype vectors for each class of the target concept are created by adding all training vectors of this class. In our case we had a target concept with two classes: positive (link was followed by the user), and negative (link was not followed by the user). The evaluation of an instance is calculated by subtracting the cosine between the instance vector and the negative prototype vector from the cosine between the instance vector and the positive prototype vector.
Random To provide a baseline measure against which to compare the learning methods, we also measured the performance achieved by randomly choosing one link on the page with uniform probability. The mean number of links per page over the data used here is 16, ranging from a minimum of 1 to a maximum of 300.

Next: Results Up: Learning Previous: How Should Pages

Thorsten Joachims
Thu Feb 9 16:27:34 EST 1995