The task of the learner is to learn the general function , given a sample of training data logged from users. In order to explore possible learning approaches and to determine the level of competence achievable by a learning agent, we applied the following four methods to training data collected by WebWatcher during 30 information search sessions:
Winnow
[Littlestone, 1988] learns a boolean concept represented as
a single linear threshold function of the instance features. Weights for this
threshold function are learned using a multiplicative update rule. In our
experiments we enriched the original 530 attributes by a transformation. Each
attribute
of an example vector was transformed into two attributes
,. One attribute is equivalent with the original,
the other is its negation. After the learning phase we removed the threshold
and used the output of the learned linear function as an evaluation for
instances.
Wordstat
attempts to make a prediction whether a link is
followed based directly on the statistics of individual words. For each
feature in the vector, it keeps two counts: a
count of the number of times this feature was set over all training examples
(total), and a count of the number of times this feature was set and
the instance was classified as positive (pos). The ratio
provides an estimate of the conditional probability that the link will be
followed, given that this feature occurs. We experimented with various ways
of combining these ratios. Of the approaches we tried, the one that worked
best in our experiments, the results of which we report here, involves
assuming that these single-word estimates are mutually independent. This
assumptions allows us to combine individual estimates in a straightforward
way. If are the individual probabilities, and I is the
set of indexes for which a bit is set in a given test vector, then the
probability that the corresponding link was followed is determined by .
TFIDF
with cosine similarity measure [Salton and McGill, 1983,Lang, 1995]
is a method developed in information retrieval. In the general case at first a
vector of words is created. In our experiments it is already given by
the representation described above. Every instance can now be represented as
a vector with the same length as , replacing every word by a number.
These numbers are calculated by the formula
, with
n being the total number of examples, the number of
occurrences of in the actual example and
the number of examples appears in. The length of the vector is
normalized to 1. Prototype vectors for each class of the target concept are
created by adding all training vectors of this class. In our case we had a
target concept with two classes: positive (link was followed by the user), and
negative (link was not followed by the user). The evaluation of an instance is
calculated by subtracting the cosine between the instance vector and the
negative prototype vector from the cosine between the instance vector and the
positive prototype vector.
Random
To provide a baseline measure against which to
compare the learning methods, we also measured the performance achieved by
randomly choosing one link on the page with uniform probability. The mean
number of links per page over the data used here is 16, ranging from a minimum
of 1 to a maximum of 300.