Controlling the costs of collaborative filtering

In order to implement a collaborative filter, there must be some way of distributing a user's votes for or against articles to the users who might want this information. To be a practical system however, this information exchange must place only a small overhead on the existing information stream. To illustrate that this is nontrivial, let us consider a simple example of how filtering information might be transported.

Consider an average sized news group such as comp.arch. This group is read by about 80,000 people and has an average of 1,000 posts a month so each reader is presented with approximately 30 new messages a day. Let us assume that 1% of the readership (800 people) participate in our filtering system. The simplest way for these users to distribute their votes would be to post their votes as an article to some shadow newsgroup, say comp.arch.votes. The existing Net News system would then automatically distribute the votes to all sites throughout the world, and anyone who wanted to use the votes for filtering could simply have their software read the votes out of the shadow group.

Now consider what is happening to that shadow group. The shadow group is receiving 800 messages a day. Further, since we will want to store the votes for at least as long as we store the messages to which they refer and the typical expiration time on Net News is 14 days, the shadow group will have on the order of 11,000 messages in it. To make matters worse, since each message will only contain one person's opinion of a message, our software will have to read all 11,000 messages before it can accumulate the collaborative filtering information we need to filter the 30 new articles of actual information.

There is one benefit to the above system, however, which is that the end user has access to complete information about the source of each vote. Having access to this information allows the end user to be very sophisticated in the design of his filter as the filter can place more weight on some opinions, less on others. The filter could track the long term behavior of opinion providers to determine which are the best predictors or use a clustering algorithm to identify a group of peers with similar interests. This is the approach taken by GroupLens.[23]

There is an unavoidable natural tension between the completeness of the collaborative information we make available for use and the cost we must pay to transport it around. In this case, the completeness of the information refers to our ability to associate an opinion with the human being who created that opinion. Considering that we have already decided to provide anonymous voting the resolution of this tension seems clear: information about the net-wide opinion of an article will be available only in summary form.

Even if there were a way to cheaply transport information about the source of each vote, the collaborative system would still have to find ways of insuring the privacy of its users. By aggregating vote information into summaries we blur the information about where a vote originated from in the same way the U.S. Census Office provides privacy to U.S. citizens by only reporting census information aggregated over city-block sized units. In terms of minimizing the cost for transporting vote information, summaries are also ideal as information about each article only needs to be stored once. This makes the summary very compact. Further, it is trivial to combine two summaries to form a third which takes up little more space than either of the originals, yet carries the combined information.

There must be some way of accessing non-summarized information, however, if we are to implement group and custom moderation. To provide moderation, we do not need to know the opinion of the net as a whole, but only the opinion of the several people whose judgments we trust. Our system provides a direct, point to point, means of obtaining this information when it is specifically required. This client-pull approach will save network bandwidth in what we assume is the common case of users mainly requesting the opinions of other users at the same or nearby sites. If one set of opinions were to become requested by many nonlocal users (eg: someone started to sell their votes as a moderation service) then other technologies termed Uniform Resource Names (URNS) related to URLs could be used to replicate and distribute the collaborative information.[8]

Next: Respecting social conventions Up: Issues in Design Previous: Providing levels of

David A. Maltz (dmaltz@cs.cmu.edu)