Guy Lebanon Language Technologies Institute Carnegie Mellon University August 2003 |
Information Retrieval Lab,
11-742 Lab instructor: Yiming Yang |
This toolkit is a set of C and Matlab functions implementing several methods of collaborative filtering (CF). In addition to implementing several algorithm proposed in the recent literature, we also supply functions for loading, handling and evaluating collaborative filtering methods. The main computational modules are implemented in C as mex files that are called from a Matlab environment thus achieving efficient computation of the bottlenck routines. |
Collaborative filtering (CF) is the
task of predicting the preferences of a user (called the active user) for
items unobserved by him. The preferences are predicted based on the active
user preference of a set of observed items and preference of other users.
Note that the item content does not play a role in the prediction. The
prediction are made on the basis of preference information. Adding content
information as features to the prediction algorithm may improve the result,
but in this toolkit we decided to focus on the traditional collaborative
filtering methods that does not employ content features. Breese et al. (1998) divides the approaches to collaborative filtering to the following two classes of algorithms
In this toolkit we implement the memory based algorithms of Pearson correlation coefficient, vector similarity, default voting and case amplification (see Breese et al, 1998 for a description). We also implement the model based method of personality diagnosis proposed by Pennock et al (2000). In addition, we implement routines for testing the method and evaluating them using different metrics and routines for handling the eachmovie data which is the canonical dataset for ranked collaborative filtering experiments. A more detailed description of the collaborative filtering task may be found here. |
The toolkit is intended to run from a Matlab environment. It is platform independent and should work on any of the recent Matlab versions (6-6.5). The computationally intensive routines were written in C as mex files and called from Matlab using the Mathworks application program interface. The mex files are available in a precompiled form for PC (.dll files). If the toolkit is used on a different platform, the mex files have to be compiled using Matlab's mex command. Preparing the software:
Software functionality
|
Preference are more often
than not unreported Using the spy command and the sparse matrix output from eachMovieFileReader.m it is possible to visualize the missing information pattern across the users. The following image shows a blue dote when the appropriate user (row) reported a preference on the appropriate item (column). The rows represent the first 600 users and the column the items that the first 600 users reported their preferences. Power law behavior of the number of users reporting preference of each item and the number of items each user reports The histogram of the number of users reporting the preference of the different items exhibit a linear trend in the log-log scale. The x axis represent how times items were reported and the y axis represent number of items that fall in this category. 10 bins were used in generating the histogram. The histogram of the number of items the different users report exhibit a linear trend in the log-log scale. The x axis represents number of users reporting the same items and the y axis representing number of occurrences that fall in this category. 10 bins were used in generating the histogram.
Rank-Value plots The following two graphs show rank-value graphs for the number of votes the different items have (first figure) and the number of items different users report. statistics for the graphs above were obtained from analyzing the first 1000 users. Performance comparison of different collaborative filtering strategies We compared the performance of three different CF systems: Pearson correlation coefficient, vector similarity and personality diagnosis (using standard deviation parameter of 0.7). An additional baseline of predicting a average performance for all items was also considered. The evaluation measures are mean absolute deviation, mean square error and ranked evaluation measure which measures the expected true preference of the chosen item when the probability to choose a recommended item decays exponentially with its location in a sorted list of recommendations. This is a generalization of the measure in Heckerman et al., 2000 for multiple preference values.
The experiments performed showed that Pearson correlation performed almost as well as vector similarity but personality diagnosis lags somewhat behind (yet still outperforms the trivial recommender of mean preference). Personality diagnosis has a free parameter - the standard deviation of the Gaussian model - whose optimization may improve the model's performance. |
[1] J. S. Breese, D. Heckerman and C. Kadie, Empirical
Analysis of Predictive Algorithms for Collaborative Filtering,
Uncertainty in Artificial Intelligence, 1998. [2] D. M. Pennock, E. Horvitz, S. Lawrence and C. L. Giles. Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-Based approach. Uncertainty in Artificial Intelligence, 2000. [3] C. M. Kadie, C. Meek and D. Heckerman. CFW: A Collaborative Filtering System Using Posteriors Over Weights of Evidence. Uncertainty in Artificial Intelligence, 2002. [4] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, C. Kadie. Dependency Networks for Inference, Collaborative Filtering, and Data Visualization. Journal of Machine Learning Research 1, 2000. |