Main Page Namespace List Class Hierarchy Alphabetical List Compound List File List Namespace Members Compound Members File Members Related Pages

Probabilistic Latent Semantic Analysis

This application (PLSA.cpp) will either perform PLSA on a collection, building three probability tables: P(z), P(d|z), and P(w|z) where z in Z are the latent variables (categories), d in D are the documents in the collection, and w in W are the terms in the vocabulary over the collection, or open those tables and read them into memory to illustrate their potential use.

The parameter doTrain (true|false) determines whether the tables are constructed or read. The default value is true.

The other parameters accepted by PLSA are:

index -- the index to use. Default is none.
numCats -- the number of latent variables (categories) to use. Default is 20.
beta -- The value of beta for Tempered EM (TEM). Default is 1.
betaMin -- The minimum value for beta, TEM iterations stop when beta falls below this value. Default is 0.6.
eta -- Multiplier to scale beta before beginning a new set of TEM iterations. Must be less than 1. Default is 0.92.
annealcue -- Minimum allowed difference between likelihood in consecutive iterations. If the difference is less than this, beta is updated. Default is 0.
numIters -- Maximum number of iterations to perform. Default is 100.
numRestarts -- Number of times to recompute with different random seeds. Default is 1.
testPercentage -- Percentage of events (d,w) to hold out for validation.
doTrain -- whether to construct the probability tables or read them in. Default is true.

Generated on Wed Nov 3 13:00:03 2004 for Lemur Toolkit by

1.2.18