Frequently, only four classes are used (student, faculty, course, project); this subset is typically called WebKB4. This is not to be confused with the 4 universities subset, which includes web pages from Cornell, Washington, Wisconsin and Texas, but not pages from the misc collection.
Some learning algorithms use both the web page text and the hyperlink structure. A relational representation of the 4 universities pages and hyperlinks is available. Also available is a collection of anchor text and fulltext for discriminating between courses and non-courses for the 4 universities data.