Koala Document Fingerprinting (KDF)
KDF is an experimental system for identifying textually related documents.
The current system focuses on
on computer science research documents such as technical reports,
conference papers and journal articles, however there are a many other applications.
Purpose
Most pieces of research writing evolve over a period of time. Different
versions of document are published or otherwise made available at various stages
along this process. The purpose of the KDF system is
to trace the history of a document by identifying common segments of text
between documents. Typical uses of the system include:
- finding citations: you have an draft version of a
document and want to know where it was eventually published.
- tracing references: you have a conference abstract
and want to know if there is a revised or journal version of the article.
- tracing plagiarism: you have a paper and want to
check whether it has been previously published in whole or part
- by the same authors.
- by different authors.
What happens during searching?
Document searches proceed in a number of steps. First the document (URL) is
loaded by the KDF server. If necessary, the document is converted to a textual
representation. Then, using this text, a fingerprint of the document is
generated. Finally, this fingerprint is matched against the current document fingerprint
database to find related documents.
Limitations
Fingerprint Database: The scope of searches in the KDF system
is defined by what is currently in the fingerprint database. We are actively
seeking out papers available online - this is a lengthly process. If you know
of papers that are not currently in the database, you may add them yourself (see
the KDF home page). Alternatively, if you have a large collection of URL's,
please email them to nch@cs.cmu.edu and well will gladly add them.
The KDF system is designed for a large database of documents.
Searching is scalable, and storage requirements are low (about 500 bytes per
document). Our aim is for a database of about one million documents, including
computer science conference proceedings and journals.
Textual Conversion: To give reliable searches for documents
in a variety of representations, the basic fingerprinting techniques use textual
representations. Documents in non-textual representations must be converted to
text. The KDF system has limited support for postscript.
Other documents must first be converted to text before they can be used.
Postscript Support: Reliable conversion from postscript to text is
difficult, error-prone and expensive. Motivated mainly by resource constraints,
we have incorporated a limited string-based postscript to text conversion
process into KDF. This is satisfactory for postscript generated using TeX
and Framemaker, it is not effective on that generated by Microsoft Word.
The limited postscript support is include only for convenience: it is not
fundamental to the fingerprinting process. We hope to incorporate better
conversion processes as they become available (pointer welcome!).
Other Applications
The fingerprinting techniques used in KDF are independent of the underlying
textual objects used. While this particular implementation has focussed on
computer science research documents, the same system could be used for searching
web pages, magazine articles or speeches.
For further details...
See
Scalable Document Fingerprinting,
Second USENIX Electronic Commerce Workshop, pp.191-200, 1996
(postscript version).
The Koala Document Fingerprinting system was developed by Nevin Heintze while at
Carnegie Mellon University, August 1995.
"Koala Document Fingerprinting" copyright © 1995 Nevin Heintze. All Rights
Reserved.