Consider the following simple fingerprinting scheme: given a document, let the
fingerprint of the document consist of the set of all possible document
substrings of length . There are
such substrings, where
l is the length of the document. Comparing two documents under this scheme is
simply a matter of counting the number of substrings common to both
fingerprints: if we compare a document A of size |A| against a document B,
and if n is the number of substrings common to both documents then
is
the measure of how much of A is contained in B. If
is chosen
appropriately, this simple fingerprint gives reliable document matching results.
We refer to this scheme as full fingerprinting. Although it is not
practical (for space reasons), it is a very useful measure of document
similarity, and we shall use it for the evaluation of our system in Section
7. (We remark that it would not be a good idea to construct
fingerprints by chopping up the document into
substrings by making a cut at every
character, because insertion
of a character at the start of the document would shift the substrings by 1 and
the resulting fingerprint would be a poor match to the original, even though the
two documents are almost identical).
The choice of in this fingerprinting scheme is particularly important,
and is subject to two conflicting constraints. If
is too small, then
there will be many false matches (e.g. if
is the size of a word, then
the scheme reduces to little more than comparing lists of words in documents, a
poor similarity metric). If
is too large, then there will be many
false negatives because one character change can affect
substrings in
the fingerprint (e.g. if
is the size of a paragraph, then a single
character change in a paragraph would effectively prevent matching for the
entire paragraph). We remark that there is no ``right'' value for
: we
cannot quantitatively or empirically calculate a definitive value for
.
In essence, the choice of
defines the notion of document similarity for
the system. Different values of
will be useful for different kinds of
searches. The value of
used in this paper is effectively around 30-45
(more precisely, our strings consist of 20 character sequences of consonants; we
discuss this in detail in Section 5). We investigate the effect of
different values of
in Section 7.