One simple strategy is random selection. However this gives poor results. For
example suppose that we have a document of length 50,000 (which gives rise to
about 50,000 possible substrings of length ) and we use 100 substring
fingerprints for storage and 1000 substring fingerprints for search. Now
consider matching the document against itself. The probability that any
particular substring appears in the storage fingerprint is
.
Hence, the expected number of substrings from the search fingerprint that match
the storage fingerprint is
(i.e. a match ratio of about
2%). The results are of course much worse for documents that are related
but not identical.
To provide more reliable matches, the selection strategy must select similar substrings from similar documents. One approach is employ a string hash function, and then a fingerprint of size n can be obtained by picking the n substrings with the lowest hash values. The approach we use is related to the hash function approach and gives similar results, but reduces false positives. We defer the details to Section 6.