One simple strategy is random selection. However this gives poor results. For example suppose that we have a document of length 50,000 (which gives rise to about 50,000 possible substrings of length ) and we use 100 substring fingerprints for storage and 1000 substring fingerprints for search. Now consider matching the document against itself. The probability that any particular substring appears in the storage fingerprint is . Hence, the expected number of substrings from the search fingerprint that match the storage fingerprint is (i.e. a match ratio of about 2%). The results are of course much worse for documents that are related but not identical.
To provide more reliable matches, the selection strategy must select similar substrings from similar documents. One approach is employ a string hash function, and then a fingerprint of size n can be obtained by picking the n substrings with the lowest hash values. The approach we use is related to the hash function approach and gives similar results, but reduces false positives. We defer the details to Section 6.