Fingerprints are hashed and stored using a very simple indexed file. Specifically, each substring selected for inclusion in a fingerprint is hashed to a 28-bit unsigned integer. The top 16 bits of this 28-bit hash values are used as an index into a table at the start of the file. This table consists of pointers into 256 word blocks. The first word of each block contains a chain pointer to an overflow block (if there is one), and the second word contains a count of the number of words used in the block. The remaining 254 words are used to store fingerprint entries: each entry consists of the lower 12-bits of the 28-bit substring hash value and a 20-bit document identifier (for a maximum of 1M documents), which gives a total of 32 bits, or one word per fingerprint substring. Since we have 100 substrings per (stored) fingerprint, each fingerprint occupies 400 bytes.
We also store a log of each document in the database. This log includes the document identifier, date, URL and a contact email address. Currently this information is stored as raw ascii and consists of about 50-80 bytes per document. This can be compressed substantially. Early experiments indicate that a factor of 4 should be possible.
The simple indexing scheme we have used has a number of drawbacks. First, if the file only contains a small number of fingerprints then there will be a lot of wasted space because most of the blocks will be nearly empty. Second, as more documents are added, the overhead of following overflow blocks can become significant. For example, if there are 1M documents, there will be 100M words of fingerprints, which will occupy about 400-500K blocks, giving rise to an average chain length of 6-8, or about 6000-8000 disk probes per document search (search fingerprints have size 1000). These issues can be addressed by more sophisticated indexing/disk-management schemes.