for Web Crawling at the 2007 WWW Conference with a technique for detecting near-duplicates over a set of web pages.
They have developed a method aimed at performing near-duplicate detection over a corpus of 8B pages with a dataset of hashes of only 64 GB (Wow!).
Interesting issues tackled within the paper:
- A method (simhash) for hashing documents. It has a very interesting attribute: hashes of similar documents are very close. You could consider that 2 documents are duplicates if their Hamming distance is 3 or less. They say that 64 Bits hashes are enough for near-duplicate detection over 8B pages.
- A way to compress the hashes so that all the data needed for performing near-duplicate detection over 8B pages fit within over 32GB
- A fast way to look for duplicates at Hamming Distance of 3 or less.
- A fast way to perform batch queries using MapReduce with a speed of 1M fingerprints every 100 seconds with 200 mappers.
- A good review of the state of the art of the duplicate detection: it is shown what shingles are; the authors propose to use typical IR document vectors for document attribute extraction; some possible usages are also mentioned, as well as other algorithms for performing duplicate detection.