Monday, May 12, 2008

Paper: Detecting Near-Duplicates for Web Crawling

Three guys from Google have published the paper Detecting Near-Duplicates
for Web Crawling
at the 2007 WWW Conference with a technique for detecting near-duplicates over a set of web pages.

They have developed a method aimed at performing near-duplicate detection over a corpus of 8B pages with a dataset of hashes of only 64 GB (Wow!).

Interesting issues tackled within the paper:

  • A method (simhash) for hashing documents. It has a very interesting attribute: hashes of similar documents are very close. You could consider that 2 documents are duplicates if their Hamming distance is 3 or less. They say that 64 Bits hashes are enough for near-duplicate detection over 8B pages.
  • A way to compress the hashes so that all the data needed for performing near-duplicate detection over 8B pages fit within over 32GB
  • A fast way to look for duplicates at Hamming Distance of 3 or less.
  • A fast way to perform batch queries using MapReduce with a speed of 1M fingerprints every 100 seconds with 200 mappers.
  • A good review of the state of the art of the duplicate detection: it is shown what shingles are; the authors propose to use typical IR document vectors for document attribute extraction; some possible usages are also mentioned, as well as other algorithms for performing duplicate detection.