Monday, May 12, 2008

Paper: Detecting Near-Duplicates for Web Crawling

Three guys from Google have published the paper Detecting Near-Duplicates
for Web Crawling
at the 2007 WWW Conference with a technique for detecting near-duplicates over a set of web pages.

They have developed a method aimed at performing near-duplicate detection over a corpus of 8B pages with a dataset of hashes of only 64 GB (Wow!).

Interesting issues tackled within the paper:

  • A method (simhash) for hashing documents. It has a very interesting attribute: hashes of similar documents are very close. You could consider that 2 documents are duplicates if their Hamming distance is 3 or less. They say that 64 Bits hashes are enough for near-duplicate detection over 8B pages.
  • A way to compress the hashes so that all the data needed for performing near-duplicate detection over 8B pages fit within over 32GB
  • A fast way to look for duplicates at Hamming Distance of 3 or less.
  • A fast way to perform batch queries using MapReduce with a speed of 1M fingerprints every 100 seconds with 200 mappers.
  • A good review of the state of the art of the duplicate detection: it is shown what shingles are; the authors propose to use typical IR document vectors for document attribute extraction; some possible usages are also mentioned, as well as other algorithms for performing duplicate detection.


Lukas said...

Useful link, thanks!

Iván de Prado said...

Hello Lukas!

I'm adding your blog to my feeds reader.


Lukas said...

I hope you won't be bored because I found I am blogging about the same kind of things - so you may not found a new information there :-)

Anyway, I hope you are doing good...