Sunday, September 14, 2008

Paper: "IRLbot: Scaling to 6 Billion Pages and Beyond"

Two of the most complex issues to deal with when developing a crawler are URL uniqueness and host politeness. When crawling, you need to visit new pages. In order to know which URL represents a new page, you have to do a look up over the list of already crawled URLs. That is easy when dealing with small amounts of URLs, but it is extremely hard when dealing with billions of pages.

In the paper "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov, the authors describe a memory and disk data structure named DRUM. It provides with an efficient storage of large collections of (key, value) pairs, and maximizes the amortized throughput of insertions, updates and lookups. In order to achieve maximum throughput, queries are done in batches: only when some structures in memory are full, all the queries are executed. So queries are answered asynchronously. Another advantage is that DRUM adapts to different combinations of memory/disk. The bigger the memory, the better the performance. And, with larger disk space, DRUM capacity gets increased. DRUM is a modification of the bucket sorting algorithm.

The paper also talks about the techniques for dealing with spam sites and a budget system to crawl more pages from sites that receive more inbound links.

DRUM is the best approach I have seen until now to solve the problem of URL uniqueness. The paper is a “must read” if you are working with crawlers.