Sunday, February 24, 2008

Distributed databases: BigTable, HBase and Hypertable

Since the publication of the Google paper about BigTable, people have started to make up their mind about distributed databases. BigTable is a distributed database where you can store big amounts of data. On the other hand, a lot of requirements have been relaxed in order to achieve this scalability. Therefore BigTable cannot be compared with typical RDBMs.

We can see BigTable as a big ordered map where you can insert pairs of (key, value). The data are stored sorted by key, so retrievals by key or by a range of keys are fast. You cannot create indexes using other fields. These big tables are physically stored in smaller pieces (of approximately 100 Mb) named “tablets”. These tablets are stored in a distributed file system.

A very interesting attribute of this architecture is that tablets can be compressed before being written to disk. That is very useful if you are thinking in storing text data (web pages, documents, etc). For example, a big quantity of crawled content can be stored in this database.

In conclusion, BigTable is not a distributed replacement for RDBMs. Instead, it is a big database with limited functionalities but with a big quality: It is scalable.

Two Open Source projects are developing BigTable clones. HBase is a subproject of Hadoop, so it uses the HDFS as file system. It is entirely developed in Java. See some performance numbers here and here.

The second one is Hypertable. They have selected to use C++ as language. KFS or HDFS can be used as file systems. Recently Some performance numbers have been published.

Both projects are in an early stage of development, but look very promising...

No comments: