Thursday, February 21, 2008

Big Data Sets Queriying and Analisys

The use of SQL and databases to analyze and extract data from datasets is a common practice. Functions like GROUP BY, ORDER BY and aggregation functions like COUNT, AVG, etc are useful and flexible enough. Tasks as generating statistics from log files or extract information from a dataset are easy with SQL.

But the problem comes when you have a very big dataset (GB or TB of data). In those cases, the databases simply do not work. It is needed to distribute the computation. There are two projects that can help.

Pig is a project built on top of Hadoop. As they say:

The highest abstraction layer in Pig is a query language interface, whereby users express data analysis tasks as queries, in the style of SQL or Relational Algebra.

The computations in Pig are expressed in a language named PigLatin that provides similar pieces than SQL but more powerful. The data are stored in the Hadoop Distributed File System (HDFS), and the computations are distributed along your Hadoop cluster. That means that you can query through terabytes of data. Currently, Pig is working and they are planning to release a new version with a lot of improvements in the next months.

Jaql is another project, younger than Pig. Jaql is a query language for JSON inspired in SQL, xQuery and PigLatin. But they are planning to make it distributed using Hadoop. Files can be read from HDFS or HBase.

In conclusion, these two projects can help a lot when extracting information from big datasets.

No comments: