Sunday, February 24, 2008

Distributed databases: BigTable, HBase and Hypertable

Since the publication of the Google paper about BigTable, people have started to make up their mind about distributed databases. BigTable is a distributed database where you can store big amounts of data. On the other hand, a lot of requirements have been relaxed in order to achieve this scalability. Therefore BigTable cannot be compared with typical RDBMs.

We can see BigTable as a big ordered map where you can insert pairs of (key, value). The data are stored sorted by key, so retrievals by key or by a range of keys are fast. You cannot create indexes using other fields. These big tables are physically stored in smaller pieces (of approximately 100 Mb) named “tablets”. These tablets are stored in a distributed file system.

A very interesting attribute of this architecture is that tablets can be compressed before being written to disk. That is very useful if you are thinking in storing text data (web pages, documents, etc). For example, a big quantity of crawled content can be stored in this database.

In conclusion, BigTable is not a distributed replacement for RDBMs. Instead, it is a big database with limited functionalities but with a big quality: It is scalable.

Two Open Source projects are developing BigTable clones. HBase is a subproject of Hadoop, so it uses the HDFS as file system. It is entirely developed in Java. See some performance numbers here and here.

The second one is Hypertable. They have selected to use C++ as language. KFS or HDFS can be used as file systems. Recently Some performance numbers have been published.

Both projects are in an early stage of development, but look very promising...

Thursday, February 21, 2008

Coordination of services in a distributed system

ZooKeeper is a service to coordinate processes in a distributed system. As they say:

“Coordinating processes of a distributed system is challenging as there are often problems that arise when implementing synchronization primitives, such as race conditions and deadlocks. With ZooKeeper, our main goal is to make less complex to coordinate the actions of processes in a distributed system despite failures of servers and concurrency.”

I think that it seems similar to Google Chubby. As a summary, the system is like a directory service that you can trust. So, ZooKeeper can be used to solve synchronization and coordination issues in distributed systems.

Very interesting… Would it be integrated in Hadoop and HBase?

Big Data Sets Queriying and Analisys

The use of SQL and databases to analyze and extract data from datasets is a common practice. Functions like GROUP BY, ORDER BY and aggregation functions like COUNT, AVG, etc are useful and flexible enough. Tasks as generating statistics from log files or extract information from a dataset are easy with SQL.

But the problem comes when you have a very big dataset (GB or TB of data). In those cases, the databases simply do not work. It is needed to distribute the computation. There are two projects that can help.

Pig is a project built on top of Hadoop. As they say:

The highest abstraction layer in Pig is a query language interface, whereby users express data analysis tasks as queries, in the style of SQL or Relational Algebra.

The computations in Pig are expressed in a language named PigLatin that provides similar pieces than SQL but more powerful. The data are stored in the Hadoop Distributed File System (HDFS), and the computations are distributed along your Hadoop cluster. That means that you can query through terabytes of data. Currently, Pig is working and they are planning to release a new version with a lot of improvements in the next months.

Jaql is another project, younger than Pig. Jaql is a query language for JSON inspired in SQL, xQuery and PigLatin. But they are planning to make it distributed using Hadoop. Files can be read from HDFS or HBase.

In conclusion, these two projects can help a lot when extracting information from big datasets.

Friday, February 15, 2008

Versión española de Properazzi

Hoy hemos lanzado la nueva versión para España del portal inmobiliario Properazzi.

Incluye muchísimos cambios, entre ellos una interfaz web muy clara y usable además de una velocidad de búsqueda endiablada. Creo que supone una gran mejora. Os invito a probarla.

Properazzi es un portal inmobiliario con cobertura mundial, con 4 millones de propiedades en todo el mundo.

Thursday, February 14, 2008

Reading Hadoop SequenceFile from Pig

A trick to read SequenceFile generated by Hadoop into Pig:


public class SequenceFileStorage implements LoadFunc {

protected SequenceFileRecordReader reader;


public SequenceFileStorage() {}
public void bindTo(String fileName, BufferedPositionedInputStream in, long offset, long end) throws IOException {         
Path file = new Path(fileName);
FileSplit split = new FileSplit(file, offset, end-offset, new JobConf());
reader = new SequenceFileRecordReader(new Configuration(), split);
}

public Tuple getNext() throws IOException {
MyKey key = new MyKey();
MyValue value = new MyValue();

if (! reader.next(key, listing)) {
reader.close();
return null;
}

Tuple tuple = new Tuple();
tuple.appendField(value.getData());

return tuple;
}
}

Grid computing with Java

I have found the project named GridGain very interesting. GridGain is an Open Source project for grid computing using Java.

Seems very easy to develop distributed applications and execute them in cluster with this library/platform.

I will take a look to this project deeply in the future.

Tuesday, February 12, 2008

Open Source and Startups: the next step forward

It is clear that the open source platform LAMP was a revolution for startups. Open source projects like MySql, Apache or PHP brought “vitamine” to many startups to reach their objectives. Some of them could have not reached success if they had had to buy this kind of technology.

LAMP is currently the standard to meet with when starting new Internet business. Lamp (or some of its variations) is used in a high percentage of new websites. Nevertheless, as projects become bigger, the limits of the LAMP architecture appear. There are scalability issues. The biggest one is usually the impossibility to scale the database (usually MySql).

But, from my point of view, a new revolution is here. Two projects,
Lucene and Hadoop, have come to take the next step forward. They do not come to replace LAMP software, they come to complete it.
Lucene is a search engine library. It can index a big amount of text data into an index file. Such index can answer keyword queries. The library is fast, and can index more data and perform queries faster than database-like search systems.
Hadoop is an open source implementation of the Google MapReduce distributed system. It allows the processing of big amounts of data over a cluster of commodity computers. I do think that this is one of the most important open source projects nowadays. And I think new startups will take profit of these projects in order to develop innovative services. As an example of the size of that these projects can have, Powerset, a new startup, is trying to build a natural language search engine using Hadoop.

I will write some more about this projects and some subprojects in future posts.

Thursday, February 7, 2008

El economista camuflado

Magnífico libro de Tim Harford "El economista camuflado" (The undercover economist en inglés).

He de reconocer que el primer capítulo me resultó un poco obvio e incluso dejé de leer el libro. Pero volví a retomarlo por el capitulo "Por qué los países pobres son pobres" y fui apasionándome y sorprendiéndome por la frescura con la que explica y hace accesibles conceptos económicos que en absoluto están claros para la gente de a pie.

Y es que este libro está planteado considerando la economía como un sistema. Y como tal, tiene ciertas reglas, funciona de una determinada manera y tiene sus ventajas y sus vicios o errores. Esto es precisamente lo que Hardford trata de explicarnos en su libro: qué cosas están bien y qué cosas están mal, el porque de éstas y cómo solucionarlas ajustando el sistema.

Mención especial requiere el capítulo 4. En él se habla de aquellos efectos negativos que no están incluidos cuando se fija el precio de una transacción entre un vendedor y un comprador. Éstos no están incluidos porque no es algo que les afecte directamente ni al vendedor ni al comprador. Hardford llamá a estos costes "externalidades negativas" y propone incluir estos costes en las transacciones mediante impuestos. Desde mi punto de vista, esta es la idea principal que se ha de aplicar para tratar de reducir las emisiones de CO2 a la atmósfera. Si una empresa genera CO2, habrá de pagar en función de cuanto emita. La propia compañía estará interesada en emitir menos CO2 ya que le será más rentable. Por otro lado, si dejamos que este coste por CO2 fluctúe mediante un sistema de subastas, el precio que las compañías deben de pagar por este CO2 será autoajustado y por tanto más real que si se fija arbitariamente: las propias compañías nos mostrarán cuánto están dispuestas a pagar por emitir.

El libro no tiene desperdicio. También se habla de los sistemas de sanidad, tanto públicos como privados, explicando que ambos son imperfectos, y propone un sistema mixto que al parecer se está usando con éxito en Singapur.

Otros temas que trata son:
  • Un modo de solucionar el tráfico
  • Una explicación de los sistemas bursátiles
  • La curiosa historia del uso de subastas en la concesión de licencias 3G
  • Por qué los países pobres son pobres
  • La globalización
  • El milagro de China
Es evidente que este libro me ha dejado huella. En esta época de ideologías difusas, ha sembrado en mí una primera semilla de las que podrían ser mis futuras ideas políticas. Y esta "ideología" consistiría en un pragmatismo que ve la vida y las relaciones humanas como un gran sistema. Un sistema que, retocando de modo adecuado sus reglas, se puede hacer converger hacia el lugar correcto.

Monday, February 4, 2008

Another blog

Hello,

I am Iván de Prado, and I will write this blog in Spanish and English. I think I will write about computer science topics related with distributed systems, search engines, information retrieval, an so on. But sure other thoughts will be in.

As first topic, I would like to present you the company where I am working: www.properazzi.com. We are trying to build the biggest real estate search engine in the world, using crawling and artificial inteligence algorithms in order to extract properties information from real estate agencies sites. Currently we have 4 millions properties, that makes us the real estate search engine with the biggest quantity of properties over the world.

Otro blog más

Hola!

Finalmente he tenido que claudicar... Y aquí esta mi blog. Pretendo que este sea un lugar en el que poner las cosas que me pasan por la cabeza, mis impresiones, mis dudas y mis frustraciones. Supongo que seré inconstante como lo soy en ciertas facetas de mi vida, pero espero actualizar de cuando en cuando el blog. Creo que haré el blog de forma bilingüe, con entradas tanto en castellano como en inglés, según me venga en gana.