Sunday, November 23, 2008

Properazzi.com is now Enormo.com

Properazzi is now Enormo.com!

As part as our constant evolution, we have changed the portal name. Now it is named Enormo.com with the aim of fitting better with the site aspirations.

Enormo is currently the properties search engine with more listings arround the world: more than 6 Million.

Sunday, September 14, 2008

Paper: "IRLbot: Scaling to 6 Billion Pages and Beyond"

Two of the most complex issues to deal with when developing a crawler are URL uniqueness and host politeness. When crawling, you need to visit new pages. In order to know which URL represents a new page, you have to do a look up over the list of already crawled URLs. That is easy when dealing with small amounts of URLs, but it is extremely hard when dealing with billions of pages.

In the paper "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov, the authors describe a memory and disk data structure named DRUM. It provides with an efficient storage of large collections of (key, value) pairs, and maximizes the amortized throughput of insertions, updates and lookups. In order to achieve maximum throughput, queries are done in batches: only when some structures in memory are full, all the queries are executed. So queries are answered asynchronously. Another advantage is that DRUM adapts to different combinations of memory/disk. The bigger the memory, the better the performance. And, with larger disk space, DRUM capacity gets increased. DRUM is a modification of the bucket sorting algorithm.

The paper also talks about the techniques for dealing with spam sites and a budget system to crawl more pages from sites that receive more inbound links.

DRUM is the best approach I have seen until now to solve the problem of URL uniqueness. The paper is a “must read” if you are working with crawlers.

Tuesday, July 8, 2008

Google Protocol Buffers released as Open Source

Google has released as open source its Protocol Buffers library, used for serializing structured data (documentation). Google has been massively using this library in their systems for storing and sharing data. I guess that most of the files stored in their internal GFS are encoded using Protocol Buffers.

It has several interesting features. First of all, the types support variable-length encoding. This fact can lead to big storage savings when dealing with big amounts of data.

The second characteristic is that it allows changes in the data schema at the same time that
forward compatibility is maintained. This point is really important due to the fact that
changes in the schema are something common in practice. Besides, forward compatibility allows old systems and data to cohabitate with new ones.

The third feature is its availability for C++, Java and Python, making it easy to share data between these three languages. Facebook has recently released Thrift, another approach to serialization and RPC.

More information about the topics in this post and the comparison with Hadoop serialization on Tom White blog

Monday, May 12, 2008

Paper: Detecting Near-Duplicates for Web Crawling

Three guys from Google have published the paper Detecting Near-Duplicates
for Web Crawling
at the 2007 WWW Conference with a technique for detecting near-duplicates over a set of web pages.

They have developed a method aimed at performing near-duplicate detection over a corpus of 8B pages with a dataset of hashes of only 64 GB (Wow!).

Interesting issues tackled within the paper:

  • A method (simhash) for hashing documents. It has a very interesting attribute: hashes of similar documents are very close. You could consider that 2 documents are duplicates if their Hamming distance is 3 or less. They say that 64 Bits hashes are enough for near-duplicate detection over 8B pages.
  • A way to compress the hashes so that all the data needed for performing near-duplicate detection over 8B pages fit within over 32GB
  • A fast way to look for duplicates at Hamming Distance of 3 or less.
  • A fast way to perform batch queries using MapReduce with a speed of 1M fingerprints every 100 seconds with 200 mappers.
  • A good review of the state of the art of the duplicate detection: it is shown what shingles are; the authors propose to use typical IR document vectors for document attribute extraction; some possible usages are also mentioned, as well as other algorithms for performing duplicate detection.

Sunday, April 27, 2008

Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides

Slides and videos of the past Hadoop Summit and Data-Intensive Computing Symposium have been published.

I am still reading them but there is a lot of good stuff there.

For more information read my previous posts Open Source and Startups: the next step forward, Big Data Sets Queriying and Analisys, Distributed databases: BigTable, HBase and Hypertable and Zookeper Video and Slides

Properazzi.com

Properazzi .com has just launched a new and simplified interface, with many improvements in terms of usability and speed.

These improvements are consistent with an expansion in the following countries:

Sunday, April 20, 2008

The end of poverty

I have read the book “The end of poverty. Economic possibilities for our time” by the economist Jeffrey D. Sachs. I have discovered a lot of things with this book. I am surprised though, as there are some topics I had never heard about. They should be in newspapers every day because of their importance for the future of humankind. The book gives light about these points:

1 – More than 1.100 million people live in extreme poverty (they live on less than $1 per day). The poorest area in the world is Africa.

2 – Extremely poor countries are in a poverty trap and they cannot go out by themselves. They have not reached the critical mass of capital that would allow them to keep a maintained economic growth. Sachs says that it is false that extremely poor people are poor because they have not done enough to get out from poverty. Two important factors to be highlighted are the geography and the climate circumstances of the countries. Countries without access to the sea, with extreme climates, highly exposed to diseases or with craggy orography are candidates to enter in the trap of poverty.

3 – It is possible to eradicate extreme poverty. Our generation has the possibility to do it.

4 – There is already a plan: The Millennium Development Goals. One of the goals is to halve extreme poverty by 2015. Through the collaboration among poor countries that want to participate to the plan and the rich countries, and with the UN as coordinator and supervisor, we can halve extreme poverty by 2015 and erase it by 2025. The plan takes into account that there is not a standard recipe that works per each country. On the contrary, it proposes a differential diagnostic in order to detect each country’s needs. The poor country would be monitored in order to analyze if plan is developing as planned and to assure that the money is spent in what was agreed.

5 – It is cheap!! The cost is approximately 60,000 million dollars per year. Less than 0,7% GDP!!.

6 – It was already signed by almost every country in the Millennium Declaration and the Monterrey consensus … But they are not fulfilling them.

Therefore, it is our responsibility to press our governments for supporting the biggest trial up to now to solve the injustices with poor people. This is the greatest and most important international initiative at the present time, for this reason I cannot understand why I have not had notice of their existence until now… That means that there are communication problems in the rich world with things that are not related to their day to day life. I am sure that if people are aware, if we achieve the goal of spreading what is happening and what we can do, people is going to press the governments and politicians. And they will be forced to do what their voters ask for.

Specifically talking about my country, Spain, it seems that Zapatero’s government is doing steps in the right way. They promise to reach 0.7 % GDP at 2012. We have to press to achieve this goal and increase the quality of the help we provide.

Related links:

EndPoverty 2015

The Earth Institute

MDG 2007 reports:

MDG Report 2007

MDG Progress Chart 2007

Spain:

Pobreza Cero

Informe del Gobierno Español sobre el objectivo 8

Thursday, April 17, 2008

Bloom Filter

As Wikipedia says, the bloom filter

is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

This interesting structure needs little memory resources, about 10 bits per item for 1% false positive rate.

This structure is used in BigTable in order to reduce the look-up time of non existing keys.

More information and a talk on Bloom Filters here.

Tuesday, April 8, 2008

Google App Engine

Surprise! Google has released a new service aimed at developing Web apps. Google App Engine is a framework to develop web applications that can run in Google infraestructure. This means that the applications can easily scale.

The service is based in a shared nothing architecture. You write functions that process requests. You cannot create threads or processes. You cannot share nothing between requests. You cannot write or read files from a filesystem.

A relaxed but scalable database is available. This database is similar to Amazon SimpleDB and Microsoft SSDS.

The request processors are written in Python. A hello word example:
import wsgiref.handlers

from google.appengine.ext import webapp

class MainPage(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.out.write('Hello, webapp World!')

def main():
application = webapp.WSGIApplication(
[('/', MainPage)],
debug=True)
wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
main()
An example of use of the database:
from google.appengine.ext import db
from google.appengine.api import users

class Pet(db.Model):
name = db.StringProperty(required=True)
type = db.StringProperty(required=True, choices=set("cat", "dog", "bird"))
birthdate = db.DateProperty()
weight_in_pounds = db.IntegerProperty()
spayed_or_neutered = db.BooleanProperty()
owner = db.UserProperty()

pet = Pet(name="Fluffy",
type="cat",
owner=users.get_current_user())
pet.weight_in_pounds = 24
pet.put()

Example of get, modify, save:
if users.get_current_user():
user_pets = db.GqlQuery("SELECT * FROM Pet WHERE pet.owner = :1",
users.get_current_user())
for pet in user_pets:
pet.spayed_or_neutered = True

db.put(user_pets)
App Engine is going to be useful for web app developers that need high scalability but do not have complex processes in their backend. As an example, I believe that it would be easy to develop Twitter using App Engine. Developers of Facebook apps could find useful the service, too.

Google is going to compete with Amazon Web Services. I find these two services different. Google's one is simpler and easier therefore its target is developers of simple but scalable web apps. Amazon Web Services provides with more control, so companies that have complex systems (like vertical search engines) would preferer it.

Monday, April 7, 2008

Andorra

The Future of Web Search: Beyond Text # Learning to Rank Answers on Large Online QA Collections

Mihai Surdeanu gave the talk “Learning to Rank Answers on Large Online QA Collections”. The idea is to use the large answers collections from Yahoo! Answers in order to learn a better way to rank answers that do not have social reputation yet.

The users’ votes for answers are used to test the possible rank approaches. Therefore, your rank algorithm has to rank first answers with votes.

At the end, a well tuned mix of different strategies gets the best results. No independent strategy is dominant, so the best results come if you mix all of them.

Simple techniques like avoiding answers with less than 4 words are mixed with more complex ones like the comparison of answers in different languages and the use of NLP.

Friday, April 4, 2008

The Future of Web Search: Beyond Text # SAPIR Project

Pavel Zezula has talked about a system for multi feature indexing called MUFIN. It is interesting because it allows the indexing of objects with an arbitrary metric distance measure. For instance, you can index multimedia content (such as images) with a defined metric: a pair of images are very similar if the colors of both are very close. Later on, you can search for images that are very close to a given color scheme.

Documents are clustered, and those on the same cluster are mapped to the same one-dimensional interval feature.

They have used a P2P architecture that assures its scalability.

The work is supported by the SAPIR EU project.

The Future of Web Search: Beyond Text

Yahoo! is hosting the workshop The Future of Web Search: Beyond Text in Andorra today. There are over 100 attendees. The workshop does special focus on multimedia and specialized topics in Web search.

I’ll post some notes from the talks on several posts.

Monday, March 31, 2008

Simple DB

Simple DB is a new Amazon database service. It comes to complete the other two Amazon "elastic demand" services (the cloud computing EC2 service and the storage S3 service). As you can see in these slides, Simple DB is a database with relaxed constraints: no schema is needed, no explicit relationship and referential integrity constraints, not too many data types. On the other hand, it provides a simple API, ensured speed, easy search (all fields are indexed), high availability and on demand scalability.

Some months after the release of the beta version of Simple DB, Microsoft has launched SQL Server Data Services (SSDS). As you can see in this document, SSDS is very similar to Simple DB. It seems that Microsoft has been inspired by Simple DB to create SSDS.

From my point of view, this kind of databases are going to be very useful for developers. They provide some advantages comparing to typical relational databases. Among such advantages, easy usage and high scalability can be mentioned. And I believe Amazon is doing very well with its strategy of developing this kind of scalable services.

Zookeper Video and Slides

Zookeper is a coordination service that can provide a lot of help when developing distributed systems (See my previous post about it). An introduction to Zookeper has been recently published. You can see the video and some slides in PDF.



Zookeper uses the file system paradigm but adding some enhancements in synchronization and providing notification. It results really elegant to manage the coordination of a distributed system with Zookeper. Distributed procedures like leader election and distributed locks are simple with Zookeper.

Tuesday, March 11, 2008

Engineering Philosophy

I have found these commandments for Google engineers:

1. All developers work out of a ~single source depot; shared infrastructure!
2. A developer can fix bugs anywhere in the source tree.
3. Building a product takes 3 commands ("get, config, make")
4. Uniform coding style guidelines across company
5. Code reviews mandatory for all checkins
6. Pervasive unit testing, written by developers
7. Unit tests run continuously, email sent on failure
8. Powerful tools, shared company-wide
9. Rapid project cycles; developers change projects often; 20% time
10. Peer-driven review process; flat management structure
11. Transparency into projects, code, process, ideas, etc.
12. Dozens of offices around world => hire best people regardless of location

Nice philosophy. I would like to highlight some of them from the point of view of startups.

A uniform coding style is needed if you want your code to grow fast but without losing evolvability and understandability. Setting some standards at the start of the project will improve things in the long term.

Startups need to grow fast, very fast. Pervasive unit testing is the unique way to grow fast without making mistakes. If you want to avoid the risk of moving backwards with new developments, you need obsessive testing. People usually do not understand it, but unit testing improves productivity, too. With unit testing you will detect if some new development destroys or introduces bugs in the software. That improves the confidence in the developed software. In the long term, you will have more time to develop new software as you do not have to waste it solving bugs. The unit testing increases productivity. Use it if you do not want your software to be a card castle.

You have to convince developers of the ideas of developing for sharing. It is usual to find developers creating some code only for their personal use in a particular case. That is wrong. If the developer has found something that he requires, it is obvious that another one may need it in the future too. So instead of developing this piece of code thinking that you are going to be the only one to use, you have to develop it thinking that this code can be useful for other developers in the future. That means that you have to take care of the API, generalize, make documentation, mantain and share your work. That will improve the reusability of the software.

Rapid project cycles are good. Long time project cycles usually increase the complexity and time of the development.

Transparency is another attribute needed in startups. You need all your team to know everything, always to have the targets clear in their minds. That makes easy to each member of the team to know how he can contribute to the company.

Saturday, March 8, 2008

BCN Meeting

I have received this announcement from my friend Tomy Pelluz:
hackers, developers, designers, entrepreneur, thinkers, web-lovers, code-poets...

Something is happening in Barcelona Web n.0 scenario and you should pay attention...

Do not miss it ;-)

We will met here > http://tinyurl.com/2vdkj2
Temporal coordinates > Tuesday 11 - 20:35
And then we will decide what we want to do... beers, tapas, running... everything is possible

Bring your friends, bring your brain.

Nice people like that will be there. I am planning to attend. I’ll see you there ;-)

Sunday, March 2, 2008

Pig Talk

Christopher Olston gave a talk about Pig. You can see the video of the talk here and the slides here.

Pig is an Open Source tool for the processsing an analysis of big datasets developed mainly by Yahoo! Research.

Sunday, February 24, 2008

Distributed databases: BigTable, HBase and Hypertable

Since the publication of the Google paper about BigTable, people have started to make up their mind about distributed databases. BigTable is a distributed database where you can store big amounts of data. On the other hand, a lot of requirements have been relaxed in order to achieve this scalability. Therefore BigTable cannot be compared with typical RDBMs.

We can see BigTable as a big ordered map where you can insert pairs of (key, value). The data are stored sorted by key, so retrievals by key or by a range of keys are fast. You cannot create indexes using other fields. These big tables are physically stored in smaller pieces (of approximately 100 Mb) named “tablets”. These tablets are stored in a distributed file system.

A very interesting attribute of this architecture is that tablets can be compressed before being written to disk. That is very useful if you are thinking in storing text data (web pages, documents, etc). For example, a big quantity of crawled content can be stored in this database.

In conclusion, BigTable is not a distributed replacement for RDBMs. Instead, it is a big database with limited functionalities but with a big quality: It is scalable.

Two Open Source projects are developing BigTable clones. HBase is a subproject of Hadoop, so it uses the HDFS as file system. It is entirely developed in Java. See some performance numbers here and here.

The second one is Hypertable. They have selected to use C++ as language. KFS or HDFS can be used as file systems. Recently Some performance numbers have been published.

Both projects are in an early stage of development, but look very promising...

Thursday, February 21, 2008

Coordination of services in a distributed system

ZooKeeper is a service to coordinate processes in a distributed system. As they say:

“Coordinating processes of a distributed system is challenging as there are often problems that arise when implementing synchronization primitives, such as race conditions and deadlocks. With ZooKeeper, our main goal is to make less complex to coordinate the actions of processes in a distributed system despite failures of servers and concurrency.”

I think that it seems similar to Google Chubby. As a summary, the system is like a directory service that you can trust. So, ZooKeeper can be used to solve synchronization and coordination issues in distributed systems.

Very interesting… Would it be integrated in Hadoop and HBase?

Big Data Sets Queriying and Analisys

The use of SQL and databases to analyze and extract data from datasets is a common practice. Functions like GROUP BY, ORDER BY and aggregation functions like COUNT, AVG, etc are useful and flexible enough. Tasks as generating statistics from log files or extract information from a dataset are easy with SQL.

But the problem comes when you have a very big dataset (GB or TB of data). In those cases, the databases simply do not work. It is needed to distribute the computation. There are two projects that can help.

Pig is a project built on top of Hadoop. As they say:

The highest abstraction layer in Pig is a query language interface, whereby users express data analysis tasks as queries, in the style of SQL or Relational Algebra.

The computations in Pig are expressed in a language named PigLatin that provides similar pieces than SQL but more powerful. The data are stored in the Hadoop Distributed File System (HDFS), and the computations are distributed along your Hadoop cluster. That means that you can query through terabytes of data. Currently, Pig is working and they are planning to release a new version with a lot of improvements in the next months.

Jaql is another project, younger than Pig. Jaql is a query language for JSON inspired in SQL, xQuery and PigLatin. But they are planning to make it distributed using Hadoop. Files can be read from HDFS or HBase.

In conclusion, these two projects can help a lot when extracting information from big datasets.

Friday, February 15, 2008

Versión española de Properazzi

Hoy hemos lanzado la nueva versión para España del portal inmobiliario Properazzi.

Incluye muchísimos cambios, entre ellos una interfaz web muy clara y usable además de una velocidad de búsqueda endiablada. Creo que supone una gran mejora. Os invito a probarla.

Properazzi es un portal inmobiliario con cobertura mundial, con 4 millones de propiedades en todo el mundo.

Thursday, February 14, 2008

Reading Hadoop SequenceFile from Pig

A trick to read SequenceFile generated by Hadoop into Pig:


public class SequenceFileStorage implements LoadFunc {

protected SequenceFileRecordReader reader;


public SequenceFileStorage() {}
public void bindTo(String fileName, BufferedPositionedInputStream in, long offset, long end) throws IOException {         
Path file = new Path(fileName);
FileSplit split = new FileSplit(file, offset, end-offset, new JobConf());
reader = new SequenceFileRecordReader(new Configuration(), split);
}

public Tuple getNext() throws IOException {
MyKey key = new MyKey();
MyValue value = new MyValue();

if (! reader.next(key, listing)) {
reader.close();
return null;
}

Tuple tuple = new Tuple();
tuple.appendField(value.getData());

return tuple;
}
}

Grid computing with Java

I have found the project named GridGain very interesting. GridGain is an Open Source project for grid computing using Java.

Seems very easy to develop distributed applications and execute them in cluster with this library/platform.

I will take a look to this project deeply in the future.

Tuesday, February 12, 2008

Open Source and Startups: the next step forward

It is clear that the open source platform LAMP was a revolution for startups. Open source projects like MySql, Apache or PHP brought “vitamine” to many startups to reach their objectives. Some of them could have not reached success if they had had to buy this kind of technology.

LAMP is currently the standard to meet with when starting new Internet business. Lamp (or some of its variations) is used in a high percentage of new websites. Nevertheless, as projects become bigger, the limits of the LAMP architecture appear. There are scalability issues. The biggest one is usually the impossibility to scale the database (usually MySql).

But, from my point of view, a new revolution is here. Two projects,
Lucene and Hadoop, have come to take the next step forward. They do not come to replace LAMP software, they come to complete it.
Lucene is a search engine library. It can index a big amount of text data into an index file. Such index can answer keyword queries. The library is fast, and can index more data and perform queries faster than database-like search systems.
Hadoop is an open source implementation of the Google MapReduce distributed system. It allows the processing of big amounts of data over a cluster of commodity computers. I do think that this is one of the most important open source projects nowadays. And I think new startups will take profit of these projects in order to develop innovative services. As an example of the size of that these projects can have, Powerset, a new startup, is trying to build a natural language search engine using Hadoop.

I will write some more about this projects and some subprojects in future posts.

Thursday, February 7, 2008

El economista camuflado

Magnífico libro de Tim Harford "El economista camuflado" (The undercover economist en inglés).

He de reconocer que el primer capítulo me resultó un poco obvio e incluso dejé de leer el libro. Pero volví a retomarlo por el capitulo "Por qué los países pobres son pobres" y fui apasionándome y sorprendiéndome por la frescura con la que explica y hace accesibles conceptos económicos que en absoluto están claros para la gente de a pie.

Y es que este libro está planteado considerando la economía como un sistema. Y como tal, tiene ciertas reglas, funciona de una determinada manera y tiene sus ventajas y sus vicios o errores. Esto es precisamente lo que Hardford trata de explicarnos en su libro: qué cosas están bien y qué cosas están mal, el porque de éstas y cómo solucionarlas ajustando el sistema.

Mención especial requiere el capítulo 4. En él se habla de aquellos efectos negativos que no están incluidos cuando se fija el precio de una transacción entre un vendedor y un comprador. Éstos no están incluidos porque no es algo que les afecte directamente ni al vendedor ni al comprador. Hardford llamá a estos costes "externalidades negativas" y propone incluir estos costes en las transacciones mediante impuestos. Desde mi punto de vista, esta es la idea principal que se ha de aplicar para tratar de reducir las emisiones de CO2 a la atmósfera. Si una empresa genera CO2, habrá de pagar en función de cuanto emita. La propia compañía estará interesada en emitir menos CO2 ya que le será más rentable. Por otro lado, si dejamos que este coste por CO2 fluctúe mediante un sistema de subastas, el precio que las compañías deben de pagar por este CO2 será autoajustado y por tanto más real que si se fija arbitariamente: las propias compañías nos mostrarán cuánto están dispuestas a pagar por emitir.

El libro no tiene desperdicio. También se habla de los sistemas de sanidad, tanto públicos como privados, explicando que ambos son imperfectos, y propone un sistema mixto que al parecer se está usando con éxito en Singapur.

Otros temas que trata son:
  • Un modo de solucionar el tráfico
  • Una explicación de los sistemas bursátiles
  • La curiosa historia del uso de subastas en la concesión de licencias 3G
  • Por qué los países pobres son pobres
  • La globalización
  • El milagro de China
Es evidente que este libro me ha dejado huella. En esta época de ideologías difusas, ha sembrado en mí una primera semilla de las que podrían ser mis futuras ideas políticas. Y esta "ideología" consistiría en un pragmatismo que ve la vida y las relaciones humanas como un gran sistema. Un sistema que, retocando de modo adecuado sus reglas, se puede hacer converger hacia el lugar correcto.

Monday, February 4, 2008

Another blog

Hello,

I am Iván de Prado, and I will write this blog in Spanish and English. I think I will write about computer science topics related with distributed systems, search engines, information retrieval, an so on. But sure other thoughts will be in.

As first topic, I would like to present you the company where I am working: www.properazzi.com. We are trying to build the biggest real estate search engine in the world, using crawling and artificial inteligence algorithms in order to extract properties information from real estate agencies sites. Currently we have 4 millions properties, that makes us the real estate search engine with the biggest quantity of properties over the world.

Otro blog más

Hola!

Finalmente he tenido que claudicar... Y aquí esta mi blog. Pretendo que este sea un lugar en el que poner las cosas que me pasan por la cabeza, mis impresiones, mis dudas y mis frustraciones. Supongo que seré inconstante como lo soy en ciertas facetas de mi vida, pero espero actualizar de cuando en cuando el blog. Creo que haré el blog de forma bilingüe, con entradas tanto en castellano como en inglés, según me venga en gana.