After 3 years at Enormo, I have started to study a MSc in Economics. It will last 2 years. During these years I will focus on studying, so I will not work at the same time.
Why this change? After knowing how the monetary system works I realized that I didn't really know what money is! Besides, crisis made me see that the economic system is very important, but also far from perfection.
As a person interested in systems in general, I see the economic system as one of the most complex and important over the world.
Thursday, September 24, 2009
New Course
Nuevo Rumbo
Después de haber estado trabajando 3 años en Enormo, finalmente he decidido dar un cambio de rumbo a mi vida. Los próximos dos años los voy a dedicar a estudiar un Máster en Economía en Madrid. Durante estos dos años me consagraré exclusivamente al máster, es decir, no trabajaré.
Este cambio se debe a un creciente interés por los asuntos de economía. Supongo que este interés surgió tras conocer cómo funciona el sistema monetario, lo que me causó un tremendo impacto. ¡No tenía ni idea de lo que era el dinero! Y sospecho que no mucha gente lo sabe.
Por otro lado, la crisis económica me ha mostrado cuán importante e imperfecto es el actual sistema económico.
Saturday, June 13, 2009
Propuesta para el gobierno: Facturación online sin necesidad de ser autónomo
Actualmente sólo hay dos modos legales para facturar:
- Tener una empresa
- Ser autónomo
- Estoy en paro y no encuentro trabajo. He encontrado una oportunidad para pintar las oficinas de una empresa. Pero no puedo hacerlo porque me exigen una factura.
- Soy un profesional con trabajo estable y tengo cierta experiencia, lo cual hace que de vez en cuando me ofrezcan trabajos esporádicos de otras empresas. No puedo hacerlos porque tendría que emitir una factura.
- Estoy trabajando como empleado de una empresa. Pero ahora estoy pensando en tener mi propio negocio y hacerme autónomo. El problema es que no estoy seguro del paso. Me vendría muy bien poder facturar unos cuantos trabajos sin darme de alta en autónomos para poder probar antes de dar el gran paso.
Sunday, November 23, 2008
Properazzi.com is now Enormo.com
Properazzi is now Enormo.com!
As part as our constant evolution, we have changed the portal name. Now it is named Enormo.com with the aim of fitting better with the site aspirations.
Enormo is currently the properties search engine with more listings arround the world: more than 6 Million.
Sunday, September 14, 2008
Paper: "IRLbot: Scaling to 6 Billion Pages and Beyond"
Two of the most complex issues to deal with when developing a crawler are URL uniqueness and host politeness. When crawling, you need to visit new pages. In order to know which URL represents a new page, you have to do a look up over the list of already crawled URLs. That is easy when dealing with small amounts of URLs, but it is extremely hard when dealing with billions of pages.
In the paper "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov, the authors describe a memory and disk data structure named DRUM. It provides with an efficient storage of large collections of (key, value) pairs, and maximizes the amortized throughput of insertions, updates and lookups. In order to achieve maximum throughput, queries are done in batches: only when some structures in memory are full, all the queries are executed. So queries are answered asynchronously. Another advantage is that DRUM adapts to different combinations of memory/disk. The bigger the memory, the better the performance. And, with larger disk space, DRUM capacity gets increased. DRUM is a modification of the bucket sorting algorithm.
The paper also talks about the techniques for dealing with spam sites and a budget system to crawl more pages from sites that receive more inbound links.
DRUM is the best approach I have seen until now to solve the problem of URL uniqueness. The paper is a “must read” if you are working with crawlers.
Tuesday, July 8, 2008
Google Protocol Buffers released as Open Source
Google has released as open source its Protocol Buffers library, used for serializing structured data (documentation). Google has been massively using this library in their systems for storing and sharing data. I guess that most of the files stored in their internal GFS are encoded using Protocol Buffers.
It has several interesting features. First of all, the types support variable-length encoding. This fact can lead to big storage savings when dealing with big amounts of data.
The second characteristic is that it allows changes in the data schema at the same time that
forward compatibility is maintained. This point is really important due to the fact that
changes in the schema are something common in practice. Besides, forward compatibility allows old systems and data to cohabitate with new ones.
The third feature is its availability for C++, Java and Python, making it easy to share data between these three languages. Facebook has recently released Thrift, another approach to serialization and RPC.
More information about the topics in this post and the comparison with Hadoop serialization on Tom White blog
Monday, May 12, 2008
Paper: Detecting Near-Duplicates for Web Crawling
Three guys from Google have published the paper Detecting Near-Duplicates
for Web Crawling at the 2007 WWW Conference with a technique for detecting near-duplicates over a set of web pages.
They have developed a method aimed at performing near-duplicate detection over a corpus of 8B pages with a dataset of hashes of only 64 GB (Wow!).
Interesting issues tackled within the paper:
- A method (simhash) for hashing documents. It has a very interesting attribute: hashes of similar documents are very close. You could consider that 2 documents are duplicates if their Hamming distance is 3 or less. They say that 64 Bits hashes are enough for near-duplicate detection over 8B pages.
- A way to compress the hashes so that all the data needed for performing near-duplicate detection over 8B pages fit within over 32GB
- A fast way to look for duplicates at Hamming Distance of 3 or less.
- A fast way to perform batch queries using MapReduce with a speed of 1M fingerprints every 100 seconds with 200 mappers.
- A good review of the state of the art of the duplicate detection: it is shown what shingles are; the authors propose to use typical IR document vectors for document attribute extraction; some possible usages are also mentioned, as well as other algorithms for performing duplicate detection.