Sunday, April 27, 2008

Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides

Slides and videos of the past Hadoop Summit and Data-Intensive Computing Symposium have been published.

I am still reading them but there is a lot of good stuff there.

For more information read my previous posts Open Source and Startups: the next step forward, Big Data Sets Queriying and Analisys, Distributed databases: BigTable, HBase and Hypertable and Zookeper Video and Slides

Properazzi.com

Properazzi .com has just launched a new and simplified interface, with many improvements in terms of usability and speed.

These improvements are consistent with an expansion in the following countries:

Sunday, April 20, 2008

The end of poverty

I have read the book “The end of poverty. Economic possibilities for our time” by the economist Jeffrey D. Sachs. I have discovered a lot of things with this book. I am surprised though, as there are some topics I had never heard about. They should be in newspapers every day because of their importance for the future of humankind. The book gives light about these points:

1 – More than 1.100 million people live in extreme poverty (they live on less than $1 per day). The poorest area in the world is Africa.

2 – Extremely poor countries are in a poverty trap and they cannot go out by themselves. They have not reached the critical mass of capital that would allow them to keep a maintained economic growth. Sachs says that it is false that extremely poor people are poor because they have not done enough to get out from poverty. Two important factors to be highlighted are the geography and the climate circumstances of the countries. Countries without access to the sea, with extreme climates, highly exposed to diseases or with craggy orography are candidates to enter in the trap of poverty.

3 – It is possible to eradicate extreme poverty. Our generation has the possibility to do it.

4 – There is already a plan: The Millennium Development Goals. One of the goals is to halve extreme poverty by 2015. Through the collaboration among poor countries that want to participate to the plan and the rich countries, and with the UN as coordinator and supervisor, we can halve extreme poverty by 2015 and erase it by 2025. The plan takes into account that there is not a standard recipe that works per each country. On the contrary, it proposes a differential diagnostic in order to detect each country’s needs. The poor country would be monitored in order to analyze if plan is developing as planned and to assure that the money is spent in what was agreed.

5 – It is cheap!! The cost is approximately 60,000 million dollars per year. Less than 0,7% GDP!!.

6 – It was already signed by almost every country in the Millennium Declaration and the Monterrey consensus … But they are not fulfilling them.

Therefore, it is our responsibility to press our governments for supporting the biggest trial up to now to solve the injustices with poor people. This is the greatest and most important international initiative at the present time, for this reason I cannot understand why I have not had notice of their existence until now… That means that there are communication problems in the rich world with things that are not related to their day to day life. I am sure that if people are aware, if we achieve the goal of spreading what is happening and what we can do, people is going to press the governments and politicians. And they will be forced to do what their voters ask for.

Specifically talking about my country, Spain, it seems that Zapatero’s government is doing steps in the right way. They promise to reach 0.7 % GDP at 2012. We have to press to achieve this goal and increase the quality of the help we provide.

Related links:

EndPoverty 2015

The Earth Institute

MDG 2007 reports:

MDG Report 2007

MDG Progress Chart 2007

Spain:

Pobreza Cero

Informe del Gobierno Español sobre el objectivo 8

Thursday, April 17, 2008

Bloom Filter

As Wikipedia says, the bloom filter

is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

This interesting structure needs little memory resources, about 10 bits per item for 1% false positive rate.

This structure is used in BigTable in order to reduce the look-up time of non existing keys.

More information and a talk on Bloom Filters here.

Tuesday, April 8, 2008

Google App Engine

Surprise! Google has released a new service aimed at developing Web apps. Google App Engine is a framework to develop web applications that can run in Google infraestructure. This means that the applications can easily scale.

The service is based in a shared nothing architecture. You write functions that process requests. You cannot create threads or processes. You cannot share nothing between requests. You cannot write or read files from a filesystem.

A relaxed but scalable database is available. This database is similar to Amazon SimpleDB and Microsoft SSDS.

The request processors are written in Python. A hello word example:
import wsgiref.handlers

from google.appengine.ext import webapp

class MainPage(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.out.write('Hello, webapp World!')

def main():
application = webapp.WSGIApplication(
[('/', MainPage)],
debug=True)
wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
main()
An example of use of the database:
from google.appengine.ext import db
from google.appengine.api import users

class Pet(db.Model):
name = db.StringProperty(required=True)
type = db.StringProperty(required=True, choices=set("cat", "dog", "bird"))
birthdate = db.DateProperty()
weight_in_pounds = db.IntegerProperty()
spayed_or_neutered = db.BooleanProperty()
owner = db.UserProperty()

pet = Pet(name="Fluffy",
type="cat",
owner=users.get_current_user())
pet.weight_in_pounds = 24
pet.put()

Example of get, modify, save:
if users.get_current_user():
user_pets = db.GqlQuery("SELECT * FROM Pet WHERE pet.owner = :1",
users.get_current_user())
for pet in user_pets:
pet.spayed_or_neutered = True

db.put(user_pets)
App Engine is going to be useful for web app developers that need high scalability but do not have complex processes in their backend. As an example, I believe that it would be easy to develop Twitter using App Engine. Developers of Facebook apps could find useful the service, too.

Google is going to compete with Amazon Web Services. I find these two services different. Google's one is simpler and easier therefore its target is developers of simple but scalable web apps. Amazon Web Services provides with more control, so companies that have complex systems (like vertical search engines) would preferer it.

Monday, April 7, 2008

Andorra

The Future of Web Search: Beyond Text # Learning to Rank Answers on Large Online QA Collections

Mihai Surdeanu gave the talk “Learning to Rank Answers on Large Online QA Collections”. The idea is to use the large answers collections from Yahoo! Answers in order to learn a better way to rank answers that do not have social reputation yet.

The users’ votes for answers are used to test the possible rank approaches. Therefore, your rank algorithm has to rank first answers with votes.

At the end, a well tuned mix of different strategies gets the best results. No independent strategy is dominant, so the best results come if you mix all of them.

Simple techniques like avoiding answers with less than 4 words are mixed with more complex ones like the comparison of answers in different languages and the use of NLP.

Friday, April 4, 2008

The Future of Web Search: Beyond Text # SAPIR Project

Pavel Zezula has talked about a system for multi feature indexing called MUFIN. It is interesting because it allows the indexing of objects with an arbitrary metric distance measure. For instance, you can index multimedia content (such as images) with a defined metric: a pair of images are very similar if the colors of both are very close. Later on, you can search for images that are very close to a given color scheme.

Documents are clustered, and those on the same cluster are mapped to the same one-dimensional interval feature.

They have used a P2P architecture that assures its scalability.

The work is supported by the SAPIR EU project.

The Future of Web Search: Beyond Text

Yahoo! is hosting the workshop The Future of Web Search: Beyond Text in Andorra today. There are over 100 attendees. The workshop does special focus on multimedia and specialized topics in Web search.

I’ll post some notes from the talks on several posts.