How to parse freebase quad dump using Amazon mapreduce

Question

Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors.

I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just directors name.

What is the right approach for this task? Do i need to parse somehow whole quad dump using amazons cloud? Or is there some esy way?

Tom Morris · Answer 1 · 2012-03-08T20:29:55.007

You do need to use the quad dump, but it is under 4 GB and shouldn't require Hadoop, MapReduce, or any cloud processing to do. A decent laptop should be fine. On a couple year old laptop, this simple-minded command:

time bzgrep '/film/' freebase-datadump-quadruples.tsv.bz2 | wc -l
10394545

real    18m56.968s
user    19m30.101s
sys 0m56.804s

extracts and counts everything referencing the film domain in under 20 minutes. Even if you have to make multiple passes through the file (which is likely), you'll be able to complete your whole task in less than an hour, which should mean there's no need for beefy computing resources.

You'll need to traverse an intermediary node (CVT in Freebase-speak) to get the actors, but rest of your information should be connected directly to the subject film node.

Tom

score 2 · Answer 2 · answered Mar 11 '12 at 18:02

First of all, I completely share Tom's point of view and his suggestion. I often use UNIX command line tools to take 'interesting' slices of data out of Freebase data dump.

However, an alternative would be to load Freebase data into a 'graph' storage system locally and use APIs and/or the query language available from that system to interact with the data for further processing.

I use RDF, since the data model is quite similar and it is very easy to convert the Freebase data dump into RDF (see: https://github.com/castagna/freebase2rdf). I then load it into Apache Jena's TDB store (http://incubator.apache.org/jena/documentation/tdb/) and use the Jena APIs or SPARQL for further processing.

Another reasonable and scalable approach would be to implement what you need to do in MapReduce, but this makes sense only if the amount of processing you do is touching a large fraction of Freebase data and it is not as trivial as counting lines. This is more expensive than using your own machine, you need an Hadoop cluster or you need to use Amazon EMR. (I should probably write a MapReduce version of freebase2rdf ;-))

My 2 cents.

How to parse freebase quad dump using Amazon mapreduce

2 Answers2