I am trying to dump some data that I have on a Hadoop cluster, usually in HBase, with a custom file format.
What I would like to do is more or less the following:
- start from a distributed list of records, such as a Scalding pipe or similar
- group items by some computed function
- make so that items belonging to the same group reside on the same server
- on each group, apply a transformation - that involves sorting - and write the result on disk. In fact I need to write a bunch of MapFile - which are essentially sorted SequenceFile, plus an index.
I would like to implement the above with Scalding, but I am not sure how to do the last step.
While of course one cannot write sorted data in a distributed fashion, it should still be doable to split data into chunks and then write each chunk sorted locally. Still, I cannot find any implementation of MapFile output for map-reduce jobs.
I recognize it is a bad idea to sort very large data, and this is the reason even on a single server I plan to split data into chunks.
Is there any way to do something like that with Scalding? Possibly I would be ok with using Cascading directly, or really an other pipeline framework, such as Spark.