Custom scalding tap (or Spark equivalent)

Question

I am trying to dump some data that I have on a Hadoop cluster, usually in HBase, with a custom file format.

What I would like to do is more or less the following:

start from a distributed list of records, such as a Scalding pipe or similar
group items by some computed function
make so that items belonging to the same group reside on the same server
on each group, apply a transformation - that involves sorting - and write the result on disk. In fact I need to write a bunch of MapFile - which are essentially sorted SequenceFile, plus an index.

I would like to implement the above with Scalding, but I am not sure how to do the last step.

While of course one cannot write sorted data in a distributed fashion, it should still be doable to split data into chunks and then write each chunk sorted locally. Still, I cannot find any implementation of MapFile output for map-reduce jobs.

I recognize it is a bad idea to sort very large data, and this is the reason even on a single server I plan to split data into chunks.

Is there any way to do something like that with Scalding? Possibly I would be ok with using Cascading directly, or really an other pipeline framework, such as Spark.

score 0 · Answer 1 · answered May 12 '14 at 17:46

Using Scalding (and the underlying Map/Reduce) you will need to use the TotalOrderPartitioner, which does pre-sampling to create appropriate buckets/splits of the input data.

Using Spark will speed up due to the faster access paths to the disk data. However it will still require shuffles to disk/hdfs so it will not be like orders of magnitude better.

In Spark you would use a RangePartitioner, which takes the number of partitions and an RDD:

val allData = sc.hadoopRdd(paths)
val partitionedRdd = sc.partitionBy(new RangePartitioner(numPartitions, allData)
val groupedRdd = partitionedRdd.groupByKey(..).
// apply further transforms..

Custom scalding tap (or Spark equivalent)

1 Answers1