0

I am trying to dump some data that I have on a Hadoop cluster, usually in HBase, with a custom file format.

What I would like to do is more or less the following:

  • start from a distributed list of records, such as a Scalding pipe or similar
  • group items by some computed function
  • make so that items belonging to the same group reside on the same server
  • on each group, apply a transformation - that involves sorting - and write the result on disk. In fact I need to write a bunch of MapFile - which are essentially sorted SequenceFile, plus an index.

I would like to implement the above with Scalding, but I am not sure how to do the last step.

While of course one cannot write sorted data in a distributed fashion, it should still be doable to split data into chunks and then write each chunk sorted locally. Still, I cannot find any implementation of MapFile output for map-reduce jobs.

I recognize it is a bad idea to sort very large data, and this is the reason even on a single server I plan to split data into chunks.

Is there any way to do something like that with Scalding? Possibly I would be ok with using Cascading directly, or really an other pipeline framework, such as Spark.

Andrea
  • 20,253
  • 23
  • 114
  • 183

1 Answers1

0

Using Scalding (and the underlying Map/Reduce) you will need to use the TotalOrderPartitioner, which does pre-sampling to create appropriate buckets/splits of the input data.

Using Spark will speed up due to the faster access paths to the disk data. However it will still require shuffles to disk/hdfs so it will not be like orders of magnitude better.

In Spark you would use a RangePartitioner, which takes the number of partitions and an RDD:

val allData = sc.hadoopRdd(paths)
val partitionedRdd = sc.partitionBy(new RangePartitioner(numPartitions, allData)
val groupedRdd = partitionedRdd.groupByKey(..).
// apply further transforms..
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560