0

My question is rather simple to be answered in a single node environment, but I don't know how to do the same thing in a distributed Spark environment. What I have now is a "frequency plot", in which for each item I have the number of times it occurs. For instance, it may be something like this: (1, 2), (2, 3), (3,1) which means that 1 occurred 2 times, 2 3 times and so on.

What I would like to get is the cumulated frequency for each item, so the result I would need from the instance data above is: (1, 2), (2, 3+2=5), (3, 1+3+2=6).

So far, I have tried to do this by using mapPartitions which gives the correct result if there is only one partition...otherwise obviously no.

How can I do that?

Thanks. Marco

mgaido
  • 2,987
  • 3
  • 17
  • 39
  • Hi, is the first position in each tuple a "unique id"?, I mean, it possible to find (1,2) and in some other position in the rdd again: (1,) ? – JoseM LM Mar 15 '15 at 07:29
  • It may be considered as a unique id, because the RDD is aggregated on that value before this step... – mgaido Mar 15 '15 at 09:36

2 Answers2

1

I don't think what you want is possible as a distributed transformation in Spark unless your data is small enough to be aggregated into a single partition. Spark functions work by distributing jobs to remote processes, and the only way to communicate back is using an action which returns some value, or using an accumulator. Unfortunately, accumulators can't be read by the distributed jobs, they're write-only.

If your data is small enough to fit in memory on a single partition/process, you can coalesce(1), and then your existing code will work. If not, but a single partition will fit in memory, then you might use a local iterator:

var total = 0L
rdd.sortBy(_._1).toLocalIterator.foreach(tuple => {
    total = total + tuple._2;
    println((tuple._1, total)) // or write to local file
})
DPM
  • 1,571
  • 11
  • 8
  • What you have written is exactly my thought, but I hoped someone smarter than me would have found a suitable solution... – mgaido Mar 15 '15 at 09:38
  • 1
    I think it's a mismatch for the framework. Spark is for parallel calculations on large data sets, you need a single-threaded calculation, hence your difficulty. Good luck. – DPM Mar 15 '15 at 20:57
-2

If I understood your question correctly, it really looks like a fit for one of the combiner functions – take a look at different versions of aggregateByKey or reduceByKey functions, both located here.

ms.
  • 332
  • 1
  • 10
  • I think you have not understood the question or you don't know how that functions work: they aggregate or reduce over the values having the same key, while I need aggregating values over different keys... – mgaido Mar 15 '15 at 09:38
  • 1
    I somehow missed the crucial keyword ("cumulative") and rushed to a conclusion. Sorry about that. – ms. Mar 15 '15 at 11:26