how would you transform this functional java code to scala above SparkContext?

Question

Ok so I have points which is a List<GeoPoint>

The following piece of code is written using java 8 functional API. It takes the points, calculates for each point its matching cluster , and then groups them point by the ClusterKey. Eventually we end up with a Map<ClusterKey, List<GeoPoint> Here it is:

points.
   parallelStream().unordered().
   collect(groupingByConcurrent(Functions::calcClusterKey))

If you have another question, well, ask another question, don't change the existing one. — Jens Schauder, Jan 28 '15 at 20:32

score 2 · Answer 1 · answered Jan 28 '15 at 16:02

2

sc.parallelize(points).groupBy(Functions.calcClusterKey).collect.toMap

The correspondence is pretty 1:1.

answered Jan 28 '15 at 16:02

lmm

17,386
3
26
37

Pretty awesome man, and quick! thanks!. Only thing is that I read somewhere that groupBy may be problematic if my list contains a couple of million points, and that I should go for a solution which involves reduceByKey – kumetix Jan 28 '15 at 16:06
1

`groupByKey` is indeed expensive, but if you want the full `Map` you're going to have to do something equivalent to it sooner or later. If you don't need the `Map` but are happy to do some sort of aggregation first then you should indeed keep it as an `RDD` and use `reduceByKey` (or better still, `aggregateByKey`) to perform further reductions first, something like `....map{v => (Functions.calcClusterKey(v), v)}.aggregateByKey(...)`. – lmm Jan 28 '15 at 16:25

how would you transform this functional java code to scala above SparkContext?

1 Answers1