0

Ok so I have points which is a List<GeoPoint>

The following piece of code is written using java 8 functional API. It takes the points, calculates for each point its matching cluster , and then groups them point by the ClusterKey. Eventually we end up with a Map<ClusterKey, List<GeoPoint> Here it is:

points.
   parallelStream().unordered().
   collect(groupingByConcurrent(Functions::calcClusterKey))
kumetix
  • 1,032
  • 1
  • 12
  • 18

1 Answers1

2
sc.parallelize(points).groupBy(Functions.calcClusterKey).collect.toMap

The correspondence is pretty 1:1.

lmm
  • 17,386
  • 3
  • 26
  • 37
  • Pretty awesome man, and quick! thanks!. Only thing is that I read somewhere that groupBy may be problematic if my list contains a couple of million points, and that I should go for a solution which involves reduceByKey – kumetix Jan 28 '15 at 16:06
  • 1
    `groupByKey` is indeed expensive, but if you want the full `Map` you're going to have to do something equivalent to it sooner or later. If you don't need the `Map` but are happy to do some sort of aggregation first then you should indeed keep it as an `RDD` and use `reduceByKey` (or better still, `aggregateByKey`) to perform further reductions first, something like `....map{v => (Functions.calcClusterKey(v), v)}.aggregateByKey(...)`. – lmm Jan 28 '15 at 16:25