0

I have a RDD[(breeze.linalg.DenseMatrix[Int], Array[Int])], using a DenseMatrix from the Breeze library that I would like to reduce but I am not sure how. Here is an example (I simplified the code, a DenseMatrix with only one column is not really useful in real life):

First cell of the RDD: (DenseMatrix((1), (2), (3)), Array(2, 1))

Second cell of the RDD: (DenseMatrix((4), (5), (6)), Array(1, 2))

Expected result: (DenseMatrix((1), (2), (3), (4), (5), (6)), Array(2, 1, 1, 2)) or (DenseMatrix((4), (5), (6), (1), (2), (3)), Array(1, 2, 2, 1)). The order of the cells reduced does not matter.

I know the size of the resulted DenseMatrix in advance thus I was thinking about creating an empty one then fill it in by looping over the RDD but could I use reduce() or fold()? How can I use it with an immutable type that is not standard like DenseMatrix?

Armand Grillet
  • 3,229
  • 5
  • 30
  • 60
  • 2
    So you're actually trying to perform some type of concatenation with aggregate? This is almost never a good idea. If driver can handle the output then it is almost always cheaper to perform operation like this locally. – zero323 Aug 13 '16 at 20:24
  • What do you mean by performing it locally? I use the RDD to parallelize an operation so at the end I have these separate DenseMatrix that I need to concatenate again, is the operation I described in the last paragraph what I should do (empty DenseMatrix that I then fill by looping over the cells)? – Armand Grillet Aug 13 '16 at 20:35
  • 1
    I mean collecting and performing it directly. Since amount of data you have to transfer is more or less constant and partial results seem to be obsolete there is nothing to gain here. To be fair though I don't understand what you're trying to do here. – zero323 Aug 13 '16 at 20:41
  • I have a big dataset as a .csv containing small clusters, I am cutting the dataset into tiles then clustering the observations in each of the tile (this is the parallelized operation). The DenseMatrix represents the observations in the tile, the array represents the clusters, e.g. the first RDD has three observations and the two first ones are in the same cluster. Once I have clustered every tile, I wist to put all the clusters back in one DenseMatrix representing the given dataset grouped by clusters with an array representing the number of observations per cluster. – Armand Grillet Aug 13 '16 at 20:46
  • 2
    So you actually work with something similar to `o.a.s.mllib.linalg.distributed.CoordinateMatrix`, aren't you? If you're actually confident that the final result fit into driver memory just `collect` and assemble results later. `fold` would be a good choice if operation actually reduced amount of data to be transfered. – zero323 Aug 13 '16 at 20:53

0 Answers0