5

If I have an array of array (similar to a matrix) in Scala, what's the efficient way to sum up each column of the matrix? For example, if my array of array is like below:

val arr =  Array(Array(1, 100, ...), Array(2, 200, ...), Array(3, 300, ...))

and I want to sum up each column (e.g., sum up the first element of all sub-arrays, sum up the second element of all sub-arrays, etc.) and get a new array like below:

newArr = Array(6, 600, ...)

How can I do this efficiently in Spark Scala?

Carter
  • 1,563
  • 8
  • 23
  • 32

4 Answers4

6

There is a suitable .transpose method on List that can help here, although I can't say what its efficiency is like:

arr.toList.transpose.map(_.sum)

(then call .toArray if you specifically need the result as an array).

Shadowlands
  • 14,994
  • 4
  • 45
  • 43
  • Also works without `toList`: `arr.transpose.map(_.sum) #=> res24: Array[Int] = Array(6, 600)` – Michael Kohl Oct 01 '15 at 03:54
  • Apparently requires all input arrays to be the same length, I get an error otherwise. – Paul Oct 01 '15 at 03:55
  • @MichaelKohl So it does, I must have made a mistake when I tried it directly on the initial array. – Shadowlands Oct 01 '15 at 03:56
  • @Paul Yes, it specifically warns about that in the (expanded) docs for `List` - doesn't seem to mention it in the docs for `Array`, though. – Shadowlands Oct 01 '15 at 03:58
  • The mention of Spark makes me think the array can be quite big. Transposing that isn't probably the best approach – The Archetypal Paul Oct 01 '15 at 07:10
  • @Shadowlands how can we do if we have randomly generated spark.liang.Densematrix? for example A= DenseMatrix.randn(m,n,rng). how make column or row wise sum of Matrix A – Humayoo Jan 04 '17 at 08:23
4

Using breeze Vector:

scala> val arr =  Array(Array(1, 100), Array(2, 200), Array(3, 300))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))

scala> arr.map(breeze.linalg.Vector(_)).reduce(_ + _)
res0: breeze.linalg.Vector[Int] = DenseVector(6, 600)

If your input is sparse you may consider using breeze.linalg.SparseVector.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra has useful reference functions for this sort of thing, though it assumes you're using Scala / Breeze data types (e.g. DenseMatrix) instead of an Array of Arrays. – Sarah Messer May 23 '18 at 20:44
4

In practice a linear algebra vector library as mentioned by @zero323 will often be the better choice.

If you can't use a vector library, I suggest writing a function col2sum that can sum two columns -- even if they are not the same length -- and then use Array.reduce to extend this operation to N columns. Using reduce is valid because we know that sums are not dependent on order of operations (i.e. 1+2+3 == 3+2+1 == 3+1+2 == 6) :

def col2sum(x:Array[Int],y:Array[Int]):Array[Int] = {
    x.zipAll(y,0,0).map(pair=>pair._1+pair._2)
}

def colsum(a:Array[Array[Int]]):Array[Int] = {
    a.reduce(col2sum)
}

val z = Array(Array(1, 2, 3, 4, 5), Array(2, 4, 6, 8, 10), Array(1, 9));

colsum(z)

--> Array[Int] = Array(4, 15, 9, 12, 15)
Paul
  • 26,170
  • 12
  • 85
  • 119
  • 1
    Nicely done, especially as it can be expressed quite succinctly: `arr.reduce(_.zipAll(_,0,0).map{case (a,b) => a+b})` – jwvh Oct 01 '15 at 08:49
  • @jwh Do you *need* the case statment, as there is only one case? – Paul Oct 01 '15 at 21:04
  • 1
    Well there are a number of ways to sum a Tuple2 (Int,Int). Your code breaks the received argument, a tuple, into its _1 _2 parts, which is good and straight forward. I chose the "partial function" method, which requires braces `{}` and the `case` statement, because I find it a little less visually cluttered. The `case` is needed to invoke pattern matching, which then assigns each part to the designated variables (`a` and `b`). – jwvh Oct 01 '15 at 22:03
0
scala> val arr =  Array(Array(1, 100), Array(2, 200), Array(3, 300 ))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))

scala> arr.flatten.zipWithIndex.groupBy(c => (c._2 + 1) % 2)
       .map(a => a._1 -> a._2.foldLeft(0)((sum, i) => sum + i._1))

res40: scala.collection.immutable.Map[Int,Int] = Map(2 -> 600, 1 -> 6, 0 -> 15)

flatten array and zipWithIndex to get index and groupBy to map new array as column array, foldLeft to sum the column array.

chengpohi
  • 14,064
  • 1
  • 24
  • 42