2

I need to count the times a value example "2" is occurring in each column.

My dataset has this structure:

1 1 2 0 0 0 2 
0 2 0 1 1 1 1
1 2 1 0 2 2 2
0 0 0 0 1 1 2

I imported the file:

val ip = sc.textFile("/home/../data-scala.txt").map(line => line.split(" "))

How I can sum up the value equal to "2" in each column? i would expect to have as result an array of elements as

[0,2,1,0,1,1,3]
emesday
  • 6,078
  • 3
  • 29
  • 46
Joey
  • 89
  • 7

2 Answers2

3

How about something like this:

import breeze.linalg.DenseVector

def toInd(s: String): DenseVector[Int] = {
    DenseVector[Int](s.split(" ").map(x => if(x == "2") 1 else 0))
}

sc.textFile("/path/to/file").map(toInd).reduce(_ + _)

If you expect significant number of columns with sum equal zero you can replace DenseVector with SparseVector.

Above solution requires a new DenseVector object for each element of RDD. For a performance reason you may consider using aggregate and vector mutation:

def seqOp(acc: DenseVector[Int] , cols: Array[String]): DenseVector[Int] = {
    cols.zipWithIndex.foreach{ case (x, i) => if(x == "2") acc(i) += 1}
    acc
}

def combOp(acc1: DenseVector[Int], acc2: DenseVector[Int]): DenseVector[Int] = {
    acc1 += acc2
    acc1
}

val n = ip.first.length
ip.aggregate(DenseVector.zeros[Int](n))(seqOp, combOp)

You can easily replace DenseVector with a sparse one or scala.collection.mutable.Map if you want.

If you ask me it is rather ugly so I provide it only to make an answer complete.

zero323
  • 322,348
  • 103
  • 959
  • 935
2

You could map the existence of 2 in each position first, giving you

[ 0 0 1 0 0 0 1 ]
[ 0 1 0 0 0 0 0 ]
[ 0 1 0 0 1 1 1 ]
[ 0 0 0 0 0 0 1 ]

Then just do a reduce to gradually SUM each column.

Without involving Spark, it looks something like:

val list = Seq(
  Seq(1, 1, 2, 0, 0, 0, 2),
  Seq(0, 2, 0, 1, 1, 1, 1),
  Seq(1, 2, 1, 0, 2, 2, 2),
  Seq(0, 0, 0, 0, 1, 1, 2)
)

list.
   map(_.map(v => if(v == 2) 1 else 0)).
   reduce((a,b) => a.zip(b).map(t => t._1 +t._2 ))

Finding the optimum version of this one-liner is probably a bit a code golf challenge.

zero323
  • 322,348
  • 103
  • 959
  • 935
mattinbits
  • 10,370
  • 1
  • 26
  • 35