2

I have a list of calls, smsIn, smsOut in a CSV file, and I want to count the number of smsIn/smsOut for each Phone number.

CallType indicates the type (call, smsIn, smsOut)

An example of the data is (phoneNumber, callType)

7035076600, 30 
5081236732, 31
5024551234, 30
7035076600, 31
7035076600, 30

Ultimately, I want something like this: phoneNum, numSMSIn, numSMSOUt I have implemented something like this:

val smsOutByPhoneNum = partitionedCalls.
                       filter{ arry => arry(2) == 30}.
                       groupBy { x => x(1) }.
                       map(f=> (f._1,f._2.iterator.length)).
                       collect()

The above gives the number of SMS out for each phone number. Similarly

val smsInByPhoneNum = partitionedCalls.
                      filter{ arry => arry(2) == 31}.
                      groupBy { x => x(1) }.
                      map(f => (f._1, f._2.iterator.length)).
                      collect()

The above gives the number of SMS in for each phone number.

Is there a way where I can get both done in one iteration instead of two.

zero323
  • 322,348
  • 103
  • 959
  • 935
sparkDabbler
  • 518
  • 2
  • 7
  • 20
  • 1
    Why was this question flagged down. Is it not a valid question ? Is it offensive to anybody? or is it a duplicate ?... I will appreciated it if somebody will give a reason for flagging this down. Flagging down without reason is very rude – sparkDabbler Dec 27 '15 at 02:06

3 Answers3

2

There are multiple ways you can solve this problem. A naive approach is to aggregate by (number, type) tuple and group partial results:

val partitionedCalls = sc.parallelize(Array(
  ("7035076600", "30"), ("5081236732", "31"), ("5024551234", "30"),
  ("7035076600", "31"), ("7035076600", "30")))

val codes = partitionedCalls.values.distinct.sortBy(identity).collect    

val aggregated = partitionedCalls.map((_, 1L)).reduceByKey(_ + _)
  .map{case ((number, code), cnt) => (number, (code, cnt))}
  .groupByKey
  .mapValues(vs => {
    codes.map(vs.toMap.getOrElse(_, 0))
  })

You can also map and reduceByKey with some structure which can capture all counts:

case class CallCounter(calls: Long, smsIn: Long, smsOut: Long, other: Long)

partitionedCalls
  .map { 
    case (number, "30") => (number, CallCounter(0L, 1L, 0L, 0L)) 
    case (number, "31") => (number, CallCounter(0L, 0L, 1L, 0L)) 
    case (number, "32") => (number, CallCounter(1L, 0L, 0L, 0L)) 
    case (number, _)    => (number, CallCounter(0L, 0L, 0L, 1L)) 
  }
  .reduceByKey((x, y) => CallCounter(
    x.calls + y.calls, x.smsIn + y.smsIn, 
    x.smsOut + y.smsOut, x.other + y.other))

or even combine map and reduce steps in a single aggregateByKey:

val transformed = partitionedCalls.aggregateByKey(
  scala.collection.mutable.HashMap.empty[String,Long].withDefault(_ => 0L)
)(
  (acc, x) => { acc(x) += 1; acc },
  (acc1, acc2) => { acc2.foreach{ case (k, v) => acc1(k) += v }; acc1 }
).mapValues(codes.map(_))

Depending on a context you should adjust accumulator class so it better suits your needs. For example if number of classes is large you should consider using linear algebra libraries like breeze - see How to sum up every column of a Scala array?

One thing you definitely should avoid is to groupBy + map when you really mean reduceByKey. It has to shuffle all the data when all you want is just a modified word count.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Good solution, with both the CallCounter class, and aggregateByKey.. I ended up using aggregateByKey. Also good suggestion to use reduceByKey, instead of groupBy + map. It helped me negate a performance issue that would have happened down the line in production. – sparkDabbler Dec 27 '15 at 17:36
1

The result codes are counterintuitive, butnevertheless:

This codes:

object PartitionedCalls {
  def authorVersion(partitionedCalls: Seq[Seq[Long]]) ={
    val smsOutByPhoneNum = partitionedCalls.
      filter{ arry => arry(1) == 30}.
      groupBy { x => x(0) }.
      map(f=> (f._1,f._2.iterator.length))

    val smsInByPhoneNum = partitionedCalls.
      filter{ arry => arry(1) == 31}.
      groupBy { x => x(0) }.
      map(f => (f._1, f._2.iterator.length))

    (smsOutByPhoneNum, smsInByPhoneNum)
  }

  def myVersion(partitionedCalls: Seq[Seq[Long]]) = {
    val smsInOut = partitionedCalls.
      filter{ arry => arry(1) == 30 || arry(1) == 31}.
      groupBy{ _(1) }.
      map { case (num, t) =>
        num -> t.
          groupBy { x => x(0) }.
          map(f=> (f._1,f._2.iterator.length))
      }

    (smsInOut(30), smsInOut(31))
  }
}

implement these tests:

class PartitionedCallsTest extends FunSuite {
  val in = Seq(
    Seq(7035076600L, 30L),
    Seq(5081236732L, 31L),
    Seq(5024551234L, 30L),
    Seq(7035076600L, 31L),
    Seq(7035076600L, 30L)
  )

  val out = (Map(7035076600L -> 2L, 5024551234L -> 1L),Map(7035076600L -> 1L, 5081236732L -> 1L))

  test("Author"){
    assert(out == PartitionedCalls.authorVersion(in))
  }

  test("My"){
    assert(out == PartitionedCalls.myVersion(in))
  }
}
nikiforo
  • 426
  • 3
  • 7
  • This is good and works, but agree with zero323, about groupBy + map will lead to excessive shuffling especially when he data size is large. – sparkDabbler Dec 27 '15 at 17:34
1

Great answer @zero323

val partitionedCalls = sc.parallelize(Array(("7035076600", "30"),   
("5081236732", "31"), ("5024551234", "30"),("7035076600", "31"), 
("7035076600", "30")))

# count the pairs <(phoneNumber, code), count>
val keyPairCounts = partitionedCalls.map((_,1))
# using reduceByKey
val aggregateCounts = keyPairCounts.reduceByKey(_ + _).map{ case((phNum,  
inOrOut), cnt) => (phNum, (inOrOut, cnt)) }
# using groupBy to aggregate and merge similar keys
val result = aggregateCounts.groupByKey.map(x => (x._1, 
x._2.toMap.values.toArray))

# collect the result 
result.map(x => (x._1, x._2.lift(0).getOrElse(0), 
x._2.lift(1).getOrElse(0))).collect().map(println)

Reference: A good explanation on difference between groupBy and reduceBy:prefer_reducebykey_over_groupbykey

Pramit
  • 1,373
  • 1
  • 18
  • 27