0

Anonymous function work fine.

For following code set up the problem:

import sparkSession.implicits._
val sparkSession = SparkSession.builder.appName("demo").getOrCreate()
val sc = sparkSession.sparkContext

case class DemoRow(keyId: Int, evenOddId: Int)
case class EvenOddCountRow(keyId: Int, oddCnt: Int, evenCnt: Int)

val demoDS = sc.parallelize(Seq(DemoRow(1, 1),
                                DemoRow(1, 2),
                                DemoRow(1, 3), 
                                DemoRow(2, 1),
                                DemoRow(2, 2))).toDS()

Showing the demoDS.show():

+-----+---------+
|keyId|evenOddId|
+-----+---------+
|    1|        1|
|    1|        2|
|    1|        3|
|    2|        1|
|    2|        2|
+-----+---------+

Using the Anonymous function id => id % 2 == 1 inside mapGroups() works fine:

val demoGroup = demoDS.groupByKey(_.keyId).mapGroups((key, iter) => {
  val evenOddIds = iter.map(_.evenOddId).toList
  val (oddIds, evenIds) = evenOddIds.partition(id => id % 2 == 1)
  EvenOddCountRow(key, oddIds.size, evenIds.size)
})

The result of demoGroup.show() is what we expected:

+-----+------+-------+
|keyId|oddCnt|evenCnt|
+-----+------+-------+
|    1|     2|      1|
|    2|     1|      1|
+-----+------+-------+

Now if I define the isOdd function, and put it into the function inside mapGroups() like below, it will raise Exception:

def isOdd(id: Int) = id % 2 == 1

val demoGroup = demoDS.groupByKey(_.keyId).mapGroups((key, iter) => {
  val evenOddIds = iter.map(_.evenOddId).toList
  val (oddIds, evenIds) = evenOddIds.partition(isOdd)
  EvenOddCountRow(key, oddIds.size, evenIds.size)
})

Caused by: java.io.NotSerializableException: scala.collection.LinearSeqLike$$anon$1

I tried different ways of define the isOdd function try to make it serializable:

val isOdd = (id: Int) => id % 2 == 1 // does not work

case object isOdd extends Function[Int, Boolean] with Serializable {
  def apply(id: Int) = id % 2 == 1
} // still does not work

Did I miss any thing or anything wrong? Thanks in advance!

Y.G.
  • 661
  • 7
  • 7

1 Answers1

0

The following works for me:

object Utils {
  def isOdd(id: Int) = id % 2 == 1
}

And then use:

evenOddIds.partition(Utils.isOdd)
Alex Karpov
  • 564
  • 4
  • 13
  • Strange. After restart the spark-shell, not only your solution work, my previous solutions works too. ... – Y.G. Feb 09 '17 at 23:17