Anonymous function work fine.
For following code set up the problem:
import sparkSession.implicits._
val sparkSession = SparkSession.builder.appName("demo").getOrCreate()
val sc = sparkSession.sparkContext
case class DemoRow(keyId: Int, evenOddId: Int)
case class EvenOddCountRow(keyId: Int, oddCnt: Int, evenCnt: Int)
val demoDS = sc.parallelize(Seq(DemoRow(1, 1),
DemoRow(1, 2),
DemoRow(1, 3),
DemoRow(2, 1),
DemoRow(2, 2))).toDS()
Showing the demoDS.show()
:
+-----+---------+
|keyId|evenOddId|
+-----+---------+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
+-----+---------+
Using the Anonymous function id => id % 2 == 1
inside mapGroups()
works fine:
val demoGroup = demoDS.groupByKey(_.keyId).mapGroups((key, iter) => {
val evenOddIds = iter.map(_.evenOddId).toList
val (oddIds, evenIds) = evenOddIds.partition(id => id % 2 == 1)
EvenOddCountRow(key, oddIds.size, evenIds.size)
})
The result of demoGroup.show()
is what we expected:
+-----+------+-------+
|keyId|oddCnt|evenCnt|
+-----+------+-------+
| 1| 2| 1|
| 2| 1| 1|
+-----+------+-------+
Now if I define the isOdd
function, and put it into the function inside mapGroups()
like below, it will raise Exception:
def isOdd(id: Int) = id % 2 == 1
val demoGroup = demoDS.groupByKey(_.keyId).mapGroups((key, iter) => {
val evenOddIds = iter.map(_.evenOddId).toList
val (oddIds, evenIds) = evenOddIds.partition(isOdd)
EvenOddCountRow(key, oddIds.size, evenIds.size)
})
Caused by: java.io.NotSerializableException: scala.collection.LinearSeqLike$$anon$1
I tried different ways of define the isOdd
function try to make it serializable:
val isOdd = (id: Int) => id % 2 == 1 // does not work
case object isOdd extends Function[Int, Boolean] with Serializable {
def apply(id: Int) = id % 2 == 1
} // still does not work
Did I miss any thing or anything wrong? Thanks in advance!