I have around 20-25 list of columns from conf file and have to aggregate first Notnull value. I tried the function to pass the column list and agg expr from reading the conf file.
I was able to get first function but couldn't find how to specify first with ignoreNull
value as true.
The code that I tried is
def groupAndAggregate(df: DataFrame, cols: List[String] , aggregateFun: Map[String, String]): DataFrame = {
df.groupBy(cols.head, cols.tail: _*).agg(aggregateFun)
}
val df = sc.parallelize(Seq(
(0, null, "1"),
(1, "2", "2"),
(0, "3", "3"),
(0, "4", "4"),
(1, "5", "5"),
(1, "6", "6"),
(1, "7", "7")
)).toDF("grp", "col1", "col2")
//first
groupAndAggregate(df, List("grp"), Map("col1"-> "first", "col2"-> "COUNT") ).show()
+---+-----------+-----------+
|grp|first(col1)|count(col2)|
+---+-----------+-----------+
| 1| 2| 4|
| 0| | 3|
+---+-----------+-----------+
I need to get 3 as a result in place of null. I am using Spark 2.1.0 and Scala 2.11
Edit 1:
If I use the following function
import org.apache.spark.sql.functions.{first,count}
df.groupBy("grp").agg(first(df("col1"), ignoreNulls = true), count("col2")).show()
I get my desired result. Can we pass the ignoreNulls
true
for first function in Map?