0

I have around 20-25 list of columns from conf file and have to aggregate first Notnull value. I tried the function to pass the column list and agg expr from reading the conf file. I was able to get first function but couldn't find how to specify first with ignoreNull value as true.

The code that I tried is

def groupAndAggregate(df: DataFrame,  cols: List[String] , aggregateFun: Map[String, String]): DataFrame = {
    df.groupBy(cols.head, cols.tail: _*).agg(aggregateFun)
}

val df = sc.parallelize(Seq(
  (0,  null, "1"),
  (1, "2", "2"),
  (0, "3", "3"),
  (0, "4", "4"),
  (1, "5", "5"),
  (1, "6", "6"),
  (1, "7", "7")
)).toDF("grp", "col1", "col2")


//first
groupAndAggregate(df,  List("grp"), Map("col1"-> "first", "col2"-> "COUNT") ).show()

+---+-----------+-----------+
|grp|first(col1)|count(col2)|
+---+-----------+-----------+
|  1|          2|          4|
|  0|           |          3|
+---+-----------+-----------+

I need to get 3 as a result in place of null. I am using Spark 2.1.0 and Scala 2.11

Edit 1:

If I use the following function

import org.apache.spark.sql.functions.{first,count}
df.groupBy("grp").agg(first(df("col1"), ignoreNulls = true), count("col2")).show()

I get my desired result. Can we pass the ignoreNulls true for first function in Map?

Community
  • 1
  • 1
Shiva Achari
  • 955
  • 1
  • 9
  • 18

2 Answers2

1

I have been able to achieve this by creating a list of Columns and passing it to agg function of groupBy. The earlier approach had an issue where i was not able to name the columns as the agg function was not returning me the order of columns in the output DF, i have renamed the columns in the list itself.

     import org.apache.spark.sql.functions._

     def groupAndAggregate(df: DataFrame): DataFrame = {
        val list: ListBuffer[Column] = new ListBuffer[Column]()
        try {

          val columnFound = getAggColumns(df) // function to return a Map[String, String]

          val agg_func = columnFound.entrySet().toList.
            foreach(field =>
              list += first(df(columnFound.getOrDefault(field.getKey, "")),ignoreNulls = true).as(field.getKey)
            )

          list += sum(df("col1")).as("watch_time")
          list += count("*").as("frequency")

          val groupColumns = getGroupColumns(df) // function to return a List[String]

          val output = df.groupBy(groupColumns.head, groupColumns.tail: _*).agg(
            list.head, list.tail: _*
          )

          output

        } catch {
          case e: Exception => {
            e.printStackTrace()}
            null
        }
      }
Shiva Achari
  • 955
  • 1
  • 9
  • 18
0

I think you should use na operator and drop all the nulls before you do aggregation.

na: DataFrameNaFunctions Returns a DataFrameNaFunctions for working with missing data.

drop(cols: Array[String]): DataFrame Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

The code would then look as follows:

df.na.drop("col1").groupBy(...).agg(first("col1"))

That will impact count so you'd have to do count separately.

Community
  • 1
  • 1
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420