3

Suppose I have a spark dataframe df with some columns (id,...) and a string sqlFilter with a SQL filter, e.g. "id is not null". I want to filter the dataframe df based on sqlFilter, i.e.

val filtered = df.filter(sqlFilter)

Now, I want to have a list of 10 ids from df that were removed by the filter.

Currently, I'm using a "leftanti" join to achieve this, i.e.

val examples = df.select("id").join(filtered.select("id"), Seq("id"), "leftanti")
                 .take(10)
                 .map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})

However, this is really slow. My guess is that this can be implemented faster, because spark only has to have a list for every partitition and add an id to the list when filter removes a row and the list contains less than 10 elements. Once the action after filter finishes, spark has to collect all the lists from the partitions until it has 10 ids.

I wanted to use accumulators as described here, but I failed because I could not find out how to parse and use sqlFilter.

Has anybody an idea how I can improve the performance?

Update Ramesh Maharjan suggested in the comments to inverse the SQL query, i.e.

df.filter(s"NOT ($filterString)")
          .select(key)
          .take(10)
          .map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})

This indeed improves the performance but it is not 100% equivalent. If there are multiple rows with the same id, the id will end up in the examples if one row is removed due to the filter. With the leftantit join it does not end up in the examples because the id is still in filtered. However, that is fine with me.

I'm still interested if it is possible to create the list "on the fly" with accumulators or something similar.

Update 2

Another issue with inverting the filter is the logical value UNKNOWN in SQL, because NOT UNKNWON = UNKNOWN, i.e. NOT(null <> 1) <=> UNKNOWN and hence this row shows up neither in the filtered dataframe nor in the inverted dataframe.

lmcoy
  • 33
  • 1
  • 6

1 Answers1

1

You can use a custom accumulator (because longAccumulator won't help you as all ids will be null); and you must formulate your filter statement as function :

Suppose you have a dataframe :

+----+--------+
|  id|    name|
+----+--------+
|   1|record 1|
|null|record 2|
|   3|record 3|
+----+--------+

Then you could do :

import org.apache.spark.util.AccumulatorV2

class RowAccumulator(var value: Seq[Row]) extends AccumulatorV2[Row, Seq[Row]] {
  def this() = this(Seq.empty[Row])
  override def isZero: Boolean = value.isEmpty
  override def copy(): AccumulatorV2[Row, Seq[Row]] = new RowAccumulator(value)
  override def reset(): Unit = value = Seq.empty[Row]
  override def add(v: Row): Unit = value = value :+ v
  override def merge(other: AccumulatorV2[Row, Seq[Row]]): Unit = value = value ++ other.value
}

val filteredAccum = new RowAccumulator()
ss.sparkContext.register(filteredAccum, "Filter Accum")

val filterIdIsNotNull = (r:Row) => {
  if(r.isNullAt(r.fieldIndex("id"))) {
    filteredAccum.add(r)
    false
  } else {
    true
  }}

df
  .filter(filterIdIsNotNull)
  .show()

println(filteredAccum.value)

gives

+---+--------+
| id|    name|
+---+--------+
|  1|record 1|
|  3|record 3|
+---+--------+

List([null,record 2])

But personally I would not do this, I would rather do something like you've already suggested :

val dfWithFilter = df
  .withColumn("keep",expr("id is not null"))
  .cache() // check whether caching is feasibly

// show 10 records which we do not keep
dfWithFilter.filter(!$"keep").drop($"keep").show(10) // or use take(10)

+----+--------+
|  id|    name|
+----+--------+
|null|record 2|
+----+--------+

// rows that we keep
val filteredDf = dfWithFilter.filter($"keep").drop($"keep")
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • My idea was to use my own version of `CollectionAccumulator` to store the ids. However, I went for the inverted filter (the `keep` column is a good idea) and I agree that this solution is better because it is easier to unterstand and maintain. I was just curious if it is possible to use an accumulator and I miss how to do it. I can't use a scala function because the filters are supplied by business users and they want to have examples for the filtered rows to be able to look at the source data and find out why the data does not look like expected. – lmcoy Jul 16 '18 at 20:24