2

I have the following dataframe :df

In some point I need to filter out items base on timestamps(milliseconds). However it is important to me to save how much records werefiltered(In case it is too many I want to fail the job) Naively I can do:

======Lots of calculations on df ======
val df_filtered = df.filter($"ts" >= startDay && $"ts"  <= endDay)
val filtered_count = df.count - df_filtered.count

However it feels like complete overkill since SPARK will perform the whole execution tree, 3 times (filter and 2 counts). This task in Hadoop MapReduce is really easy since I can maintain counter for each row filtered. Is there more efficient way, I could only find accumulators but I can't connect it to filter.

A suggested approach was to cache df before the filter however I would prefer this option as last resort due to DF size.

RefiPeretz
  • 543
  • 5
  • 19
  • https://stackoverflow.com/a/44279421/5741205 – MaxU - stand with Ukraine Apr 10 '18 at 12:18
  • How this approach is faster compare to count? Is there a way to count exactly? It still doesn't deal with all the effort before the count itself which will be execute 3 times all the calculations without cache – RefiPeretz Apr 10 '18 at 12:23
  • How about `df.except(df_filtered).count` ? – philantrovert Apr 10 '18 at 12:28
  • The example of accumulator usage can be found [here](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-accumulators.html). But note, using accumulators in transformations is not 100% accurate, since transformations can be executed multiple times in case of failures. But i guess in your case this thing can be ignored. Another drawback is that you can't stop processing at the moment when you discover that there are too much errors, because you can't get accumulator value in filter operation. So you will need to fail job after all processing. – Vladislav Varslavans Apr 10 '18 at 12:32
  • @VladislavVarslavans Can you give simple example or link on how to use accumulators with dataframe filter because I can't see it. If I filter the values how can I sace accumulator for them – RefiPeretz Apr 10 '18 at 12:37
  • @RefiPeretz i've added the code as an answer below :) – Vladislav Varslavans Apr 10 '18 at 12:53

1 Answers1

3

Spark 1.6.0 code:

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)

  case class xxx(a: Int, b: Int)

  def main(args: Array[String]): Unit = {

    val df = sqlContext.createDataFrame(sc.parallelize(Seq(xxx(1, 1), xxx(2, 2), xxx(3,3))))

    val acc = sc.accumulator[Long](0)

    val filteredRdd = df.rdd.filter(r => {
      if (r.getAs[Int]("a") > 2) {
        true
      } else {
        acc.add(1)
        false
      }
    })

    val filteredRddDf = sqlContext.createDataFrame(filteredRdd, df.schema)

    filteredRddDf.show()

    println(acc.value)
  }
}

Spark 2.x.x code:

import org.apache.spark.sql.SparkSession

object Main {

  val ss = SparkSession.builder().master("local[*]").getOrCreate()
  val sc = ss.sparkContext

  case class xxx(a: Int, b: Int)

  def main(args: Array[String]): Unit = {

    val df = ss.createDataFrame(sc.parallelize(Seq(xxx(1, 1), xxx(2, 2), xxx(3,3))))

    val acc = sc.longAccumulator

    val filteredDf = df.filter(r => {
      if (r.getAs[Int]("a") > 2) {
        true
      } else {
        acc.add(1)
        false
      }
    }).toDF()


    filteredDf.show()

    println(acc.value)

  }
}
Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33
  • Done (see answer) :) – Vladislav Varslavans Apr 10 '18 at 13:31
  • Thanks alot! I don't know what is the problem when I run it I get 0 in the accumlator. I matched my code to yours any idea? – RefiPeretz Apr 10 '18 at 14:23
  • Ok I think I got it, I only get the accu value when I do show(). Any suggestion for production code instead of show()? – RefiPeretz Apr 10 '18 at 14:38
  • Any action that will trigger calculation of all elements will do. You can use `foreach` or `count`. But you *must* use an action - as it's a basics of Spark, that calculation is triggered only by actions. `show` might be involved in some optimizations since it prints only some number of elements. – Vladislav Varslavans Apr 10 '18 at 14:47