Suppose I have a spark dataframe df
with some columns (id,...) and a string sqlFilter
with a SQL filter, e.g. "id is not null"
.
I want to filter the dataframe df
based on sqlFilter
, i.e.
val filtered = df.filter(sqlFilter)
Now, I want to have a list of 10 ids from df
that were removed by the filter.
Currently, I'm using a "leftanti" join to achieve this, i.e.
val examples = df.select("id").join(filtered.select("id"), Seq("id"), "leftanti")
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
However, this is really slow. My guess is that this can be implemented faster, because spark only has to have a list for every partitition and add an id to the list when filter removes a row and the list contains less than 10 elements. Once the action after filter finishes, spark has to collect all the lists from the partitions until it has 10 ids.
I wanted to use accumulators as described here,
but I failed because I could not find out how to parse and use sqlFilter
.
Has anybody an idea how I can improve the performance?
Update Ramesh Maharjan suggested in the comments to inverse the SQL query, i.e.
df.filter(s"NOT ($filterString)")
.select(key)
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
This indeed improves the performance but it is not 100% equivalent.
If there are multiple rows with the same id, the id will end up in the examples if one row is removed due to the filter. With the leftantit join it does not end up in the examples because the id is still in filtered
.
However, that is fine with me.
I'm still interested if it is possible to create the list "on the fly" with accumulators or something similar.
Update 2
Another issue with inverting the filter is the logical value UNKNOWN in SQL, because NOT UNKNWON = UNKNOWN, i.e. NOT(null <> 1) <=> UNKNOWN
and hence this row shows up neither in the filtered dataframe nor in the inverted dataframe.