-2

A sample data:

+--------------------+-----+--------+----+----------+--------+-----------+
|                  id|click|    hour|  C1|banner_pos| site_id|site_domain|
+--------------------+-----+--------+----+----------+--------+-----------+
| NULL               |    0|14102100|1005|         0|1fbe01fe|   f3845767|
|10000169349117863715|    0|14102100|1005|         0|1fbe01fe|   f3845767|
|10000371904215119486|    0|NULL    |1005|         0|1fbe01fe|   f3845767|
|10000640724480838376|    0|14102100|1005|         0|1fbe01fe|   f3845767|
|10000679056417042096|    0|14102100|1005|         1|fe8cc448|   9166c161|
+--------------------+-----+--------+----+----------+--------+-----------+

Expected result:

+--------------------+-----+--------+----+----------+--------+-----------+
|                  id|click|    hour|  C1|banner_pos| site_id|site_domain|
+--------------------+-----+--------+----+----------+--------+-----------+
| NULL               |    0|14102100|1005|         0|1fbe01fe|   f3845767|
|10000371904215119486|    0|NULL    |1005|         0|1fbe01fe|   f3845767|
+--------------------+-----+--------+----+----------+--------+-----------+

I want to filter all columns in a Dataframe with the condition that at least one column contains "NULL" string. I have many columns, around 30. I have seen that with a few columns conditions can be added like this:

df2 = df1.filter($"Status" === 2 || $"Status" === 3)

But since I have many columns with the same conditions, is there a way to treat the columns as a collection and filter them all together? I have tried the following but it doesn't seem to work:

  df2.filter(
    lit(
      df2.columns.map(col(_).contains("NULL")).contains(lit(true))
    )
  ).show()

It seems df2.columns.map(col(_).contains("NULL")).contains(lit(true)) always returns false.

Why is it like this? Could you explain why this doesn't work?


The following codes work for the above purpose.

  df.filter(
    lit(true).isin(df.columns.map(col(_).contains("NULL")): _*)
  ).show()
CyberPlayerOne
  • 3,078
  • 5
  • 30
  • 51

2 Answers2

1

The reason for this is your first snippet

df.columns.map(col(_).contains("NULL"))

returns an

Array[Column]

and you try to pass that entire array to search for the boolean literal "true" with this statement

df2.columns.map(col(_).contains("NULL")).contains(lit(true))

Since it is a complex structure of an Array[Column] it does not work.

But this following expression:

df.columns.map(col(_).contains("NULL")): _*

sends each of the array element as a single argument one my one to the isin function, and thus your filter works.

Hope this could explain!

partha_devArch
  • 414
  • 2
  • 10
0
df.filter(r=> r.toSeq.exists(c => c == null))