2

Suppose I have the following pyspark dataframe:

>>> df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
>>> df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

How can I now select or filter for any row, containing at least one null value, like so?:

>>> df.SOME-COMMAND-HERE.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+
Ytsen de Boer
  • 2,797
  • 2
  • 25
  • 36
  • Possible duplicate of [how to filter out a null value from spark dataframe](http://stackoverflow.com/questions/39727742/how-to-filter-out-a-null-value-from-spark-dataframe) – Alberto Bonsanto Dec 25 '16 at 14:58
  • No, this is not at all the same question. There they want to **filter out** any rows containing a null value **for a specific column**. Here I want to **filter in** any rows containing **at least one** null value. – Ytsen de Boer Dec 29 '16 at 11:08

2 Answers2

4

Create an intermediate dataframe from the original, by dropping the desired rows. Then "subtract" it from the original:

# Create the data frame
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

# Construct an intermediate dataframe without the desired rows
df_drop = df.dropna('any')
df_drop.show()
+---+---------+---+
| c1|       c2| c3|
+---+---------+---+
|  A|Amsterdam|3.4|
+---+---------+---+

# Then subtract it from the original to reveal the desired rows
df.subtract(df_drop).show()
+---+------+----+
| c1|    c2|  c3|
+---+------+----+
|  B|London|null|
|  C|  null|null|
|  D|  null|11.1|
+---+------+----+
Ytsen de Boer
  • 2,797
  • 2
  • 25
  • 36
0

Construct an appropriate raw SQL query and apply that:

# Create the data frame
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

# Compose the approprate raw SQL query
sql_query_base = 'SELECT * FROM df WHERE '
sql_query_apps = ['{} IS NULL'.format(col_name) for col_name in df.columns]
sql_query = str_base + ' OR '.join(sql_query_apps)
sql_query
'SELECT * FROM df WHERE c1 IS NULL OR c2 IS NULL OR c3 IS NULL'

# Register the dataframe as a SQL table
sqlContext.registerDataFrameAsTable(df, 'df')

# Apply raw SQL
sqlContext.sql(sql_query).show()
+---+------+----+
| c1|    c2|  c3|
+---+------+----+
|  B|London|null|
|  C|  null|null|
|  D|  null|11.1|
+---+------+----+
Ytsen de Boer
  • 2,797
  • 2
  • 25
  • 36