Pyspark SQL select data where a column is NaN

Question

How can I select only the rows where a certain column has NaN values in pyspark?

Setup

import numpy as np
import pandas as pd


# pyspark
import pyspark
from pyspark.sql import functions as F 
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext, SQLContext


spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
sc.setLogLevel("INFO")


# data
dft = pd.DataFrame({
    'Code': [1, 2, 3, 4, 5, 6],
    'Name': ['Odeon', 'Imperial', 'Majestic',
             'Royale', 'Paraiso', 'Nickelodeon'],
    'Movie': [5.0, 1.0, np.nan, 6.0, 3.0, np.nan]})


schema = StructType([
    StructField('Code',IntegerType(),True),
    StructField('Name',StringType(),True),
    StructField('Movie',FloatType(),True),

    ])

sdft = sqlContext.createDataFrame(dft, schema)
sdft.createOrReplaceTempView("MovieTheaters")
sdft.show()

My Attempt

spark.sql("""
select * from MovieTheaters where Movie is null
""").show()

+----+----+-----+
|Code|Name|Movie|
+----+----+-----+
+----+----+-----+

I am getting EMPTY output, how to solve the issue?

Expected output:

+----+-----------+-----+
|Code|       Name|Movie|
+----+-----------+-----+
|   3|   Majestic|  NaN|
|   6|Nickelodeon|  NaN|
+----+-----------+-----+

`null` and `NaN` are not the same thing. You can use the `isnan` function. Try this: `spark.sql("select * from MovieTheaters where isnan(Movie)").show()` — pault, Mar 23 '20 at 19:40

score 2 · Accepted Answer · answered Mar 23 '20 at 19:40

2

If you want to get the np.nan values from your data frame, use the following code:

>>> spark.sql("""select * from MovieTheaters where Movie = 'NaN' """).show()
+----+-----------+-----+
|Code|       Name|Movie|
+----+-----------+-----+
|   3|   Majestic|  NaN|
|   6|Nickelodeon|  NaN|
+----+-----------+-----+

answered Mar 23 '20 at 19:40

Jaroslav Bezděk

6,967
6
29
46

At first, I thought this was wrong (due to datatype mismatches) but surprisingly it works..? – pault Mar 23 '20 at 19:41
@pault, I thought exactly the same way. – Jaroslav Bezděk Mar 23 '20 at 19:45
1

My guess is that spark makes an implicit conversion when comparing the different types so string `"NaN"` gets cast, resulting in float `NaN` – pault Mar 23 '20 at 19:47

Pyspark SQL select data where a column is NaN

Setup

My Attempt

1 Answers1