33

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

Are there any difference between those? How can they be dealt with?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Ivan Lee
  • 3,420
  • 4
  • 30
  • 45

4 Answers4

56

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))
Shaido
  • 27,497
  • 23
  • 70
  • 73
2

You can diference your NaN values using the function isnan, like this example

>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))
>>> df.select(isnan("a").alias("r1"), isnan(df.a).alias("r2")).collect()
[Row(r1=False, r2=False), Row(r1=True, r2=True)]

The difference is in the type of the object that generates the value. NaN (not a number) is an old fashion way to deal with the "None value for a number", you can think that you have all the numbers (-1-2...0,1,2...) and there is the need to have and extra value, for cases of errors (example, 1/0), I want that 1/0 gives me a number, but which number? well, like there is number for 1/0, they create a new value called NaN, that is also of type Number.

None is used for the void, absence of an element, is even more abstract, because inside the number type, you have, besides de NaN value, the None value. The None value is present in all the sets of values of all the types

bodha
  • 180
  • 1
  • 13
developer_hatch
  • 15,898
  • 3
  • 42
  • 75
  • Thanks, Could you point out the difference between the two types of null and Nan in spark? I am still confusing about why spark exits those two types to represent nothing. – Ivan Lee May 10 '17 at 03:11
  • I hope this enlight your mind, your question was very interesting, because is not simple to understand and to deal with this kind of concepts. Nice :) – developer_hatch May 10 '17 at 03:28
1

you can deal with it using this code

df = df.where(pandas.notnull(df), None)

The code will convert any NaN value into null

Below is the reffrence link

Link

Community
  • 1
  • 1
Ayush Jain
  • 19
  • 3
0

I have different think Maybe you can change the nan or null into another value like this:

xxDf.withColumn("xxColumn", when(col("xxColumn").isNull, "xxx")).when(col("xxColumn").isNan, "xxx")).otherwise(col("xxColumn")))
ahscuml
  • 1
  • 1