Differences between null and NaN in spark? How to deal with it?

Question

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

Are there any difference between those? How can they be dealt with?

Shaido · Accepted Answer · 2018-07-03T06:05:54.953

56

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))

edited Jul 03 '18 at 06:05

answered May 10 '17 at 03:13

Shaido

27,497
23
70
73

2

I got the full answers from @Shadio. Thanks! – Ivan Lee May 10 '17 at 03:22
3

That is an example of a perfect answer, just nice :) – developer_hatch May 10 '17 at 03:30
Thanks, nice to be of help. :) – Shaido May 10 '17 at 03:40
3

but when I calculate `1.0/0.0`, I get `null` instead of `NaN`. Why? – panc Jan 05 '18 at 03:54
@PanChao: What happens when dividing by zero can be different in different languages and types. For example in scala `1.0/0.0` would be infinity: https://stackoverflow.com/questions/43938418/scala-division-by-zero-yields-different-results. `0.0/0.0` would most probably give you a `NaN`. – Shaido Jan 05 '18 at 08:01
@Shaido Thanks for clarification. I am using pyspark. – panc Jan 05 '18 at 19:21
.filter((col("colum_x").isNotNull()).and(col("colum_x").notEqual("null"))) – Shasu Aug 21 '23 at 13:02

score 2 · Answer 2 · edited Apr 19 '20 at 22:37

You can diference your NaN values using the function isnan, like this example

>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))
>>> df.select(isnan("a").alias("r1"), isnan(df.a).alias("r2")).collect()
[Row(r1=False, r2=False), Row(r1=True, r2=True)]

The difference is in the type of the object that generates the value. NaN (not a number) is an old fashion way to deal with the "None value for a number", you can think that you have all the numbers (-1-2...0,1,2...) and there is the need to have and extra value, for cases of errors (example, 1/0), I want that 1/0 gives me a number, but which number? well, like there is number for 1/0, they create a new value called NaN, that is also of type Number.

None is used for the void, absence of an element, is even more abstract, because inside the number type, you have, besides de NaN value, the None value. The None value is present in all the sets of values of all the types

Thanks, Could you point out the difference between the two types of null and Nan in spark? I am still confusing about why spark exits those two types to represent nothing. — Ivan Lee, May 10 '17 at 03:11
I hope this enlight your mind, your question was very interesting, because is not simple to understand and to deal with this kind of concepts. Nice :) — developer_hatch, May 10 '17 at 03:28

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

1

you can deal with it using this code

df = df.where(pandas.notnull(df), None)

The code will convert any NaN value into null

Below is the reffrence link

Link

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 01 '20 at 04:23

Ayush Jain

19
3

ahscuml · Answer 4 · 2022-08-26T10:12:21.293

0

I have different think Maybe you can change the nan or null into another value like this:

xxDf.withColumn("xxColumn", when(col("xxColumn").isNull, "xxx")).when(col("xxColumn").isNan, "xxx")).otherwise(col("xxColumn")))

edited Aug 26 '22 at 10:12

answered Aug 26 '22 at 10:11

ahscuml

1
1

Differences between null and NaN in spark? How to deal with it?

4 Answers4

Linked

Related