How to replace infinity in PySpark DataFrame

Question

It seems like there is no support for replacing infinity values. I tried the code below and it doesn't work. Or am I missing out something?

a=sqlContext.createDataFrame([(None, None), (1, np.inf), (None, 2)])
a.replace(np.inf, 10)

Or do I have to take the painful route: convert PySpark DataFrame into pandas DataFrame, replace infinity values, and convert it back to PySpark DataFrame

zero323 · Accepted Answer · 2017-02-26T20:55:03.777

13

It seems like there is no support for replacing infinity values.

Actually it looks like a Py4J bug not an issue with replace itself. See Support nan/inf between Python and Java.

As a workaround, you can try either UDF (slow option):

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col, lit, udf, when

df = sc.parallelize([(None, None), (1.0, np.inf), (None, 2.0)]).toDF(["x", "y"])

replace_infs_udf = udf(
    lambda x, v: float(v) if x and np.isinf(x) else x, DoubleType()
)

df.withColumn("x1", replace_infs_udf(col("y"), lit(-99.0))).show()

## +----+--------+-----+
## |   x|       y|   x1|
## +----+--------+-----+
## |null|    null| null|
## | 1.0|Infinity|-99.0|
## |null|     2.0|  2.0|
## +----+--------+-----+

or expression like this:

def replace_infs(c, v):
    is_infinite = c.isin([
        lit("+Infinity").cast("double"),
        lit("-Infinity").cast("double")
    ])
    return when(c.isNotNull() & is_infinite, v).otherwise(c)

df.withColumn("x1", replace_infs(col("y"), lit(-99))).show()

## +----+--------+-----+
## |   x|       y|   x1|
## +----+--------+-----+
## |null|    null| null|
## | 1.0|Infinity|-99.0|
## |null|     2.0|  2.0|
## +----+--------+-----+

edited Feb 26 '17 at 20:55

answered Dec 23 '15 at 10:45

zero323

322,348
103
959
935

Why are `UDF`s slower than expressions? – Alberto Bonsanto Dec 23 '15 at 11:09
@AlbertoBonsanto Because `DataFrame` is not a Python object it requires a full round trip. – zero323 Dec 23 '15 at 11:20
@AlbertoBonsanto Another aspect, which is not PySpark specific, is that UDF is just a black box for optimizer. It shouldn't matter here but in general it means that you cannot reason about an operation which requires UDF. Finally, as far as I know, internal representation doesn't use standard Scala types. So even in Scala or Java you may prefer using expressions directly without UDFs. – zero323 Dec 23 '15 at 11:37
Thanks, the problem is that sometimes I have problems figuring out how to achieve things like that using expressions instead of `UDF`, I asked because I had a code that converted an array of letters to a `SparseVector` using `UDFs` and the code never finished – Alberto Bonsanto Dec 23 '15 at 11:41
Well, It is not always possible but if there is a choice then expression >> jvm-udf >> python-udf. – zero323 Dec 23 '15 at 12:03
Why the last row of the DataFrame contains `(null, null)` rather than `(null, 2)`? – Kevin Ghaboosi Feb 26 '17 at 19:50
1

@KevinGhaboosi This is due to type mismatch. Spark doesn't consider Python integers as a valid value for double / float column. Fixed (and thank you for the edit!). – zero323 Feb 26 '17 at 20:57

How to replace infinity in PySpark DataFrame

1 Answers1