Pyspark toPandas() Out of bounds nanosecond timestamp error

Question

I am trying to convert a Spark DataFrame to Pandas. However, it is giving the following error:

OutOfBoundsDatetime: Out of bounds nanosecond timestamp:

Is there a work around for this?

It works if I drop all timestamp columns, but I would like to bring the whole table into Pandas.

I've never encountered an error when bringing a Spark DataFrame to Pandas.

This is a reasonably large table that contains multiple timestamp columns. Some are YYYY-MM-DD, and some are YYYY-MM-DD 00:00:00 type columns.

There are an unknown number of several of the columns that contain nonexistent year variables.

Below is an example.

data = {
    "ID": ["AB", "CD", "DE", "EF"],
    "year": [2016, 2017, 2018, 2018],
    "time_var_1": [
        "3924-01-04 00:00:00",
        "4004-12-12 12:38:00",
        "2018-10-02 01:32:23",
        "2018-04-05 00:00:00",
    ],
}

df = pd.DataFrame(data)

sdf = spark.createDataFrame(df)

sdf = sdf.withColumn("time_var_1", spark_fns.to_timestamp(spark_fns.col("time_var_1")))

I am not super familiar with PySpark so I am not sure if there is a errors='coerce' equivalent when bringing a table from a Spark DataFrame to Pandas.

score 8 · Accepted Answer · answered Feb 22 '21 at 19:00

You can mask the invalid timestamps with null using when:

import pandas as pd
import pyspark.sql.functions as F

pdf = sdf.withColumn(
    'time_var_1',
    F.when(F.col('time_var_1') < F.lit(pd.Timestamp.max), F.col('time_var_1'))
).toPandas()

print(pdf)
   ID  year          time_var_1
0  AB  2016                 NaT
1  CD  2017                 NaT
2  DE  2018 2018-10-02 01:32:23
3  EF  2018 2018-04-05 00:00:00

Pyspark toPandas() Out of bounds nanosecond timestamp error

1 Answers1

Linked