I am trying to convert a Spark DataFrame to Pandas. However, it is giving the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp:
Is there a work around for this?
It works if I drop all timestamp columns, but I would like to bring the whole table into Pandas.
I've never encountered an error when bringing a Spark DataFrame to Pandas.
This is a reasonably large table that contains multiple timestamp columns. Some are YYYY-MM-DD
, and some are YYYY-MM-DD 00:00:00
type columns.
There are an unknown number of several of the columns that contain nonexistent year variables.
Below is an example.
data = {
"ID": ["AB", "CD", "DE", "EF"],
"year": [2016, 2017, 2018, 2018],
"time_var_1": [
"3924-01-04 00:00:00",
"4004-12-12 12:38:00",
"2018-10-02 01:32:23",
"2018-04-05 00:00:00",
],
}
df = pd.DataFrame(data)
sdf = spark.createDataFrame(df)
sdf = sdf.withColumn("time_var_1", spark_fns.to_timestamp(spark_fns.col("time_var_1")))
I am not super familiar with PySpark so I am not sure if there is a errors='coerce'
equivalent when bringing a table from a Spark DataFrame to Pandas.