10

I have a feature which let's me query a databricks delta table from a client app. This is the code I use for that purpose:

df = spark.sql('SELECT * FROM EmployeeTerritories LIMIT 100')
dataframe = df.toPandas()
dataframe_json = dataframe.to_json(orient='records', force_ascii=False)

However, the second line throws me the error

Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp

I know what this error says, my date-type field is out of bounds and I tried searching for the solution but none of them were eligible for my scenario.

The solutions I found were about a specific dataframe column but in my case I have a global problem because I have tons of delta tables and I don't know the specific date-typed column so I can do type manipulation in order to avoid this.

Is it possible to find all Timestamp type columns and cast them to string? Does this seem like a good solution? Do you have any other ideas on how can I achieve what I'm trying to do?

blackbishop
  • 30,945
  • 11
  • 55
  • 76
anthino12
  • 770
  • 1
  • 6
  • 29
  • Does this answer your question? [Pyspark toPandas() Out of bounds nanosecond timestamp error](https://stackoverflow.com/questions/66320966/pyspark-topandas-out-of-bounds-nanosecond-timestamp-error) – Stef Dec 29 '21 at 12:58
  • No, it doesn't. This solution requires me to know the column name, which in my case is not a single one and different tables may have different column names with timestamp type. – anthino12 Dec 29 '21 at 13:02
  • can't you use [`schema`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.schema.html) to get these column names? – Stef Dec 29 '21 at 13:07

1 Answers1

11

Is it possible to find all Timestamp type columns and cast them to string?

Yes, that's the way to go. You can loop through df.dtype and handle columns having type = "timestamp" by casting them into strings before calling df.toPandas():

import pyspark.sql.functions as F

df = df.select(*[
    F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c)
    for c, t in df.dtypes
])

dataframe = df.toPandas()

You can define this as a function that take df as parameter and use it with all your tables:

def stringify_timestamps(df: DataFrame) -> DataFrame:
    return df.select(*[
        F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c).alias(c)
        for c, t in df.dtypes
    ])

If you want to preserve the timestamp type, you can consider nullifying the timestamp values which are greater than pd.Timestamp.max as shown in this post instead of converting into strings.

blackbishop
  • 30,945
  • 11
  • 55
  • 76