df.rdd.collect() converts timestamp column(UTC) to local timezone(IST) in pyspark

Question

spark reads a table from MySQL which has a timestamp column storing UTC timezone values. Spark is configured in local(IST). MySQL stores below timestamp values.

spark.conf.set("spark.sql.session.timeZone" , "UTC")

df.show(100,False)

after using above conf, I can see the correct records with df.show(). Later df.rdd.collect() converts these values back to IST timezone.

for row in df.rdd.collect():
    print("row.Mindate ",row.Mindate)
    

row.Mindate 2021-03-02 19:30:31
row.Mindate 2021-04-01 14:05:03
row.Mindate 2021-06-15 11:39:40
row.Mindate 2021-07-07 18:14:17
row.Mindate 2021-08-03 10:48:51
row.Mindate 2021-10-06 10:21:11

spark dataframe and df.rdd show different result sets. How does it change the values back to local timezone even after "spark.sql.session.timeZone" , "UTC".

Thanks in advance

EDIT 1:

df.printSchema()

root
 |-- Mindate: timestamp (nullable = true)
 |-- Maxdate: timestamp (nullable = true)

Can you `printSchema` and post it here? Are you sure it's an actual conversion and not just the way it's formatted when it's printed (date -> str). — Kashyap, Oct 27 '21 at 14:45

score 1 · Answer 1 · answered May 08 '23 at 16:10

The solution

Make sure that your Spark timezone (spark.sql.session.timeZone) is set to the same timezone as your Python timezone (TZ environment variable). Spark will convert between the two whenever you call DataFrame.collect(). You can do this as follows:

import os
import time

# change Python timezone
os.environ["TZ"] = "UTC"
time.tzset()

# change Spark timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")

More about the problem

I've run into the exact same issue, where a timestamp would be converted whenever I collect a DataFrame from Spark to Python. I wrote this following simple test to confirm the behaviour:

def test_timezone_string_to_timestamp_utc(spark):
    spark.conf.set("spark.sql.session.timeZone", "UTC")
    df = spark.createDataFrame([("1970-01-01 10:00:00",)], "ts STRING").withColumn("ts", f.col("ts").cast("timestamp"))
    assert df.collect()[0]["ts"] == datetime.datetime(1970, 1, 1, 10)

Which fails with the message datetime.datetime(1970, 1, 1, 11, 0) != datetime.datetime(1970, 1, 1, 10, 0). What happens is that the timestamp is converted from UTC to my system default (CET) when the DataFrame is collected from the Spark TimeStampType to Python's datetime.datetime. I don't know why this happens, but it happens... I've tried looking in the source as @Kashyap also pointed out, but don't really see why this would happen.

The Python timezone defaults to the system default timezone, and can be configured by the TZ system variable. See Python docs

So the conversion happens whenever there is a difference between Python's timezone and Spark's timezone setting. We can check this as follows:

>>> spark.conf.get("spark.sql.session.timeZone")
'UTC'
>>> import time
>>> time.tzname
('CET', 'CEST')

score -1 · Answer 2 · answered Nov 08 '21 at 20:19

TL;DR

df.rdd.collect() converts timestamp column(UTC) to local timezone(IST) in pyspark

No it does not. In fact the timestamp inside the dataframe you read has no timezone. What you see is simply the behavior of show() based on session local timezone.

timezone info is lost when you store a `datetime.datetime` value, in a column of type `TimestampType`

As described in the docs

Datetime type

TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time-zone. The timestamp value represents an absolute point in time.

As you can see in code TimestampType is a wrapper over python datetime.datetime, but it strips out timezone and internally stores it as epoch time.

class TimestampType(AtomicType, metaclass=DataTypeSingleton):
    """Timestamp (datetime.datetime) data type.
    """

    def needConversion(self):
        return True

    def toInternal(self, dt):
        if dt is not None:
            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
                       else time.mktime(dt.timetuple()))
            return int(seconds) * 1000000 + dt.microsecond

    def fromInternal(self, ts):
        if ts is not None:
            # using int to avoid precision loss in float
            return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)

Some more sample code:

from typing import Union
from pyspark.sql.types import TimestampType, StringType
from datetime import datetime
from pyspark.sql import DataFrame, functions as F


def to_str(val: Union[str, datetime]) -> str:
    type_str = f'{type(val).__name__}:'
    if isinstance(val, str):
        return type_str + val
    else:
        return f'{type_str}{val.isoformat()} tz:{val.tzinfo}'


def print_df_info(df: DataFrame):
    df.show(truncate=False)
    for row in df.collect():
        log('DF :', ','.join([to_str(cell) for cell in row]))
    for row in df.rdd.collect():
        log('RDD:', ','.join([to_str(cell) for cell in row]))


spark.conf.set("spark.sql.session.timeZone", "UTC")
timestamps = ['2021-04-01 10:00:00 -05:00']
timestamp_data = [{'col_original_str': s} for s in timestamps]

my_df = spark.createDataFrame(timestamp_data)
# 1. col_original_str -> col_to_timestamp (convert to UTC and stored WITHOUT timezone)
my_df = my_df.withColumn('col_to_timestamp', F.to_timestamp(my_df.col_original_str))
# 2. col_to_timestamp -> col_date_format (convert an Epoch time (which has no timezone) to string)
my_df = my_df.withColumn('col_date_format', F.date_format(my_df.col_to_timestamp, "yyyy-MM-dd HH:mm:ss.SSSXXX"))
# This is really confusing.
# 3. col_to_timestamp -> col_to_utc_timestamp (tell pyspark to interpret col_to_timestamp with
#                                              timezone Asia/Kolkata, and convert it to UTC)
my_df = my_df.withColumn('col_reinterpret_tz', F.to_utc_timestamp(my_df.col_to_timestamp, 'Asia/Kolkata'))

my_df.printSchema()

log('#################################################')
log('df with session.timeZone set to UTC')
spark.conf.set("spark.sql.session.timeZone", "UTC")
print_df_info(my_df)

log('#################################################')
log('df with session.timeZone set to Asia/Kolkata')
spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
print_df_info(my_df)

Note in the output:

DF : and RDD : (See logs from print_df_info()) have exactly same contents. They are different facades over same data.
Changing spark.sql.session.timeZone has no impact on 'internal representation'. See logs from print_df_info().
Changing spark.sql.session.timeZone changes the way show() prints the values of type timestamp.

2021-11-08T12:16:22.817 spark.version: 3.0.3
root
 |-- col_original_str: string (nullable = true)
 |-- col_to_timestamp: timestamp (nullable = true)
 |-- col_date_format: string (nullable = true)
 |-- col_reinterpret_tz: timestamp (nullable = true)

2021-11-08T13:57:54.243 #################################################
2021-11-08T13:57:54.244 df with session.timeZone set to UTC

+--------------------------+-------------------+------------------------+-------------------+
|col_original_str          |col_to_timestamp   |col_date_format         |col_reinterpret_tz |
+--------------------------+-------------------+------------------------+-------------------+
|2021-04-01 10:00:00 -05:00|2021-04-01 15:00:00|2021-04-01 15:00:00.000Z|2021-04-01 09:30:00|
+--------------------------+-------------------+------------------------+-------------------+

2021-11-08T13:57:54.506 DF : str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.569 RDD: str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None

2021-11-08T13:57:54.570 #################################################
2021-11-08T13:57:54.570 df with session.timeZone set to Asia/Kolkata

+--------------------------+-------------------+------------------------+-------------------+
|col_original_str          |col_to_timestamp   |col_date_format         |col_reinterpret_tz |
+--------------------------+-------------------+------------------------+-------------------+
|2021-04-01 10:00:00 -05:00|2021-04-01 20:30:00|2021-04-01 15:00:00.000Z|2021-04-01 15:00:00|
+--------------------------+-------------------+------------------------+-------------------+

2021-11-08T13:57:54.828 DF : str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.916 RDD: str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None

Some references:

The rdd method is in fact forcing a conversion of the time -- the `datetime.datetime` object in the collected rows is a naive datetime in local time regardless of the value of `spark.sql.session.timeZone`. You can verify this inconsistency by comparing `df.toPandas()` with `pd.DataFrame([row.asDict() for row in df.collect()])` — charleyc, Dec 17 '21 at 23:25
@charleyc, 1. not sure where Pandas came into discussion, it's not part of OP or my response. 2. I've seen and posted the code that shows that timezone is stripped when a timestamp is stored. 3. And not sure what you mean by _" the datetime.datetime object in the collected rows is a naive datetime in local time regardless of the value of spark.sql.session.timeZone"_ -- Word "local" implies timezone. — Kashyap, Dec 27 '21 at 03:01

df.rdd.collect() converts timestamp column(UTC) to local timezone(IST) in pyspark

2 Answers2

The solution

More about the problem

timezone info is lost when you store a datetime.datetime value, in a column of type TimestampType

timezone info is lost when you store a `datetime.datetime` value, in a column of type `TimestampType`