TL;DR
df.rdd.collect() converts timestamp column(UTC) to local timezone(IST) in pyspark
No it does not. In fact the timestamp inside the dataframe you read has no timezone. What you see is simply the behavior of show()
based on session local timezone.
timezone info is lost when you store a datetime.datetime
value, in a column of type TimestampType
As described in the docs
Datetime type
- TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time-zone. The timestamp value represents an absolute point in time.
As you can see in code TimestampType
is a wrapper over python datetime.datetime
, but it strips out timezone and internally stores it as epoch time.
class TimestampType(AtomicType, metaclass=DataTypeSingleton):
"""Timestamp (datetime.datetime) data type.
"""
def needConversion(self):
return True
def toInternal(self, dt):
if dt is not None:
seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
else time.mktime(dt.timetuple()))
return int(seconds) * 1000000 + dt.microsecond
def fromInternal(self, ts):
if ts is not None:
# using int to avoid precision loss in float
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
Some more sample code:
from typing import Union
from pyspark.sql.types import TimestampType, StringType
from datetime import datetime
from pyspark.sql import DataFrame, functions as F
def to_str(val: Union[str, datetime]) -> str:
type_str = f'{type(val).__name__}:'
if isinstance(val, str):
return type_str + val
else:
return f'{type_str}{val.isoformat()} tz:{val.tzinfo}'
def print_df_info(df: DataFrame):
df.show(truncate=False)
for row in df.collect():
log('DF :', ','.join([to_str(cell) for cell in row]))
for row in df.rdd.collect():
log('RDD:', ','.join([to_str(cell) for cell in row]))
spark.conf.set("spark.sql.session.timeZone", "UTC")
timestamps = ['2021-04-01 10:00:00 -05:00']
timestamp_data = [{'col_original_str': s} for s in timestamps]
my_df = spark.createDataFrame(timestamp_data)
# 1. col_original_str -> col_to_timestamp (convert to UTC and stored WITHOUT timezone)
my_df = my_df.withColumn('col_to_timestamp', F.to_timestamp(my_df.col_original_str))
# 2. col_to_timestamp -> col_date_format (convert an Epoch time (which has no timezone) to string)
my_df = my_df.withColumn('col_date_format', F.date_format(my_df.col_to_timestamp, "yyyy-MM-dd HH:mm:ss.SSSXXX"))
# This is really confusing.
# 3. col_to_timestamp -> col_to_utc_timestamp (tell pyspark to interpret col_to_timestamp with
# timezone Asia/Kolkata, and convert it to UTC)
my_df = my_df.withColumn('col_reinterpret_tz', F.to_utc_timestamp(my_df.col_to_timestamp, 'Asia/Kolkata'))
my_df.printSchema()
log('#################################################')
log('df with session.timeZone set to UTC')
spark.conf.set("spark.sql.session.timeZone", "UTC")
print_df_info(my_df)
log('#################################################')
log('df with session.timeZone set to Asia/Kolkata')
spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
print_df_info(my_df)
Note in the output:
DF :
and RDD :
(See logs from print_df_info()
) have exactly same contents. They are different facades over same data.
- Changing
spark.sql.session.timeZone
has no impact on 'internal representation'. See logs from print_df_info()
.
- Changing
spark.sql.session.timeZone
changes the way show()
prints the values of type timestamp
.
2021-11-08T12:16:22.817 spark.version: 3.0.3
root
|-- col_original_str: string (nullable = true)
|-- col_to_timestamp: timestamp (nullable = true)
|-- col_date_format: string (nullable = true)
|-- col_reinterpret_tz: timestamp (nullable = true)
2021-11-08T13:57:54.243 #################################################
2021-11-08T13:57:54.244 df with session.timeZone set to UTC
+--------------------------+-------------------+------------------------+-------------------+
|col_original_str |col_to_timestamp |col_date_format |col_reinterpret_tz |
+--------------------------+-------------------+------------------------+-------------------+
|2021-04-01 10:00:00 -05:00|2021-04-01 15:00:00|2021-04-01 15:00:00.000Z|2021-04-01 09:30:00|
+--------------------------+-------------------+------------------------+-------------------+
2021-11-08T13:57:54.506 DF : str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.569 RDD: str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.570 #################################################
2021-11-08T13:57:54.570 df with session.timeZone set to Asia/Kolkata
+--------------------------+-------------------+------------------------+-------------------+
|col_original_str |col_to_timestamp |col_date_format |col_reinterpret_tz |
+--------------------------+-------------------+------------------------+-------------------+
|2021-04-01 10:00:00 -05:00|2021-04-01 20:30:00|2021-04-01 15:00:00.000Z|2021-04-01 15:00:00|
+--------------------------+-------------------+------------------------+-------------------+
2021-11-08T13:57:54.828 DF : str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.916 RDD: str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
Some references: