12

I'm trying to convert a column of GMT timestamp strings into a column of timestamps in Eastern timezone. I want to take daylight savings into account.

My column of timestamp strings look like this:

'2017-02-01T10:15:21+00:00'

I figured out how to convert the string column into a timestamp in EST:

from pyspark.sql import functions as F

df2 = df1.withColumn('datetimeGMT', df1.myTimeColumnInGMT.cast('timestamp'))
df3 = df2.withColumn('datetimeEST', F.from_utc_timestamp(df2.datetimeGMT, "EST"))

But the times don't change with daylight savings. Is there another function or something that accounts for daylight savings with converting the timestamps?

EDIT: I think I figured it out. In the from_utc_timestamp call above, I needed to use "America/New_York" instead of "EST":

df3 = df2.withColumn('datetimeET', F.from_utc_timestamp(df2.datetimeGMT, "America/New_York"))
Bob Swain
  • 3,052
  • 3
  • 17
  • 28
  • 1
    Yes, you are correct. You should always use the area/locality forms of time zone identifiers. You can answer your own question with this if you like, though this particular point has been addressed many times before (perhaps not with regard to spark though). – Matt Johnson-Pint Aug 18 '17 at 23:03

2 Answers2

17

I ended up figuring out the answer, so I figured I would add it here. I also think that this question/answer is worthwhile because while I was searching for this issue before posting the question, I couldn't find anything about daylight savings for spark. I probably should have realized that I should search for the underlying java functions.

The answer to the question ended up being to use the string "America/New_York" instead of "EST". This correctly applies daylight savings.

from pyspark.sql import functions as F
df3 = df2.withColumn('datetimeET', F.from_utc_timestamp(df2.datetimeGMT, "America/New_York"))

EDIT:

This link shows a list of available time zone strings that can be used in this way: https://garygregory.wordpress.com/2013/06/18/what-are-the-java-timezone-ids/

Bob Swain
  • 3,052
  • 3
  • 17
  • 28
  • 1
    i tried this for the UK format. But not working. `df3 = df_temp2.withColumn('datetimeET', F.from_utc_timestamp("SRC_ACTVTY_DATE", 'Europe/London'))` – anidev711 Nov 20 '19 at 17:34
  • Thanks Bob for the details. for list of timezone IDs that can be used, I would suggest to use : https://en.wikipedia.org/wiki/List_of_tz_database_time_zones as this is a more up do date and exhaustive list. – Abhishek Vij Aug 27 '22 at 00:34
0

I would suggest to convert utc_datetime_column to timestamptype with unix_timestamp and then using the from_utc_timestamp().

from_utc_timestamp( unix_timestamp(datetime_column).cast(TimestanpType()),'America/Chicago')

This would solve daylight savings. to_timestamp() in place of unix_timestamp() wont handle daylight savings.

sureskn3
  • 1
  • 1