How to cast to timestamp for the example shown in Pyspark?

Question

I have 5 cloumns in a df. I want to cast the column "occurrence" to timestamp. I have a piece of code shown below which does the work when I just put the actual string. How do I modify the code to convert the entire occurrence column to a timestamp? I am very new to python and I really appreciate your guidance here

import uuid
import time_uuid
from datetime import datetime

my_uuid = uuid.UUID("2255270f-3310-11e9-7f7f-7f7f7f7f7f7f")
ts = time_uuid.TimeUUID(bytes=my_uuid.bytes).get_timestamp()
print(datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))

have you tried `from_unixtime()` yet? – samkart Aug 06 '22 at 14:35 — samkart, Aug 06 '22 at 14:35

score 0 · Answer 1 · answered Aug 06 '22 at 14:54

Create an UDF to convert the uuid string to seconds, and use from_unixtime to convert the seconds to timestamp.

def uuid2ts(uuid_str):
    import uuid
    import time_uuid
    from datetime import datetime

    my_uuid = uuid.UUID(uuid_str)
    ts_long = time_uuid.TimeUUID(bytes=my_uuid.bytes).get_timestamp()

    return float(ts_long)

uuid2ts_udf = func.udf(uuid2ts, FloatType())

spark.sparkContext.parallelize([('2255270f-3310-11e9-7f7f-7f7f7f7f7f7f',)]). \
    toDF(['uuid_string']). \
    withColumn('ts', func.from_unixtime(uuid2ts_udf('uuid_string'))). \
    show(truncate=False)

# +------------------------------------+-------------------+
# |uuid_string                         |ts                 |
# +------------------------------------+-------------------+
# |2255270f-3310-11e9-7f7f-7f7f7f7f7f7f|2019-02-18 00:00:00|
# +------------------------------------+-------------------+

How to cast to timestamp for the example shown in Pyspark?

1 Answers1