pyspark to_timestamp does not include milliseconds

Question

I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?

I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.

This is my dataframe.

+--------------------------+
|updated_date              |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+

I use the millisecond format without any success as below

>>> df.select('updated_date').withColumn("updated_date_col2", 
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date              |updated_date_col2  |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+

I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152

The timestamp type does not have milliseconds. If you want that, you'll have to keep your column as a string. — pault, Jan 04 '19 at 20:08

score 4 · Answer 1 · edited Sep 29 '20 at 02:30

I think you can use UDF and Python's standard datetime module as below.

import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType

def _to_timestamp(s):
    return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')

udf_to_timestamp = udf(_to_timestamp, TimestampType())

df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)

score 2 · Answer 2 · answered Jan 15 '19 at 03:25

This is not a solution with to_timestamp but you can easily keep your column to time format

Following code is one of example on converting a numerical milliseconds to timestamp.

from datetime import datetime

ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False) 

+---+--------------------------+
|obs|time                      |
+---+--------------------------+
|1  |2019-01-15 12:15:49.565263|
+---+--------------------------+

if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.

The above solution doesnt work. i keep getting a `'DataFrame' object has no attribute 'time'` even though I imported the datetime library — thentangler, Oct 16 '20 at 15:44

score 1 · Answer 3 · edited Jun 17 '20 at 07:19

Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.

Following workaround may work:

If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression

ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"

# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))

# add milliseconds as inteval
if 'S' in timestamp_pattern:
   df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))

To get INTERVAL 256 MILLISECONDS we may use a Java UDF:

df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))

Inside UDF: getIntervalStringUDF(String timeString, String pattern)

Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions

score 0 · Answer 4 · answered May 25 '23 at 10:24

0

You can directly cast the field and will it will have the milliesecond.

df = df.withColumn('datetime', col('timestamp').cast(TimestampType()))

answered May 25 '23 at 10:24

MrBigData

3
6

pyspark to_timestamp does not include milliseconds

4 Answers4

Linked

Related