5

I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?

I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.

This is my dataframe.

+--------------------------+
|updated_date              |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+

I use the millisecond format without any success as below

>>> df.select('updated_date').withColumn("updated_date_col2", 
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date              |updated_date_col2  |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+

I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152

Anand Hemmige
  • 103
  • 1
  • 3
  • 8
  • The timestamp type does not have milliseconds. If you want that, you'll have to keep your column as a string. – pault Jan 04 '19 at 20:08

4 Answers4

4

I think you can use UDF and Python's standard datetime module as below.

import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType

def _to_timestamp(s):
    return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')

udf_to_timestamp = udf(_to_timestamp, TimestampType())

df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
Dharman
  • 30,962
  • 25
  • 85
  • 135
iwataka
  • 51
  • 5
2

This is not a solution with to_timestamp but you can easily keep your column to time format

Following code is one of example on converting a numerical milliseconds to timestamp.

from datetime import datetime

ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False) 

+---+--------------------------+
|obs|time                      |
+---+--------------------------+
|1  |2019-01-15 12:15:49.565263|
+---+--------------------------+

if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.

JunL22
  • 21
  • 1
  • The above solution doesnt work. i keep getting a `'DataFrame' object has no attribute 'time'` even though I imported the datetime library – thentangler Oct 16 '20 at 15:44
1

Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.

Following workaround may work:

If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression

ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"

# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))

# add milliseconds as inteval
if 'S' in timestamp_pattern:
   df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))

To get INTERVAL 256 MILLISECONDS we may use a Java UDF:

df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))

Inside UDF: getIntervalStringUDF(String timeString, String pattern)

  1. Use SimpleDateFormat to parse date according to pattern
  2. return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
  3. return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions
סטנלי גרונן
  • 2,917
  • 23
  • 46
  • 68
Adeeb
  • 41
  • 5
0

You can directly cast the field and will it will have the milliesecond.

df = df.withColumn('datetime', col('timestamp').cast(TimestampType()))
MrBigData
  • 3
  • 6