Can unix_timestamp() return unix time in milliseconds in Apache Spark?

Question

I'm trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits).

scala> var df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.123", "2017-01-18 11:00:00.882", "2017-01-18 11:00:02.432").toDF()
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df = df.selectExpr("value timeString", "cast(value as timestamp) time")
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp]


scala> df = df.withColumn("unix_time", unix_timestamp(df("time")))
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp ... 1 more field]

scala> df.take(4)
res63: Array[org.apache.spark.sql.Row] = Array(
[2017-01-18 11:00:00.000,2017-01-18 11:00:00.0,1484758800], 
[2017-01-18 11:00:00.123,2017-01-18 11:00:00.123,1484758800], 
[2017-01-18 11:00:00.882,2017-01-18 11:00:00.882,1484758800], 
[2017-01-18 11:00:02.432,2017-01-18 11:00:02.432,1484758802])

Even though 2017-01-18 11:00:00.123 and 2017-01-18 11:00:00.000 are different, I get the same unix time back 1484758800

What am I missing?

score 10 · Answer 1 · answered May 13 '19 at 11:23

10

Milliseconds hide in fraction part timestamp format

Try this:

df = df.withColumn("time_in_milliseconds", col("time").cast("double"))

You'll get something like 1484758800.792, where 792 it's milliseconds

At least it's works for me (Scala, Spark, Hive)

answered May 13 '19 at 11:23

Тимур Залимов

131
1
6

This is also of particular usefulness if you need to round a time field to the nearest second. unix_timestamp() does not properly round the timestamp, instead it takes the mathematical floor of the time down to the seconds, ignoring the milliseconds fraction. The solution in this answer provides a decimal unix time which can be rounded, thus providing a way to round a timestamp to the nearest second. – Brad Hein Jul 31 '23 at 15:07

Sangram Gaikwad · Answer 2 · 2019-01-23T11:03:35.140

Implementing the approach suggested in Dao Thi's answer

import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()

Output:

+----------------------------+
|TIME                        |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)

Converting string time-format (including milliseconds ) to unix_timestamp(double). Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)

df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)

Converting unix_timestamp(double) to timestamp datatype in Spark.

df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)

This will give you following output

+----------------------------+----------------+-----------------------+
|TIME                        |unix_timestamp  |TimestampType          |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+

Checking the Schema:

df2.printSchema()


root
 |-- TIME: string (nullable = true)
 |-- unix_timestamp: double (nullable = true)
 |-- TimestampType: timestamp (nullable = true)

score 5 · Answer 3 · answered Mar 10 '21 at 10:56

5

It cannot be done with unix_timestamp() but since Spark 3.1.0 there is a built-in function called unix_millis():

unix_millis(timestamp) - Returns the number of milliseconds since 1970-01-01 00:00:00 UTC. Truncates higher levels of precision.

answered Mar 10 '21 at 10:56

Wim

93
2
8

score 4 · Answer 4 · answered Mar 02 '17 at 09:19

unix_timestamp() return unix timestamp in seconds.

The last 3 digits in the timestamps are the same with the last 3 digits of the milliseconds string (1.999sec = 1999 milliseconds), so just take the last 3 digits of the timestamps string and append to the end of the milliseconds string.

Michael Heil · Answer 5 · 2021-01-12T14:50:02.237

Up to Spark version 3.0.1 it is not possible to convert a timestamp into unix time in milliseconds using the SQL built-in function unix_timestamp.

According to the code on Spark's DateTimeUtils

"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."

Therefore, if you define a UDF that has a java.sql.Timestamp as input you can call getTime for a Long in millisecond. If you apply unix_timestamp you will only get unix time with precision in seconds.

val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)

Applying this to a variety of Timestamps:

val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
  .toDF("timestampString")
  .withColumn("timestamp", to_timestamp(col("timestampString")))
  .withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
  .withColumn("timestampUnixTimestamp", unix_timestamp(col("timestamp")))

df.printSchema()
df.show(false)

// returns
root
 |-- timestampString: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampConversionToLong: long (nullable = false)
 |-- timestampCastAsLong: long (nullable = true)

+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString        |timestamp              |timestampConversionToLong|timestampUnixTimestamp|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00    |1484733600000            |1484733600         |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111            |1484733600         |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110            |1484733600         |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1  |1484733600100            |1484733600         |
+-----------------------+-----------------------+-------------------------+-------------------+

score 0 · Answer 6 · answered Nov 17 '22 at 03:23

Wow, same with @Тимур Залимов just cast it

>>> df2 = df_msg.withColumn("datetime", F.col("timestamp").cast("timestamp")).withColumn("timestamp_back" , F.col("datetime").cast("double"))
>>> r = df2.rdd.take(1)[0]
>>> r.timestamp_back                                                            
1666509660.071501
>>> r.timestamp
1666509660.071501
>>> r.datetime
datetime.datetime(2022, 10, 23, 15, 21, 0, 71501)

Can unix_timestamp() return unix time in milliseconds in Apache Spark?

6 Answers6

Linked