26

I'm trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits).

scala> var df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.123", "2017-01-18 11:00:00.882", "2017-01-18 11:00:02.432").toDF()
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df = df.selectExpr("value timeString", "cast(value as timestamp) time")
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp]


scala> df = df.withColumn("unix_time", unix_timestamp(df("time")))
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp ... 1 more field]

scala> df.take(4)
res63: Array[org.apache.spark.sql.Row] = Array(
[2017-01-18 11:00:00.000,2017-01-18 11:00:00.0,1484758800], 
[2017-01-18 11:00:00.123,2017-01-18 11:00:00.123,1484758800], 
[2017-01-18 11:00:00.882,2017-01-18 11:00:00.882,1484758800], 
[2017-01-18 11:00:02.432,2017-01-18 11:00:02.432,1484758802])

Even though 2017-01-18 11:00:00.123 and 2017-01-18 11:00:00.000 are different, I get the same unix time back 1484758800

What am I missing?

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
van_d39
  • 725
  • 2
  • 14
  • 28

6 Answers6

10

Milliseconds hide in fraction part timestamp format

Try this:

df = df.withColumn("time_in_milliseconds", col("time").cast("double"))

You'll get something like 1484758800.792, where 792 it's milliseconds

At least it's works for me (Scala, Spark, Hive)

  • This is also of particular usefulness if you need to round a time field to the nearest second. unix_timestamp() does not properly round the timestamp, instead it takes the mathematical floor of the time down to the seconds, ignoring the milliseconds fraction. The solution in this answer provides a decimal unix time which can be rounded, thus providing a way to round a timestamp to the nearest second. – Brad Hein Jul 31 '23 at 15:07
6

Implementing the approach suggested in Dao Thi's answer

import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()

Output:

+----------------------------+
|TIME                        |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)

Converting string time-format (including milliseconds ) to unix_timestamp(double). Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)

df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)

Converting unix_timestamp(double) to timestamp datatype in Spark.

df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)

This will give you following output

+----------------------------+----------------+-----------------------+
|TIME                        |unix_timestamp  |TimestampType          |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+

Checking the Schema:

df2.printSchema()


root
 |-- TIME: string (nullable = true)
 |-- unix_timestamp: double (nullable = true)
 |-- TimestampType: timestamp (nullable = true)
Sangram Gaikwad
  • 764
  • 11
  • 21
5

It cannot be done with unix_timestamp() but since Spark 3.1.0 there is a built-in function called unix_millis():

unix_millis(timestamp) - Returns the number of milliseconds since 1970-01-01 00:00:00 UTC. Truncates higher levels of precision.

Wim
  • 93
  • 2
  • 8
4

unix_timestamp() return unix timestamp in seconds.

The last 3 digits in the timestamps are the same with the last 3 digits of the milliseconds string (1.999sec = 1999 milliseconds), so just take the last 3 digits of the timestamps string and append to the end of the milliseconds string.

3

Up to Spark version 3.0.1 it is not possible to convert a timestamp into unix time in milliseconds using the SQL built-in function unix_timestamp.

According to the code on Spark's DateTimeUtils

"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."

Therefore, if you define a UDF that has a java.sql.Timestamp as input you can call getTime for a Long in millisecond. If you apply unix_timestamp you will only get unix time with precision in seconds.

val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)

Applying this to a variety of Timestamps:

val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
  .toDF("timestampString")
  .withColumn("timestamp", to_timestamp(col("timestampString")))
  .withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
  .withColumn("timestampUnixTimestamp", unix_timestamp(col("timestamp")))

df.printSchema()
df.show(false)

// returns
root
 |-- timestampString: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampConversionToLong: long (nullable = false)
 |-- timestampCastAsLong: long (nullable = true)

+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString        |timestamp              |timestampConversionToLong|timestampUnixTimestamp|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00    |1484733600000            |1484733600         |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111            |1484733600         |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110            |1484733600         |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1  |1484733600100            |1484733600         |
+-----------------------+-----------------------+-------------------------+-------------------+
Michael Heil
  • 16,250
  • 3
  • 42
  • 77
0

Wow, same with @Тимур Залимов just cast it

>>> df2 = df_msg.withColumn("datetime", F.col("timestamp").cast("timestamp")).withColumn("timestamp_back" , F.col("datetime").cast("double"))
>>> r = df2.rdd.take(1)[0]
>>> r.timestamp_back                                                            
1666509660.071501
>>> r.timestamp
1666509660.071501
>>> r.datetime
datetime.datetime(2022, 10, 23, 15, 21, 0, 71501)
Neo li
  • 378
  • 3
  • 8