0

I have a dataframe as such

test_df1 = spark.createDataFrame(
    [
        (1, "-", "-"),
        (1, "97", "00:00:00.02"),
        (1, "78", "00:00:00.02"),
        (2, "83", "00:00:00.02"),
        (2, "14", "00:00:00.02"),
        (2, "115", "00:00:00.02"),
    ],
    ['ID', 'random', 'time']
)
test_df1.show()

+---+------+-----------+
| ID|random|    time   |
+---+------+-----------+
|  1|     -|          -|
|  1|    97|00:00:00.02|
|  1|    78|00:00:00.02|
|  2|    83|00:00:00.02|
|  2|    14|00:00:00.02|
|  2|   115|00:00:00.02|
+---+------+-----------+

How can I convert the time column to milliseconds in doubletype? I am currently doing it as stated below where I get the numbers after seconds as string and then cast it as double. Is there better ways?

test_df2 = test_df1.withColumn("time", F.substring_index("time", '.', -1).cast("double"))
test_df2.show()

+---+------+----+
| ID|random|time|
+---+------+----+
|  1|  null|null|
|  1|  97.0| 2.0|
|  1|  78.0| 2.0|
|  2|  83.0| 2.0|
|  2|  14.0| 2.0|
|  2| 115.0| 2.0|
+---+------+----+
eemilk
  • 1,375
  • 13
  • 17
  • 1
    This should help you out : https://stackoverflow.com/questions/42237938/can-unix-timestamp-return-unix-time-in-milliseconds-in-apache-spark – user238607 Nov 16 '20 at 16:41

1 Answers1

0

What I ended up doing is converting the time column into timestamp and then to unix time after which I reduce todays timestamp as unix time from it. This gives me seconds and I can devide it to get ms or ns or whatever.

import datetime
from time import mktime

today = datetime.date.today()
unixtime = mktime(today.timetuple())

test_df1 = test_df1.withColumn('time_to_timestamp', to_timestamp('time')) \
                    .withColumn("unix_time_w_ms", col("time_to_timestamp").cast("double")) \
                    .withColumn("time_in_s", col("unix_time_w_ms") - unixtime) \
                    .withColumn("time_in_s", round(col('time_in_s'), 3))

test_df1.show()

+---+------+-----------+--------------------+---------------+---------+
| ID|random|       time|   time_to_timestamp| unix_time_w_ms|time_in_s|
+---+------+-----------+--------------------+---------------+---------+
|  1|     -|          -|                null|           null|     null|
|  1|    97|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  1|    78|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  2|    83|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  2|    14|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  2|   115|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
+---+------+-----------+--------------------+---------------+---------+

I still have a feeling this could be done better maybe without so many withColumn usage as if I have to loop this with huge dataframe I've read that withColumn usage is not preferred.

eemilk
  • 1,375
  • 13
  • 17