Assuming you already have a dataframe with columns of timestamp type:
from datetime import datetime
data = [
(1, datetime(2018, 7, 25, 17, 15, 6, 390000), datetime(2018, 7, 25, 17, 15, 6, 377000)),
(2, datetime(2018, 7, 25, 11, 12, 49, 317000), datetime(2018, 7, 25, 11, 12, 48, 883000))
]
df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- max_ts: timestamp (nullable = true)
# |-- min_ts: timestamp (nullable = true)
You can get the time in seconds by casting the timestamp-type column to a double
type, or in milliseconds by multiplying that result by 1000 (and optionally casting to long
if you want an integer).
For example
df.select(
F.col('max_ts').cast('double').alias('time_in_seconds'),
(F.col('max_ts').cast('double') * 1000).cast('long').alias('time_in_milliseconds'),
).toPandas()
# time_in_seconds time_in_milliseconds
# 0 1532538906.390 1532538906390
# 1 1532517169.317 1532517169317
Finally, if you want the difference between the two times in milliseconds, you could do:
df.select(
((F.col('max_ts').cast('double') - F.col('min_ts').cast('double')) * 1000).cast('long').alias('diff_in_milliseconds'),
).toPandas()
# diff_in_milliseconds
# 0 13
# 1 434
I'm doing this on PySpark 2.4.2. There is no need to use string concatenation whatsoever.