0

I'm trying to recreate some work I have already done in Python using Databricks. I have a dataframe and within it is a column called 'time', of data in nanoseconds. In Python, I use the following code to convert the field into the appropriate datetime values:

import pandas as pd

# Convert time field from nanoseconds into datetime
df["time"] = pd.to_datetime(df["time"], unit='ns')

This code converts the following value 1642778070000000000 into 2022-01-21 15:14:30. I now want to do this in databricks using pyspark (as I'm scaling up the problem and the dataset I'm using is too large to do in Python). I've created a spark dataframe called df and then imported the pyspark.pandas functions and then tried effectively the same code, but it doesn't work:

from pyspark import pandas as ps
df = df.ps.to_datetime(df.columns[2], unit='ns') #the time column is in column index 2

I get an error:

'DataFrame' object has no attribute 'ps'

Any suggestions?

Any suggestions?

JGW
  • 314
  • 4
  • 18

1 Answers1

2

Pyspark does not provide any direct functions to work with time in nanoseconds. There is a function called from_unixtime() which takes time in seconds as argument and converts it to a timestamp of the format yyyy-MM-dd hh:mm:ss (your requirement). So, to use this function we must manually convert these nanoseconds to seconds using Pyspark.

The following is the demonstration of code that I have used on my sample data:

  • The following is the schema and sample data in my dataframe df.

enter image description here

  • To convet nanoseconds to seconds, use the following code.
from pyspark.sql.functions import *

df = df.withColumn("seconds", (col('nanoseconds')/1000000000)) 
df.show() 

+---+-------------------+--------------------+
| id|        nanoseconds|             seconds|
+---+-------------------+--------------------+
|  1|1642778070000000000|        1.64277807E9|
|  2|1360287003083912345|1.3602870030839124E9|
+---+-------------------+--------------------+
  • Now we have seconds column which is of type double. Now you can use from_unixtime() and pass the seconds column as an argument as shown below.
df = df.withColumn("date_time",from_unixtime(col('seconds'))) 
df.show() 

+---+-------------------+--------------------+-------------------+
| id|        nanoseconds|             seconds|          date_time|
+---+-------------------+--------------------+-------------------+
|  1|1642778070000000000|        1.64277807E9|2022-01-21 15:14:30|
|  2|1360287003083912345|1.3602870030839124E9|2013-02-08 01:30:03|
+---+-------------------+--------------------+-------------------+

enter image description here

Now you have a date_time column of StringType which contains the values of the respective nanoseconds as timestamp in the format yyyy-MM-dd hh:mm:ss.

Saideep Arikontham
  • 5,558
  • 2
  • 3
  • 11
  • This is a fantastic answer - really detailed and helpful. It works perfectly. Thanks! – JGW Jul 06 '22 at 11:39