Cast String field to datetime64[ns] in parquet file using pandas-on-spark

Question

My input is parquet file with I need to recast as below:

df=spark.read.parquet("input.parquet")
psdf=df.to_pandas_on_spark()

psdf['reCasted'] = psdf['col1'].astype('float64')
psdf['reCasted'] = psdf['col2'].astype('int32')
psdf['reCasted'] = psdf['col3'].astype('datetime64[ns]')

In the above code I am able to convert col1 into float64 and col2 into int32. But when I try to convert col3 into datetime64[ns], I am getting the recasted value as NaT. Note that col3 is originally a String which I trying to convert to datetime64[ns]

I can do this recasting using Pandas as below:

psdf['reCasted'] = pd.to_datetime(psdf['col3'],format='%Y-m-%d%')

But I don't want to use Pandas as the process is taking time. I want to use pandas_on_spark only. What can I try next?

score 0 · Answer 1 · answered Jul 10 '23 at 21:03

0

To recast the col3 column from a string to datetime64[ns] using pandas_on_spark, you can utilize the to_timestamp function provided by pandas_on_spark.

answered Jul 10 '23 at 21:03

Trickster Ke

1

Tried like this : psdf['reCasted'] = ps.to_timestamp(psdf['col3'],format='%Y-m-%d%'). But I am getting "AttributeError: module pyspark.pandas has no attribute to_timestamp" – user2531569 Jul 10 '23 at 21:19
psdf['reCasted'] = ps.to_datetime(psdf['col3'],format='%Y-m-%d%') is working. thanks – user2531569 Jul 10 '23 at 21:41

Cast String field to datetime64[ns] in parquet file using pandas-on-spark

1 Answers1