My input is parquet file with I need to recast as below:
df=spark.read.parquet("input.parquet")
psdf=df.to_pandas_on_spark()
psdf['reCasted'] = psdf['col1'].astype('float64')
psdf['reCasted'] = psdf['col2'].astype('int32')
psdf['reCasted'] = psdf['col3'].astype('datetime64[ns]')
In the above code I am able to convert col1
into float64
and col2
into int32
. But when I try to convert col3
into datetime64[ns]
, I am getting the recasted value as NaT
. Note that col3
is originally a String which I trying to convert to datetime64[ns]
I can do this recasting using Pandas as below:
psdf['reCasted'] = pd.to_datetime(psdf['col3'],format='%Y-m-%d%')
But I don't want to use Pandas as the process is taking time. I want to use pandas_on_spark only. What can I try next?