0

My input is parquet file with I need to recast as below:

df=spark.read.parquet("input.parquet")
psdf=df.to_pandas_on_spark()

psdf['reCasted'] = psdf['col1'].astype('float64')
psdf['reCasted'] = psdf['col2'].astype('int32')
psdf['reCasted'] = psdf['col3'].astype('datetime64[ns]')

In the above code I am able to convert col1 into float64 and col2 into int32. But when I try to convert col3 into datetime64[ns], I am getting the recasted value as NaT. Note that col3 is originally a String which I trying to convert to datetime64[ns]

I can do this recasting using Pandas as below:

psdf['reCasted'] = pd.to_datetime(psdf['col3'],format='%Y-m-%d%')

But I don't want to use Pandas as the process is taking time. I want to use pandas_on_spark only. What can I try next?

halfer
  • 19,824
  • 17
  • 99
  • 186
user2531569
  • 609
  • 4
  • 18
  • 36

1 Answers1

0

To recast the col3 column from a string to datetime64[ns] using pandas_on_spark, you can utilize the to_timestamp function provided by pandas_on_spark.

  • Tried like this : psdf['reCasted'] = ps.to_timestamp(psdf['col3'],format='%Y-m-%d%'). But I am getting "AttributeError: module pyspark.pandas has no attribute to_timestamp" – user2531569 Jul 10 '23 at 21:19
  • psdf['reCasted'] = ps.to_datetime(psdf['col3'],format='%Y-m-%d%') is working. thanks – user2531569 Jul 10 '23 at 21:41