2

I have a Pyspark dataframe with million records. It has a column with string persian date and need to convert it to miladi date.I tried several approuches, first I used UDF function in Python which did not have good performance. Then I wrote UDF function in Scala and used its Jar in Pyspark program; but performace did not change very much. I searched and found that pandas_UDF has better speed; so, I decided to use it, however, it did not work very well. I used Pandas_UDF in these ways:

First:

 import pandas as pd
 
 @pandas_udf('long', PandasUDFType.SCALAR)
 def f1(v: pd.Series) -> pd.Series:
   return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
 
 df.withColumn('date_miladi', f1(df.trx_date)).show()
 
 Error:  TypeError: 'decimal.Decimal' object is not subscriptable

Second:

 import pandas as pd
 from typing import Iterator
 @pandas_udf(DateType())
 def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
 for date in iterator:
     return pd.Series(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())

 df.withColumn('date_miladi', f1(df.trx_date)).show()
 
 Error: TypeError: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>

Thirth:

import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR) 
def f1(v: pd.Series) -> pd.Series:
   return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())

df.withColumn('date_miladi', f1(df.trx_date)).show()    

Error: TypeError: 'decimal.Decimal' object is not subscriptable
    

Fourth:

import pandas as pd
@pandas_udf(DateType())
def f1(col1: pd.Series) -> pd.Series:
    return (JalaliDate(int(str(col1[1])[0:4]), int(str(col1[1])[4:6]), int(str(col1[1])[6:8])).to_gregorian())
    
df.withColumn('date_miladi', f1(df.trx_date)).show()


Error: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>

Update: I use iterate in this way, but it still has error:

@pandas_udf("string",PandasUDFType.SCALAR_ITER)
    def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
        #  making empty Iterator list
        for date in iterator:
            print('type date:', type(date[1]))
            yield str(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())

 Error: AttributeError: 'str' object has no attribute 'isnull'

Dataframe is like this:

 +-----------+-------------+
 |id         | persian_date|
 +-----------+-------------+
 |13085178737| 14010901    |
 |13098336049| 14010901    |
 |13098486609| 14010901    |
 |13097770966| 14010901    |
 |13099744296| 14010901    |
 |13101233891| 14010901    |
 |13100358276| 14010901    |
 +-----------+-------------+

Result should be like this:

 +-----------+-------------+--------------+
 |id         | persian_date| date_miladi  |
 +-----------+-------------+--------------+
 |13085178737| 14010901    |2022-11-22    |
 |13098336049| 14010901    |2022-11-22    |
 |13098486609| 14010901    |2022-11-22    |
 |13097770966| 14010901    |2022-11-22    |
 |13099744296| 14010901    |2022-11-22    |
 |13101233891| 14010901    |2022-11-22    |
 |13100358276| 14010901    |2022-11-22    |
 +-----------+-------------+--------------+

Would you please guide me what is the correct way to use Pandas_UDF in Pyspark program?

Any help is really appreciated.

M_Gh
  • 1,046
  • 4
  • 17
  • 43
  • 1
    I think that using pandas UDF will not yield significant improvements in performance as it relies on vectorized computation inside the UDF function, and map is not inherently vectorized as it simply iterates through the elements of the Series. – Shubham Sharma Mar 26 '23 at 11:37
  • @ShubhamSharma thank you for your feedback. So, there is not any other way to improve performance of the program. – M_Gh Mar 28 '23 at 11:08
  • Sorry for the late reply. I just saw your message.. You can try with pandas UDF with iterator to see of it improves the performance – Shubham Sharma Mar 31 '23 at 16:23
  • 1
    @ShubhamSharma I tried pandas UDF with iterator; but still it does not work, would you please take a look at **Update** part of the question? Many thanks. – M_Gh Apr 03 '23 at 14:25
  • @M_Gh can you include a sample of your dataframe in the question so we can reproduce your errors? you can copy and paste the output from `df.show()` as well as `df.printSchema()` in your question, thank you! – Derek O Apr 03 '23 at 17:33
  • 1
    @DerekO I added a part of the dataframe, would you please take a look at this? Thanks. – M_Gh Apr 04 '23 at 06:27

1 Answers1

1

Solution

Import required modules

from typing import Iterator
from pyspark.sql import functions as F
from persiantools.jdatetime import JalaliDate

First define a utility function to parse persian_date to gregorian

def parse_date(s: str):
    s = str(s)
    return JalaliDate(*map(int, [s[:4], s[4:6], s[6:8]])).to_gregorian()

Now you can try two approaches although I would prefer approach 1 since your are not doing any heavy initialization in UDF so no point in using iterators:

Approach 1: Series UDF

@F.pandas_udf('date')
def parse_date_pdf(series: pd.Series) -> pd.Series:
    return series.map(parse_date)

Approach 2: Series iterator UDF

@F.pandas_udf('date')
def parse_date_pdf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # looping over iterator yields a pandas series
    for series in iterator:
        yield series.map(parse_date)

Result

df = df.withColumn('date_miladi', parse_date_pdf('persian_date'))
df.show()

+-----------+------------+-----------+
|         id|persian_date|date_miladi|
+-----------+------------+-----------+
|13085178737|    14010901| 2022-11-22|
|13098336049|    14010901| 2022-11-22|
|13098486609|    14010901| 2022-11-22|
|13097770966|    14010901| 2022-11-22|
|13099744296|    14010901| 2022-11-22|
|13101233891|    14010901| 2022-11-22|
|13100358276|    14010901| 2022-11-22|
+-----------+------------+-----------+
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • thank you for the complete answer. Would you please guide me if it is possible to use ```Date``` type instead of ```string``` in **pandas_udf**? – M_Gh Apr 04 '23 at 12:38
  • Dear @Shubham Sharma, would you please take a look at this question: https://stackoverflow.com/q/76300991/6640504 ? Many thanks. – M_Gh May 22 '23 at 07:01