I have a Pyspark dataframe with million records. It has a column with string persian date and need to convert it to miladi date.I tried several approuches, first I used UDF function in Python which did not have good performance. Then I wrote UDF function in Scala and used its Jar in Pyspark program; but performace did not change very much. I searched and found that pandas_UDF has better speed; so, I decided to use it, however, it did not work very well. I used Pandas_UDF in these ways:
First:
import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR)
def f1(v: pd.Series) -> pd.Series:
return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: 'decimal.Decimal' object is not subscriptable
Second:
import pandas as pd
from typing import Iterator
@pandas_udf(DateType())
def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for date in iterator:
return pd.Series(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>
Thirth:
import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR)
def f1(v: pd.Series) -> pd.Series:
return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: 'decimal.Decimal' object is not subscriptable
Fourth:
import pandas as pd
@pandas_udf(DateType())
def f1(col1: pd.Series) -> pd.Series:
return (JalaliDate(int(str(col1[1])[0:4]), int(str(col1[1])[4:6]), int(str(col1[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>
Update:
I use iterate
in this way, but it still has error:
@pandas_udf("string",PandasUDFType.SCALAR_ITER)
def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# making empty Iterator list
for date in iterator:
print('type date:', type(date[1]))
yield str(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())
Error: AttributeError: 'str' object has no attribute 'isnull'
Dataframe is like this:
+-----------+-------------+
|id | persian_date|
+-----------+-------------+
|13085178737| 14010901 |
|13098336049| 14010901 |
|13098486609| 14010901 |
|13097770966| 14010901 |
|13099744296| 14010901 |
|13101233891| 14010901 |
|13100358276| 14010901 |
+-----------+-------------+
Result should be like this:
+-----------+-------------+--------------+
|id | persian_date| date_miladi |
+-----------+-------------+--------------+
|13085178737| 14010901 |2022-11-22 |
|13098336049| 14010901 |2022-11-22 |
|13098486609| 14010901 |2022-11-22 |
|13097770966| 14010901 |2022-11-22 |
|13099744296| 14010901 |2022-11-22 |
|13101233891| 14010901 |2022-11-22 |
|13100358276| 14010901 |2022-11-22 |
+-----------+-------------+--------------+
Would you please guide me what is the correct way to use Pandas_UDF in Pyspark program?
Any help is really appreciated.