I have a pyspark dataframe like this:
+-------+-------+
| level | value |
+-------+-------+
| 1 | 4 |
| 1 | 5 |
| 2 | 2 |
| 2 | 6 |
| 2 | 3 |
+-------+-------+
I have to create a value for every group in level column and save this in lable column. This value for every group must be unique, so I use ObjectId Mongo function to create that. Next dataframe is like this:
+-------+--------+-------+
| level | lable| value |
+-------+--------+-------+
| 1 | bb76 | 4 |
| 1 | bb76 | 5 |
| 2 | cv86 | 2 |
| 2 | cv86 | 6 |
| 2 | cv86 | 3 |
+-------+--------+-------+
Then I must create a dataframe as following:
+-------+-------+
| lable | value |
+-------+-------+
| bb76 | 9 |
| cv86 | 11 |
+-------+-------+
To do that, first I used spark groupby
:
def create_objectid():
a = str(ObjectId())
return a
def add_lable(df):
df = df.cache()
df.count()
grouped_df = df.groupby('level').agg(sum(df.value).alias('temp'))
grouped_df = grouped_df.withColumnRenamed('level', 'level_temp')
grouped_df = grouped_df.withColumn('lable', udf_create_objectid())
grouped_df = grouped_df.drop('temp')
df = df.join(grouped_df.select('level_temp','lable'), col('level') == col('level_temp'), how="left").drop(grouped_df.level_temp)
return df
When I used the above code on spark dataframe with 2 millions records, it takes about 155
seconds to finish.
I searched and found that spark window
has better performance. Then, I changed the last function to this one. Because pandas_udf
needs arg
, so I just pass one and print it:
@f.pandas_udf("string")
def create_objectid_on_window(v: pd.Series) -> str:
print('v:',v)
return str(ObjectId())
def add_lable(df):
w = Window.partitionBy('level')
df = df.withColumn('lable', create_objectid_on_window('level').over(w))
return df
But after running the program, I receive this error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Update: I read this question and answers; I do know this is because of the pandas UDF function. Unfortunately, I do not know how to change it.
Would you please guide me how to change the pandas UDF function?
Any help is really appreciated.