How to assign a constant value to all records of the pyspark dataframe window

Question

I have a pyspark dataframe like this:

+-------+-------+
| level | value |
+-------+-------+
|  1    |   4   |
|  1    |   5   |
|  2    |   2   |
|  2    |   6   |
|  2    |   3   |
+-------+-------+

I have to create a value for every group in level column and save this in lable column. This value for every group must be unique, so I use ObjectId Mongo function to create that. Next dataframe is like this:

+-------+--------+-------+
| level |   lable| value |
+-------+--------+-------+
|  1    |   bb76 |   4   |
|  1    |   bb76 |   5   |
|  2    |   cv86 |   2   |
|  2    |   cv86 |   6   |
|  2    |   cv86 |   3   |
+-------+--------+-------+

Then I must create a dataframe as following:

+-------+-------+
| lable | value |
+-------+-------+
|  bb76 |   9   |
|  cv86 |   11  |
+-------+-------+

To do that, first I used spark groupby:

   def create_objectid():
       a = str(ObjectId())
       return a

   def add_lable(df):
       df = df.cache()
       df.count()
       grouped_df = df.groupby('level').agg(sum(df.value).alias('temp'))
       grouped_df = grouped_df.withColumnRenamed('level', 'level_temp')
       grouped_df = grouped_df.withColumn('lable', udf_create_objectid())
       grouped_df = grouped_df.drop('temp')
       df  = df.join(grouped_df.select('level_temp','lable'), col('level') == col('level_temp'), how="left").drop(grouped_df.level_temp)
       return df

When I used the above code on spark dataframe with 2 millions records, it takes about 155 seconds to finish. I searched and found that spark window has better performance. Then, I changed the last function to this one. Because pandas_udf needs arg, so I just pass one and print it:

@f.pandas_udf("string")
def create_objectid_on_window(v: pd.Series) -> str:
    print('v:',v)
    return str(ObjectId())

def add_lable(df):  
    w = Window.partitionBy('level')
    df = df.withColumn('lable', create_objectid_on_window('level').over(w))
    return df

But after running the program, I receive this error:

AttributeError: 'NoneType' object has no attribute '_jvm'

Update: I read this question and answers; I do know this is because of the pandas UDF function. Unfortunately, I do not know how to change it.

Would you please guide me how to change the pandas UDF function?

Any help is really appreciated.

Does this answer your question? [Why do I get AttributeError: 'NoneType' object has no attribute 'something'?](https://stackoverflow.com/questions/8949252/why-do-i-get-attributeerror-nonetype-object-has-no-attribute-something). Even if not, take some time to understand the cause of the error, it might help you find a solution or at least ask a less generic question. — Ulrich Eckhardt, May 21 '23 at 20:03
What is the full stacktrace ? You did not indicate which line raises this error. — Itération 122442, May 21 '23 at 20:55
Your question still has the same title, which kind-of indicates a problem that is explained in the given link. There is no [mcve] either. Don't get me wrong, your problem itself is welcome here, but please take the [tour] and read [ask] to understand how to ask a good question. — Ulrich Eckhardt, May 22 '23 at 16:22

How to assign a constant value to all records of the pyspark dataframe window

0 Answers0

Linked