Generate unique ID in a mutable pyspark data frame

Question

I want to generate sequential unique id to a data frame that is subject to change. When i say change it means that more number of rows will be added tomorrow after i generate the ids today. when more rows are added i want to look up the id column which has the generated ids and increment for the newly added data

+-------+--------------------+-------------+
|deal_id|           deal_name|Unique_id    |
+-------+--------------------+--------------
| 613760|ABCDEFGHI           |            1|    
| 613740|TEST123             |            2|             
| 598946|OMG                 |            3|

Say if i get more data tomorrow i want to append the same to this data frame and the unique id should increment to 4 and go on.

+-------+--------------------+-------------+
|deal_id|           deal_name|Unique_id    |
+-------+--------------------+--------------
| 613760|ABCDEFGHI           |            1|    
| 613740|TEST123             |            2|             
| 598946|OMG                 |            3|
| 591234|OM21                |            4|
| 988217|Otres               |            5|
.
.
.

Code Snippet

deals_df_final = deals_df.withColumn("Unique_id",F.monotonically_increasing_id())

But this didnt give sequential ID.

I can try row_num and RDD zip with index but looks like the dataframe will be immutable.

Any help please? I want to be able to generate and also increment the id as and when data is added.

How is the `Unique_id` column different from dataframe [index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html)? — S.Au.Ra.B.H, Dec 18 '19 at 22:58
I am relatively new to this concept but correct me if i am wrong - I want the unique_id to be generated sequentially starting from 1 and it should be associated with the deal_id column always... Not sure if generating an index would serve that purpose but again correct me if i am wrong. Thats one part of the problem - the 2 nd part is the tricky one for me, after generating the unique ids for the first set of data i want to be able to append data to this data frame and generate unique ids to the newly appended data continuing the sequence. Again not sure if we can do that with data frame index — Naveen Subramanian, Dec 19 '19 at 00:18
the _default_ [dataframe index](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) is an auto-increment "column" that starts with 0. One can change the index to start from 1 using [this](https://stackoverflow.com/questions/20167930/start-index-at-1-for-pandas-dataframe) answer. — S.Au.Ra.B.H, Dec 19 '19 at 01:04
Yeah panda dataframe doesn't have the ability to run in a cluster. I am looking for a solution in pyspark dataframe. — Naveen Subramanian, Dec 19 '19 at 16:07

score 0 · Answer 1 · answered Nov 29 '21 at 17:04

Very brief note if it helps - I had the same problem, and the 2nd example in this post helped me: https://kb.databricks.com/sql/gen-unique-increasing-values.html

My current in-progress code:

from pyspark.sql import (
    SparkSession,
    functions as F,
    window as W
)

df_with_increasing_id = df.withColumn("monotonically_increasing_id", F.monotonically_increasing_id())
window = W.Window.orderBy(F.col('monotonically_increasing_id'))
df_with_consecutive_increasing_id = df_with_increasing_id.withColumn('increasing_id', F.row_number().over(window))
    df = df_with_consecutive_increasing_id.drop('monotonically_increasing_id')
# now find the maximum value in the `increasing_id` column in the current dataframe before appending new
previous_max_id = df.agg({'increasing_id': 'max'}).collect()[0]
previous_max_id = previous_max_id['max(increasing_id)']
# CREATE NEW ROW HERE
# and then create new ids (same way as creating them originally)
# then union or vertically concatenate it with the old dataframe to get the combined one
df.withColumn("cnsecutiv_increase", F.col("increasing_id") + F.lit(previous_max_id)).show()

Generate unique ID in a mutable pyspark data frame

1 Answers1

Linked