I have the pyspark code below. I’m trying to create a new match_id that is sequentially increasing. The old match_id was random unique values. The match_id indicates matched pairs of products in the data. So I’m sorting the original data by id. There can be multiple id corresponding to one match_id. I’m taking the first id for each match_id. Then sorting that smaller dataframe of first id and match_id, by id. Then I’m creating a new monotonically increasing id. Then I join back to the original dataframe on match_id, so I can have the original dataframe with the new monotonically increasing match_id. The problem that I’m having is that the new match_id is getting to values larger than the number of records in either dataframe. How can that be possible? Shouldn’t the new_match_id just equal the row number from id_to_match_id plus 1? So the largest new_match_id should be 377729? Can someone please explain what is causing these large new_match_id and suggest how I can create the new_match_id correctly?
code:
# partition by match_id and get first id
w2 = Window().partitionBy("match_id").orderBy(new_matched_df.id.asc())
id_to_match_id=new_matched_df\
.select(col("match_id"), first("id",True).over(w2).alias('id')).distinct()
# creating new match_id
from pyspark.sql.functions import monotonically_increasing_id
id_to_match_id=id_to_match_id.sort('id',ascending=True).withColumn('new_match_id',(monotonically_increasing_id()+1))
new_matched_df2=new_matched_df
# replacing old match_id with new match_id
new_matched_df2=new_matched_df2.alias('a')\
.join(id_to_match_id.alias('b'),
(col('a.match_id')==col('b.match_id')),
how='inner'
)\
.select(col('a.storeid'),
col('a.product_id'),
col('a.productname'),
col('a.productbrand'),
col('a.producttype'),
col('a.productsubtype'),
col('a.classification'),
col('a.weight'),
col('a.unitofmeasure'),
col('a.id'),
col('b.new_match_id').alias('match_id'))
id_to_match_id.sort('new_match_id',ascending=False).show()
print(new_matched_df2.count())
print(id_to_match_id.count())
Output:
+------------+------+-------------+
| match_id| id| new_match_id|
+------------+------+-------------+
|412316878198|864316|1709396985719|
|412316878188|864306|1709396985718|
|412316878183|864301|1709396985717|
|412316878182|864300|1709396985716|
|412316878181|864299|1709396985715|
|412316878178|864296|1709396985714|
|412316878177|864295|1709396985713|
|412316878175|864293|1709396985712|
|412316878174|864292|1709396985711|
|412316878169|864287|1709396985710|
|412316878160|864278|1709396985709|
|412316878156|864274|1709396985708|
|412316878154|864272|1709396985707|
|412316878149|864267|1709396985706|
|412316878148|864266|1709396985705|
|412316878146|864264|1709396985704|
|412316878145|864263|1709396985703|
|412316878143|864261|1709396985702|
|412316878136|864254|1709396985701|
|412316878135|864253|1709396985700|
+------------+------+-------------+
864302
377728