monotonically_increasing_id larger than total records in dataframe

Question

I have the pyspark code below. I’m trying to create a new match_id that is sequentially increasing. The old match_id was random unique values. The match_id indicates matched pairs of products in the data. So I’m sorting the original data by id. There can be multiple id corresponding to one match_id. I’m taking the first id for each match_id. Then sorting that smaller dataframe of first id and match_id, by id. Then I’m creating a new monotonically increasing id. Then I join back to the original dataframe on match_id, so I can have the original dataframe with the new monotonically increasing match_id. The problem that I’m having is that the new match_id is getting to values larger than the number of records in either dataframe. How can that be possible? Shouldn’t the new_match_id just equal the row number from id_to_match_id plus 1? So the largest new_match_id should be 377729? Can someone please explain what is causing these large new_match_id and suggest how I can create the new_match_id correctly?

code:

# partition by match_id and get first id


w2 = Window().partitionBy("match_id").orderBy(new_matched_df.id.asc())

id_to_match_id=new_matched_df\
.select(col("match_id"), first("id",True).over(w2).alias('id')).distinct()

# creating new match_id

from pyspark.sql.functions import monotonically_increasing_id

id_to_match_id=id_to_match_id.sort('id',ascending=True).withColumn('new_match_id',(monotonically_increasing_id()+1))


new_matched_df2=new_matched_df

# replacing old match_id with new match_id
new_matched_df2=new_matched_df2.alias('a')\
.join(id_to_match_id.alias('b'),
     (col('a.match_id')==col('b.match_id')),
      how='inner'
     )\
.select(col('a.storeid'),
        col('a.product_id'),
        col('a.productname'),
        col('a.productbrand'),
        col('a.producttype'),
        col('a.productsubtype'),
        col('a.classification'),
        col('a.weight'),
        col('a.unitofmeasure'),
        col('a.id'),
        col('b.new_match_id').alias('match_id'))


id_to_match_id.sort('new_match_id',ascending=False).show()


print(new_matched_df2.count())

print(id_to_match_id.count())

Output:

+------------+------+-------------+
|    match_id|    id| new_match_id|
+------------+------+-------------+
|412316878198|864316|1709396985719|
|412316878188|864306|1709396985718|
|412316878183|864301|1709396985717|
|412316878182|864300|1709396985716|
|412316878181|864299|1709396985715|
|412316878178|864296|1709396985714|
|412316878177|864295|1709396985713|
|412316878175|864293|1709396985712|
|412316878174|864292|1709396985711|
|412316878169|864287|1709396985710|
|412316878160|864278|1709396985709|
|412316878156|864274|1709396985708|
|412316878154|864272|1709396985707|
|412316878149|864267|1709396985706|
|412316878148|864266|1709396985705|
|412316878146|864264|1709396985704|
|412316878145|864263|1709396985703|
|412316878143|864261|1709396985702|
|412316878136|864254|1709396985701|
|412316878135|864253|1709396985700|
+------------+------+-------------+


864302
377728

Could you provide the sample data set before these functions are applied and expected output, would be easier to understand your problem? — H Roy, Jan 06 '20 at 05:53

score 2 · Accepted Answer · edited Jan 06 '20 at 09:39

Hi check this for more.

According to documentation for monotonically_increasing_id.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

what you need is something like this

from pyspark.sql.window import Window 
from pyspark.sql import functions as 

df2 = df1.withColumn("id", fun.monotonically_increasing_id())

windowSpec = Window.orderBy("idx")

df2.withColumn("id", fun.row_number().over(windowSpec)).show()

+-----+---+
|value|id |
+-----+---+
|    1|  1|
|    3|  2|
|    9|  3|
|   13|  4|
+-----+---+

This will create the type of id you need.

monotonically_increasing_id larger than total records in dataframe

1 Answers1