Why does ps.merge() give a different result from pd.merge()?

Question

I have a dataframe billplan which is a Pyspark.Pandas dataframe.

I have a code where i convert it to Pandas dataframe and do a pd.merge:

# TS_BILLPLAN : PREV & PREV_PREV attributes
billplan = billplan.to_pandas()
billplan = billplan.sort_values(['SUBSCR_NO','BP_START','BP_END'],na_position='last').reset_index().drop(columns=['index']).reset_index()
billplan['index_minus1'] = billplan['index'] + 1
billplan['index_minus2'] = billplan['index'] + 2
billplan = pd.merge(billplan, billplan[['SUBSCR_NO','index_minus1','BP_DESC','BP_GROUP_0','BP_GROUP_1','BP_GROUP_2','BP_TIER','BP_START','BP_END','BP_SUBSCR_TYPE','BP_SOURCE','BP_RC_CHARGE_NET']], left_on=['SUBSCR_NO','index'], right_on=['SUBSCR_NO','index_minus1'], how='left', suffixes=['','_PREV'])
billplan = pd.merge(billplan, billplan[['SUBSCR_NO','index_minus2','BP_DESC','BP_GROUP_0','BP_GROUP_1','BP_GROUP_2','BP_TIER','BP_START','BP_END','BP_SUBSCR_TYPE','BP_SOURCE','BP_RC_CHARGE_NET']], left_on=['SUBSCR_NO','index'], right_on=['SUBSCR_NO','index_minus2'], how='left', suffixes=['','_PREV_PREV'])
billplan = billplan.drop(columns=['index','index_minus1','index_minus2','index_minus1_PREV','index_minus2_PREV_PREV'])
#display(billplan)
billplan = ps.from_pandas(billplan)
print('Nr Rows:', billplan.shape)

Another code where I don't convert to Pandas and use ps.merge. These 2 sets of code is almost exactly the same except for the conversion to Pandas.

# TS_BILLPLAN : PREV & PREV_PREV attributes
#billplan = billplan.to_pandas()
billplan = billplan.sort_values(['SUBSCR_NO','BP_START','BP_END'],na_position='last').reset_index().drop(columns=['index']).reset_index()
billplan['index_minus1'] = billplan['index'] + 1
billplan['index_minus2'] = billplan['index'] + 2
billplan = ps.merge(billplan, billplan[['SUBSCR_NO','index_minus1','BP_DESC','BP_GROUP_0','BP_GROUP_1','BP_GROUP_2','BP_TIER','BP_START','BP_END','BP_SUBSCR_TYPE','BP_SOURCE','BP_RC_CHARGE_NET']], left_on=['SUBSCR_NO','index'], right_on=['SUBSCR_NO','index_minus1'], how='left', suffixes=['','_PREV'])
billplan = ps.merge(billplan, billplan[['SUBSCR_NO','index_minus2','BP_DESC','BP_GROUP_0','BP_GROUP_1','BP_GROUP_2','BP_TIER','BP_START','BP_END','BP_SUBSCR_TYPE','BP_SOURCE','BP_RC_CHARGE_NET']], left_on=['SUBSCR_NO','index'], right_on=['SUBSCR_NO','index_minus2'], how='left', suffixes=['','_PREV_PREV'])
#billplan = billplan.drop(columns=['index','index_minus1','index_minus2','index_minus1_PREV','index_minus2_PREV_PREV'])
#display(billplan)
#billplan = ps.from_pandas(billplan)
print('Nr Rows:', billplan.shape)

I get the correct end result using Pandas but not Pyspark.Pandas. The latter gets the exact same index_minus1 and index_minus2 as Pandas, but it just isn't doing the joins to previous/prev_prev.

Any idea why this happens? Is it because Pyspark.Pandas dataframe operations is done across multiple nodes?

Why does ps.merge() give a different result from pd.merge()?

0 Answers0