Spark Window - How to compare first row with nth row of a data frame?

Question

I have a dataframe as shown below. I have to take the rank of the row that has claim_typ as 'PD' and do a difference with the rank of of the current row and add it as a new column.

Source dataframe:

Id svc_dt clm_typ rank 48115882 20180209 RV 1 48115882 20180209 RJ 2 48115882 20180216 RJ 3 48115882 20180302 RJ 4 48115882 20180402 PD 5 48115882 20180502 RJ 6

Expected resultant dataframe:

Id svc_dt clm_typ rank diff_PD_Rank 48115882 20180209 RV 1 4 (Current rank - rank of column with 'PD') 48115882 20180209 RJ 2 3 48115882 20180216 RJ 3 2 48115882 20180302 RJ 4 1 48115882 20180402 PD 5 null 48115882 20180502 RJ 6 null

score 1 · Accepted Answer · answered May 20 '19 at 21:04

1

PySpark solution.

Assuming there is one row per clm_type 'PD' per id, you can use conditional aggregation with max(when...)) to get the necessary difference.

# necessary imports 
w1 = Window.partitionBy(df.id).orderBy(df.svc_dt)
df = df.withColumn('rnum',row_number().over(w1))
w2 = Window.partitionBy(df.id)
res = df.withColumn('diff_pd_rank',max(when(df.clm_typ == 'PD',df.rnum)).over(w2) - rnum)
res.show()

answered May 20 '19 at 21:04

Vamsi Prabhala

48,685
4
36
58

This worked. I have edited my question and have posted again. Need help. – Premkumar May 21 '19 at 14:34
that is unfair and invalidates my answer..you should post a new question instead and undo the changes made to the question. – Vamsi Prabhala May 21 '19 at 14:41
Am Sorry. Will do it now. – Premkumar May 21 '19 at 14:55
Hi Vamsi, as you suggested. I have added my new question here: https://stackoverflow.com/questions/56241454/how-to-find-the-difference-between-1st-row-and-nth-row-of-a-dataframe-based-on-a – Premkumar May 21 '19 at 15:59

Spark Window - How to compare first row with nth row of a data frame?

1 Answers1