Remove rows where groups of two columns have differences

Question

Is it possible to remove rows if the values in the Block column occurs at least twice which has different values in the ID column?

My data looks like this:

ID	Block
1	A
1	C
1	C
3	A
3	B

In the above case, the value A in the Block column occurs twice, which has values 1 and 3 in the ID column. So the rows are removed.

The expected output should be:

ID	Block
1	C
1	C
3	B

I tried to use the dropDuplicates after the groupBy, but I don't know how to filter with this type of condition. It appears that I would need a set for the Block column to check with the ID column.

score 1 · Accepted Answer · answered Nov 03 '22 at 19:56

1

One way to do it is using window functions. The first one (lag) marks the row if it is different than the previous. The second (sum) marks all "Block" rows for previously marked rows. Lastly, deleting roes and the helper (_flag) column.

Input:

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [(1, 'A'),
     (1, 'C'),
     (1, 'C'),
     (3, 'A'),
     (3, 'B')],
    ['ID', 'Block'])

Script:

w1 = W.partitionBy('Block').orderBy('ID')
w2 = W.partitionBy('Block')
grp = F.when(F.lag('ID').over(w1) != F.col('ID'), 1).otherwise(0)
df = df.withColumn('_flag', F.sum(grp).over(w2) == 0) \
    .filter('_flag').drop('_flag')

df.show()
# +---+-----+
# | ID|Block|
# +---+-----+
# |  3|    B|
# |  1|    C|
# |  1|    C|
# +---+-----+

answered Nov 03 '22 at 19:56

ZygD

22,092
39
79
102

@ZygD, does lag()/lead() have any impact on distributed rows? The documentation does not provide these details, nor I could reproduce it to test. ++ for this as it solved my problem; but I still have the above doubt. – Azhar Khan Nov 04 '22 at 13:55
1

@AzharKhan - As far as I know, in order to perform some calculations based on partitions, the data which belongs to a partition is moved to the same node. So, shuffling happens. So it's important to properly partition your data, I mean, without a clause `partitionBy`, the whole dataframe would be moved into one node which is not efficient. `lag`/`lead` are no different in this contxt than any other function applied over a window. – ZygD Nov 04 '22 at 14:05
I am not sure if this is efficient but I also got the same expected output using the command below. I forgot that I needed a duplicate threshold so I ended up with this after getting some understanding with ZygD's answer. `duplicate_threshold=2` `df.join(df.groupBy('Block').agg(size(collect_set('ID')).alias('_flag')).filter(col('_flag') – john Nov 04 '22 at 14:30
Efficiency-wise, both options look comparable. Depending on data, any of the two ways could be more efficient. – ZygD Nov 04 '22 at 15:10

score 0 · Answer 2 · answered Nov 04 '22 at 05:39

Use window functions. get ranks per group of blocks and through away any rows that rank higher than 1. Code below

(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('ID','Block')))#create an index to reorder after comps
 .withColumn('BlockRank', rank().over(Window.partitionBy('Block').orderBy('ID'))).orderBy('index')#Rank per Block
 .where(col('BlockRank')==1)
 .drop('index','BlockRank')
).show()

+---+-----+
| ID|Block|
+---+-----+
|  1|    A|
|  1|    C|
|  1|    C|
|  3|    B|
+---+-----+

Remove rows where groups of two columns have differences

2 Answers2