Remove duplicate tuple pairs from PySpark RDD

Question

I am given a rdd. Example: test = sc.parallelize([(1,0), (2,0), (3,0)])

I need to get the Cartesian product and remove resulting tuple pairs that have duplicate entries. In this toy example these would be ((1, 0), (1, 0)), ((2, 0), (2, 0)), ((3, 0), (3, 0)).

I can get the Cartesian product as follows: NOTE The collect and print statements are there ONLY for troubleshooting.

def compute_cartesian(rdd):
    result1 = sc.parallelize(sorted(rdd.cartesian(rdd).collect()))
    print(type(result1))
    print(result1.collect())

My type and output at this stage are correct:

<class 'pyspark.rdd.RDD'>
[((1, 0), (1, 0)), ((1, 0), (2, 0)), ((1, 0), (3, 0)), ((2, 0), (1, 0)), ((2, 0), (2, 0)), ((2, 0), (3, 0)), ((3, 0), (1, 0)), ((3, 0), (2, 0)), ((3, 0), (3, 0))]

But now I need to remove the three pairs of tuples with duplicate entries.

Tried so far:

.distinct() This runs but does not produce a correct resulting rdd.
.dropDuplicates() Will not run. I assume this is an incorrect usage of .dropDuplicates().
Manual function:

Without an RDD this task is easy.

# Remove duplicates
for elem in result:
    if elem[0] == elem[1]:
        result.remove(elem)
print(result)
print("After: ", len(result))

This was a function I wrote that removes duplicate tuple pairs and then spits out the resulting len so I could do a sanity check.

I am just not sure how to directly perform actions on the RDD, in this case remove any duplicate tuple pairs resulting from the Cartesian product, and return an RDD.

Yes, I can .collect() it, perform the operation, and then re-type it as an RDD, but that defeats the purpose. Suppose this was billions of pairs. I need to perform the operations on the rdd and return an rdd.

How about `rdd.cartesian(rdd).filter(lambda x: x[0] != x[1])`? — Stef, Aug 31 '21 at 13:28
Note I would not call those "duplicate pairs", but rather "diagonal pairs" or "pairs of duplicates". This is why `distinct` and `dropDuplicates` are not appropriate here: they remove duplicate pairs, but that's not what you want. — Stef, Aug 31 '21 at 13:31
Why does that for loop you wrote not work due to rdd? Just curious what happens — dmscs, Aug 31 '21 at 13:32
@Stef, Your solution worked, and thanks for explaining why distinct and dropDuplicates were failing. If you re-post as an answer, I will accept it as correct and working. — MarkS, Aug 31 '21 at 13:36
@dmscs It might be different with an rdd than with a list, but removing list elements while iterating on a list doesn't work superbly well: [How to remove list elements in a for loop in python?](https://stackoverflow.com/questions/10665591/how-to-remove-list-elements-in-a-for-loop-in-python) — Stef, Aug 31 '21 at 13:43
Related (but different) question: [How to create a pyspark dataframe of combinations from list column](https://stackoverflow.com/questions/66109410/how-to-create-a-pyspark-dataframe-of-combinations-from-list-column) — Stef, Aug 31 '21 at 13:46

score 1 · Accepted Answer · answered Aug 31 '21 at 13:40

You can use filter to remove the pairs that you don't want:

dd.cartesian(rdd).filter(lambda x: x[0] != x[1])

Note that I would not call those pairs "duplicate pairs", but rather "pairs of duplicates" or even better, "diagonal pairs": they correspond to the diagonal if you visualize the Cartesian product geometrically.

This is why distinct and dropDuplicates are not appropriate here: they remove duplicates, which is not what you want. For instance, [1,1,2].distinct() is [1,2].

Remove duplicate tuple pairs from PySpark RDD

1 Answers1