3

I have a Spark dataframe:

> df
+---+
| id|       
+---+
|  a|
+---+
|  b|
+---+
|  c|
+---+

I want to obtain all pairs for id column, so I need to cross join the dataframe on itself. But I want to exclude symmetric results (in my case a,b == b,a, etc.).

If I apply df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show() I obtain the following output:

+---+---+
|id1|id2|
+---+---+
|  a|  a|
|  a|  b|
|  a|  c|
|  b|  a|
|  b|  b|
|  b|  c|
|  c|  a|
|  c|  b|
|  c|  c|
+---+---+

But desired output is:

+---+---+
|id1|id2|
+---+---+
|  a|  a|
|  a|  b|
|  a|  c|
|  b|  b|
|  b|  c|
|  c|  c|
+---+---+

In SQL I could achieve such result using something like this:

select df1.*, df2.*
from df df1 join df df2
     on df1.id < df2.id

How can I implement this using PySpark?

red_quark
  • 971
  • 5
  • 20

1 Answers1

1

You can use range join

df.withColumnRenamed('id', 'id1').createOrReplaceTempView("df1")

df.withColumnRenamed('id', 'id2').createOrReplaceTempView("df2")

spark.sql(
"""SELECT *
FROM df1, df2
WHERE df1.id1 = df2.id2
  OR df1.id1 < df2.id2""").show()

Or you can go your way, cross join(THIS IS A VERY EXPENSIVE OPERATION), create and sort array from the id1 and id2, sort the array and use it to drop duplicates. I wouldn't advice cross join.

df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).withColumn('filter', array_sort(array('id1','id2'))).dropDuplicates(['filter']).drop('filter').show()

+---+---+
|id1|id2|
+---+---+
|  a|  a|
|  a|  b|
|  a|  c|
|  b|  b|
|  b|  c|
|  c|  c|
+---+---+
wwnde
  • 26,119
  • 6
  • 18
  • 32