I have a Spark dataframe:
> df
+---+
| id|
+---+
| a|
+---+
| b|
+---+
| c|
+---+
I want to obtain all pairs for id
column, so I need to cross join the dataframe on itself. But I want to exclude symmetric results (in my case a,b == b,a
, etc.).
If I apply df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show()
I obtain the following output:
+---+---+
|id1|id2|
+---+---+
| a| a|
| a| b|
| a| c|
| b| a|
| b| b|
| b| c|
| c| a|
| c| b|
| c| c|
+---+---+
But desired output is:
+---+---+
|id1|id2|
+---+---+
| a| a|
| a| b|
| a| c|
| b| b|
| b| c|
| c| c|
+---+---+
In SQL I could achieve such result using something like this:
select df1.*, df2.*
from df df1 join df df2
on df1.id < df2.id
How can I implement this using PySpark?