-1

How do I join a dataframe with itself using left outer (all from first matching from second)?

Not sure if this is correct

df.alias('d1').join(df.alias('d2'), how = 'leftouter')

Edit 1

df = spark.read.parquet(file)
dfSort = df.sort(col('ID').asc(), col('Date').asc())

dfIndex = dfSort.withColumn('Index', monotonically_increasing_id())
                .withColumn('IndexNext', col('Index')+1)
                .withColumn('AccountIndex', concat(col('ID'),lit('-'), col('Index')
                .withColumn('AccountIndexNext', concat(col('ID'),lit('-'), col('IndexNext')
                .drop('Index', 'Index Next')

dfJoined = dfIndex.alias('d1').join(dfIndex.alias('d2'), df1.AccountIndexNext == df2.AccountIndex, 'leftouter').dropDuplicates()

This takes a while to run but does it make sense?

chintan s
  • 6,170
  • 16
  • 53
  • 86
  • Your question is not entirely clear, at least to me.... but if you want to have all values in the first column with all the values in the second column, it works... only consider delete the duplicated values – Samir Hinojosa Feb 22 '23 at 15:24
  • Thanks @SamirHinojosa PowerBI has this option to use `left outer (all from first matching from second)` so I want to implement the same in PySpark. – chintan s Feb 22 '23 at 15:26

1 Answers1

1

Here is an example

data = [("JUAN","CHINTAN"), 
        ("JHON","USER"), 
        ("LAURA","CARLA"), 
        ("HELOISE","STEVE"),
        ("CALI","JULIA"), 
        ("STALIN","JULIA"), 
        ("DAVID","MATH")
      ]

dataColumns = ["A","B"]
dataDF = spark.createDataFrame(data=data, schema=dataColumns)

dataDF.show()

+-------+-------+
|      A|      B|
+-------+-------+
|   JUAN|CHINTAN|
|   JHON|   USER|
|  LAURA|  CARLA|
|HELOISE|  STEVE|
|   CALI|  JULIA|
| STALIN|  JULIA|
|  DAVID|   MATH|
+-------+-------+

After merging

dataDF.select("A").join(dataDF.select("B"), how = 'leftouter').dropDuplicates().show(50)
+-------+-------+
|      A|      B|
+-------+-------+
|   JUAN|CHINTAN|
|   JUAN|  STEVE|
|   JUAN|   USER|
|   JUAN|  JULIA|
|   JUAN|   MATH|
|   JUAN|  CARLA|
|   JHON|CHINTAN|
|   JHON|   USER|
|   JHON|  CARLA|
|   JHON|  JULIA|
|   JHON|   MATH|
|   JHON|  STEVE|
|  LAURA|   MATH|
|  LAURA|  STEVE|
|  LAURA|   USER|
|  LAURA|  JULIA|
|  LAURA|CHINTAN|
|  LAURA|  CARLA|
|HELOISE|   USER|
|HELOISE|  JULIA|
|HELOISE|   MATH|
|HELOISE|CHINTAN|
|HELOISE|  CARLA|
|HELOISE|  STEVE|
|   CALI|CHINTAN|
|   CALI|  JULIA|
|   CALI|   MATH|
|   CALI|  STEVE|
|   CALI|  CARLA|
|   CALI|   USER|
| STALIN|   MATH|
| STALIN|   USER|
| STALIN|  STEVE|
| STALIN|  JULIA|
| STALIN|  CARLA|
| STALIN|CHINTAN|
|  DAVID|   USER|
|  DAVID|  STEVE|
|  DAVID|  JULIA|
|  DAVID|CHINTAN|
|  DAVID|   MATH|
|  DAVID|  CARLA|
+-------+-------+

As I mentioned before, consider using .dropDuplicates()

Samir Hinojosa
  • 825
  • 7
  • 24