How do I join a dataframe with itself using left outer (all from first matching from second)
?
Not sure if this is correct
df.alias('d1').join(df.alias('d2'), how = 'leftouter')
Edit 1
df = spark.read.parquet(file)
dfSort = df.sort(col('ID').asc(), col('Date').asc())
dfIndex = dfSort.withColumn('Index', monotonically_increasing_id())
.withColumn('IndexNext', col('Index')+1)
.withColumn('AccountIndex', concat(col('ID'),lit('-'), col('Index')
.withColumn('AccountIndexNext', concat(col('ID'),lit('-'), col('IndexNext')
.drop('Index', 'Index Next')
dfJoined = dfIndex.alias('d1').join(dfIndex.alias('d2'), df1.AccountIndexNext == df2.AccountIndex, 'leftouter').dropDuplicates()
This takes a while to run but does it make sense?