I'm using Pyspark 2.1.0.
I'm attempting to perform a left outer join of two dataframes using the following: I have 2 dataframes, schema of which appear as follows:
crimes
|-- CRIME_ID: string (nullable = true)
|-- YEAR_MTH: string (nullable = true)
|-- CRIME_TYPE: string (nullable = true)
|-- CURRENT_OUTCOME: string (nullable = true)
outcomes
|-- CRIME_ID: string (nullable = true)
|-- YEAR_MTH: string (nullable = true)
|-- FINAL_OUTCOME: string (nullable = true)
I need to be able to join crimes to outcomes based on a left outer since many outcomes exist for a single crime. I would like to exclude columns that are common to both dataframes.
I have tried the following 2 ways, but each generate various errors:
cr_outs = crimes.join(outcomes, crimes.CRIME_ID == outcomes.CRIME_ID, 'left_outer')\
.select(['crimes.'+c for c in crimes.columns] + ['outcomes.FINAL_OUTCOME'])
from pyspark.sql.functions as fn
cr_outs = crimes.alias('a').join(outcomes.alias('b'), fn.col('b.CRIME_ID') = fn.col('a.CRIME_ID') ,'left_outer')\
.select([fn.col('a.'+ c) for c in a.columns] + b.FINAL_OUTCOME)
could anybody suggest an alternative way? thanks