3

I am trying to join 2 dataframes such that both have the following named columns. What's the best way to do a LEFT OUTER join?

df = df.join(df_forecast, ["D_ACCOUNTS_ID", "D_APPS_ID", "D_CONTENT_PAGE_ID"], 'left')

Currently, I get an error that:

You're trying to access a column, but multiple columns have that name.

what am i missing out on?

x89
  • 2,798
  • 5
  • 46
  • 110
  • Do you wanna keep columns with the same name from both sides? – ARCrow Feb 23 '23 at 20:58
  • Yeah I guess that's the only way @ARCrow – x89 Feb 24 '23 at 07:04
  • Please ask 1 specific researched non-duplicate question. Please either ask about 1 bad definition/query/function with the obligatory [mre], including why you think it should return something else or are unsure at the 1st subexpression that it doesn't give what you expect or are stuck, justified by reference to authoritative documentation, or ask about your overall goal giving working parts you can do with justification & ideally a [mre]. Then misunderstood code doesn't belong. But please ask about unexpected behaviour 1st because misconceptions get in the way of your goal. [ask] [Help] – philipxy Feb 25 '23 at 11:23

1 Answers1

1
import pyspark.sql.functions as f

join_keys = ["D_ACCOUNTS_ID", "D_APPS_ID", "D_CONTENT_PAGE_ID"]

df = (
    df
    .join(df_forecast, join_keys, 'left')
    .select(
        *join_keys,
        # selecting columns from left side of the join that are not in the join keys.
        *[df[element].alias('df_'+element) for element in df.columns if element not in join_keys],
        # selecting columns from right side of the join that are not in the join keys.
        *[df_forecast[element].alias('df_forecast_'+element) for element in df_forecast.columns if element not in join_keys]
    )
)
philipxy
  • 14,867
  • 6
  • 39
  • 83
ARCrow
  • 1,360
  • 1
  • 10
  • 26