8

I am doing join of 2 data frames and select all columns of left frame for example:

val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer")

in above I want to do select first_df.* .How can I select all columns of one frame in join ?

user2895589
  • 1,010
  • 4
  • 20
  • 33

4 Answers4

22

With alias:

first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")
3

We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.

Here we join two dataframes df1 and df2 based on column col1.

    df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi") 
3

Suppose you:

  1. Want to use the DataFrame syntax.
  2. Want to select all columns from df1 but only a couple from df2.
  3. This is cumbersome to list out explicitly due to the number of columns in df1.

Then, you might do the following:

val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)
Bryan Johnson
  • 651
  • 5
  • 5
0

Just to add one possibility, whithout using alias, I was able to do that in pyspark with

   first_df.join(second_df, "id", "left_outer").select( first_df["*"] )

Not sure if applies here, but hope it helps