Join two data frames, select all columns from one and some columns from the other

Question

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.

Is there a way to replicate the following command:

sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")

by using only pyspark functions such as join(), select() and the like?

I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

score 118 · Answer 1 · edited Nov 13 '19 at 15:46

118

Asterisk (*) works with alias. Ex:

from pyspark.sql.functions import *

df1 = df1.alias('df1')
df2 = df2.alias('df2')

df1.join(df2, df1.id == df2.id).select('df1.*')

edited Nov 13 '19 at 15:46

cronoik

15,434
3
40
78

answered Jul 11 '16 at 23:26

maxcnunes

2,927
1
20
26

21

perfect -- full solution; { df1.join(df2, df1.id == df2.id).select('df1.*', 'df2.other') } – Andre Odendaal Apr 18 '18 at 20:35
1

You wrote `df1 = df1.alias('df1')` and `df2 = df2.alias('df2')`. What is the purpose here? You are renaming `df1` as `df1`. Isn't this useless? – Sheldore Jan 31 '21 at 15:30
@Sheldore see https://stackoverflow.com/a/46358218/1552998 – stormfield Jul 19 '21 at 14:04
3

Somehow this approach doesn't work on Spark 3 for me. – hui chen Feb 09 '22 at 09:34

score 79 · Accepted Answer · edited Feb 12 '18 at 10:33

79

Not sure if the most efficient way, but this worked for me:

from pyspark.sql.functions import col

df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

The trick is in:

[col('a.'+xx) for xx in a.columns] : all columns in a

[col('b.other1'),col('b.other2')] : some columns of b

edited Feb 12 '18 at 10:33

Ramesh Maharjan

41,071
6
69
97

answered Apr 21 '16 at 19:12

Pablo Estevez

822
6
3

6

In spark2, I had to change this to col('b.id') == col('a.id') (with two equals signs). Otherwise, it gives me a 'SyntaxError: keyword can't be an expression' exception – void Jul 29 '17 at 06:43
Hi, How can I pass multiple columns as a list instead of individual cols like this [col('b.other1'),col('b.other2')] for df2 dataset – Manu Sharma Jul 04 '20 at 05:32

score 76 · Answer 3 · answered Nov 05 '18 at 10:50

76

Without using alias.

df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

answered Nov 05 '18 at 10:50

Akhilesh Bharadwaj

786
5
4

3

I notice that when joined dataframes have same-named column names, doing `df1["*"]` in the select method correctly gets the columns from that dataframe even if `df2` had columns with some of the same names as `df1`. Would you mind explaining (or linking to docs on) how this works? – lampShadesDrifter Dec 23 '19 at 20:21
1

This is the current (2022) best answer IMHO – e.thompsy Dec 15 '22 at 17:18

Katya Willard · Answer 4 · 2023-04-12T19:24:19.017

12

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.

a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
    
c = a.join(b, a.a_id == b.b_id).select(a["*"],b["other"])

Then, c.show() yields:

+----+-----+-----+
|a_id|extra|other|
+----+-----+-----+
|   a|  foo|   p1|
|   b|  hem|   p2|
|   c|  haw|   p3|
+----+-----+-----+

edited Apr 12 '23 at 19:24

answered Mar 21 '16 at 15:58

Katya Willard

2,152
4
22
43

5

Well, the OP has asked for selection of only few cols, ie. filteration, the answer has all the columns after join. – Viv Apr 16 '19 at 14:45
Please modify your answer to include the missing ```select()``` to include only required columns after ```join()``` – Prasad Nadiger Apr 04 '23 at 07:45

score 9 · Answer 5 · answered Mar 30 '20 at 15:17

I believe that this would be the easiest and most intuitive way:

final = (df1.alias('df1').join(df2.alias('df2'),
                               on = df1['id'] == df2['id'],
                               how = 'inner')
                         .select('df1.*',
                                 'df2.other')
)

score 5 · Answer 6 · answered Aug 31 '18 at 10:21

5

drop duplicate b_id

c = a.join(b, a.a_id == b.b_id).drop(b.b_id)

answered Aug 31 '18 at 10:21

Selvaraj S.

124
1
5

score 4 · Answer 7 · answered Nov 10 '20 at 07:39

Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.

emp_df  = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)


emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()

Output  for 'emp_df.show()':

+---+---------+------+------+
| ID|     Name|Salary|DeptID|
+---+---------+------+------+
|  1|     John| 20000|     1|
|  2|    Rohit| 15000|     2|
|  3|    Parth| 14600|     3|
|  4|  Rishabh| 20500|     1|
|  5|    Daisy| 34000|     2|
|  6|    Annie| 23000|     1|
|  7| Sushmita| 50000|     3|
|  8| Kaivalya| 20000|     1|
|  9|    Varun| 70000|     3|
| 10|Shambhavi| 21500|     2|
| 11|  Johnson| 25500|     3|
| 12|     Riya| 17000|     2|
| 13|    Krish| 17000|     1|
| 14| Akanksha| 20000|     2|
| 15|   Rutuja| 21000|     3|
+---+---------+------+------+

Output  for 'dept_df.show()':
+------+----------+
|DeptID|      Name|
+------+----------+
|     1|     Sales|
|     2|Accounting|
|     3| Marketing|
+------+----------+

Join Output:
+---+---------+------+------+----------+
| ID|     Name|Salary|DeptID|     DName|
+---+---------+------+------+----------+
|  1|     John| 20000|     1|     Sales|
|  2|    Rohit| 15000|     2|Accounting|
|  3|    Parth| 14600|     3| Marketing|
|  4|  Rishabh| 20500|     1|     Sales|
|  5|    Daisy| 34000|     2|Accounting|
|  6|    Annie| 23000|     1|     Sales|
|  7| Sushmita| 50000|     3| Marketing|
|  8| Kaivalya| 20000|     1|     Sales|
|  9|    Varun| 70000|     3| Marketing|
| 10|Shambhavi| 21500|     2|Accounting|
| 11|  Johnson| 25500|     3| Marketing|
| 12|     Riya| 17000|     2|Accounting|
| 13|    Krish| 17000|     1|     Sales|
| 14| Akanksha| 20000|     2|Accounting|
| 15|   Rutuja| 21000|     3| Marketing|
+---+---------+------+------+----------+

score 1 · Answer 8 · answered Feb 21 '19 at 18:56

1

I got an error: 'a not found' using the suggested code:

from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

I changed a.columns to df1.columns and it worked out.

answered Feb 21 '19 at 18:56

filip stepniak

103
4

3

Changing the name of the variable should be obvious. – cozek Dec 08 '19 at 13:55

score 1 · Answer 9 · answered Feb 06 '20 at 06:51

function to drop duplicate columns after joining.

check it

def dropDupeDfCols(df): newcols = [] dupcols = []

for i in range(len(df.columns)):
    if df.columns[i] not in newcols:
        newcols.append(df.columns[i])
    else:
        dupcols.append(i)

df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
    df = df.drop(str(dupcol))

return df.toDF(*newcols)

score 1 · Answer 10 · answered Jun 16 '23 at 20:32

Some of the answers mentioned above I got ambiguous column exception (this happens when two dataframes have same column, Also I am using spark on databricks). I tried this and it worked.

df_join = df1.join(df2, (df1.a == df2.a) & (df1.b == df2.b), "inner").select(df1.columns,df2.columns)

score 0 · Answer 11 · answered Mar 04 '21 at 08:48

0

I just dropped the columns I didn't need from df2 and joined:

sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho

answered Mar 04 '21 at 08:48

Johan Khanye

35
8

score 0 · Answer 12 · answered Sep 20 '22 at 09:36

0

df1.join(df2, ['id']).drop(df2.id)

answered Sep 20 '22 at 09:36

Morteza

441
1
6
14

STKasha · Answer 13 · 2022-11-18T12:30:19.007

0

If you need multiple columns from other pyspark dataframe then you can use this

based on single join condition

x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])

based on multiple join condition

x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])

edited Nov 18 '22 at 12:30

answered Nov 18 '22 at 12:27

STKasha

1
1

score 0 · Answer 14 · answered Jan 12 '23 at 16:27

I very much like Xehron's answer above, and I suspect it's mechanically identical to my solution. This works in databricks, and presumably works in a typical spark environment (replacing keyword "spark" with "sqlcontext"):

df.createOrReplaceTempView('t1') #temp table t1
df2.createOrReplaceTempView('t2') #temp table t2

output = (
          spark.sql("""
                    select
                      t1.*
                      ,t2.desired_field(s)
                    from 
                      t1
                    left (or inner) join t2 on t1.id = t2.id
                    """
                   )
          )

score -1 · Answer 15 · answered Mar 21 '16 at 14:03

-1

You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join

answered Mar 21 '16 at 14:03

Erica

9
3

My question is exactly how to select all columns from one data frame (without enumerating them one by one) and one column from the other – Francesco Sambo Mar 23 '16 at 13:51

Join two data frames, select all columns from one and some columns from the other

15 Answers15

function to drop duplicate columns after joining.

check it

Linked