Joining two DataFrames in Spark SQL and selecting columns of only one

Question

I have two DataFrames in Spark SQL (D1 and D2).

I am trying to inner join both of them D1.join(D2, "some column") and get back data of only D1, not the complete data set.

Both D1 and D2 are having the same columns.

Could some one please help me on this?

I am using Spark 1.6.

cheseaux · Accepted Answer · 2016-08-02T13:37:03.587

60

Let say you want to join on "id" column. Then you could write :

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._    
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")

edited Aug 02 '16 at 13:37

answered Aug 02 '16 at 13:08

cheseaux

5,187
31
50

Hi , This answer helps . I have a silly question . What does the $ sign imply? I get error when i try to write the $ sign in my code. – Avi Aug 02 '16 at 13:32
3

It's an alternative way of accessing DataFrame columns, as explained [here](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Column.html), but you have to import the implicits package (I've edited my answer) – cheseaux Aug 02 '16 at 13:42
1

Well, there is a "leftsemi" join option as well. It includes rows from the left table which have a matching row on the right. Also no need to dedup as the matching rows are included only once. – user238607 Oct 02 '17 at 04:12

Krzysztof Atłasik · Answer 2 · 2019-11-29T18:21:33.600

13

You could use left_semi:

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")

Semi-join takes only rows from the left dataset where joining condition is met.

There's also another interesting join type: left_anti, which works similarily to left_semi but takes only those rows where the condition is not met.

edited Nov 29 '19 at 18:21

answered May 30 '19 at 15:13

Krzysztof Atłasik

21,985
6
54
76

score 12 · Answer 3 · answered Jan 30 '18 at 11:12

12

As an alternate answer, you could also do the following without adding aliases:

d1.join(d2, d1("id") === d2("id"))
  .select(d1.columns.map(c => d1(c)): _*)

answered Jan 30 '18 at 11:12

nsanglar

1,632
1
14
24

Hi @nsanglar, thanks, this was super helpful. However, if I have a derived column like `.withColumn("derived_id", coalesce(d2("id"),d1("id")))` and I want this column along with all the columns of d1, then how will I modify the select statement? – riyaB Jun 04 '20 at 15:16

score 1 · Answer 4 · answered Oct 27 '20 at 14:31

You should use leftsemi join which is similar to inner join difference being leftsemi join returns all columns from the left dataset and ignores all columns from the right dataset.

You can try something like the below in Scala to Join Spark DataFrame using leftsemi join types.

empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftsemi")
    .show(false)

If you are using Python use below PySpark join dataframe example

empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftsemi") \
   .show(truncate=False)

Joining two DataFrames in Spark SQL and selecting columns of only one

4 Answers4

Linked

Related