24

I have two DataFrames in Spark SQL (D1 and D2).

I am trying to inner join both of them D1.join(D2, "some column") and get back data of only D1, not the complete data set.

Both D1 and D2 are having the same columns.

Could some one please help me on this?

I am using Spark 1.6.

cheseaux
  • 5,187
  • 31
  • 50
Avi
  • 451
  • 2
  • 5
  • 12

4 Answers4

60

Let say you want to join on "id" column. Then you could write :

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._    
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")
cheseaux
  • 5,187
  • 31
  • 50
  • Hi , This answer helps . I have a silly question . What does the $ sign imply? I get error when i try to write the $ sign in my code. – Avi Aug 02 '16 at 13:32
  • 3
    It's an alternative way of accessing DataFrame columns, as explained [here](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Column.html), but you have to import the implicits package (I've edited my answer) – cheseaux Aug 02 '16 at 13:42
  • 1
    Well, there is a "leftsemi" join option as well. It includes rows from the left table which have a matching row on the right. Also no need to dedup as the matching rows are included only once. – user238607 Oct 02 '17 at 04:12
13

You could use left_semi:

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")

Semi-join takes only rows from the left dataset where joining condition is met.

There's also another interesting join type: left_anti, which works similarily to left_semi but takes only those rows where the condition is not met.

Krzysztof Atłasik
  • 21,985
  • 6
  • 54
  • 76
12

As an alternate answer, you could also do the following without adding aliases:

d1.join(d2, d1("id") === d2("id"))
  .select(d1.columns.map(c => d1(c)): _*)
nsanglar
  • 1,632
  • 1
  • 14
  • 24
  • Hi @nsanglar, thanks, this was super helpful. However, if I have a derived column like `.withColumn("derived_id", coalesce(d2("id"),d1("id")))` and I want this column along with all the columns of d1, then how will I modify the select statement? – riyaB Jun 04 '20 at 15:16
1

You should use leftsemi join which is similar to inner join difference being leftsemi join returns all columns from the left dataset and ignores all columns from the right dataset.

You can try something like the below in Scala to Join Spark DataFrame using leftsemi join types.

empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftsemi")
    .show(false)

If you are using Python use below PySpark join dataframe example

empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftsemi") \
   .show(truncate=False)
NNK
  • 1,044
  • 9
  • 24