Spark Hive: Filter rows of one DataFrame by the values of another DataFrame's column

Question

I have the following two DataFrames:

DataFrame "dfPromotion":
date        | store
===================
2017-01-01  | 1    
2017-01-02  | 1


DataFrame "dfOther":
date        | store
===================
2017-01-01  | 1    
2017-01-03  | 1

Later I need to union both of the DataFrames above. But before I have to remove all rows of dfOther that have a date value, that is also contained in dfPromotion.

The result of the following filtering step should look like this:

DataFrame "dfPromotion" (this stays always the same, must not be changed in this step!)
date        | store
===================
2017-01-01  | 1    
2017-01-02  | 1


DataFrame "dfOther" (first row is removed as dfPromotion contains the date 2017-01-01 in the "date" column)
date        | store
===================
2017-01-03  | 1

Is there a way to do this in Java? I only found the DataFrame.except method herefore, but this checks all columns of the DataFrames. I need to filter the second DataFrame just by the date column, as other columns could be added later, which could contain different values...

Calling dfOther.filter(dfOther.col("date").isin(dfPromotion.col("date"))) throws following exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) date#64 missing from date#0,store#13 in operator !Filter date#0 IN (date#64);

Suresh · Answer 1 · 2017-03-16T13:20:08.973

2

You can use the subtract function,

dfOther.select("date").except(dfPromotion.select("date")).join(dfOther,'date').show()

edited Mar 16 '17 at 13:20

answered Mar 15 '17 at 18:14

Suresh

5,678
2
24
40

Sorry, but there's no `DataFrame.subtract` method in the Spark API: https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrame.html – D. Müller Mar 16 '17 at 09:19
oh sorry. I had it used in pyspark and have used subtract in Java rdd. Had no idea it was removed for dataframe. Anyways, we have except function, you can use that in above line of code instead of subtract.It should work mostly. – Suresh Mar 16 '17 at 10:29
The point with the `except` function is, that I have to filter the `dfOther` DataFrame just by the `date` column. So I can't use this method, else this would be the easiest way to go. – D. Müller Mar 16 '17 at 10:41
I believe we can, dfOther.select("date"), this returns a dataframe with just date column and similarly dfPromotion.select("date") as well. so, except between two dataframe with date column will return the result we needed. Just give it a try and let me know if I have missed something. – Suresh Mar 16 '17 at 10:47
This returns only two tables, each containing just one column `date`. But I need all the column in the result, which are `date` and `store` in the case above. – D. Müller Mar 16 '17 at 10:54
1

This is how you did it, dfOther.select("date").except(dfPromotion.select("date")).join(dfOther,'date').show() – Suresh Mar 16 '17 at 11:36
Yes, you're right, following code did the same for me: ` dfOther.select("date").except(dfPromotion.select("date")).join(dfOther, "date"); ` Thanks for your help! – D. Müller Mar 16 '17 at 13:08

vikrame · Accepted Answer · 2017-03-15T19:32:06.897

Since you mentioned about Spark Hive, can you try spark sql approach like below?

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val dfpromotion = sqlContext.sql("select * from dfpromotion");

dfpromotion.show
+----------+-----+
|        dt|store|
+----------+-----+
|2017-01-01|    1|
|2017-01-02|    1|
+----------+-----+

val dfother = sqlContext.sql("select * from dfother");

dfother.show
+----------+-----+
|        dt|store|
+----------+-----+
|2017-01-01|    1|
|2017-01-03|    1|
+----------+-----+


val dfdiff = sqlContext.sql("select o.dt, o.store from dfpromotion p right         outer join dfother o on p.dt = o.dt where p.dt is null");
val dfunion = dfpromotion.union(dfdiff);


scala> dfunion.show
+----------+-----+
|        dt|store|
+----------+-----+
|2017-01-01|    1|
|2017-01-02|    1|
|2017-01-03|    1|

This did it for me, thank you very much. My final code now: `dfPromotion.join(dfOther, dfPromotion.col("date").equalTo(dfOther.col("date")), "right_outer").where(dfPromotion.col("date").isNull()).select(dfOther.col("date"), dfOther.col("store"));` — D. Müller, Mar 16 '17 at 09:18

Spark Hive: Filter rows of one DataFrame by the values of another DataFrame's column

2 Answers2