Muliple joins on Spark dataframes duplicate records

Question

I am trying to do two joins on two Spark dataframes, after which I want to keep the entries in the second dataframe, and only the matched results from the first dataframe. What I have so far is this:

    val join1 = blacklist.where($"RULE_TYPE".equalTo("S")).join(data,$"DEVICEID" === $"AleDeviceId", "rightouter")
    val join2 = blacklist.where($"RULE_TYPE".equalTo("M")).join(data,$"MODULESN" === $"ModuleSerialNumber" && $"DEVICEID" === "AleDeviceId", "rightouter")

I used a rightouter join because it is my understanding this will accomplish what is described above. The issue is that after joining these I'd like to combine the results into one dataframe with the following:

   join1
  .union(join2)

This, however, will duplicate records from each output from the joins. Is there a way to do this without getting duplicate records in the final dataframe?

Thanks

Not the most efficient, but should work: `join1.union(join2).distinct()` — Travis Hegner, Jun 12 '18 at 18:32

score 0 · Answer 1 · answered Jun 12 '18 at 18:24

0

Please try something like this:

df = left.join(right, ["name"])

More description on this link

answered Jun 12 '18 at 18:24

darthsidious

2,851
3
19
30

score 0 · Answer 2 · answered Jun 12 '18 at 20:50

0

You can combine both conditions together.

val join = blacklist.join(data,($"RULE_TYPE" === "S" && $"DEVICEID" === $"AleDeviceId") || 
    ($"RULE_TYPE" === "M" && $"MODULESN" === $"ModuleSerialNumber" && $"DEVICEID" === "AleDeviceId"), "rightouter")

answered Jun 12 '18 at 20:50

Kaushal

3,237
3
29
48

I like this. I'll give this a try. Thanks – CaroV1x3n Jun 13 '18 at 20:10

Muliple joins on Spark dataframes duplicate records

2 Answers2