0

I am trying to do two joins on two Spark dataframes, after which I want to keep the entries in the second dataframe, and only the matched results from the first dataframe. What I have so far is this:

    val join1 = blacklist.where($"RULE_TYPE".equalTo("S")).join(data,$"DEVICEID" === $"AleDeviceId", "rightouter")
    val join2 = blacklist.where($"RULE_TYPE".equalTo("M")).join(data,$"MODULESN" === $"ModuleSerialNumber" && $"DEVICEID" === "AleDeviceId", "rightouter")

I used a rightouter join because it is my understanding this will accomplish what is described above. The issue is that after joining these I'd like to combine the results into one dataframe with the following:

   join1
  .union(join2)

This, however, will duplicate records from each output from the joins. Is there a way to do this without getting duplicate records in the final dataframe?

Thanks

Trevor Tracy
  • 356
  • 1
  • 10
CaroV1x3n
  • 135
  • 2
  • 2
  • 10

2 Answers2

0

Please try something like this:

df = left.join(right, ["name"])

More description on this link

darthsidious
  • 2,851
  • 3
  • 19
  • 30
0

You can combine both conditions together.

val join = blacklist.join(data,($"RULE_TYPE" === "S" && $"DEVICEID" === $"AleDeviceId") || 
    ($"RULE_TYPE" === "M" && $"MODULESN" === $"ModuleSerialNumber" && $"DEVICEID" === "AleDeviceId"), "rightouter")
Kaushal
  • 3,237
  • 3
  • 29
  • 48