I am trying to do two joins on two Spark dataframes, after which I want to keep the entries in the second dataframe, and only the matched results from the first dataframe. What I have so far is this:
val join1 = blacklist.where($"RULE_TYPE".equalTo("S")).join(data,$"DEVICEID" === $"AleDeviceId", "rightouter")
val join2 = blacklist.where($"RULE_TYPE".equalTo("M")).join(data,$"MODULESN" === $"ModuleSerialNumber" && $"DEVICEID" === "AleDeviceId", "rightouter")
I used a rightouter join because it is my understanding this will accomplish what is described above. The issue is that after joining these I'd like to combine the results into one dataframe with the following:
join1
.union(join2)
This, however, will duplicate records from each output from the joins. Is there a way to do this without getting duplicate records in the final dataframe?
Thanks