PySpark gives me little odd results after dropDuplicates and join data-sets. The situation is there are two very large dataset: one with people's ID and some variables and second one with their region_code
first dataset:
ID|VAR1|VAR2|VAR3|VAR4|VAR5|
1|-----|----|---|---|----|
2|-----|----|---|---|----|
3|-----|----|---|---|----|
4|-----|----|---|---|----|
second dataset:
ID|region_code|
1|7|
2|5|
1|9|
4|7|
the result which I'm getting after following code is:
file_1 = file_1.dropDuplicates(["ID"])
file_2 = file_2.dropDuplicate(["ID"])
file_2.filter(filter("ID == '1'").show()
ID|region_code|
1|7|
After joining the files I'm expecting:
merge_file = file_1.join(file_2, "ID", "left")
ID|VAR1|VAR2|VAR3|VAR4|VAR5|region_code|
1|-----|----|---|---|----|7|
2|-----|----|---|---|----|5|
3|-----|----|---|---|----|null|
4|-----|----|---|---|----|7|
but I got:
merge_file.filter("ID == '1'").show()
ID|VAR1|VAR2|VAR3|VAR4|VAR5|region_code|
1|-----|----|---|---|----|9|
I'm very curious about these strange results.