0

PySpark gives me little odd results after dropDuplicates and join data-sets. The situation is there are two very large dataset: one with people's ID and some variables and second one with their region_code

first dataset:

ID|VAR1|VAR2|VAR3|VAR4|VAR5|
1|-----|----|---|---|----|
2|-----|----|---|---|----|
3|-----|----|---|---|----|
4|-----|----|---|---|----|

second dataset:

ID|region_code|
1|7|
2|5|
1|9|
4|7|

the result which I'm getting after following code is:

file_1 = file_1.dropDuplicates(["ID"])
file_2 = file_2.dropDuplicate(["ID"])
file_2.filter(filter("ID == '1'").show()

ID|region_code|
1|7|

After joining the files I'm expecting:

merge_file = file_1.join(file_2, "ID", "left")

ID|VAR1|VAR2|VAR3|VAR4|VAR5|region_code|
1|-----|----|---|---|----|7|
2|-----|----|---|---|----|5|
3|-----|----|---|---|----|null|
4|-----|----|---|---|----|7|

but I got:

merge_file.filter("ID == '1'").show()

ID|VAR1|VAR2|VAR3|VAR4|VAR5|region_code|
1|-----|----|---|---|----|9|

I'm very curious about these strange results.

default_settings
  • 440
  • 1
  • 5
  • 10
  • I'm not sure `dropDuplicates` is the function you're looking for here. In your example `ID == 1` has two associated regions. Dropping duplicate `ID` would arbitrarily select a region for each `ID` then drop the others, which isn't a reliable behavior. – vielkind May 31 '18 at 13:42
  • Is your suggestion to use `file_2.dropDuplicates(["ID", "region_code"])` ? and why first applied filter on file_2 gives after dropDuplicates gives `ID|region_code| 1|7|` – default_settings May 31 '18 at 13:51

1 Answers1

0

pyspark is a lazy interpreter. Your code is only executed when you call an action (i.e. show(), count() etc.). In your code example you are creating file_2. Instead of thinking of file_2 as an object living in memory, file_2 is really just a set of instructions that tells the pyspark engine the processing steps. When you call file_2.filter(filter("ID == '1'").show() those instructions are being executed (including dropDuplicates()) to generate the output.

When you are creating merge_file you are referencing the processing steps for file_2, which could be evaluated differently than your previous example. There is no guarantee that dropDuplicates() will drop the same rows every time because of the lazy execution of pyspark. This is why you are getting different results between the two DataFrames.

This is also why I would suggest you think about your data and what you want to accomplish with your merge because dropDuplicates() is not a reliable method if the relationship between ID and region is important.

vielkind
  • 2,840
  • 1
  • 16
  • 16
  • Thank you @vealkind for the explanation, which function for dropping duplicated records from file_2, would you recommend? I want to achieve only one record per ID in file_2. – default_settings May 31 '18 at 14:16
  • You could do the following: `file_2.groupby(func.col("ID")).agg(func.collect_set(func.col("region_code")).alias("arr_region"))`, which will create one record per `ID` that will contain an array of all distinct `region_name` associated with the `ID` then perform your `merge`. – vielkind May 31 '18 at 14:22