7

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:

imp_sample.where(col("location").isNotNull()).count()

And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this:

imp_sample.where(col("location").isNull()).count()

and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!

user3245256
  • 1,842
  • 4
  • 24
  • 51
  • That shouldn't happen, how do you populate `imp_sample`? – Alex Dec 02 '17 at 22:05
  • 1
    @Jaco - so it's important how I populate it? I mean - why should it be important? I have a long code that modifies it again and again. But once it's populated, the same command yields different results. Similarly, when I do the final imp_sample count, write that file out as a parquet file and then read it in - I am also getting a slightly different number of rows! – user3245256 Dec 03 '17 at 12:55
  • 1
    @Jaco I've been thinking about your question and want to thank you for it. Maybe you can provide it as an answer so that I could upvote it? Right before I do the count, I do sampling: sampled_impressions = impressions3.sampleBy("click_status", fractions={0: 0.037, 1: 1}, seed=0) - I guess there is some error due to rounding because I have 70 million rows. So, every time I execute count after this line, the results are slightly different. Correct? – user3245256 Dec 03 '17 at 15:09

2 Answers2

2

Ok, I have suffered majorly from this in the past. I had a seven or eight stage pipeline that normalised a couple of tables, added ids, joined them and grouped them. Consecutive runs of the same pipeline gave different results, although not in any coherent pattern I could understand.

Long story short, I traced this feature to my usage of the function monotonically_increasing_id, supposed resolved by this JIRA ticket, but still evident in Spark 2.2.

I do not know exactly what your pipeline does, but please understand that my fix is to force SPARK to persist results after calling monotonically_increasing_id. I never saw the issue again after I started doing this.

Let me know if a judicious persist resolves this issue.

To persist an RDD or DataFrame, call either df.cache (which defaults to in-memory persistence) or df.persist([some storage level]), for example

df.persist(StorageLevel.DISK_ONLY)

Again, it may not help you, but in my case it forced Spark to flush out and write id values which were behaving non-deterministically given repeated invocations of the pipeline.

Chondrops
  • 728
  • 1
  • 4
  • 14
  • Thank you! In my comment in response to Jaco above, I am hypothesizing that the different number of rows I am getting is due to the fact that right before I sample. I do fix the seed, but maybe the rounding leads to a slightly different result each time. This being said, I also tried to use monotonically-Increasing_id and after it produced crappy results (when I applied it to 2 DFs of the same height, the id was not consecutive), I stopped. But how do you force Spark to persist results? What's the code for that? Thank you! – user3245256 Dec 03 '17 at 15:17
  • Added an example - very curious if this sorts your problem out! – Chondrops Dec 03 '17 at 22:02
  • This, `df.persist(StorageLevel.DISK_ONLY)` solved an issue I had with Bucketizer() giving inconsistent results at every run. – mamonu Aug 16 '18 at 12:09
  • Thank you! Your post helped me identify an issue I was debugging.Btw, if there is memory pressure, data cached via df.cache (with the default persistence level) can be evicted, thereby forcing the `monotonically_increasing_id` function to re-evaluated. So the disk option might be safer. – qrslt Jan 31 '20 at 06:35
2

As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.

Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).

Finally, you can persist a data frame, by called persist() on it.

Alex
  • 21,273
  • 10
  • 61
  • 73