spark inconsistency when running count command

Question

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:

imp_sample.where(col("location").isNotNull()).count()

And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this:

imp_sample.where(col("location").isNull()).count()

and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!

@Jaco - so it's important how I populate it? I mean - why should it be important? I have a long code that modifies it again and again. But once it's populated, the same command yields different results. Similarly, when I do the final imp_sample count, write that file out as a parquet file and then read it in - I am also getting a slightly different number of rows! — user3245256, Dec 03 '17 at 12:55
@Jaco I've been thinking about your question and want to thank you for it. Maybe you can provide it as an answer so that I could upvote it? Right before I do the count, I do sampling: sampled_impressions = impressions3.sampleBy("click_status", fractions={0: 0.037, 1: 1}, seed=0) - I guess there is some error due to rounding because I have 70 million rows. So, every time I execute count after this line, the results are slightly different. Correct? — user3245256, Dec 03 '17 at 15:09

Chondrops · Answer 1 · 2017-12-03T22:02:14.010

2

Ok, I have suffered majorly from this in the past. I had a seven or eight stage pipeline that normalised a couple of tables, added ids, joined them and grouped them. Consecutive runs of the same pipeline gave different results, although not in any coherent pattern I could understand.

Long story short, I traced this feature to my usage of the function monotonically_increasing_id, supposed resolved by this JIRA ticket, but still evident in Spark 2.2.

I do not know exactly what your pipeline does, but please understand that my fix is to force SPARK to persist results after calling monotonically_increasing_id. I never saw the issue again after I started doing this.

Let me know if a judicious persist resolves this issue.

To persist an RDD or DataFrame, call either df.cache (which defaults to in-memory persistence) or df.persist([some storage level]), for example

df.persist(StorageLevel.DISK_ONLY)

Again, it may not help you, but in my case it forced Spark to flush out and write id values which were behaving non-deterministically given repeated invocations of the pipeline.

edited Dec 03 '17 at 22:02

answered Dec 03 '17 at 13:10

Chondrops

728
1
4
14

Thank you! In my comment in response to Jaco above, I am hypothesizing that the different number of rows I am getting is due to the fact that right before I sample. I do fix the seed, but maybe the rounding leads to a slightly different result each time. This being said, I also tried to use monotonically-Increasing_id and after it produced crappy results (when I applied it to 2 DFs of the same height, the id was not consecutive), I stopped. But how do you force Spark to persist results? What's the code for that? Thank you! – user3245256 Dec 03 '17 at 15:17
Added an example - very curious if this sorts your problem out! – Chondrops Dec 03 '17 at 22:02
This, `df.persist(StorageLevel.DISK_ONLY)` solved an issue I had with Bucketizer() giving inconsistent results at every run. – mamonu Aug 16 '18 at 12:09
Thank you! Your post helped me identify an issue I was debugging.Btw, if there is memory pressure, data cached via df.cache (with the default persistence level) can be evicted, thereby forcing the `monotonically_increasing_id` function to re-evaluated. So the disk option might be safer. – qrslt Jan 31 '20 at 06:35

score 2 · Accepted Answer · answered Dec 03 '17 at 18:15

As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.

Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).

Finally, you can persist a data frame, by called persist() on it.

spark inconsistency when running count command

2 Answers2

Linked