Spark rdd.count() yields inconsistent results

Question

I'm a bit baffled.

A simple rdd.count() gives different results when run multiple times.

Here is the code i run:

val inputRdd = sc.newAPIHadoopRDD(inputConfig,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Long],
classOf[org.bson.BSONObject])

println(inputRdd.count())

It opens a connection to a MondoDb Server and simply counts the Objects. Seems pretty straight forward to me

According to MongoDb there are 3,349,495 entries

Here is my spark output, all ran the same jar:

spark1 :    3.257.048  
spark2 :    3.303.272  
spark3 :    3.303.272  
spark4 :    3.303.272  
spark5 :    3.303.271   
spark6 :    3.303.271  
spark7 :    3.303.272  
spark8 :    3.303.272  
spark9 :    3.306.300  
spark10:    3.303.272  
spark11:    3.303.271

Spark and MongoDb are run on the same cluster.
We are running:

Spark version 1.5.0-cdh5.6.1  
Scala version 2.10.4  
MongoDb version 2.6.12

Unfortunately we can not update these

Is Spark non-deterministic?
Is there anyone who can enlighten me?

Thanks in advance

EDIT/ Further Info
I just noticed an error in our mongod.log. Could this error cause the inconsistent behaviour?

[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet syncing to: hadoop05:27017
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet error RS102 too stale to catch up, at least from hadoop05:27017
[rsBackgroundSync] replSet our last optime : Jul  2 10:19:44 57777920:111
[rsBackgroundSync] replSet oldest at hadoop05:27017 : Jul  5 15:17:58 577bb386:59
[rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
[rsBackgroundSync] replSet error RS102 too stale to catch up

Did you check number of entries in MongoDb several times (in parallel to running spark `count()`)? — Yaron, Jan 25 '17 at 14:56
The number of entries in MongoDb wasn't changed while running. And thanks for reformating :) — PeterLudolf, Jan 25 '17 at 20:00
a) What's your MongoDB deployment topology ? (replica set or sharded cluster ?) Perhaps the spark workers return different answer based on the MongoDB members, i.e. some of the members haven't replicated the data yet. b) MongoDB v2.6 has reached it's end of life October 2016, please upgrade whenever possible. — Wan B., Feb 24 '17 at 00:29

score 0 · Answer 1 · answered Jul 31 '17 at 15:06

As you already spotted, the problem does not appear to be with spark (or scala) but with MongoDB.

As such the question regarding the difference seems to be resolved.

You will still want to troubleshoot the actual MongoDB error, the provided link can be a good starting point for that: http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

score 0 · Answer 2 · answered Nov 27 '20 at 02:08

0

count returns an estimated count. As such, the value returned can change even if the number of documents hasn't changed.

countDocuments was added to MongoDB 4.0 to provide an accurate count (that also works in multi-document transactions).

answered Nov 27 '20 at 02:08

D. SM

13,584
3
12
21

Spark rdd.count() yields inconsistent results

2 Answers2