1

I use spark mongo-connector to sync data from mongodb collection to hdfs file, my code works fine if the collection is read through mongos, but when it comes to local.oplog.rs, a replica collection only could be read through mongod, it gives me exception:

Caused by: com.mongodb.hadoop.splitter.SplitFailedException: Unable to calculate input splits: couldn't find index over splitting key { _id: 1 }

I think the data structure is different between oplog.rs and normal collection, oplog.rs doesn't have "_id" property, so the newAPIHadoopRDD can not work nomally, is that right?

MayI
  • 53
  • 1
  • 9

1 Answers1

0

Yes, Document structure is a bit different in oplog.rs. You will find your actual document in "o" field of oplog document.

Example oplog document:

{
"_id" : ObjectId("586e74b70dec07dc3e901d5f"),
"ts" : Timestamp(1459500301, 6436),
"h" : NumberLong("5511242317261841397"),
"v" : 2,
"op" : "i",
"ns" : "urDB.urCollection",
"o" : {
    "_id" : ObjectId("567ba035e4b01052437cbb27"),
      .... 
     .... this is your original document.

      }

}

Use "ns" and "o" of oplog.rs to get your expected collection and document.

Probal
  • 90
  • 12