0

I have been reading a few articles on MongoDB fault tolerance, and some people complain it can never really be achieved with MongoDB (like in this article: http://hackingdistributed.com/2013/01/29/mongo-ft/), and this got be confused.

Can someone confirm (and if possible show me the appropriate docs) that using the Write Concern "Journal + Majority" is enough to make sure that 100% of the writes that were reported as success by my driver are durably written and won't be lost even if any replica fails just after the write?

I'm talking about a 3 replica setup. I'm ok with the system no longer accepting writes in case of failure, but when a write is reported as successful by the driver, I need it to be durably committed (regardless of the number of replica failing after that).

Flavien
  • 7,497
  • 10
  • 45
  • 52
  • If at most one replica then at least one replica will always have the write as such it will always be durable even if you don't have journal. In fact having journal will do nothing to help you in the event of a failure in a distributed data setup and might encourage manual rollbacks – Sammaye Apr 22 '13 at 13:31
  • Coming back to that article, I have just actually scim read it and there are so many flaws in its logic. It was written by a guy who, a. didn't understand why SQL is slow and b. didn't understand the concept of MongoDB and c. a guy who clearly didn't read up on many parts of the docmentation but instead decided to just moan about the default settings that came with MongoDB, no offence to the author but his article is just another bash by someone who doesn't really kow what they are on about. They expected one thing but in reality got another cos they didnt research – Sammaye Apr 22 '13 at 13:41
  • Actually I changed my question. I can't have a write reported as succeeded and not durably written, in any situation (even with 2 replica failing). – Flavien Apr 22 '13 at 13:52
  • If you actually ask MongoDB to ack then MongoDB will either try and write to a node or if it cannot will not and will error with no candidate servers error, it cannot write to something that does not exist. If it is able to write to a node it will be able to. – Sammaye Apr 22 '13 at 13:55
  • If I use "Safe + Majority", it might successfully connect to 2 replica, report success, then the 2 replica fail before they get a chance to commit the change to disk. The change is not committed, yet success was reported to the driver. – Flavien Apr 22 '13 at 14:06
  • There is a 60ms Window, but the chance of two systems failing in the same 60ms without your application server going down with them is unlikely. At which point you will suffer the same problems and any other site on the internet. – Sammaye Apr 22 '13 at 14:09
  • It should be noted that you cna of course configure that time to be as low as your like as well...again it is a default setting, though if you are looking for straight to disk writes then you are probably looking at the wrong tech...just saying – Sammaye Apr 22 '13 at 14:10
  • This is 60 seconds, not 60 ms, "unlikely" means it will happen sooner or later. Writing to disk is the main purpose of a database. Are you saying MongoDB is not capable of delivering basic durability? Can you comment on why "Journal + Majority" is not good for what I am trying to do? – Flavien Apr 22 '13 at 14:18
  • http://docs.mongodb.org/manual/reference/command/fsync/ – Flavien Apr 22 '13 at 14:20
  • "Are you saying MongoDB is not capable of delivering basic durability?" It is capable of delivering excellent durability for its scenarios. And if you are looking into using journal then you are killing any speed MongoDB has, you might as well be using SQL since yor scenairo is orientated around that and thats weird I am sure it used to say ms :\ – Sammaye Apr 22 '13 at 14:22
  • Note that my question was not about speed though. I know journaling adds a delay of 100 ms, I am ok with that. What I am not ok with is having a risk (however small it is) to lose data that should have been committed. – Flavien Apr 22 '13 at 14:26
  • 1
    Then you are using the wrong technology, if you require consistency and dirability (ACID) compliance over everything you should probably seek out SQL. – Sammaye Apr 22 '13 at 14:28
  • ah I am such a dummbass, it is 60ms to journal and then you have fsync, so yea 60ms of loss with journaling turned on, just remembered it – Sammaye Apr 22 '13 at 16:47
  • 1
    @Flavien I wouldn't pay too much attention to this discussion, there is a lot of unrelated things being mixed up in these comments. If you receive acknowledgement to j:true and w:majority it means that the write occurred on majority of the replica set *and* flushed to disk on the primary. It's not 100ms loss to wait for journal acknowledgement, it's at most 33ms. fsync interval is irrelevant. The highest durability AND high availability of data is achieved with replication *not* disk durability, as there is no guarantee the disk will survive a crash but on secondary it'll still be available. – Asya Kamsky Apr 23 '13 at 04:03
  • @AsyaKamsky Unrelated? I think earlier I actually mentioned that even though there is a 60ms window sending it to the majority keeps availability, which is an alternative to what you just said. Also if it is 33ms at most you might wanna fix the doc page on journaling which states 100ms... – Sammaye Apr 23 '13 at 07:04
  • It's 100ms by default but journal flush is done up to 3x as often if there is anyone waiting for acknowledgement "j:1". Once it's on disk in the journal it will get fsync'ed to the data file, so fsync write concern is never necessary (when journaling is used). I don't know where you got 60ms from. – Asya Kamsky Apr 23 '13 at 13:31

1 Answers1

0

Right so if you choose a journal write you are basically ensuring the write has made it to disk of a single node. If you choose to do a majority write, you are ensuring that the write has made it to memory of at least x number of nodes in your replica set.

By default, mongodb will flush from memory to journal every 100ms. By having your replica nodes on different machines (physical or virtual), ideally in different data centres, you are very unlikely to ever see ALL nodes in a geographically ditributed replica set go down within the same 100ms before one gets to disk.

Alternatively to guarantee that write made it to disk of a single node - use journal write.

sweaves
  • 572
  • 6
  • 18
  • Just FYI - that article is hideously inaccurate. He also works for another database start-up called and is likely trying to cause controversy by bashing other tech. – sweaves Apr 22 '13 at 18:09