UpdateOne fails on client due to timeout, but MongoDB processes it anyway

Question

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:

Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.

This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.

I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.

Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?

Any help is appreciated.

UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.

@gold_cy I didn't, but the function under test doesn't spawn goroutines or runs inside one. — Michael Benford, Aug 26 '21 at 22:34
If the client can't read the command response, you can't assume anything about its success. How would you proceed if you'd know the error represents response reading failure? — icza, Aug 26 '21 at 22:52
@icza Fair point. I thought I wouldn't need to retry if l knew the command went through, even if l couldn't read its response (it's an upsert and I don't care about the returning ID). — Michael Benford, Aug 26 '21 at 23:13

icza · Answer 1 · 2021-08-27T06:49:52.553

In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.

In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.

Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.

I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:

mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}

And there is the error when the timeout happens writing the command:

mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}

As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).

So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.

Thanks for taking the time to try and reproduce the problem on your end. I managed to make my test stop randomly failing by pinging the database before shutting it down, so the handshake phase could complete. My best guess is that it was contributing to the command timeout and making the test more likely to fail. — Michael Benford, Aug 27 '21 at 14:04
In my test I only used a context with deadline to reproduce the issue, with no proxy or network error, so the issue is real. — icza, Aug 27 '21 at 14:32

score 1 · Answer 2 · answered Aug 27 '21 at 19:04

1

If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.

This functionality has been around for years so unless you are using an ancient driver version you should already have it.

If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.

answered Aug 27 '21 at 19:04

D. SM

13,584
3
12
21

I'm using the latest driver (1.7.1). I have retryable writes disabled because as far as I know, they only retry once, and my function needs to retry until it succeeds. Am I missing something? – Michael Benford Aug 27 '21 at 19:18
1

You can read https://github.com/mongodb/specifications/blob/master/source/retryable-writes/retryable-writes.rst#id19 ("constructing write commands") and implement your own retryable writes that retry any number of times. As far as I know no driver implements multiple retries at this point, though this is likely to change soon. – D. SM Aug 29 '21 at 15:16
Interesting. [The doc](https://docs.mongodb.com/manual/core/retryable-writes/) reads _Retryable writes allow MongoDB drivers to automatically retry certain write operations **a single time** if they encounter network errors (...)_, so I assumed it wasn't up to the driver to retry multiple times. – Michael Benford Aug 29 '21 at 23:56
1

See https://github.com/mongodb/specifications/pull/881. – D. SM Aug 30 '21 at 04:36

score 0 · Answer 3 · answered Jan 09 '22 at 10:11

I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.

To add to @icza solution: You can also get the error context deadline exceeded so check also for that. A check for a context abortion would look something like this:

if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
   ...
}

My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.

Example:

result, err := database.collection.InsertOne(context, item)
if result != nil {
    return result.InsertedID, err
}
return nil, err

If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/30773890) — Bracken, Jan 12 '22 at 12:28

UpdateOne fails on client due to timeout, but MongoDB processes it anyway

3 Answers3