Idempotent client-retryable entry creation in AppEngine given eventually consistent queries, etc

Question

I need to come up with a strategy for handling client-retries on a data-store entry creation:

Client sends request to create new entry in database
Server performs entry creation and prepares success-reply
Some error happens that makes the client believe that the request wasn't processed (packet loss, ...)
Client sends same request to create new entry in database again
Server detects retry and recreates and sends original reply without creating another data-store entry
Client receives reply
Everyone is happy and only ONE entry was created in database

I have one restriction: The server is STATELESS! It has no kind of session-information on the client.

My current idea is the following:

Tag every create-request with a guaranteed globally unique ID (here's how I create them, although they are not too relevant for the question):
- Using the data-store (and memcache), I assign a unique, monotonically increasing ID to every server instance once it loads (let's call it SI)
- When a client requests the starting-page, the instance that served the request generates a unique monotonically increasing page-load-id (PL) and sends SI.PL to the client along with the page content
- For every create-request, the client generates a unique monotonically increasing request-id (RI) and sends SI.PL.RI along with the create-request
For every create-request, the server first checks whether it knows the create-tag
If not, it creates the new entry and somehow stores the create-tag along with it
If it does know the tag, it uses it to find the originally created entry and recreates a corresponding reply

Here are the implementation options that I am thinking about right now and their problems:

Store the create-tag as an indexed property inside the entry:
- When the server gets a request, it has to use a query to find any existing entry
- Problem: Since queries in AppEngine are only eventually consistent, it might miss an entry
Use the create-tag as the entry's key:
- Should be ok as it is guaranteed to be unique if the numbers don't wrap (unlikely with longs)
- Minor inconvenience: It increases the length of the entries' keys in any future use (unneeded overhead)
- Major problem: This will generate sequential entry keys in the datastore which should be avoided at all cost as it creates hot-spots in the stored data and thus can impact performance significantly

One solution I am contemplating for option 2 is to use some sort of formula that takes the sequential numbers and re-maps them onto a unique, deterministic, but random-looking sequence instead to eliminate hot-spots. Any ideas on what such a formula could look like?

Or maybe there is a better approach altogether?

Seems awfully complicated. There must be some combination of the data (even if it's all of it) that makes it unique (else you can't distinguish a retry), so just use that combination, or a hash of it, as your key. — Greg, Sep 22 '14 at 19:37
@Greg I could use the data, that's true. But unfortunately that doesn't solve the problem of eventually consistent queries: The query might still return an empty result-set even if the entry already exists. — Markus A., Sep 22 '14 at 19:45
@Greg Maybe I made it sound more complicated than it is. It's fairly trivial to do id+=1 to generate these "unique monotonically increasing" server-, client-, and thus request-ids. It's definitely MUCH faster and more transferable than worrying about which part of the data I need to hash (i.e. what part of it is the unique identifier) and how to handle potential modifications in this part of the data by other clients before the retry arrives... :) — Markus A., Sep 22 '14 at 19:49
You'd use the hash as the key, so it absolutely does solve the problem of eventual-consistency. — Greg, Sep 22 '14 at 19:53
@Greg Sorry: I missed the "as your key" part... Yes. That does solve the eventual consistency issue. But how do I handle/avoid hash collisions? (I know they're unlikely, but unfortunately not impossible) — Markus A., Sep 22 '14 at 19:53

score 1 · Answer 1 · answered Sep 22 '14 at 19:38

1

How do you assign a key to a new entity?

If you create a key yourself, problem solved. A repeat entity will simply overwrite the existing entity because it has the same key. An example would be creating a product entity where a product's SKU is used to generate a key.

If a key is assigned by the Datastore, then, when a request times out, show an error message to a user and reload data to the client. Then a user will see if an entity was already created.

It's not as fancy as "random-looking sequences", but it's simpler and more reliable :)

answered Sep 22 '14 at 19:38

Andrei Volgin

40,755
6
49
58

In option (1) above, I would have the datastore create the keys for me. In option (2), I would create the keys myself. Unfortunately, while that does solve the problem of preventing re-creation, it comes with the above described other issues that I need to solve to make that work. – Markus A. Sep 22 '14 at 19:46
1

Still, it looks like a classic case of over-engineering. If 1 request out of 100,000 fails this way, just reload the data to a user. By the time a response times out, the datastore state will be already consistent. – Andrei Volgin Sep 22 '14 at 19:54
Agreed, I'm trying to work on the 5th 9 here, but since this is code that will go into a library that hopefully will get a bunch of use, I'd rather spend a day or two on truly solving the issue than hoping that the library will face so few users that their combined frustration is lower than mine... :) – Markus A. Sep 22 '14 at 19:57
Or are you saying that the retry-roundtrip will always take longer than the consistency in the datastore, so if I go with option (1) above, there really is no issue? Unfortunately I haven't been able to find any sort of guarantees on consistency times in the docs... – Markus A. Sep 22 '14 at 20:00
You said it yourself: "when a request times out". That's usually a magnitude longer than a typical eventual consistency delay. As for the library, you cannot solve every possible scenario: a memcache may clear at any moment, your SI.PL.RI request may fail, etc. – Andrei Volgin Sep 22 '14 at 20:11
Unfortunately the most likely scenarios that even trigger a lost response don't really wait for the timeout: For example if a computer loses its wifi signal, I believe all open TCP connections are terminated immediately. Not sure how mobile devices handle a temporary carrier loss. I could imagine the retry to happen on the order of seconds. Especially since it might also be triggered by other failures than a reply-loss. As far as solving every possible scenario goes: I think this one is solvable. Option (2), for example, should work if I can fix the hot-spot performance issue, no? – Markus A. Sep 22 '14 at 20:17
If a computer loses a signal, how would a user retry to create the same entity within a second or two :) ? – Andrei Volgin Sep 22 '14 at 20:21
:) Let me change the question to allow for other failures beyond reply-loss... Also, am I guaranteed consistency within a few seconds? – Markus A. Sep 22 '14 at 20:23
Most of the time it's within a second. I saw reports that it may be up to a few seconds on a bad day. – Andrei Volgin Sep 22 '14 at 23:54
I think I found a fairly straight-forward solution, but definitely +1 for some good pointers. Thanks. :) – Markus A. Sep 23 '14 at 18:21

Markus A. · Accepted Answer · 2014-09-23T19:45:11.550

While it is possible to make implementation option (1) from above work, by using a cleverly designed data-structure, the right transactions, and garbage-collection, it'll be quite a pain to get it to work reliably.

So, it seems like the right solution is to go with a generalized version of option (2) instead:

Use some unique identifier for the created entity as the entity's key.

That way, if the same entity is created again in a retry, it is either easy to reliably find an existing copy (as gets are strongly consistent), or blindly writing it again will simply overwrite the first version.

@Greg suggested in the comments to use a hash over the entity's uniquely identifying data as the key. While this solves the problem of having keys that are evenly distributed across parameter-space and thus leads to an efficient distribution of the data across physical storage locations, it does create the new problem of having to manage (or ignore) hash-collisions, especially if one tries to keep keys from getting very long.

There are ways to handle these collisions. For example: In case of a collision, compare the actual content to check whether it truly is a duplicate, and, if not, add a "1" to the key. Then, see if that key exists also, and, if so, check again if it has the same content. If not, add a "2" instead, check again for a collision, and so on... While this works, it gets quite messy.

Or you can just say that hash collisions are so rare that one will never have enough user data in one's database to ever see one. I personally don't like these sorts of "keep-your-fingers-crossed"-approaches, but in many cases, it might be an acceptable way to go.

But, luckily I already have a collision-free globally unique identifier for the data: the create-tag. And it turns out the two issues that I saw with using it are both easily remedied through some clever bit-shuffling:

Using the same identifiers as in the original question, my create-tag SI.PL.RI consists of SI, which will keep increasing forever, PL, which resets to 0 every time a new server instance is created, and RI which resets for every new client session. So RI is likely always tiny, PL will stay somewhat small, and SI will slowly get huge.

Given that, I could for example build the key like this (starting with the most significant bits):

- Lowest 10 bits of PL
- Lowest  4 bits of RI
- Lowest 17 bits of SI
- 1 bit indicating whether there are any further non-zero values
- Next lowest 10 bits of PL
- Next lowest  4 bits of RI
- Next lowest 17 bits of SI
- 1 bit indicating whether there are any further non-zero values
- ... until ALL bits of RI, PL, and SI are used (eventually breaking 10-4-17 pattern)

That way, the generated keys are spread nicely across parameter space if sorted in lexical order (as AppEngine does), and, the first keys are only half as long as the auto-generated ones, and they only get longer as needed.

Aside 1:

In fact, if no server instance is ever alive for long enough to serve more than a thousand page-loads and no one client ever creates more than 16 new entities in one session, and server instances are not spawned faster than one every 5 minutes on average, it will take more than a year before keys get longer than 4 bytes on average.

And if no server instance is ever alive for long enough to serve more than a million page-loads and no one client ever creates more than 256 new entities in one session, and server instances are not spawned faster than one every second on average, it will take still take over 500 years before keys get longer than 8 bytes (and thus longer than auto-generated ones) on average. Should be fine... :)

Aside 2:

If I need to use the keys to index a Java HashMap instead, the hashCode() function of my key-object can instead return an integer built from the first 4 key-bytes in reverse order to spread the keys across buckets.

Idempotent client-retryable entry creation in AppEngine given eventually consistent queries, etc

2 Answers2