2

Using Google App Engine's NDB datastore, how do I ensure a strongly consistent read of a list of entities after creating a new entity?

The example use case is that I have entities of the Employee kind.

  • Create a new employee entity
  • Immediately load a list of employees (including the one that was added)

I understand that the approach below will yield an eventually consistent read of the list of employees which may or may not contain the new employee. This leads to a bad experience in the case of the latter.

e = Employee(...)
e.put()
Employee.query().fetch(...)

Now here are a few options I've thought about:

IMPORTANT QUALIFIERS

I only care about a consistent list read for the user who added the new employee. I don't care if other users have an eventual consistent read.

Let's assume I do not want to put all the employees under an Ancestor to enable a strongly consistent ancestor query. In the case of thousands and thousands of employee entities, the 5 writes / second limitation is not worth it.

Let's also assume that I want the write and the list read to be the result of two separate HTTP requests. I could theoretically put both write and read into a single transaction (?) but then that would be a very non-RESTful API endpoint.

Option 1

  • Create a new employee entity in the datastore
  • Additionally, write the new employee object to memcache, local browser cookie, local mobile storage.
  • Query datastore for list of employees (eventually consistent)
  • If new employee entity is not in this list, add it to the list (in my application code) from memcache / local memory
  • Render results to user. If user selects the new employee entity, retrieve the entity using key.get() (strongly consistent).

Option 2

  • Create a new employee entity using a transaction
  • Query datastore for list of employees in a transaction

I'm not sure Option #2 actually works.

  • Technically, does the previous write transaction get written to all the servers before the read transaction of that entity occurs? Or is this not correct behavior?
  • Transactions (including XG) have a limit on number of entity groups and a list of employees (each is its own entity group) could exceed this limit.
  • What are the downsides of read-only transactions vs. normal reads?

Thoughts? Option #1 seems like it would work, but it seems like a lot of work to ensure consistency on a follow-on read.

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
dopster
  • 65
  • 1
  • 7
  • You have the key of the new employee, so why query. Peform a query and add the entity to the result set. Also a get on the key will force the index writes. The question is how long after the creation of the entity will the query be performed and by whom ? If the same user then a session object can manage a list of newly created entities with some sort of time boundary. I have been involved with a system with 2000+ users and we generally don't see issues with CRUD operations. – Tim Hoffman Apr 17 '15 at 11:31

3 Answers3

1

If you don not use an entity group you can do a key_only query and get_multi(keys) lookup for entity consistency. For the new employee you have to pass the new key to key list of the get_multi.

Docs: A combination of the keys-only, global query with a lookup method will read the latest entity values. But it should be noted that a keys-only global query can not exclude the possibility of an index not yet being consistent at the time of the query, which may result in an entity not being retrieved at all. The result of the query could potentially be generated based on filtering out old index values. In summary, a developer may use a keys-only global query followed by lookup by key only when an application requirement allows the index value not yet being consistent at the time of a query.

More info and magic here : Balancing Strong and Eventual Consistency with Google Cloud Datastore

voscausa
  • 11,253
  • 2
  • 39
  • 67
0

I had the same problem, option #2 doesn't really work: a read using the key will work, but a query might still miss the new employee.

Option #1 could work, but only in the same request. The saved memcache key can dissapear at any time, a subsequent query on the same instance or one on another instance potentially running on another piece of hw would still miss the new employee.

The only "solution" that comes to mind for consistent query results is to actually not attempt to force the new employee into the results and rather leave things flow naturally until it does. I'd just add a warning that creating the new user will take "a while". If tolerable maybe keep polling/querying in the original request until it shows up? - that would be the only place where the employee creation event is known with certainty.

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
0

This question is old as I write this. However, it is a good question and will be relevant long term.

Option #2 from the original question will not work.

If the entity creation and the subsequent query are truly independent, with no context linking them, then you are really just stuck - or you don't care. The trick is that there is almost always some relationship or some use case that must be covered. In other words if the query is truly some kind of, essentially, ad hoc query, then you really don't care. In that case, you just quote CAP theorem and remind the client executing the query how great it is that this system scales. However, almost always, if you are worried about the eventual consistency, there is some use case or set of cases that must be handled. For example, if you have a high score list, the highest score must be at the top of the list. The highest score may have just been achieved by the user who is now looking at the list. Another example might be that when an employee is created, that employee must be on the "new employees" list. So what you usually do is exploit these known cases to balance the throughput needed with consistency. For example, for the high score example, you may be able to afford to keep a secondary index (an entity) that is the list of the high scores. You always get it by key and you can write to it as frequently as needed because high scores are not generated that often presumably. For the new employee example, you might use an approach that you started to suggest by storing the timestamp of the last employee in memcache. Then when you query, you check to make sure your list includes that employee ... or something along those lines.

The price in balancing write throughput and consistency on App Engine and similar systems is always the same. It requires increased model complexity / code complexity to bridge the business needs.

Jay
  • 525
  • 5
  • 15