ndb data contention getting worse and worse

Question

I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.

The first few tasks work fine but as time continues I start getting these on the final put:

suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)

So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.

Here is some pseudocode for my task:

def my_task(request):
    stuff = get_ndb_instances() #this accessed a few things from different tables
    better_stuff = process(ndb_instances) #pretty much just a summation
    try_put(better_stuff)
    return {'status':'Groovy'}

def try_put(oInstance,iCountdown=10):
    if iCountdown<1:
        return oInstance.put()
    try:
        return oInstance.put()
    except:
        import time
        import random 
        logger.info("sleeping")
        time.sleep(random.random()*20)
        return oInstance.try_put(iCountdown-1)

Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.

Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.

EDIT:

there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?

EDIT2:

Here is a little bit about the key structures in play:

My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)

Type->Instance->Position

The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.

I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.

What is the key structure used, what sort of contention error are you getting ? — Tim Hoffman, Feb 16 '16 at 11:16
Possible duplicate of ["Too much contention" when creating new entity in dataStore](http://stackoverflow.com/questions/17308179/too-much-contention-when-creating-new-entity-in-datastore) — Brent Washburne, Feb 16 '16 at 17:19
@TimHoffman: No. This code is a shortened version of the real thing — Sheena, Mar 01 '16 at 13:28
@TimHoffman: re the type of contention, I'm not getting much info beyond the exception I pasted in my question above. I'll an edit to talk about key structures. — Sheena, Mar 01 '16 at 13:33
@BrentWashburne: I don't believe this is a duplicate. the issue is that contention gets progressively worse over time even though the number of processes dealing with the data has a hard upper limit. — Sheena, Mar 01 '16 at 13:36
Do you see the number of app instances increasing during this time by any chance? — Dan Cornilescu, Mar 01 '16 at 13:41
@DanCornilescu: I'll run this stuff again and get those numbers — Sheena, Mar 01 '16 at 14:05
Thousands of `Instance`s to few `Type`s => large entity groups, each group supporting max ~ 1 write per second. What's the rate of the tasks updating `Instance`s for the same `Type` parent (i.e. same group)? Do you have `threadsafe: true` in your `.yaml` config? — Dan Cornilescu, Mar 01 '16 at 14:17

score 0 · Accepted Answer · answered Dec 28 '16 at 16:22

Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.

When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.

If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.

Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).

Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.

ndb data contention getting worse and worse

1 Answers1