0

I have an account key that I check when processing items. This key prevents me from spending time working on items that are no longer valid. On an inner loop in processing, I validate that I can still find the key that I am working on.

Entity definition:

class AuthorizedKey(ndb.Model):
    secret_key = ndb.StringProperty()
    ...additional fields...

Code I am using to check:

authorized_key = AuthorizedKey.query(AuthorizedKey.secret_key == first_key).get()

first_key is a string that matches an AuthorizedKey entity that has already been validated and is in the datastore. It has not been modified in weeks or months depending on the key.

This query periodically fails to retrieve a result. Roughly 1 in 10,000 times. Errors appear to be distributed among the keys (i.e. the problem is not with just one record).

Any suggestions on how to get rid of these errors? I need the information that is stored inside AuthorizedKey in order to accurately process the requests. If data is coming in that does not have a key, I want to stop processing it as quickly as possible.

Update --- The record that matches the query is typically created a few days before it is ever used. These records have been in place on average for months. They have worked millions of times, existed before, and continue to exist. Sometimes, the response to the get() call is None.

While multiple records have exhibited this problem, one particular record has been consistently accessed since about 2 days after its creation. I just added monitoring code on the inner loop a few days ago. There is no error message returned.

"if not authorized_key:" 

right after the code above returns False roughly once out of every 10,000 times. Accessing before and after works. An entity with a secret_key that matches my argument has existed for just shy of 8 months and has been successfully accessed millions of times. This is the reason for my question, to ask for help. I quietly fail in this case and automatically retry later. The retry works fine. Exactly the same data.

Aside from the odd nature of the issue, the main reasons that I worry about this are:

  • Reliability of the core system
  • Other areas of the system may not be degrading as gracefully as this one.
halfer
  • 19,824
  • 17
  • 99
  • 186
TheJacobTaylor
  • 4,063
  • 1
  • 21
  • 23
  • 1
    you are sure that that specific `AuthorizedKey` entity is in the datastore and was not deleted by any other process of your app? when you query for that key right after it failed do you get the entity? – aschmid00 Dec 12 '12 at 20:21
  • @aschmid00 The most recent even involves a key that was created back in April, it has existed consistently the entire time. It is conceivable that the entity is occasionally updated but it should never be changed. It is designed to be written once and updated once (barring upgrades). The incidents of the errors are not correlated with upgrades. – TheJacobTaylor Dec 12 '12 at 20:47
  • 1
    when you speak of `existed consistently` are you saying the key/entity does still exist? did you try do retrieve the entity after the 'error' where it was not found? – aschmid00 Dec 12 '12 at 21:06
  • 1
    What is the actual error? There are many kinds of datastore errors. – bossylobster Dec 12 '12 at 22:22
  • 1
    @bossylobster there is no 'error', the error is that the entity is not found. – aschmid00 Dec 12 '12 at 22:51
  • This record has been consistently accessed since about 2 days after its creation. I just added monitoring code on the inner loop a few days ago. There is no error message returned. "if not authorized_key:" right after the code above returns False roughly once out of every 10,000 times. Accessing before and after works. An entity of with that attribute has existed for just shy of 8 months and has been successfully accessed millions of times. This is the reason for my question, to ask for help. I quietly fail in this case and automatically retry later. The retry works fine. Exactly the same data – TheJacobTaylor Dec 12 '12 at 23:13
  • i see. did you write tests for this query? i suggest you to use `self.policy = datastore_stub_util.PseudoRandomHRConsistencyPolicy(probability=0)` in your tests, dragonx might be right with his answer – aschmid00 Dec 12 '12 at 23:19
  • 1
    Are you 100% certain that first_key is an exact match in the datastore? Is it possible that first_key sometimes has a space at the end or special characters or different encoding? – Rob Curtis Dec 13 '12 at 09:59
  • @Bert I thought of that as well. The same data, on a retry, goes through just fine. The same data was already searched for, and found, with a slightly different query earlier in processing (before adding it to a task queue). I send the key that caused the error in email and have verified that there are no spaces on either side of it and have used cut/paste to put the value into a query. – TheJacobTaylor Dec 13 '12 at 17:56
  • @aschmid00 I don't understand how an eventual consistency issue should be able to cause a problem with a record that is 8 months old. We generally issue new keys hours before their first use. More frequently, the keys are created days before their first use. One field that I do not have on this record is a last modified time. These records are typically written once and updated once. Is there any chance that a datastore write to the same field could cause the query to not find a result? The error has affected only 4 keys to date. For the past two days, it is limited to 1 key. – TheJacobTaylor Dec 13 '12 at 18:13
  • 1
    @TheJacobTaylor i can't really tell what the issue is. but it should not happen and eventual consistency should not affect 'older' entities like you say. – aschmid00 Dec 13 '12 at 18:53
  • I just ran a simple test where I ran the query that was failing in a tight loop checking for success. I ran 10 concurrent requests fetching 1,000 times each. Running two sets of these produced 0 errors out of roughly 180,000 requests. Anybody have suggestions for tests I can run? Is there another way to structure this query to be more reliable? – TheJacobTaylor Dec 13 '12 at 19:01
  • Correction, roughly 18,000 requests. Not enough. Trying more. – TheJacobTaylor Dec 13 '12 at 22:16

2 Answers2

2

You're using a non-ancestor query, so you may get eventually consistent results. If the entity is fairly new, then a query on it could fail to show it.

dragonx
  • 14,963
  • 27
  • 44
  • 1
    isn't eventually consistent only an issue on newly created entities until they are replicated in the cloud? – aschmid00 Dec 12 '12 at 22:50
  • ok i misread your answer. he is speaking about an older entity. – aschmid00 Dec 13 '12 at 00:05
  • 1
    I just saw the update about these entities being old. Sounds pretty scary actually. I wonder if it's possibly an ndb bug, since ndb get requests can be cached... I wonder if there's a small timing bug in the ndb caching that fails rarely. – dragonx Dec 13 '12 at 05:33
  • 1
    NDB does not cache queries. – Guido van Rossum Dec 13 '12 at 07:37
  • I have actually been planning on changing the key for these entities to be the value I am querying against to allow the entity cache to be used. – TheJacobTaylor Dec 13 '12 at 17:59
  • 1
    @TheJacobTaylor are checking for exceptions? ie are you certain that it's returning no result, vs failing with some exception indicating there's something wrong with the datastore? – dragonx Dec 13 '12 at 18:04
  • @dragonx It is all in a try/catch block that catches everything. Right after the query, I am checking to see if the result was invalid with 'if not authorized_key'. The code in that if block is what is notifying me of the problem. Aside from the error, 4x in the past day, we have had one transient datastore error and a few transaction retries. None of the times appeared correlated (off by over 2 hours at closest) – TheJacobTaylor Dec 13 '12 at 18:27
1

It turns out that this internal processing code was called in two modes. The first mode pulls a batch and processes them. If there are too many items in the batch, it schedules a follow up task for later. When the follow up task is scheduled, it is in a namespace other than the default.

The second mode is as a follow up task, which carries forward the namespace from which it was queued. This means that I am now looking for the AuthorizedKey in the wrong namespace and not finding it. My batch size was rather high, but not astronomical. When one device is very active, or the system is heavily loaded, events can hit that mark, causing the error very sporadically.

Thank you all for your suggestions and debugging help!

halfer
  • 19,824
  • 17
  • 99
  • 186
TheJacobTaylor
  • 4,063
  • 1
  • 21
  • 23