3

I'm looping over data in app engine using chained deferred tasks and query cursors. Python 2.7, using db (not ndb). E.g.

def loop_assets(cursor = None):

  try:

     assets = models.Asset.all().order('-size')

     if cursor:
        assets.with_cursor(cursor)

     for asset in assets.run():

        if asset.is_special():
           asset.yay = True
           asset.put()

  except db.Timeout:
     cursor = assets.cursor()
     deferred.defer(loop_assets, cursor = cursor,  _countdown = 3, _target = version, _retry_options = dont_retry)
     return

This ran for ~75 minutes total (each task for ~ 1 minute), then raised this exception:

BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results.

Reading the docs, the only stated cause of this is:

New App Engine releases may change internal implementation details, invalidating cursors that depend on them. If an application attempts to use a cursor that is no longer valid, the Datastore raises a BadRequestError exception.

So maybe that's what happened, but it seems a co-incidence that the first time I ever try this technique I hit a 'change in internal implementation' (unless they happen often).

Is there another explanation for this? Is there a way to re-architect my code to avoid this?

If not, I think the only solution is to mark which assets have been processed, then add an extra filter to the query to exclude those, and then manually restart the process each time it dies.

For reference, this question asked something similar, but the accepted answer is 'use cursors', which I am already doing, so it cant be the same issue.

Community
  • 1
  • 1
tom
  • 2,189
  • 2
  • 15
  • 27
  • 1
    Of note is that the BadRequestError is not always the cause of an implementation change - it's possible that it was caused by a single sub-query timing out. Implementing a retry could be a solution. – Adam Feb 28 '16 at 00:24
  • @Adam thanks. Do you have links to any documentation on other causes of BadRequestError? - would love to read up on it. – tom Mar 01 '16 at 01:55

2 Answers2

1

You may want to look at AppEngine MapReduce

MapReduce is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request, tasks like:

  • Analyzing application logs
  • Aggregating related data from external sources
  • Transforming data from one format to another
  • Exporting data for external analysis
Brent Washburne
  • 12,904
  • 4
  • 60
  • 82
  • Thanks. In my is_special() block I'm actually doing a call to AWS S3 (via boto sdk) which does a urlfetch. Can that be done in MapReduce, or is it limited to data manipulation? – tom Feb 24 '16 at 21:33
  • I think you could add a stage to perform the urlfetch before another stage that processes the data: https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/1.2-Jobs-and-Stages – Brent Washburne Feb 24 '16 at 21:44
  • 2
    Yes, you can perform the urlfetch using a custom input reader as per https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/3.4-Readers-and-Writers. – Adam Feb 28 '16 at 00:19
0

When I asked this question, I had run the code once, and experienced the BadRequestError once. I then ran it again, and it completed without a BadRequestError, running for ~6 hours in total. So at this point I would say that the best 'solution' to this problem is to make the code idempotent (so that it can be retried) and then add some code to auto-retry.

In my specific case, it was also possible to tweak the query so that in the case that the cursor 'expires', the query can restart w/o a cursor where it left off. Effectively change the query to:

assets = models.Asset.all().order('-size').filter('size <', last_seen_size)

Where last_seen_size is a value passed from each task to the next.

tom
  • 2,189
  • 2
  • 15
  • 27