0

I'm trying to iterate over a huge number of datastore records, currently about 330,000. Conceptually, each record has a row, a column, and a value, and I'm iterating over the records and constructing a matrix, which I'll then use for calculations.

The error I get is: Timeout: The datastore operation timed out, or the data was temporarily unavailable.

[ADDED: NOTE that my issue is not an app engine timeout. Running as a CRON job, I have plenty of time, and the datastore error happens more quickly than the app engine time out. Also, I have tried the answers given in other questions, as I mention below.]

The error happens after the iteration runs over less than 100,000 of the records.

My current code, which I wrote after consulting past related threads, is:

    prodcauses_query = ProdCause.query(projection=['prod_id', 'value', 'cause']).filter(ProdCause.seller_id == seller_id)
    for pc in prodcauses_query.iter(read_policy=ndb.EVENTUAL_CONSISTENCY, deadline=600):
        ### COPY DATA IN RECORD PC INTO A MATRIX
        ### row is prod_id, col is cause, value is value

Is there any better way to do this than ITER? Any better settings for batch_size or deadline or read_policy?

Note that this process is running in a CRON job, so it doesn't bother me if it takes a long time to do this. The rest of the process takes a few seconds, the hard part has been reading in the data.

Thanks for any thoughts!

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130

2 Answers2

1

Two options:

  • Use MapReduce library for App Engine to run over all of your entities. And in the map part of MapReduce do the magic thing you wanna do on each of your entities. Tutorial can be found here: MapReduce on App Engine made easy
  • Or, use cursors and task with limited query size. I.E. your cron job will run the first batch or entities, and if there are any remaining it will start another task with the query cursor of the query you just ran.
MeLight
  • 5,454
  • 4
  • 43
  • 67
1

YOu haven't said if your using a task queue, so I will assume you aren't.

A cron job should start a task to do your processing other wise the handler will still have a 60 second deadline. Running it as a task will give you a 10min deadline.

The consider your batch size, specifying large batches sizes reduces the number of round trips.

Lastly if the jobs run for long periods you can either chain tasks (watch how long you have been running for and start a new task to continue) or look at mapreduce jobs.

Tim Hoffman
  • 12,976
  • 1
  • 17
  • 29
  • cron jobs have the same deadline as tasks. Starting a task to get retries is still useful though. – Greg Oct 04 '15 at 07:57
  • I missed the limit increase (it appears that it's only documented in the SDK 1.4 release notes as far as I can tell.) I have always had cron start tasks, as tasks a retryable whereas cron requests are not. – Tim Hoffman Oct 04 '15 at 09:10