Can't iterate over huge number of DataStore records

Question

I'm trying to iterate over a huge number of datastore records, currently about 330,000. Conceptually, each record has a row, a column, and a value, and I'm iterating over the records and constructing a matrix, which I'll then use for calculations.

The error I get is: Timeout: The datastore operation timed out, or the data was temporarily unavailable.

[ADDED: NOTE that my issue is not an app engine timeout. Running as a CRON job, I have plenty of time, and the datastore error happens more quickly than the app engine time out. Also, I have tried the answers given in other questions, as I mention below.]

The error happens after the iteration runs over less than 100,000 of the records.

My current code, which I wrote after consulting past related threads, is:

    prodcauses_query = ProdCause.query(projection=['prod_id', 'value', 'cause']).filter(ProdCause.seller_id == seller_id)
    for pc in prodcauses_query.iter(read_policy=ndb.EVENTUAL_CONSISTENCY, deadline=600):
        ### COPY DATA IN RECORD PC INTO A MATRIX
        ### row is prod_id, col is cause, value is value

Is there any better way to do this than ITER? Any better settings for batch_size or deadline or read_policy?

Note that this process is running in a CRON job, so it doesn't bother me if it takes a long time to do this. The rest of the process takes a few seconds, the hard part has been reading in the data.

Thanks for any thoughts!

possible duplicate of [Google App Engine time out?](http://stackoverflow.com/questions/7328582/google-app-engine-time-out) — Zig Mandel, Oct 04 '15 at 13:31

score 1 · Answer 1 · answered Oct 04 '15 at 07:38

Two options:

Use MapReduce library for App Engine to run over all of your entities. And in the map part of MapReduce do the magic thing you wanna do on each of your entities. Tutorial can be found here: MapReduce on App Engine made easy
Or, use cursors and task with limited query size. I.E. your cron job will run the first batch or entities, and if there are any remaining it will start another task with the query cursor of the query you just ran.

score 1 · Answer 2 · answered Oct 04 '15 at 07:48

1

YOu haven't said if your using a task queue, so I will assume you aren't.

A cron job should start a task to do your processing other wise the handler will still have a 60 second deadline. Running it as a task will give you a 10min deadline.

The consider your batch size, specifying large batches sizes reduces the number of round trips.

Lastly if the jobs run for long periods you can either chain tasks (watch how long you have been running for and start a new task to continue) or look at mapreduce jobs.

answered Oct 04 '15 at 07:48

Tim Hoffman

12,976
1
17
29

cron jobs have the same deadline as tasks. Starting a task to get retries is still useful though. – Greg Oct 04 '15 at 07:57
I missed the limit increase (it appears that it's only documented in the SDK 1.4 release notes as far as I can tell.) I have always had cron start tasks, as tasks a retryable whereas cron requests are not. – Tim Hoffman Oct 04 '15 at 09:10

Can't iterate over huge number of DataStore records

2 Answers2