4

I need to read all the entries in a Google AppEngine datastore to do some initialization work. There are a lot of entities (80k currently) and this continues to grow. I'm starting to hit the 30 second datastore query timeout limit.

Are there any best practices for how to shard these types of huge reads in the datastore? Any examples?

user1617999
  • 75
  • 1
  • 6
  • 1
    Could you explain the use case? – Sebastian Kreft Aug 23 '12 at 20:26
  • I have a query which basically just does a scan on my datastore for entities of a particular kind. There are about 80k of them in there and they take a long time to read, about 45 seconds. This exceeds the datastore read timeout which means that these table scans fail. I'm trying to understand how I can somehow break up my reads into small chunks or otherwise push this to some longer deadline type of processing so that my initialization won't fail. Also, the number of entities I have (80k today) is likely to grow so I'd like this to work for 800k entities. @SebastianKreft – user1617999 Aug 23 '12 at 21:01
  • 2
    Sounds like a job for a mapreduce. – Daniel Roseman Aug 23 '12 at 21:44
  • Without knowing more about what the data is and why you would want to query so much of it at once all I can do is agree with @DanielRoseman that mapreduce tends to be a good tool for jobs of this size. With more information about the reasoning and purpose behind the query and data we may be able to provide better advice. – Bryce Cutt Aug 23 '12 at 21:46

2 Answers2

3

You can tackle this in several ways:

  1. Execute your code on Task Queue which has 10min timeout instead of 30s (more like 60s in practice). The easiest way to do this is via DeferredTask.

    Warning: DeferredTask must be serializable, so it's hard to pass it complex data. Also dont make it an inner class.

  2. See backends. Requests served by backend instance do not have time limit.

  3. Finally, if you need to break-up a big task and execute in parallel than look at mapreduce.

Community
  • 1
  • 1
Peter Knego
  • 79,991
  • 11
  • 123
  • 154
  • Thanks for this analysis. I think you're basically right. The one question I have is about the Task deadline. My understanding on the datastore read deadline is still 30s whether it's in a Task or a regular servlet/JSP page. If this is true, are there any good ways to chunk up a reads using multiple Tasks? I'm thinking of something like starting 10 Tasks that each reach a section of keys. I know this is basically what MR does, but just wondering. – user1617999 Aug 24 '12 at 17:38
  • 1
    Yes, that's what we're doing. Querying the database with cursor and collecting in every iteration 1000 entities, then creating a DeferredTask and passing the data to it. This works without a problem: we are processing 2M entities within minutes (2000 tasks). – Peter Knego Aug 24 '12 at 21:32
  • But you should also rethink your architecture: in NoSQL it's better to calculate in real-time when data comes in instead of running large expensive queries like this. – Peter Knego Aug 24 '12 at 21:34
  • Also, datastore deadline only applies if you make a read operation that's too large: for example making a read and passing 1M keys. But for small reads, like 1000 entities, that's not an issue. – Peter Knego Aug 24 '12 at 21:44
0

This answer on StackExchange served me well:

Expired queries and appengine

I had to slightly modify it to work for me:

def loop_over_objects_in_batches(batch_size, object_class, callback):

    num_els = object_class.count() 
    num_loops = num_els / batch_size
    remainder = num_els - num_loops * batch_size
    logging.info("Calling batched loop with batch_size: %d, num_els: %s, num_loops: %s, remainder: %s, object_class: %s, callback: %s," % (batch_size, num_els, num_loops, remainder, object_class, callback))    
    offset = 0
    while offset < num_loops * batch_size:
        logging.info("Processing batch (%d:%d)" % (offset, offset+batch_size))
        query = object_class[offset:offset + batch_size]
        for q in query:
            callback(q)

        offset = offset + batch_size

    if remainder:
        logging.info("Processing remainder batch (%d:%d)" % (offset, num_els))
        query = object_class[offset:num_els]
        for q in query:
            callback(q)
Community
  • 1
  • 1