8

I'm using the new experimental taskqueue for java appengine and I'm trying to create tasks that aggregate statistics in my datastore. I'm trying to count the number of UNIQUE values within all the entitities (of a certain type) in my datastore. More concretely, say entity of type X has a field A. I want to count the NUMBER of unique values of A in my datastore.

My current approach is to create a task which queries for the first 10 entities of type X, creating a hashtable to store the unique values of A in, then passing this hashtable to the next task as the payload. This next task will count the next 10 entities and so on and so forth until I've gone through all the entities. During the execution of the last task, I'll count the number of keys in my hashtable (that's been passed from task to task all along) to find the total number of unique values of A.

This works for a small number of entities in my data store. But I'm worried that this hashtable will get too big once I have a lot of unique values. What is the maximum allowable size for the payload of an appengine task?????

Can you suggest any alternative approaches?

Thanks.

aloo
  • 5,331
  • 7
  • 55
  • 94

3 Answers3

14

According to the docs, the maximum task object size is 100K.

Andrei Volgin
  • 40,755
  • 6
  • 49
  • 58
Jonathan Feinberg
  • 44,698
  • 7
  • 80
  • 103
  • does object size = payload size? – aloo Dec 22 '09 at 05:21
  • 4
    You need to serialize your object somehow. That's the payload. If you expect it to be more than 10k, you can use the deferred library's trick of serializing the key of a datastore entity containing the actual data. – Nick Johnson Dec 24 '09 at 19:23
  • 1
    Updated URL to quota page: https://cloud.google.com/appengine/docs/quotas#Task_Queue – Nathan Jul 13 '16 at 13:54
1

"Can you suggest any alternative approaches?".

Create an entity for each unique value, by constructing a key based on the value and using Model.get_or_insert. Then Query.count up the entities in batches of 1000 (or however many you can count before your request times out - more than 10), using the normal paging tricks.

Or use code similar to that given in the docs for get_or_insert to keep count as you go - App Engine transactions can be run more than once, so a memcached count incremented in the transaction would be unreliable. There may be some trick around that, though, or you could keep the count in the datastore provided that you aren't doing anything too unpleasant with entity parents.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
0

This may be too late, but perhaps it can be of use. First, anytime you have a remote chance of wanting to walk serially through a set of entities, suggest using either a date_created or date_modified auto_update field which is indexed. From this point you can create a model with a TextProperty to store your hash table using json.dumps(). All you need to do is pass the last date processed, and the model id for the hash table entity. Do a query with date_created later than the last date, json_load() the TextProperty, and accumulate the next 10 records. Could get a bit more sophisticated (e.g. handle date_created collisions by utilizing the parameters passed and a little different query approach). Add a 1 second countdown to the next task to avoid any issues with updating the hash table entity too quickly. HTH, -stevep

stevep
  • 959
  • 5
  • 8