0

In my Google App Engine App, I have a large number of entities representing people. At certain times, I want to process these entities, and it is really important that I have the most up to date data. There are far too many to put them in the same entity group or do a cross-group transaction.

As a solution, I am considering storing a list of keys in Google Cloud Storage. I actually use the person's email address as the key name so I can store a list of email addresses in a text file.

When I want to process all of the entities, I can do the following:

  1. Read the file from Google Cloud Storage
  2. Iterate over the file in batches (say 100)
  3. Use ndb.get_multi() to get the entities (this will always give the most recent data)
  4. Process the entities
  5. Repeat with next batch until done

Are there any problems with this process or is there a better way to do it?

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
new name
  • 15,861
  • 19
  • 68
  • 114
  • Will your list of keys change often? If so, are you not just pushing the problem towards making sure you have a strongly consistent list of keys? – tx802 Oct 03 '15 at 08:41
  • I don't see how this solves your problem at all. How do you plan on updating the file? – Greg Oct 03 '15 at 11:54
  • Updating the file is not a problem. The file is updated before new entities are created. New entities are created relatively rarely. – new name Oct 03 '15 at 14:16
  • Why not add a common ancestor to all your people and do an ancestor query. If that doesn't seem feasible i'll throw in something odd: sharded ancestors. Evenly distribute your people entities over shards of an ancestor and ancestor query those shards when you need the up-to-data data. It's just a thought but maybe worth a try. – konqi Oct 03 '15 at 22:18
  • @konqi, interesting idea but it seems complicated to implement. I'll have to think about it more. – new name Oct 04 '15 at 04:42

2 Answers2

1

if, like you say in comments, your lists change rarely and cant use ancestors (I assume because of write frequency in the rest of your system), your proposed solution would work fine. You can do as many get(multi) and as frequently as you wish, datastore can handle it.

Since you mentioned you can handle having that keys list updated as needed, that would be a good way to do it. You can stream-read a big file (say from cloud storage with one row per line) and use datastore async reads to finish very quickly or use google cloud dataflow to do the reading and processing/consolidating. dataflow can also be used to instantly generate that keys list file in cloud storage.

Zig Mandel
  • 19,571
  • 5
  • 26
  • 36
-1

You probably don't need to write your own solution, there are many libraries available to help you process a large number of entities on App Engine. You could do it with map reduce, although the prefered way now is via the Google App Engine Pipeline API.

Julian Go
  • 4,442
  • 3
  • 23
  • 28
  • My concern is making sure I get the most recent data (avoiding eventual consistency), and I don't think map reduce will do that. Also, I don't have so much data that I would need map reduce, just too much data to put in a single entity group or transaction. – new name Oct 02 '15 at 21:40
  • this does not answer the question. op is right. you cant use MapReduce because getting fresh results only happens if you get by key. – Zig Mandel Oct 05 '15 at 04:21