7

We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.

One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:

query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)

We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.

This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.

This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?

Update

We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.

Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:

query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)

The results from Appstats:

service.call           #RPCs  real time  api time
datastore_v3.RunQuery  22     352ms      555ms
datastore_v3.Delete    1      366ms      132825ms
taskqueue.BulkAdd      1      7ms        0ms

The Delete call is the single most expensive part of the operation!

Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Blair Holloway
  • 15,969
  • 2
  • 29
  • 28
  • 4
    Instead of `db.delete(models)` try using `key_only` queries with `db.delete(keys)`. You can avoid the serialization and return of the entities in your loop. Sounds like you want to delete an entire entity group at a time. – kevpie Dec 15 '10 at 09:01
  • Ah, of course! I hadn't thought to use a `keys_only` query. I'll need to try that out and see what it yields. – Blair Holloway Dec 15 '10 at 09:06
  • Also, no, I'm not using appstats? – Blair Holloway Dec 15 '10 at 09:06
  • 1
    It is your new friend. http://code.google.com/appengine/docs/python/tools/appstats.html It is one of GVR's creations. It provides for you visuals for appengine api calls. So each query in your loop and the db.delete will show their cost. – kevpie Dec 15 '10 at 10:41
  • You're also deleting entities one at a time (with possible children) - a batch delete would be much much faster. – Nick Johnson Dec 15 '10 at 21:56

2 Answers2

2

We recently added a bulk-delete handler, documented here. It takes the most efficient possible approach to bulk deletion, though it still consumes CPU quota.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
1

If you want to spread out the CPU burn, you could create a map reduce job. It will still iterate over every entity (this is a current limitation of the mapper API). However, you can check if each entity meets the condition and delete or not at that time.

To slow down the CPU usage, assign the mapper to a task queue that you've configured to run slower than normal. You can spread the run time out over several days and not eat up all your CPU quota.

Luke Francl
  • 31,028
  • 18
  • 69
  • 91