4

For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.

How can I count large entities?

Ray Yun
  • 1,571
  • 3
  • 16
  • 22
  • I'm having the same issue, inside a task queue. The issue isn't the 60-second `DeadlineExceededError` for front-end requests, it's the 60-second (ish?) deadline for the query itself, `_ToDatastoreError(err)... Timeout: The datastore operation timed out, or the data was temporarily unavailable.` Well, I'm guessing that's OP's problem. – Zach Young Jul 27 '17 at 21:50

3 Answers3

2

To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.

Here is an example of how you could use the deferred library in your app:

import logging

def count_large_query(query):
  total = query.count()
  logging.info('Total entities: %d' % total)

Then you can call the above function from within your app like:

from google.appengine.ext import deferred

# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())

While I'm still not sure if the count() going to return any results with such large datastore you could use this count_large_query() function instead, which is using cursors (untested):

LIMIT = 1024
def count_large_query(query):
  cursor = None
  more = True
  total = 0
  while more:
    ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
    total += len(ndbs)

  logging.info('Total entitites: %d' % total)

To try locally the above set the LIMIT to 4 and check if in your console you can see the Total entitites: ## line.


As Guido mentioned in the comment this will not going to scale either:

This still doesn't scale (though it may postpone the problem). A task has a 10 minute instead of 1 minute, so maybe you can count 10x as many entities. But it's pretty expensive! Have a search for sharded counters if you want to solve this properly (unfortunately it's a lot of work).

So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.

Lipis
  • 21,388
  • 20
  • 94
  • 121
  • 1
    Hm. This still doesn't scale (though it may postpone the problem). A task has a 10 minute instead of 1 minute, so maybe you can count 10x as many entities. But it's pretty expensive! Have a search for sharded counters if you want to solve this properly (unfortunately it's a lot of work). – Guido van Rossum Feb 05 '13 at 15:58
  • @RayYan I updated my answer after Guido's comment.. so take that into consideration as well. – Lipis Feb 05 '13 at 16:35
2

This is indeed a frustrating issue. I've been doing some work in this area lately to get some general count stats - basically, the number of entities that satisfy some query. count() is a great idea, but it is hobbled by the datastore RPC timeout.

It would be nice if count() supported cursors somehow so that you could cursor across the result set and simply add up the resulting integers rather than returning a large list of keys only to throw them away. With cursors, you could continue across all 1-minute / 10-minute boundaries, using the "pass the baton" deferred approach. With count() (as opposed to fetch(keys_only=True)) you can greatly reduce the waste and hopefully increase the speed of the RPC calls, e.g., it takes a shocking amount of time to count to 1,000,000 using the fetch(keys_only=True) approach - an expensive proposition on backends.

Sharded counters are a lot of overhead if you only need/want periodic count statistics (e.g., a daily count of all my accounts in the system by, e.g., country).

JasonC
  • 349
  • 1
  • 2
  • The solution here is to use `db.Query.count` instead, which *does* support cursors. It's the difference between many minutes counting and several seconds, when the count is on the order of millions. Using `deferred.defer` to store and forward results, you can count up to 2M in two 60s requests. Note that you can use `db.Query.count` even if you're using an `ndb` mapping. – technomage Apr 06 '14 at 12:17
1

It is better to use google app engine backends. Backends are exempt from the 60-second deadline for user requests and the 10-minute deadline for tasks, and run indefinitely. Please take a look at the document here: https://developers.google.com/appengine/docs/java/backends/overview

Evan
  • 236
  • 1
  • 4
  • Thanks but even backends, query deadline 60s is not avoidable. – Ray Yun Feb 04 '13 at 11:29
  • 1
    this is true, it's the database API that's timing out rather then the platform you are calling it from. it's unavoidable as I found on from a similar question/answer. – Paul Collingwood Feb 04 '13 at 11:41