I have an App Engine app using the go115
runtime that interacts with Cloud Datastore via the google.golang.org/appengine/v2/datastore
package. The app inserts and queries Song
entities with automatically-assigned integer IDs and a Rating
field (along with other fields):
type Song struct {
// ...
Rating float64 `json:"rating"`
// ...
}
With 19,900 Song
entities, https://console.cloud.google.com/datastore/stats reports a data size of 15.53 MB, a built-in index size of 102.27 MB, and a composite index size of 86.31 MB. The app is running in us-central
and Datastore is in nam5 (us-central)
.
The following keys-only query consistently takes 2-4 seconds to return 13,571 keys when run from App Engine:
q := datastore.NewQuery("Song").KeysOnly().Filter("Rating >=", 0.75)
keys, err := q.GetAll(ctx, nil)
Using Run
instead of GetAll
and collecting the IDs while iterating seems to be a tiny bit faster. I also tried using the cloud.google.com/go/datastore
package from my laptop just to make sure that the older appengine
package isn't particularly slow, but the results were far worse (7-9 seconds).
My understanding (based on things like the Indexes document and the I/O 2010 "Next gen queries" talk) is that Datastore automatically maintains an index for the Rating
field. I would expect it to be trivial to satisfy this query using that index. The integer IDs in the result set seem like they should consume about 100 KB of space.
I haven't found any official guidance from Google about what type of performance to expect from Datastore, but this is far slower than I'd hoped for. When I perform a brute-force approximation of this query on my laptop by running the jq tool and grep
against a JSON dump of the data, it only takes 400 ms.
Is it expected that a trivial keys-only Datastore query will take multiple seconds to return thousands of keys?
When I run keys-only queries that return smaller result sets, the performance is better: queries that return 300-400 keys often take 100-200 ms, while queries that return fewer than 200 keys complete in 40-70 ms.
I'm unable to modify many of my queries to limit the number of results they return:
I sometimes perform multiple queries in parallel and then manually intersect or subtract their results to get around the Datastore limitation that inequality filters are limited to at most one property.
I sometimes get the full result set in order to return a random subset of it. I've read about approaches people take to simulate randomized results when using Datastore, but I don't think they'll work here since my queries already use inequality filters.
I can think of hacks that may help in some cases (e.g. sharding/partitioning my entities so I can run multiple queries in parallel), but I'd prefer to avoid that if possible.