1

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.

I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.

Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?

My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.

Twistieman
  • 131
  • 6

1 Answers1

2

It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.

One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.

Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.

This could work well as long as the size of your cache does not need to be prohibitively large.

If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

Amy U.
  • 2,227
  • 11
  • 11
  • The major predicament I have is further filtering results using these numeric values. So, if I then need to narrow down results further based on the numeric values, I should just use this cached data, iterating through each set of values in the result set, with its associated numerical data from the cache and just use standard python logical comparisons to eliminate results that don't meet the crtieria? – Twistieman Aug 17 '12 at 00:19
  • Sorry, I'd missed that you wanted to filter on the numeric fields too. In that case, it does essentially boil down to doing an app-level join. (Or, maintain all the information in Search docs). However, you want to leverage the datastore's query engine if possible. So, one approach is to do a keys-only query on the entities, based on your numeric filters, then check that list of keys against the docs returned by the text query, throwing out any docs that don't intersect with the entity key list. Then, for the docs that pass that filter, fetch the associated numeric data from the cache. – Amy U. Aug 17 '12 at 01:31
  • An addendum to my comment above-- that approach would be feasible if your entity query returned a manageably small result set. If not, and if the doc result set is much smaller than the entity query results, you are probably better off (as you proposed) pushing the doc id constraint to the datastore query. – Amy U. Aug 17 '12 at 02:44
  • Thanks for your help, I'm still relatively new to the App Engine environment and your advice has been invaluable in solving this problem. – Twistieman Aug 17 '12 at 04:22