I need to perform a nightly update in the datastore on a relatively large dataset (syncing a subset of corporate data with GAE). I've been using the bulkloader, and it does the job, but the write costs are really adding up. Since I'm specifying key strings for each entity, the bulkloader is essentially rewriting the ENTIRE entity for every record it loads, which in my case, is about 90 writes PER ENTITY. (It's a large, flat dataset with a lot of indexes.) But within my dataset, only six of my 50 properties actually change overnight, so I'm doing a lot of redundant writing.
My first thought was to keep a cache of the prior night's build, loop through it for changes, get the entity, then execute a put() on the properties that need it. This works effectively to reduce writes, but takes a LONG time -- even when I batch the put(). It only takes ~3 minutes to load the ENTIRE dataset with the bulkloader -- and 16-18 just to run the updates! (I'm using remote API, BTW.) This won't work when I scale up.
I tried using ndb.KeyProperty in my model and only updating the changed fields via bulkloader, but then I lose the abilty to query/sort on the keyProperty, which I need.
I also tried StructuredProperties, which DOES let you query/sort, but the structured property doesn't allow you to set an ID for it, so I can't load just the structured property.
So...is there a way for me to reduce these writes and keep the functionality I need? Can I use the bulkloader to update changes only? Do I need to restructure my dataset??
The problem with using the child/parent approach is that I sacrifice query/sort capabilities... – Simon Phoenix May 22 '14 at 15:21