What would be the right way of searching riak-search
for documents that need correction, then update them ?
By design, riak-search is an index that may NOT stick to the riak-kv content. I except that on heavy duty check/write operation that my index won't match my riak-kv content.
I count on riak-search to limit read/write operation on a limited number of matching entries.
I really can't operate using this kind of algorithm:
page=0
while true:
results = riak.search('index', 'sex:male', start=page)
if results['num_found'] == 0:
break
for r in results['docs']:
obj = riak.bucket_type(r['_yz_rt']).bucket('_yz_rb').get('_yz_rk')
// alter object
obj.store()
page = page + len(results['docs])
I see a lot of issues with it:
- First, as riak-search catches up, it won't find the first documents I altered, breaking my pagination.
- Paginate from the end, is a tempting alternative, but it will stress solr with that, or hit the
max_search_results
limit - Testing
num_found
is not a good way of breaking the loop, i'm pretty sure of it.
Should load all riak-kv keys before starting to edit ? Is there a proper algorithm/way to achieve my needs ?
EDIT:
My use case is the following. I store text document that content an array of terms from my string tokenizer algorithm, as any machine learning system it evolves and getting better over time. The string tokenizer is doing nothing but creating a word cloud.
My bucket type is ever growing and I need to patch old term array from previous tokenizer version. To achieve that I am willing to search old documents or documents that contains bad tokens that I know where corrected in my new tokenizer version.
So, my search query is either:
- terms:badtoken
- created_date:[2000-11-01 TO 2014-12-01]
Working with date is not an issue, but working with token is. As removing the badtoken from the document will change the solr index in a matter of seconds and while still searching for "badtoken". It will change my current pagination, and make me miss documents.
For the moment, I renounced to use the index and simply walk the whole bucket.