0

What would be the right way of searching riak-search for documents that need correction, then update them ? By design, riak-search is an index that may NOT stick to the riak-kv content. I except that on heavy duty check/write operation that my index won't match my riak-kv content.

I count on riak-search to limit read/write operation on a limited number of matching entries.

I really can't operate using this kind of algorithm:

page=0
while true:
    results = riak.search('index', 'sex:male', start=page)
    if results['num_found'] == 0:
        break
    for r in results['docs']:
       obj = riak.bucket_type(r['_yz_rt']).bucket('_yz_rb').get('_yz_rk')
       // alter object
       obj.store()
    page = page + len(results['docs])

I see a lot of issues with it:

  • First, as riak-search catches up, it won't find the first documents I altered, breaking my pagination.
  • Paginate from the end, is a tempting alternative, but it will stress solr with that, or hit the max_search_results limit
  • Testing num_found is not a good way of breaking the loop, i'm pretty sure of it.

Should load all riak-kv keys before starting to edit ? Is there a proper algorithm/way to achieve my needs ?

EDIT:

My use case is the following. I store text document that content an array of terms from my string tokenizer algorithm, as any machine learning system it evolves and getting better over time. The string tokenizer is doing nothing but creating a word cloud.

My bucket type is ever growing and I need to patch old term array from previous tokenizer version. To achieve that I am willing to search old documents or documents that contains bad tokens that I know where corrected in my new tokenizer version.

So, my search query is either:

  • terms:badtoken
  • created_date:[2000-11-01 TO 2014-12-01]

Working with date is not an issue, but working with token is. As removing the badtoken from the document will change the solr index in a matter of seconds and while still searching for "badtoken". It will change my current pagination, and make me miss documents.

For the moment, I renounced to use the index and simply walk the whole bucket.

Guibod
  • 371
  • 1
  • 3
  • 13
  • Could you describe your use case? I would say that if you struggle applying a Riak technique to your use case, then your data modeling isn't right. Why are you concerned with search indexes being out of sync with the data? Do you have a way to figure out whether a value needs editing or not? In any case, you should pass `rows` in addition to `start` to paginate (see http://docs.basho.com/riak/latest/dev/using/search/#Querying, the section about Pagination), and also presort results. Stop when a page contains less than your page size (`rows`). – vempo Mar 20 '16 at 17:45
  • Added more context. The out of sync issue is when i search on a filter that is altered by the data correction. – Guibod Mar 21 '16 at 08:37

0 Answers0