How do I index all the revisions of a couchdb doc using elasticsearch river plugin

Question

I know how to set up the river plugin and search across it. The problem is if the same document is edited multiple times (multiple revisions), the data from oldest revision is retained and older data is lost. I intend to be able keep an index all revisions for my entire couchdb, so I don'thave to keep the history on my couchdb and retrieve history on a doc using elasticsearch and not have to go to the futon. I know the issue will be to uniqely determine a key for a couchdb doc while indexing, but we can append the "revision" number to the key and every key will be unique.

I couldn't find for a way to do that in any documentation. Does anyone have an idea as to how to do it.

Any suggestions/thoughts are welcome.

EDIT 1 : To be more explicit, at the moment elasticsearch saves couchdb docs like this:

"_index": "foo",
"_type": "foo",
"_id": "27fd33f3f51e16c0262e333f2002580a",
"_score": 1.0310782,
"_source": {
    "barVal": "bar",
    "_rev": "3-d10004227969c8073bc573c33e7e5cfd",
    "_id": "27fd33f3f51e16c0262e333f2002580a",

here the _id from couchdb is same as _id for search index. I want the search index to be concat("_id","_rev") from couchdb.

EDIT 2: (after trying out @DaveS solution) So I tried the following, but It didn't work - the search still indexes it based on the couchdb's _id

What I did:

curl -XDELETE 127.0.0.1:9200/_all
curl -XPUT 'localhost:9200/foo_test' -d '{
  "mappings": {
    "foo_test": {
      "_id": {
        "path": "newId",
        "index": "not_analyzed",
        "store": "yes"
      }
    }
  }
}'

curl -XPUT 'localhost: 9200/_river/foo_test/_meta' -d '{
  "type": "couchdb",
  "couchdb": {
    "host": "127.0.0.1",
    "port": 5984,
    "db": "foo_test",
    "script": "ctx.doc.newId = ctx.doc._id + ctx.doc._rev",
    "filter": null
  },
  "index": {
    "index": "foo_test",
    "type": "foo_test",
    "bulk_size": "100",
    "bulk_timeout": "10ms"
  }
}'

And after this, when I search for a doc I added, I get:

_index: foo_test
_type: foo_test
_id: 53fa6fcf981a01b05387e680ac4a2efa
_score: 8.238497
_source: {
    _rev: 4-8f8808f84eebd0984d269318ad21de93
    content: {
        foo: bar
        foo3: bar3
        foo2: bar2
    }
    _id: 53fa6fcf981a01b05387e680ac4a2efa
    newId: 53fa6fcf981a01b05387e680ac4a2efa4-8f8808f84eebd0984d269318ad21de93

@DaveS - Hope this helps in explaining that elasticsearch is not not using the new path to define its "_id" field.

EDIT 3 - for @dadoonet. Hope this helps

This is how you get all older rev info for a couchdb. Then you can iterate through the ones available and get their data and index them:

Get a list of all revisions on a doc id:

curl http://:5984/testdb/cde07b966fa7f32433d33b8d16000ecd?revs_info=true {"_id":"cde07b966fa7f32433d33b8d16000ecd", "_rev":"2-16e89e657d637c67749c8dd9375e662f", "foo":"bar", "foo2":"bar2", "_revs_info":[ {"rev":"2-16e89e657d637c67749c8dd9375e662f", "status":"available"}, {"rev":"1-4c6114c65e295552ab1019e2b046b10e", "status":"available"}]}

And then you can retrieve each version by (if the status is available):

curl http://<foo>:5984/testdb/cde07b966fa7f32433d33b8d16000ecd?rev=1-4c6114c65e295552ab1019e2b046b10e
{"_id":"cde07b966fa7f32433d33b8d16000ecd",
 "_rev":"1-4c6114c65e295552ab1019e2b046b10e",
 "foo":"bar"}

curl http://<foo>:5984/testdb/cde07b966fa7f32433d33b8d16000ecd?rev=2-16e89e657d637c67749c8dd9375e662f
{"_id":"cde07b966fa7f32433d33b8d16000ecd",
 "_rev":"2-16e89e657d637c67749c8dd9375e662f",
 "foo":"bar",
 "foo2":"bar2"}

Have you considered using ES's version feature directly, instead of rolling your own versioned documents? E.g. http://www.elasticsearch.org/blog/2011/02/08/versioning.html — Dave S., Mar 15 '13 at 02:00
I did look at versioning of elasticsearch,but that doesn't solve the issue, as through that I cannot retrieve/search for older versions, which is what I want to do. >>>you can't do this using the builtin versioning. All that does is to store the current version number to prevent you applying updates out of order. If you wanted to keep multiple versions available, then you'd have to implement that yourself.Refer: http://stackoverflow.com/questions/8218309/can-we-retrieve-previous-source-docs-with-elastic-search-versions — Sunny, Mar 15 '13 at 08:18
Thanks, I didn't know you couldn't load old versions. Bummer! — Dave S., Mar 15 '13 at 13:17

score 2 · Accepted Answer · answered Mar 14 '13 at 23:10

2

I don't think you can. Just because as far as I remember, CouchDb does not hold the older versions of a document. After a compact, old versions are removed.

That said, even if it was doable in CouchDB, you can not store different versions of a document in Elasticsearch.

To do that, you have to define an ID for the new document: for example: DOCID_REVNUM

That way, new revisions won't update the existing document.

The CouchDB river does not do that by now.

I suggest that you manage that in CouchDB (aka create new docs for each new version of a document) and let the standard CouchDB river index it as another document.

Hope this helps

answered Mar 14 '13 at 23:10

dadoonet

14,109
3
42
49

Yes, That is precisely what I wanted. I want elasticsearch to create "_id" based on a composite of the "_id" and "_rev" fields of the couchdb document it gets from the _changes stream. I do understand that its a custom setting, but I'm sure it would be something I can change in the codebase of elastcisearch. I just dont know if that would break something and was curious to see if anyone has done that yet. – Sunny Mar 14 '13 at 23:16
Also, the main reason I'm trying to do this is so that I do not blow up my couchdb. So creating a new doc everytime wouldn't solve it. I want to be able to compact the couchdb every day, and all changes will be stored in the index on elasticsearch. – Sunny Mar 14 '13 at 23:24
I will have a look at it and see how we can modify the couchdb river to store revisions. – dadoonet Mar 15 '13 at 07:05
Thanks, I'll wait for your reply! – Sunny Mar 15 '13 at 08:19
I made some progress on that but I'm not able to have old documents from CouchDb. I mean that CouchDb doesn't seem to keep old docs. How do you setup CouchDb to keep documents whatever the version is? – dadoonet Mar 16 '13 at 13:03
I have update my question with an edit (3) which explains how to solve your blocker on couchdb end. Replace with loclahost/127.0.0.1/wherever your couchdb is - stackoverflow wasn't allowing me to do that! – Sunny Mar 16 '13 at 15:00
>"I made some progress on that but I'm not able to have old documents from CouchDb." By this do you mean some code change for the river plugin, or a different way to start the river/pass in a different filter/script? – Sunny Mar 19 '13 at 03:24
I meant that the progress I made were by modifying river source code. – dadoonet Mar 19 '13 at 07:05
I can push it in my repo but I'm not able to test it as _changes API doesn't send me expected sequences. Do you want to test it? – dadoonet Mar 19 '13 at 07:08
Can you pass me the link to your repo... I looked at your github, couldn't find one for couchdb river. – Sunny Mar 19 '13 at 16:23
I just pushed it here: https://github.com/dadoonet/elasticsearch-river-couchdb/tree/issue/use_rev – dadoonet Mar 19 '13 at 18:05
yes that works, but I did not end up using this - modified the way couch stores history and index the history itself. I found couch takes less space to store data than elasticsearch, and its more stable as well. That way I also separate "search" - the service from data storage. – Sunny Apr 03 '13 at 19:13
Thanks for testing it. So that means you don't need this change in CouchDb river? I can remove that branch? – dadoonet Apr 04 '13 at 01:36
I guess it will be nice to have it in - as we already have a solution which fixes things. But yeah, I do not intend to use it. Thanks for your help! – Sunny Apr 04 '13 at 01:55

Dave S. · Answer 2 · 2013-03-15T13:15:08.987

0

You might consider adjusting your mapping to pull the _id field from a generated field, e.g. from the docs:

{
    "couchdoc" : {
        "_id" : {
            "path" : "doc_rev_id"
        }
    }
}

Then "just" modify the river to concatenate the strings and add the result into the document in my_concat_field. One way to do that might be to use the script filter plugin that the couchdb river provides. E.g. something like this:

{
    "type" : "couchdb",
    "couchdb" : {
        "script" : "ctx.doc.doc_rev_id = ctx.doc._id + '_' + ctx.doc._rev"
    }
}

You'd take the above snippit and PUT it to the river's endpoint, possibly with the rest of the definition, e.g. via curl -XPUT 'localhost:9200/_river/my_db/_meta' -d '<snippit from above>. Take care to escape the quotes as necessary.

edited Mar 15 '13 at 13:15

answered Mar 15 '13 at 02:07

Dave S.

6,349
31
33

Thanks @Dave, I did look at that doc page you suggested, but couldn't figure out how to add entry to a couchdoc while indexing it via river. Can you be more specific on how to accomplish this - _Then "just" modify the river to concatenate the strings and add the result into the document in my_concat_field._ I'll be able to mark your answer as the solution then. – Sunny Mar 15 '13 at 05:04
@Sunny - I added an example, and though it is untested, it's basically straight out of the docs. Can you try it and see how it goes? Does my explanation of how to load it make sense? – Dave S. Mar 15 '13 at 13:16
I've put relevant info as an edit to the question (EDIT 2), which is more readable. – Sunny Mar 15 '13 at 17:58
Hmm, not sure what's wrong there. What if you have your script set `ctx.doc._id = ctx.doc._id + ctx.doc._rev` and then skip the _id/path mapping change? E.g. just overwrite the _id field in the incoming doc. – Dave S. Mar 15 '13 at 19:31
Haha.. I had tried that as well. [84]: index [foo_test], type [foo_test], id [53fa6fcf981a01b05387e680ac4a2efa], message [MapperParsingException[Failed to parse [_id]]; nested: MapperParsingException[Provided id [53fa6fcf981a01b05387e680ac4a2efa] does not match the content one [53fa6fcf981a01b05387e680ac4a2efa5-2bfe470c3b93e970041d885bed436f4f]]; ] – Sunny Mar 15 '13 at 19:53
That makes me think that there is some other route through with it already know which "id" of couchdb's doc it is trying to index - probably from "_changes" stream. Not sure how to circumvent that! – Sunny Mar 15 '13 at 19:55
found this: The id the document is indexed under is extracted from the couchdb _changes stream (under "id"), not from the document itself, so thats what gets reported by couch. – Sunny Mar 15 '13 at 19:57

How do I index all the revisions of a couchdb doc using elasticsearch river plugin

2 Answers2