I'm new to elasticsearch and want to index many sentences to search them efficiently. At first I tried bulk adding to an index, but that didn't work for me, so now I'm adding sentences one by one using the following piece of (python) code:
c = pycurl.Curl()
add_document(c, 'myIndexName', 'someJsonString', 99)
def add_document(c, index_name, js, _id):
c.setopt(c.POST, 1)
c.setopt(c.URL, 'localhost:9200/%s/sentence/%i' % (index_name, _id))
c.setopt(c.POSTFIELDS, json.dumps(js))
c.perform()
Where I'm incrementing the id, and an example of a json input string would be:
{"sentence_id": 2, "article_name": "Kegelschnitt", "paragraph_id": 1, "plaintext": "Ein Kegelschnitt ist der zweidimensionale Sonderfall einer Quadrik .", "postags": "Ein/ART Kegelschnitt/NN ist/VAFIN der/ART zweidimensionale/ADJA Sonderfall/NN einer/ART Quadrik/NE ./$."}
So far so good, seems to work. I suspect that getting this to work in a bulk import way is a lot more efficient, but since this is a one-time only process, efficiency is not my primary concern. I'm using this query (on the command line) to get an overview of my indices:
curl 'localhost:9200/_cat/indices?v'
Which gives me (for the relevant index):
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open wiki_dump_jan2019 5 1 795502 276551 528.1mb 528.1mb
Similarly, the query:
curl -XGET 'localhost:9200/wiki_dump_jan2019/sentence/_count?pretty' -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}}'
returns
{
"count" : 795502,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Telling me that I have 795.502 sentences in my index.
My problem here is that in total I do over 23 million inserts. I realise that there may well be some duplicate sentences, but checked this and found over 21 million unique sentences. My python code executed fine, no errors, and I checked the elasticsearch logs and did not find anything alarming in there. I'm a bit unsure about the number of docs.deleted from the index (276.551, see above), but I understood that this may have to do with re-indexing, duplicates, and should not necessarily be a problem (and in any case, the total number of docs and the docs.deleted are still way below my number of sentences).
The only thing I could find (getting close to my problem) was this post: elasticsearch stops indexing new documents after a while, using Tire , but the following query:
curl -XGET 'localhost:9200/_nodes/stats/process?filter_path=**.max_file_descriptors'
returns:
{"nodes":{"hoOAMZoCTkOgirg6_aIkUQ":{"process":{"max_file_descr^Ctors":65536}}}}
so from what I understand upon installation it defaulted to the max value and this should not be the issue.
Anyone who can shed some light on this?
UPDATE: Ok, I guess I'm officially stupid. My issue was that I used the sentence_id as index id in the adding/inserting process. This sentence_id is coming from one particular document, so the max nr of docs (sentences) in my index would be the highest sentence_id (the longest document in my data set apparently had 795502 sentences). It just kept overwriting all entries after every document... Sorry for having wasted your time if you read this. NOT an elasticsearch issue; bug in my python code (outside of the displayed function above).