0

I'm new to elasticsearch and want to index many sentences to search them efficiently. At first I tried bulk adding to an index, but that didn't work for me, so now I'm adding sentences one by one using the following piece of (python) code:

c = pycurl.Curl()
add_document(c, 'myIndexName', 'someJsonString', 99)

def add_document(c, index_name, js, _id):

    c.setopt(c.POST, 1)
    c.setopt(c.URL, 'localhost:9200/%s/sentence/%i' % (index_name, _id))
    c.setopt(c.POSTFIELDS, json.dumps(js))
    c.perform()

Where I'm incrementing the id, and an example of a json input string would be:

{"sentence_id": 2, "article_name": "Kegelschnitt", "paragraph_id": 1, "plaintext": "Ein Kegelschnitt ist der zweidimensionale Sonderfall einer Quadrik .", "postags": "Ein/ART Kegelschnitt/NN ist/VAFIN der/ART zweidimensionale/ADJA Sonderfall/NN einer/ART Quadrik/NE ./$."}

So far so good, seems to work. I suspect that getting this to work in a bulk import way is a lot more efficient, but since this is a one-time only process, efficiency is not my primary concern. I'm using this query (on the command line) to get an overview of my indices:

curl 'localhost:9200/_cat/indices?v'

Which gives me (for the relevant index):

health status index             pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   wiki_dump_jan2019   5   1     795502       276551    528.1mb        528.1mb 

Similarly, the query:

curl -XGET 'localhost:9200/wiki_dump_jan2019/sentence/_count?pretty' -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}}'

returns

{
  "count" : 795502,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

Telling me that I have 795.502 sentences in my index.

My problem here is that in total I do over 23 million inserts. I realise that there may well be some duplicate sentences, but checked this and found over 21 million unique sentences. My python code executed fine, no errors, and I checked the elasticsearch logs and did not find anything alarming in there. I'm a bit unsure about the number of docs.deleted from the index (276.551, see above), but I understood that this may have to do with re-indexing, duplicates, and should not necessarily be a problem (and in any case, the total number of docs and the docs.deleted are still way below my number of sentences).

The only thing I could find (getting close to my problem) was this post: elasticsearch stops indexing new documents after a while, using Tire , but the following query:

curl -XGET 'localhost:9200/_nodes/stats/process?filter_path=**.max_file_descriptors'

returns:

{"nodes":{"hoOAMZoCTkOgirg6_aIkUQ":{"process":{"max_file_descr^Ctors":65536}}}}

so from what I understand upon installation it defaulted to the max value and this should not be the issue.

Anyone who can shed some light on this?

UPDATE: Ok, I guess I'm officially stupid. My issue was that I used the sentence_id as index id in the adding/inserting process. This sentence_id is coming from one particular document, so the max nr of docs (sentences) in my index would be the highest sentence_id (the longest document in my data set apparently had 795502 sentences). It just kept overwriting all entries after every document... Sorry for having wasted your time if you read this. NOT an elasticsearch issue; bug in my python code (outside of the displayed function above).

Igor
  • 1,251
  • 10
  • 21
  • Can you check you disk usage . Elasticsearch changes indexes read-only after 95 % of disk usage under some rules. https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html#disk-allocator – gaurav9620 Jun 18 '19 at 08:57
  • thanks! Current usage is at 83% with 33GB available still (the total size of my sents in text is 11GB, so that shouldn't be an issue; still, I can try freeing some space and trying again) – Igor Jun 18 '19 at 08:59
  • Then there is some other issue . Can you show me the error which you are getting while indexing. Or are you getting any error? – gaurav9620 Jun 18 '19 at 09:01
  • Also check you python logs while indexing. in case of bulk ingest , it will not directly throw exception if there is some problem in a document . You have to get bulk errors from bulk response and print them – gaurav9620 Jun 18 '19 at 09:06
  • during the index population (which takes several hours), the ES logs just display `[2019-06-18T11:33:14,317][INFO ][o.e.m.j.JvmGcMonitorService] [user-VirtualBox] [gc][1975] overhead, spent [497ms] collecting in the last [1.2s] ` every few mins, nothing serious. I'll see if I can get some more info from pycurl. – Igor Jun 18 '19 at 10:05
  • see update above. Sorry for wasting your time, thanks for the comments! – Igor Jun 18 '19 at 11:16
  • Its ok no issues – gaurav9620 Jun 18 '19 at 11:20

0 Answers0