ElasticSearch 5.0.0 ingest-attachment plugin issues to index PDF

Question

My Env:

{   "name" : "node-0",
    "cluster_name" : "ES500-JBD-0",  
    "cluster_uuid" : "q_akJRkrSI-glTwT5vfH4A",  
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"   },
  "tagline" : "You Know, for Search"
}

Index & pipeline creation (Edit 3):

curl -XPUT 'vm01.jbdata.fr:9200/_ingest/pipeline/attachment' -d '{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}'

Mapping creation (Edit 4) with french :

curl -XPUT 'vm01.jbdata.fr:9200/ged-idx-00' -d '{
  "mappings" : {
    "ged_type_0" : {
      "properties" : {
         "attachment.data" : {
            "type": "text",
            "analyzer" : "french"
            }
         }
      }
   }
}'

ES specific config (Edit 1 & Edit 2):

$ bin/elasticsearch-plugin list
ingest-attachment

From config/elasticsearch.yml

plugin.mandatory: ingest-attachment

CommandS to index a PDF:

1/ A "raw" PDF.

curl -H 'Content-Type: application/pdf' -XPUT vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment -d @/tmp/zookeeperAdmin.pdf

{"error":{"root_cause":[{"type":"settings_exception","reason":"Failed to load settings from [%PDF-1.4%�� ... 0D33957F>]>>startxref76764%%EOF; line: 1, column: 2]"}},"status":500}

2/ A "B64ed" PDF.

aPath='/tmp/zookeeperAdmin.pdf'
aB64content=$(base64 $aPath | perl -pe 's/\n/\\n/g')
echo $aB64content > /tmp/zookeeperAdmin.pdf.b64
curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

{"error":{"root_cause":... "reason":"failed to parse source for create index","caused_by":{"type":"json_parse_exception","reason":"Unexpected character (':' (code 58)): was expecting comma to separate Object entries\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@65a254b6; line: 2, column: 25]"}},"status":400}

How to use correctly the ingest-attachment plugin ton index PDF ?

score 1 · Answer 1 · answered Dec 12 '16 at 02:55

From my experience, the file needs to be encoded in Base64, so your option 2 should be the good way to go.

About your last attempt:

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

The provided JSON is malformed ("a" : "b" : "c"), hence the error.

As specified in your pipeline creation, you only need a data field, so the following should do the trick:

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "data" : "'$aB64content'"
}'

`curl -XPUT "http://192.168.56.101:9200/ged-idx-00?pipeline=attachment" -d '{ "data" : "'$aB64content'" }'` **Error:** ...sh: line 7: /usr/bin/curl: Argument list too long — jbigdata.fr, Mar 03 '17 at 13:29

score 1 · Answer 2 · answered Apr 04 '17 at 13:52

In fact, it's quite difficult to extract text from PDF properly, often you have to extract inline images or render the whole page and OCR it depending on the text extracted from the page and it's content (for example you have to analyse whether encoding is right or not). You simply can not tune Tika to use any custom logic inside parsing process, neither you can't do so with Ingest Attachment. If you're aiming at a good quality PDF parsing - Ingest Attachment is not what you're looking for, you have to do it yourself.

Read the full story here: https://blog.ambar.cloud/ingest-attachment-plugin-for-elasticsearch-should-you-use-it/

[fscrawler](https://github.com/dadoonet/fscrawler) do the job to index documents with ES. — jbigdata.fr, May 23 '17 at 08:28
This [link](http://jbigdata.fr/jbigdata/ged-02.html) shows a casestudy about indexing quality. — jbigdata.fr, May 23 '17 at 08:37

ElasticSearch 5.0.0 ingest-attachment plugin issues to index PDF

2 Answers2