See this post
My Env:
{ "name" : "node-0",
"cluster_name" : "ES500-JBD-0",
"cluster_uuid" : "q_akJRkrSI-glTwT5vfH4A",
"version" : {
"number" : "5.0.0",
"build_hash" : "253032b",
"build_date" : "2016-10-26T04:37:51.531Z",
"build_snapshot" : false,
"lucene_version" : "6.2.0" },
"tagline" : "You Know, for Search"
}
Index & pipeline creation (Edit 3):
curl -XPUT 'vm01.jbdata.fr:9200/_ingest/pipeline/attachment' -d '{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}'
Mapping creation (Edit 4) with french :
curl -XPUT 'vm01.jbdata.fr:9200/ged-idx-00' -d '{
"mappings" : {
"ged_type_0" : {
"properties" : {
"attachment.data" : {
"type": "text",
"analyzer" : "french"
}
}
}
}
}'
ES specific config (Edit 1 & Edit 2):
$ bin/elasticsearch-plugin list
ingest-attachment
From config/elasticsearch.yml
plugin.mandatory: ingest-attachment
CommandS to index a PDF:
1/ A "raw" PDF.
curl -H 'Content-Type: application/pdf' -XPUT vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment -d @/tmp/zookeeperAdmin.pdf
{"error":{"root_cause":[{"type":"settings_exception","reason":"Failed to load settings from [%PDF-1.4%��� ... 0D33957F>]>>startxref76764%%EOF; line: 1, column: 2]"}},"status":500}
2/ A "B64ed" PDF.
aPath='/tmp/zookeeperAdmin.pdf'
aB64content=$(base64 $aPath | perl -pe 's/\n/\\n/g')
echo $aB64content > /tmp/zookeeperAdmin.pdf.b64
curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
"file" : "content" : "'$aB64content'"
}'
{"error":{"root_cause":... "reason":"failed to parse source for create index","caused_by":{"type":"json_parse_exception","reason":"Unexpected character (':' (code 58)): was expecting comma to separate Object entries\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@65a254b6; line: 2, column: 25]"}},"status":400}
How to use correctly the ingest-attachment plugin ton index PDF ?