I recently try to test mapper-attachments plugin for Elastic Search.
Here are my very basic mappings in Elastic :
PUT /test
PUT /test/docs/_mapping
{
"docs": {
"properties": {
"file": {
"type": "attachment",
"fields": {
"content": {
"term_vector":"with_positions_offsets",
"store": true
}
}
}
}
}
}
I am using the Python Elasticsearch Client for Python and the following examples assumes that :
import base64
from elasticsearch import Elasticsearch
es = Elasticsearch()
When i encode a string by myself it is ok :
encoded = base64.b64encode(b'Encoding Test')
es.index(index='test',
doc_type='docs',
id=1,
refresh='true',
body={
'file' : {
"_content": encoded,
"_indexed_chars" : -1
}
})
I also tried to index a pdf file, works fine too.
But when i tried with a Microsoft Office Document (OpenXML), like a pptx or docx, the content of the file seems not to be indexed :
file_path = "docx/test.docx"
with open(file_path, "rb") as docx_file:
es.index(index='test', doc_type='docs', id=2, refresh='true', body={
'file' : {
"_content": base64.b64encode(docx_file.read()),
"_indexed_chars" : -1
}
})
I tried many Word and Powerpoint files, no success.
When i extract text manually by unzipping the file, extract text from XML and re-encoding this text with base64 it's working, but what the point of doing this with map-attachments plugin (Apache Tika)?
Seems weird to me.
Do you have a clue of what happening?
Have i missed a step?
Python : 3.4
Elastic Search : 2.2.0
Plugins : mapper-attachments
On my laptop (Mac OS X v 10.11.3)