Elastic mapper-attachments does not extract text from Office documents

Question

I recently try to test mapper-attachments plugin for Elastic Search.
Here are my very basic mappings in Elastic :

PUT /test
PUT /test/docs/_mapping
{
  "docs": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "content": {
            "term_vector":"with_positions_offsets",
            "store": true
          }
        }
      }
    }
  }
}

I am using the Python Elasticsearch Client for Python and the following examples assumes that :

import base64
from elasticsearch import Elasticsearch
es = Elasticsearch()

When i encode a string by myself it is ok :

encoded = base64.b64encode(b'Encoding Test')
es.index(index='test',
         doc_type='docs',
         id=1,
         refresh='true',
         body={
            'file' : {
                "_content": encoded,
                "_indexed_chars" : -1
             }
})

I also tried to index a pdf file, works fine too.
But when i tried with a Microsoft Office Document (OpenXML), like a pptx or docx, the content of the file seems not to be indexed :

file_path = "docx/test.docx"
with open(file_path, "rb") as docx_file:
    es.index(index='test', doc_type='docs', id=2, refresh='true', body={
        'file' : {
            "_content": base64.b64encode(docx_file.read()),
            "_indexed_chars" : -1
         }
    })

I tried many Word and Powerpoint files, no success. When i extract text manually by unzipping the file, extract text from XML and re-encoding this text with base64 it's working, but what the point of doing this with map-attachments plugin (Apache Tika)?
Seems weird to me.

Do you have a clue of what happening?
Have i missed a step?

Python : 3.4
Elastic Search : 2.2.0
Plugins : mapper-attachments
On my laptop (Mac OS X v 10.11.3)

We've found this too. I see no one has responded. Did you make any progress? — Drammy, Jun 29 '16 at 09:44
Unfortunately no. I tried on Linux, same failure. It does not seem to be platform dependant. Otherwise, i've seen that mapper-attachments will be replaced by a new plugin called "ingest-attachment". So wait & see ? — bosswhale, Jun 29 '16 at 16:53
I asked some of the Elastic folks yesterday and this is now fixed in ES v2.3.3.. BUT: they've stripped out a lot of the Tika dependencies so the attachment plugin doesn't support the same number of doc types that Tika does. I've asked for a list of supported types but its not been very forthcoming so far... — Drammy, Jun 30 '16 at 10:11

Elastic mapper-attachments does not extract text from Office documents

0 Answers0