0

How do I get plain text search results from indexed pdf file contents with FOS Elastica Bundle?

I'm using
ElasticSearch with Attachment-Mapper Plugin
Elastica
FOS Elastica Bundle, with Doctrine on Symfony2

So far, I've been able to get the mapper-attachment up and running. The pdf file content is indexed using this, https://github.com/FriendsOfSymfony/FOSElasticaBundle/issues/96, as a guide.
Summary of method so you don't have to read the entire github post:

1) Creates a "document" entity with a "getEncodedFile" method. Note: I only grab the file contents in this method. I don't believe there is a need to base64 encode the data here as this happens later (I'm pretty sure the Elastica Document class does this).

2) Then I set up the config.yml:

      types: 
          document:  
                mappings:  
                    id: ~  
                    encodedFile:  
                        type: attachment
                persistence:
                    driver: orm 
                    model: MyBundle\Entity\Document
                    provider: ~
                    finder: ~
                    listener: ~  

The search function returns the correct entity. When I var_dump the hybrid results, I get the correct entity, including all the fields. If I add the "setHighlight" method, nothing changes. The setHighlight method returns nothing regarding the "encodedFile" field. I did get setHighlight to work with other fields.

How do I pull the plain text search results (with some context) from the indexed base64 encoded data?

According to this stack post, Best practices for searchable archive of thousands of documents (pdf and/or xml), it seems possible.

Thanks in advance

UPDATE

So I caved. I ended up using XPDF to extract and index the text of each pdf document. Then I just run the query as normal.

Community
  • 1
  • 1
JonnyS
  • 314
  • 1
  • 7
  • does your query work if you run it via CURL on the command line? For an example of the notation see http://pastebin.com/ph8AbPU5 – herrjeh42 Apr 21 '13 at 09:24
  • If I run the query in the command line, I get the same results that I get through the php, i.e. the correct results are produced but without plain text excerpt of the pdf contents. It works as expected, but I would like to tease out the plain text excerpt of the pdf contents (stored as base64 encoded data). – JonnyS Apr 22 '13 at 16:14
  • can you post the json query? which version of Elastic Search are you using? – herrjeh42 Apr 23 '13 at 21:28

0 Answers0