How do I get plain text search results from indexed pdf file contents with FOS Elastica Bundle?
I'm using
ElasticSearch with Attachment-Mapper Plugin
Elastica
FOS Elastica Bundle, with Doctrine on Symfony2
So far, I've been able to get the mapper-attachment up and running. The pdf file content is indexed using this, https://github.com/FriendsOfSymfony/FOSElasticaBundle/issues/96, as a guide.
Summary of method so you don't have to read the entire github post:
1) Creates a "document" entity with a "getEncodedFile" method. Note: I only grab the file contents in this method. I don't believe there is a need to base64 encode the data here as this happens later (I'm pretty sure the Elastica Document class does this).
2) Then I set up the config.yml:
types:
document:
mappings:
id: ~
encodedFile:
type: attachment
persistence:
driver: orm
model: MyBundle\Entity\Document
provider: ~
finder: ~
listener: ~
The search function returns the correct entity. When I var_dump the hybrid results, I get the correct entity, including all the fields. If I add the "setHighlight" method, nothing changes. The setHighlight method returns nothing regarding the "encodedFile" field. I did get setHighlight to work with other fields.
How do I pull the plain text search results (with some context) from the indexed base64 encoded data?
According to this stack post, Best practices for searchable archive of thousands of documents (pdf and/or xml), it seems possible.
Thanks in advance
UPDATE
So I caved. I ended up using XPDF to extract and index the text of each pdf document. Then I just run the query as normal.