Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

Question

I'm currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that best match the query. The user will then be to select any document returned and view that document within MS Word, Excel, or a PDF viewer.

Can I use ElasticSearch or Solr to import the raw binary documents (ie. .docx, .xlsx, .pdf files) into its "data store", and then export the document to the user's device on command for viewing.

Previously, I used MongoDB 2.6.6 to import the raw files into GridFS and the extracted text into a separate collection (the collection contained a text index) and that worked fine. However, MongoDB full text searching is quite basic and therefore I'm now looking at either Solr or ElasticSearch to perform more complex text searching.

Nick

You may want to consider http://elasticwarehouse.org for that. It reads file, extracts metadata using Tika and stores binary content inside ES (as binary element) or in external filesystem. You can also use it to test your usecase (storing huge binary files or lot of binary files may cause ES cluster issues) — zuko, Oct 12 '15 at 11:22
Hi, can you give any feedback about the solution used to meet your needs, and the concers you've faced when trying to implement search engines? Thanks in advance. — Naou, Nov 08 '15 at 14:33
How are you extracting text from the PDF's? Do you have some custom tools to do that or is elastic search handling that too? — The Unknown Dev, Feb 24 '16 at 21:22

score 25 · Accepted Answer · answered Jan 17 '15 at 15:59

25

Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.

Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.

Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.

So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.

answered Jan 17 '15 at 15:59

Alexandre Rafalovitch

9,709
1
24
27

Thank You! I'm now thinking of storing the documents in mongo and the extracted text in ElasticSearch (using the MongoDB river plugin as the link) – ngekas Jan 18 '15 at 22:48
@ngekas you can use Ambar as solution, we developed it to be a solid solution for such kind of problems. Check it out here https://github.com/RD17/ambar – Ilia P Apr 17 '17 at 08:49

John Petrone · Answer 2 · 2016-02-25T00:18:43.837

I would try the Elasticsearch attachment plugin. Details can be found here:

https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html

https://github.com/elasticsearch/elasticsearch-mapper-attachments

It's built on top of Apache Tika:

http://tika.apache.org/1.7/formats.html

Attachment Type

The attachment type allows to index different "attachment" type field (encoded as base64), for example, Microsoft Office formats, open document formats, ePub, HTML, and so on (full list can be found here).

The attachment type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under $ES_HOME/plugins location. It will be automatically detected and the attachment type will be added.

Supported Document Formats

HyperText Markup Language

XML and derived formats

Microsoft Office document formats

OpenDocument Format

iWorks document formats

Portable Document Format

Electronic Publication Format

Rich Text Format

Compression and packaging formats

Text formats

Feed and Syndication formats

Help formats

Audio formats

Image formats

Video formats

Java class files and archives

Source code

Mail formats

CAD formats

Font formats

Scientific formats

Executable programs and libraries

Crypto formats

Can I use hadoop to store data and use mapper attachments plugin ? Is it possible ? — Sachin, Apr 04 '16 at 12:36
This method does not work anymore in newer versions of Elasticsearch, it has been replaced by the ingest-attachment plugin. — Joost Aarts, Apr 11 '18 at 09:42

Nick · Answer 3 · 2018-12-19T15:47:10.537

A bit late to the party but this may help someone :)

I had a similar problem and some research led me to fscrawler. Description:

This crawler helps to index binary documents such as PDF, Open Office, MS Office.

Main features:

Local file system (or a mounted drive) crawling and index new files,
update existing ones and removes old ones. Remote file system over SSH crawling.
REST interface to let you "upload" your binary documents to elasticsearch.

score 0 · Answer 4 · answered Jan 16 '15 at 06:28

0

Regarding solr:

If the docs only need to be returned on metadata searches, Solr features a BinaryField fieldtype, to which you can send binary data base64 encoded.Keep in mind that in general people recommend against doing this, as it may increase your index (RAM requirements/performance), and if possible a set-up where you store the files externally (and the path to the file in solr) might bea better choice.

If you want solr to automatically index the text inside the pdf/doc -- that's possible with the extractingrequesthandler: https://wiki.apache.org/solr/ExtractingRequestHandler

answered Jan 16 '15 at 06:28

Alegis

103
5

1

This is why Solr does have the [external file type](https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes). When using it, you do not need to re-invent the handling of index vs. filesystem on your own. – cheffe Jan 16 '15 at 07:44
Thanks for explaining the limitations of storing binary data within Solr (I assume the same limitation applies to ElasticSearch also). – ngekas Jan 18 '15 at 22:52

score 0 · Answer 5 · edited Mar 13 '15 at 09:47

0

Elasticsearch do store documents (.pdfs, .docs for instance) in the _source field. It can be used as a NoSQL datastore (same as MongoDB).

edited Mar 13 '15 at 09:47

Opal

81,889
28
189
210

answered Mar 13 '15 at 09:28

Jeff

11

Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

5 Answers5

Linked

Related