4

My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.

I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:

So I'm left with two options:

  1. Pick a different search tool
  2. Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read

Which approach do you recommend?

If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!

If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?

Thanks everyone. Really appreciate your help.

javanna
  • 59,145
  • 14
  • 144
  • 125
Mike
  • 9,692
  • 6
  • 44
  • 61

2 Answers2

2

Well I have not done binary file indexing before, but apparently Solr has support for it see Indexing files with SPHINX/ultrasphinx and http://wiki.apache.org/solr/ExtractingRequestHandler There are quite a few gems available for Solr, Sunspot seems to be a popular one http://outoftime.github.com/sunspot/ Although it seems Sunspot does not have built in support for Solr Cells, there seems to be some work going into it https://github.com/tomasc/sunspot_cell There are probably better options out there, but this should give you a good starting point.

Community
  • 1
  • 1
maecro
  • 233
  • 2
  • 9
  • Many thanks for your feedback. I've decided to go down the **try to extract plain-text versions of the attachments into the database for thinking-sphinx to read** route as per my answer below but your suggestion is useful nonetheless. – Mike Oct 16 '11 at 09:25
1

Just to update this. The approach I've decided to go with is:

Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read

Specifically, I'll be doing the following:

  • Using thinking-sphinx
  • Using the subexec gem to call ...
  • ... Tika from the command line

It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file] but I'll post my experiences if it turns out to be more complicated!

Mike
  • 9,692
  • 6
  • 44
  • 61