0

Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)?

I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?

Osy
  • 1,613
  • 5
  • 21
  • 35
  • What are you trying to do? What code do you already have? – Gagravarr Aug 09 '13 at 09:59
  • No code, only configuration. On one hand I am crawling with nutch and indexing with solr, but my requirement also includes to parse web pdfs, office docs, etc so I need to add Tika. My question is about the best way to do it, Solr side or Nutch side. – Osy Aug 09 '13 at 17:07

2 Answers2

1

Set up the Tika plugin with Nutch, Nutch will parse the data for you and will do all the hard work for you.

I would suggest setting it up on Solr as well, you may wish to send documents to Solr via the curl command and it would help to have it set up on Solr too. It comes with little extra configuration and no performance costs:

There is a guide to setting up Tika & extracting request handler here

Allan Macmillan
  • 1,481
  • 3
  • 18
  • 30
  • If I setup Tika in both sides (Nutch and Solr), exists any way to avoid the double Tika maintenance? – Osy Aug 21 '13 at 17:15
  • There is no maintenance. Once the config is set-up thats it, you wont need to change anything anymore. I have it set-up on both sides and havent needed any maintenance – Allan Macmillan Aug 21 '13 at 22:01
0

Apply tika parser in Nutch's parsing phase.

GS Majumder
  • 999
  • 6
  • 8