0

How do I maintain the Original formatting of the HTML document in the results given by Solr?

I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document.

I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them.

does solr-cell or tika retains these formatting anywhere?

If it does not retain the formatting then I'll need to fetch each document from physical file location using resourcename field of solr and apply the highlights and other solr ready made functionality, But this process is too tedious.

EDIT: What can i use as a Request Handler if i have to use "HTMLStripCharFilterFactory" as suggested by Jayendra in the answer? also can i extract metadata tags in that case?

Can anyone guide me regarding this!

Thank you for all your support.!!!

Ry-
  • 218,210
  • 55
  • 464
  • 476
Mantra
  • 316
  • 3
  • 16

1 Answers1

2

Solr Cell with Tika does not maintain the original formatting of the document.
You would get only the extracted text from the documents fed to Solr through Tika.

Else you have to feed the html document as a normal Solr field and apply HTMLStripCharFilterFactory filter to maintain both copies.

Solr will maintain the Original Document with HTML fields when stored=true.
However, for Search (indexed=true) the search will only happen on the Content and not on the html elements.

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • Thank you for reply. I was expecting answer from you as seen you answering in lots of solr tags. Coming to the point can you please explain me more about "_document as a normal field_". Is it like i have to feed the HTML document in text format to the solr? – Mantra Feb 08 '13 at 11:04
  • yup should feed the html document contents as a normal solr field which would be analyzed through the html filter. – Jayendra Feb 08 '13 at 11:07
  • I hope you understood my question, that i want to display original document in which search was found with the highlights and other enrichment's. If i provide the html doc as text so the Search query will be searched in HTML tags also which i don't want.can you guide me further with this i am totally new to solr. – Mantra Feb 08 '13 at 11:12
  • Can please put some example configuration level changes which i need to do either in schema.xml or solrconfig.xml. – Mantra Feb 08 '13 at 11:23
  • which request handler i can use other than ExtractingRequestHandler? do i need to make my own? or is there any predefined handler :( please help!!! – Mantra Feb 12 '13 at 13:50