Solr query results - need searched text and a few lines around it

Question

I am completely lost. I think I am definitely missing something fundamental here. Everybody has such awesome stuff to say about Solr but I fail to see it.

I indexed a structured pdf document in Solr. The problem is when I search for a simple string - I get the entire content field as the response! I don't know how to change that. My requirement is that, lets say I search for "metadata" it should give me

"MetadataDiscussion . . . 4 matches ... make sure that Tika users have a chance to get to all of the metadata created and/or extracted by Tika. == Original Problem == The original inspiration for this page was a Tika ... 10.7k - rev: 2 (current) last modified: 2010-08-02 18:09:45 "

But it gives me the whole document!- the entire string that was indexed. It seems like Lucene can only tell me in which field it occurred, not where in the field it occurred

Any help will be greatly appreciated!!

score 0 · Accepted Answer · answered May 19 '12 at 02:08

0

Lucene/Solr is primarily a retrieval engine - it retrieves documents that match a query. So this behavior is desirable and expected. Now as for your requirement, you can use the highlighting feature of Solr to give you exactly that. Suppose your document text is stored in a field named text - then you would pass the following parameters to Solr:

&hl=true&hl.fl=text&hl.snippets=5&hl.fragsize=200

Look through the other parameters to customize it even further.

Solr is amazing :)

answered May 19 '12 at 02:08

Ansari

8,168
2
23
34

I tried that. It doesn't work.:( It again returns the entire field. The field is supposed to be stored right?I've made almost no changes to the solrconfig.xml besides making the text field stored. I post the document using solr cell, so curl "http://localhost:8983/solr/update/extract?stream.file=/home/Desktop/DOCUMENTS/T.pdf&stream.contentType=application/pdf&literal.id=DOC_N&commit=true&captureAttr=true" – 12rad May 20 '12 at 23:42
And then doing a query *:* shows me that all the content of this document got indexed in the field. Now when I do a simple search with the highlight parameters on, i still get the entire content. Nothing changes. What am i doing wrong? Could it be that the pdfparser is not indexing the document correctly? I've struggled with this a lot!:( – 12rad May 20 '12 at 23:44
– 12rad May 20 '12 at 23:55
That is the only response I get! – 12rad May 20 '12 at 23:55

Solr query results - need searched text and a few lines around it

1 Answers1