Solr Get Paragraphs of Documents

Asked Dec 31 '11 at 13:40

Active Jan 01 '12 at 13:30

Viewed 329 times

I've been working with solr for a couple of days, and I need to split a document into its paragraphs and then search on every one of them. I tried a lot of things, but solr just doesn't want to capture paragraphs correctly; either it captures nothing, or it captures everything as one big text. I tried:

 ContentStreamUpdateRequest up 
    = new ContentStreamUpdateRequest("/update/extract");

  up.addFile(new File("/home/usr/Documents/example.doc"));
  up.setParam("literal.id", "foo");

  up.setParam(ExtractingParams.CAPTURE_ATTRIBUTES, "true");
  up.setParam(ExtractingParams.CAPTURE_ELEMENTS, "p");
  up.setParam(ExtractingParams.MAP_PREFIX + "p", "attr_paragraphs");

Whatever combination I try it always gets wrong results. Does anyone know how to get the paragraphs and make them easy to use? I am writing a plugin that does basic queries based summarization and is supposed to retrieve the paragraph that has the most information about the query, but I just don't know how to get the paragraphs.

Thanks!

edited Jan 01 '12 at 13:30

javanna

59,145
14
144
125

asked Dec 31 '11 at 13:40

user1124347

1

Uh, why not split up the article client-side and then upload each paragraph individually to the server? You likely want each paragraph to be a separate solr document (term for solr entity). – Jesvin Jose Dec 31 '11 at 17:44

Solr Get Paragraphs of Documents

0 Answers0