1

I've been working with solr for a couple of days, and I need to split a document into its paragraphs and then search on every one of them. I tried a lot of things, but solr just doesn't want to capture paragraphs correctly; either it captures nothing, or it captures everything as one big text. I tried:

 ContentStreamUpdateRequest up 
    = new ContentStreamUpdateRequest("/update/extract");

  up.addFile(new File("/home/usr/Documents/example.doc"));
  up.setParam("literal.id", "foo");

  up.setParam(ExtractingParams.CAPTURE_ATTRIBUTES, "true");
  up.setParam(ExtractingParams.CAPTURE_ELEMENTS, "p");
  up.setParam(ExtractingParams.MAP_PREFIX + "p", "attr_paragraphs");

Whatever combination I try it always gets wrong results. Does anyone know how to get the paragraphs and make them easy to use? I am writing a plugin that does basic queries based summarization and is supposed to retrieve the paragraph that has the most information about the query, but I just don't know how to get the paragraphs.

Thanks!

javanna
  • 59,145
  • 14
  • 144
  • 125
  • 1
    Uh, why not split up the article client-side and then upload each paragraph individually to the server? You likely want each paragraph to be a separate solr document (term for solr entity). – Jesvin Jose Dec 31 '11 at 17:44

0 Answers0