I've been working with solr for a couple of days, and I need to split a document into its paragraphs and then search on every one of them. I tried a lot of things, but solr just doesn't want to capture paragraphs correctly; either it captures nothing, or it captures everything as one big text. I tried:
ContentStreamUpdateRequest up
= new ContentStreamUpdateRequest("/update/extract");
up.addFile(new File("/home/usr/Documents/example.doc"));
up.setParam("literal.id", "foo");
up.setParam(ExtractingParams.CAPTURE_ATTRIBUTES, "true");
up.setParam(ExtractingParams.CAPTURE_ELEMENTS, "p");
up.setParam(ExtractingParams.MAP_PREFIX + "p", "attr_paragraphs");
Whatever combination I try it always gets wrong results. Does anyone know how to get the paragraphs and make them easy to use? I am writing a plugin that does basic queries based summarization and is supposed to retrieve the paragraph that has the most information about the query, but I just don't know how to get the paragraphs.
Thanks!