Solr File Indexing map content by pages

Question

I would like to index files in Solr. I have already made an "output script" with PHP, but my project leader has given me the task of displaying the page number of the found text.

So: - I am searching for the Word "Foo". - Solr returns the results and also the highlighted text. - Now I would like to know on which page this highlighted text is, to find it.

The files are *.pdf files.

One solution I have thought of would be to import the Text of the PDF Files in different fields? Or maybe in this one multivalued field named "content".

Maybe like this:

Json:
    content:
        1: "page one text",
        2: "page two text"

and so on?

Is this possible? Or is there a better way to find this information out? Thanks for your help! :-)

Hi Cyruxx - Welcome to StackOverflow. You might like to post the php code you have got already, that could help people suggest where to put changes etc!! — Neil Townsend, Apr 05 '13 at 15:59

score 0 · Answer 1 · answered Apr 06 '13 at 07:45

0

You need to create a separate Solr document for every page of every PDF file. If you want to return only one result per file, then you can use FieldCollapsing to group all the results from the same PDF file.

answered Apr 06 '13 at 07:45

nikhil500

3,458
19
23

Hello, but I'm using the ExtractorHandler for this, so how is this possible? By the way, thanks for your solution. :) – amahrt Apr 08 '13 at 07:46

Solr File Indexing map content by pages

1 Answers1