Get page numbers of searchresult of a pdf in solr

Question

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.

So what I need is the page number and a short text snippet of every search result.

I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.

I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.

score 2 · Accepted Answer · answered Mar 21 '13 at 11:14

2

I'm now splitting the PDF and sending each page separately to SOLR. So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.

answered Mar 21 '13 at 11:14

Gesh

565
1
6
21

Hi @Gesh maybe you can share how did you manag to split your pdfs? – zygimantus Jan 19 '17 at 12:12

score 0 · Answer 2 · answered Feb 28 '13 at 04:12

0

There is JIRA SOLR-380 with a Patch, which you can check upon.

answered Feb 28 '13 at 04:12

Jayendra

52,349
4
80
90

Thx, but it doesn't seem to work with pdf files converted by Tika. I also doubt that this patch is working with SOLR 4.1. – Gesh Mar 01 '13 at 10:01

score 0 · Answer 3 · answered Sep 02 '16 at 04:20

0

I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.

answered Sep 02 '16 at 04:20

Mayank Vij

1

So you were using PDFBox two times? When splitting and then when parsing? – zygimantus Jan 19 '17 at 12:13
No. I used PDFbox only once. I used it to split it into multiple pages and also to enter the parent file name in its title. Then I sent the file to Solr server and there I opened the file using combination of the parent file name+Page number. – Mayank Vij Mar 03 '17 at 05:49

score 0 · Answer 4 · answered May 03 '18 at 16:49

0

I have not tried it myself. Approach,

Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click

A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.

If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

answered May 03 '18 at 16:49

aswath86

551
5
14

Why is this a "far better approach" than indexing each page as a separate document? – MatsLindh May 03 '18 at 21:57
May be because you don't have to retain 2 copies of all your PDFs? – aswath86 May 04 '18 at 23:50
Why would you require that? You only store the content for each page once with associated file name? – MatsLindh May 05 '18 at 13:18
You need the full PDF to link it to the user on the search results. You need the split PDFs for indexing them. 2 copies. User is not interested in the split PDF. Search Engine is not interested in the Full PDF. – aswath86 May 07 '18 at 19:01
Sorry, but that's not how it works - the indexed content would be the same, and you still need only one copy of the PDF. There is no difference in the amount of data indexed, and you don't need to generate or keep a split pdf around - that can be done entirely in code when indexing (i.e. extract the content of a single page). This will not be an issue. – MatsLindh May 08 '18 at 07:17
"That's not how it works" if you are planning to split the pdf in your code. I was talking about "splitting them" as a separate process with some tool. In your case, you have to do result grouping or collapsing during search time. To me, it's a clumsy process to do pagination or entity extraction or even basic faceting for that matter when I have multiple solr docs against 1 actual doc. – aswath86 May 08 '18 at 15:33
And relevancy tuning also. I wouldn't not like to trade these. But to each its their own use case. Sometimes, all you care about is just a text based look up. Sometimes, you need them all. – aswath86 May 08 '18 at 15:43

Get page numbers of searchresult of a pdf in solr

4 Answers4

Linked