3

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.

So what I need is the page number and a short text snippet of every search result.

I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.

I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.

Community
  • 1
  • 1
Gesh
  • 565
  • 1
  • 6
  • 21

4 Answers4

2

I'm now splitting the PDF and sending each page separately to SOLR. So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.

Gesh
  • 565
  • 1
  • 6
  • 21
0

There is JIRA SOLR-380 with a Patch, which you can check upon.

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • Thx, but it doesn't seem to work with pdf files converted by Tika. I also doubt that this patch is working with SOLR 4.1. – Gesh Mar 01 '13 at 10:01
0

I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.

  • So you were using PDFBox two times? When splitting and then when parsing? – zygimantus Jan 19 '17 at 12:13
  • No. I used PDFbox only once. I used it to split it into multiple pages and also to enter the parent file name in its title. Then I sent the file to Solr server and there I opened the file using combination of the parent file name+Page number. – Mayank Vij Mar 03 '17 at 05:49
0

I have not tried it myself. Approach,

  1. Solr customer connector integrating with Apache Tika parser for indexing PDFs
  2. Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
  3. In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
  4. Enable search on all the “page” attributes
  5. When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
  6. The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
  7. Link the PDF with the “#PageNumber” of the PDF and pop up the page on click

A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.

If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

aswath86
  • 551
  • 5
  • 14
  • Why is this a "far better approach" than indexing each page as a separate document? – MatsLindh May 03 '18 at 21:57
  • May be because you don't have to retain 2 copies of all your PDFs? – aswath86 May 04 '18 at 23:50
  • Why would you require that? You only store the content for each page once with associated file name? – MatsLindh May 05 '18 at 13:18
  • You need the full PDF to link it to the user on the search results. You need the split PDFs for indexing them. 2 copies. User is not interested in the split PDF. Search Engine is not interested in the Full PDF. – aswath86 May 07 '18 at 19:01
  • Sorry, but that's not how it works - the indexed content would be the same, and you still need only one copy of the PDF. There is no difference in the amount of data indexed, and you don't need to generate or keep a split pdf around - that can be done entirely in code when indexing (i.e. extract the content of a single page). This will not be an issue. – MatsLindh May 08 '18 at 07:17
  • "That's not how it works" if you are planning to split the pdf in your code. I was talking about "splitting them" as a separate process with some tool. In your case, you have to do result grouping or collapsing during search time. To me, it's a clumsy process to do pagination or entity extraction or even basic faceting for that matter when I have multiple solr docs against 1 actual doc. – aswath86 May 08 '18 at 15:33
  • And relevancy tuning also. I wouldn't not like to trade these. But to each its their own use case. Sometimes, all you care about is just a text based look up. Sometimes, you need them all. – aswath86 May 08 '18 at 15:43