4

We're building a PDF search machine with Solr and Lucene where users can search for text in PDFs. The database only contains PDFs.

In the search results page ("/browse") we want to append the PDF file with #page=X where X is the page the text was found on. (Adobe Acrobat automatically scrolls to a certain page if specified with an anchor tag.)

For example, if I search for foobar and there's a pdf document where foobar is on page 5, the link should be http://pdfserver/pdfs/pdf.pdf#page=5 (note the anchor at the end).

  1. Is this possible?
  2. How would we get this page number?
Simon Fredsted
  • 964
  • 2
  • 14
  • 26
  • i don't think i understand what you're actually trying to achieve. Do you want to index pdf files and any search that you make to return the page number of the matched text or is it something else? – omu_negru Jun 30 '14 at 09:59
  • Exactly that. So if I search for "foobar" and there's a pdf document where "foobar" is on page 5, the link should be http://pdfserver/pdfs/pdf.pdf#page=5 – Simon Fredsted Jun 30 '14 at 10:32
  • Did you ever find a solution to this? Seems like a basic requirement when indexing a load of PDF files. – MrTelly Dec 07 '15 at 04:47
  • @MrTelly, I used the #search solution and URL-encoding the search term. – Simon Fredsted Dec 07 '15 at 09:11

2 Answers2

1

One easy-to-implement solution I found was to use the #search parameter that Adobe Reader supports when embedded in IE.

For example:

http://pdfserver/pdfs/pdf.pdf#search=foobar

Adobe Reader then jumps to the page.

One would need to URL-encode the search terms, of course.

Simon Fredsted
  • 964
  • 2
  • 14
  • 26
0

Apache tika can transform PDF files into structured data for you to feed into the solr server .

My approach to your problem would be to index each pdf per page, with extra fields linking to the chapter, text title (or absolute path, or both) and page number.Using this data you can then open the relevant document at the relevant page.

Read more about tika here : http://tika.apache.org/

omu_negru
  • 4,642
  • 4
  • 27
  • 38