0

I am trying to find the page number of a PDF object using itext's Java API. The following code reads in the PDF file, and gets the object containing the open action. How do I get the page number of that object?

PdfReader soPdfItext = null;
      try {
        soPdfItext = new PdfReader(new FileInputStream("C:\\Temp\\sample.pdf"));
      } catch (IOException e) {
        /* barf here */
      }
      /* Get the catalog */
      PdfDictionary soCatalog = soPdfItext.getCatalog();
      /* Get the object referring to the open action */
      PRIndirectReference soOpenActionReference = (PRIndirectReference) soCatalog.get(PdfName.OPENACTION);
     /* Get the actual object containing the open action */
     PdfObject soOpenActionObject = originalPdfItext.getPdfObject(soOpenActionReference.getNumber());

Now what? There is a class Document that contains a method getPageNumber(), but I'm not sure if a) it's relevant to what I want to do and b) if it is relevant, how to implement.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
9-Pin
  • 143
  • 8
  • 1
    Have you tried using the getPageNumber method? What values does it return? – JamesB Jun 15 '15 at 21:43
  • I'm having a hard time finding this class `Document`. Could you provide a link? (It doesn't seem to be either of the default Java API `Document` classes and I don't see a `Document` class on the iText API either.) – River Jun 15 '15 at 21:47
  • 1
    Usually (but not always), the `/OpenAction` referes to another object: a *destination*. The destination object contains the page *index* and one of the available `/Fit*` options. If you want to receive an actual *page number*, you will need to check `/PageLabels` to convert the zero-based index. This can be found in the PDF Reference but you'll going to have to translate it into iText functions... – Jongware Jun 15 '15 at 22:17
  • To River: I found the documentation for the Document class [here](http://api.itextpdf.com/itext/com/itextpdf/text/Document.html). To Jongware, thank you for the hint. Will give it a try tomorrow. – 9-Pin Jun 16 '15 at 00:08

1 Answers1

3

There are no such things as page numbers in a PDF. Pages are part of a page tree. This page tree consists of /Pages elements (the branches of the tree) and /Page elements (the leaves of the tree). The page index is calculated by traversing the different branches and leaves of the tree. Optionally, a PDF also defines /PageLabels. If you know the page index and if you have the definition of the page labels, you can derive the page number.

You are extracting an PdfObject that represents an open action. It can be a PdfDictionary or a PdfArray.

PdfDictionary

If the PdfObject is an instance of a PdfDictionary, then you need to look at the /S item of this dictionary to find out which type of action will be triggered.

  • That action could be some JavaScript. If that JavaScript contains an action that jumps to a specific page, there might be a page number in that method.
  • That action could be a GoTo action, in which case you need to look at the /D entry for the destination (*).

There are 20 possible types of actions, and actions can be chained, so it's up to you to loop through the action chain and to examine every possible action.

This is an example:

/OpenAction<</D[8 0 R/Fit]/S/GoTo>>

The << and >> indicate that the open action is described using a dictionary. The /S shows that you have a /GoTo action and /D describes the destination.

PdfArray

If the PdfAction is an instance of a PdfArray, then this array is a destination (*).

This is an example:

/OpenAction[6 0 R/XYZ 0 806 0]

Destination

A destination is an array that consists of a variable number of elements. These are some examples:

[8 0 R/Fit]
[6 0 R/XYZ 0 806 0]

The first example is an array with two elements 8 0 R and /Fit. The second example is an array with four elements 6 0 R, /XYZ, 0, 806 and 0. You need the first element. It doesn't give you the page number (because there is no such thing as page numbers), but it gives you a reference to the /Page object. Based on that reference, you can deduce the page number by looping over the page tree and comparing the object number of a specific page with the object number in the destination.

P.S. the other elements are explained in my answer to this question: iTextPDF hyperlink not linking to the right place

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • 1
    Yes, the page *index* is calculated and based on this index and the page labels, you can get the page number. I'll update my answer. – Bruno Lowagie Jun 16 '15 at 08:38
  • Bruno, thank you for that response. I see the tree root in the catalog, and all the `/Pages` and `/Page` leaves. I spent all morning traversing a few PDFs' trees, landing on the required object, and determining the page number. It works every time! – 9-Pin Jun 16 '15 at 22:14