Page count of Pdf with Java

Question

at the moment I am using itext to read the page count of a pdf. This takes quite long because the lib seems to scan the whole file.

Is the page information somewhere in the header of the pdf, or is a full filescan needed?

It's more of an general question than code question. I will stay with itext if it's does the best it can. But loading the complete file seems useless. — hans sausage, May 17 '11 at 06:33
http://stackoverflow.com/a/4135059/489364 this answer uses Apache PDFBox . java library — kommradHomer, Jun 18 '13 at 06:44

score 26 · Answer 1 · edited Jan 01 '14 at 22:03

That's correct. iText parses quite a bit of a PDF when it is opened (it doesn't read the contents of stream objects, but that's about it)...

UNLESS you use the PdfReader(RandomAccessFileOrArray) constructor, in which case it will only read the xrefs (mostly required), but not parse anything until you start requesting specific objects (directly or via various calls).

The first PDF program I ever wrote did exactly this. It opened up a PDF and doing the bare minimum amount of work necessary, read the number of pages. It didn't even parse the xrefs it didn't have to. Haven't thought about that program in years...

So while not perfectly efficient, it'll be vastly more efficient to use a RandomAccessFileOrArray:

int efficientPDFPageCount(String path) {
  RandomAccessFileOrArray file = new RandomAccessFileOrArray(path, false, true );
  PdfReader reader = new PdfReader(file);
  int ret = reader.getNumberOfPages();
  reader.close();
  return ret;
}

Update:

The itext API underwent a little overhaul. Now (in version 5.4.x) the correct way to use it is to pass through java.io.RandomAccessFile:

int efficientPDFPageCount(File file) {
     RandomAccessFile raf = new RandomAccessFile(file, "r");
     RandomAccessFileOrArray pdfFile = new RandomAccessFileOrArray(
          new RandomAccessSourceFactory().createSource(raf));
     PdfReader reader = new PdfReader(pdfFile, new byte[0]);
     int pages = reader.getNumberOfPages();
     reader.close();
     return pages;
  }

I ran a quick test locally where I read a bunch of PDF files using `Apache PDFBox (2.0.26)` vs `IText (5.5.13.3)` with the `byte[]` constructor for `PdfReader` vs same `IText` with the code shown above using a `RandomAccessFile`. For some reason, the code above seems to come out as slowest. `IText` with `byte[]` constructor finished in `767ms`, followed by `PDFBox` in `1485ms` and the code above in `8462ms`. Not sure if I'm missing something, or if something changed in `IText` in the meantime, but using the `byte[]` constructor seems to be much faster. — SND, Jul 07 '22 at 06:20

aioobe · Answer 2 · 2011-05-17T06:40:17.517

4

Lars Vogel uses the following code:

PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();

I'd be surprised if the implementation of getNumberOfPages is slower than any other solution.

Section F.3.3 says there is a header-field called N described as follows:

N     integer (Required)      The number of pages in the document.

edited May 17 '11 at 06:40

answered May 17 '11 at 06:28

aioobe

413,195
112
811
826

yes i know that's my code. But does this piece of code have to scan the full pdf or would their be an easier way if you only read the header of the pdf. – hans sausage May 17 '11 at 06:32

score 3 · Answer 3 · answered May 17 '11 at 07:28

3

You just need to read the Page tree (Catalogue, Pages, Kids) and count the Page entries.

answered May 17 '11 at 07:28

mark stephens

3,205
16
19

Actually, you just need the root Pages object and get it's /Count. – Mark Storer May 17 '11 at 20:39

score 1 · Answer 4 · answered Nov 10 '20 at 12:32

1

In iText version 5.5.13 the method bellow will give you a page number without scanning the whole file. It will not read full file content into memory.

int efficientPDFPageCount(String filePath){
     PdfReader reader = new PdfReader(filePath, new byte[0], true);
     int pages = reader.getNumberOfPages();
     reader.close();
     return pages;

}

answered Nov 10 '20 at 12:32

Deividas Duda

123
1
8

You can also use the `PdfReader(ReaderProperties properties, final String filename)` constructor and set partial read to `true`. – dvlcube Aug 18 '23 at 04:52

score 0 · Answer 5 · edited May 17 '11 at 07:32

0

PdfReader document = new PdfReader(new FileInputStream(new File("filename")));  
int noPages = document.getNumberOfPages();

edited May 17 '11 at 07:32

Joachim Sauer

302,674
57
556
614

answered May 17 '11 at 06:29

TKV

2,533
11
43
56

score 0 · Answer 6 · edited May 17 '11 at 07:33

0

PdfReader document = new PdfReader(new FileInputStream(new File("filename")));   
int noPages = document.getNumberOfPages();

above is the process for counting the pdf pages

edited May 17 '11 at 07:33

Joachim Sauer

302,674
57
556
614

answered May 17 '11 at 06:30

developer

9,116
29
91
150

I know it's been two years but... iText @Jaydev – Taslim Oseni Jul 17 '18 at 00:06

Page count of Pdf with Java

6 Answers6

Linked