Get the number of pages in a PDF document

Question

This question is for referencing and comparing. The solution is the accepted answer below.

Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.

Here are some of the answers I found insufficient or simply NOT working:

Using Imagick (a PHP extension)

Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.

Using FPDI (a PHP library)

FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:

FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.

Opening a stream and search with a regular expression:

This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.

$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));

if(!$stream || !$content)
    return 0;

$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";

if(preg_match_all($regex, $content, $matches))
    $count = max($matches);

return $count;

/\/Count\s+(\d+)/ (looks for /Count <number>) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
/\/Page\W*(\d+)/ (looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
/\/N\s+(\d+)/ (looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N ; most, if not all, not containing the pagecount. Source.

So, what does work reliable and accurate?

See the answer below

score 111 · Accepted Answer · edited Jan 13 '22 at 19:32

111

A simple command line executable called: pdfinfo.

It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.

One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:

Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.

There is an easy way of extracting the pagecount from the output, here in PHP:

// Make a function for convenience 
function getPDFPages($document)
{
    $cmd = "/path/to/pdfinfo";           // Linux
    $cmd = "C:\\path\\to\\pdfinfo.exe";  // Windows
    
    // Parse entire output
    // Surround with double quotes if file name has spaces
    exec("$cmd \"$document\"", $output);

    // Iterate through lines
    $pagecount = 0;
    foreach($output as $op)
    {
        // Extract the number
        if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
        {
            $pagecount = intval($matches[1]);
            break;
        }
    }
    
    return $pagecount;
}

// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13

Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).

I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.

Security Notice: Use escapeshellarg on $document if document name is being fed from user input or file uploads.

edited Jan 13 '22 at 19:32

Ali Nadalizadeh

2,726
3
22
24

answered Feb 01 '13 at 10:33

Richard de Wit

7,102
7
44
54

13

+1 For taking time to help community and for sharing your knowledge gained as a result of this problem – Hanky Panky Apr 18 '13 at 12:04
As an alternative (if pdfinfo is not available on the server), you can also use **pdftk** with the `dump_data` option. You just have to do a few changes : - Set the $cmd variable to the pdftk binary - Change the preg_match call from `Pages` to `NumberOfPages` And that's all :-) – Levure Feb 02 '15 at 12:56
@bouchon - It sure looks like something nice (the Server one that is, the rest has a GUI), although you have to install it. `pdfinfo` is a single binary file. Just download it and place it anywhere (e.g. next to your PHP script for easy access) – Richard de Wit Feb 02 '15 at 13:28
7

I make a composer package for this. Wish it can help https://github.com/howtomakeaturn/pdfinfo – 尤川豪 Mar 21 '15 at 09:46
1

@尤川豪 Wow, that's really impressive! I'm honored :) – Richard de Wit Mar 21 '15 at 19:34
5

This can be done right in the shell using the usual gnu tools: pdfinfo $PDF_File | grep Pages | awk '{print $2}' – Sunday Apr 20 '15 at 14:42
Any recommendation for Centos/Amazon Linux? pdfinfo and xpdf don't seem to be available for this OS. – Mark Kasson Nov 16 '15 at 20:27
1

I found poppler. sudo yum install poppler-utils and now I have pdfinfo in Amazon Linux on EC2 – Mark Kasson Nov 17 '15 at 13:14
`mutool info` from the mupdf-tools package is significantly faster than `pdfinfo`. – frabjous Aug 18 '16 at 17:53
1

pdfinfo is inside `sudo apt-get install poppler-utils` for the lazy ubuntu / lxss users – toster-cx May 16 '17 at 08:05
What if I need to get the number of pages of an online PDF without downloading it? – Jacquelyn.Marquardt Sep 04 '17 at 17:43
@f126ck well you need *some* way of reading the file. So I guess you could load it with `curl` or `wget` into a temp file and then execute that script on it – Richard de Wit Sep 04 '17 at 17:58
`pdfinfo` is definitively worth it to count pages against command line `gs` or npm packages like `pdf2json`. Its output is immediate compared to others that take several seconds for large files. Thank you! – maganap Jul 10 '20 at 16:43
I can only recommend `qpdf`, because qpdf returns json or pure values, and you save yourself the parsing. This is not only faster but also less code. – CodeBrauer Aug 26 '20 at 12:22
Be aware that `pdfinfo` (and all of `poppler-utils`) is licensed under `GPL`. https://www.glyphandcog.com/opensource.html – omnesia Jul 06 '21 at 14:48

score 31 · Answer 2 · edited Apr 11 '17 at 05:31

31

Simplest of all is using ImageMagick

here is a sample code

$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();

otherwise you can also use PDF libraries like MPDF or TCPDF for PHP

edited Apr 11 '17 at 05:31

Richard de Wit

7,102
7
44
54

answered Dec 30 '15 at 15:29

Kuldeep Dangi

4,126
5
33
56

Brilliant, thank you, just something to note though, not all PHP installations have the imagick mod installed... you may need to check if that class exists first. – Craig Wayne Aug 19 '18 at 23:00
i found this worked at first, but then some PDFs gave an error of 'Failed to read the file' presumably they were not compatible. Suggest using the library noted above: https://github.com/howtomakeaturn/pdfinfo – pgee70 Dec 11 '18 at 09:21
Imagick::pingImage not implemented im getting an error like this – Mukhilan Elangovan Mar 07 '23 at 09:36

score 10 · Answer 3 · answered Aug 19 '19 at 19:26

10

You can use qpdf like below. If a file file_name.pdf has 100 pages,

$ qpdf --show-npages file_name.pdf
100

answered Aug 19 '19 at 19:26

SuperNova

25,512
7
93
64

1

+1 One of the few options that is not licensed under GPL: http://qpdf.sourceforge.net/ – omnesia Jul 06 '21 at 15:05

score 5 · Answer 4 · answered Oct 27 '20 at 13:38

5

Here is a simple example to get the number of pages in PDF with PHP.

<?php

function count_pdf_pages($pdfname) {
  $pdftext = file_get_contents($pdfname);
  $num = preg_match_all("/\/Page\W/", $pdftext, $dummy);

  return $num;
}

$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);

echo $pages;

?>

answered Oct 27 '20 at 13:38

Purvik Dhorajiya

4,662
3
34
43

In case of PDFs without incremental updates this may often work. – mkl Oct 27 '20 at 14:02
Can confirm this works on many occasions. But recently i ran into problems with PDF's consisting of more that 150 pages. E.g for a 179 page PDF, this counts for 181. Other than that, simple and useful. – Skywarth Apr 08 '21 at 21:14
The reason for the extra pages is likely pdfmarks for Bookmarks. See the bookmarks section of the [Adobe pdfmarks reference](https://opensource.adobe.com/dc-acrobat-sdk-docs/acrobatsdk/pdfs/acrobatsdk_pdfmark.pdf) – Kurt Friars Aug 18 '21 at 10:15

score 3 · Answer 5 · answered Sep 25 '14 at 05:10

3

if you can't install any additional packages, you can use this simple one-liner:

foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)

answered Sep 25 '14 at 05:10

Muad'Dib

511
4
11

Could you explain what it does and needs? What is `sed -n`, `sort -rn` and `head -n`? Also it seems that you are looking for `/Count `, which I showed in my question, doesn't work. – Richard de Wit Sep 25 '14 at 06:36
1

strings - grabs all strings from PDF binary. Sed - matches values found in 'Count' strings. The (-n) when used in conjunction with (p)rint will avoid repetition of line printing. sort - will take found 'Count' values and sort in (-r)everse order, handling each as a (n)umbers (descending). head - will print first -n line numbers. In this case, 1 (default is 10), which will be the highest 'Count' value. I haven't run across any PDFs that haven't had a Count value. Just luck I guess. Have you verified that your regex is working properly outside of preg_match_all? – Muad'Dib Sep 26 '14 at 02:14
2

Thank you for your explanation. Yes I have. I have tested a lot of PDFs for this (as I'm mainly working with PDFs, like 100/day) and approx. 40% of all PDFs actually have the `Count` value. I've also tested this by simply writing the stream to a textfile and search for it (or even parts of it) manually. On some PDFs I've found it, but on most PDFs I didn't. – Richard de Wit Sep 26 '14 at 05:42

score 2 · Answer 6 · answered Jun 01 '17 at 21:40

2

This seems to work pretty well, without the need for special packages or parsing command output.

<?php                                                                               

$target_pdf = "multi-page-test.pdf";                                                
$cmd = sprintf("identify %s", $target_pdf);                                         
exec($cmd, $output);                                                                
$pages = count($output);

answered Jun 01 '17 at 21:40

dhildreth

637
1
6
15

Running this commande returns me the followig `identify-im6.q16: attempt to perform an operation not allowed by the security policy PDF @ error/constitute.c/IsCoderAuthorized/408.` ! – Meloman Jun 04 '21 at 09:10

score 2 · Answer 7 · answered May 19 '19 at 02:06

2

Since you're ok with using command line utilities, you can use cpdf (Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:

cpdf.exe -pages "my file.pdf"

answered May 19 '19 at 02:06

Franck Dernoncourt

77,520
72
342
501

james-geldart · Answer 8 · 2020-02-06T17:05:10.077

I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer@

/**
 * Wrapper for pdfinfo program, part of xpdf bundle
 * http://www.xpdfreader.com/about.html
 * 
 * this will put all pdfinfo output into keyed array, then make them accessible via getValue
 */
class PDFInfoWrapper {

    const PDFINFO_CMD = 'pdfinfo';

    /**
     * keyed array to hold all the info
     */
    protected $info = array();

    /**
     * raw output in case we need it
     */
    public $raw = "";

    /**
     * Constructor
     * @param string $filePath - path to file
     */
    public function __construct($filePath) {
        exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);

        //loop each line and split into key and value
        foreach($output as $line) {
            $colon = strpos($line, ':');
            if($colon) {
                $key = trim(substr($line, 0, $colon));
                $val = trim(substr($line, $colon + 1));

                //use strtolower to make case insensitive
                $this->info[strtolower($key)] = $val;
            }
        }

        //store the raw output
        $this->raw = implode("\n", $output);

    }

    /**
     * get a value
     * @param string $key - key name, case insensitive
     * @returns string value
     */
    public function getValue($key) {
        return @$this->info[strtolower($key)];
    }

    /**
     * list all the keys
     * @returns array of key names
     */
    public function getAllKeys() {
        return array_keys($this->info);
    }

}

Thinking about this, for security a check that $filePath is valid (i.e. `if(!file_exists($filePath)) return false`) prior to calling exec() should probably be added — james-geldart, Aug 28 '20 at 09:30
james well said, I would use is_readable instead of file_exists though :) Thanks — Oliver M Grech, Apr 14 '22 at 11:04

score 1 · Answer 9 · answered Aug 22 '21 at 21:45

this simple 1 liner seems to do the job well:

strings $path_to_pdf | grep Kids | grep -o R | wc -l

there is a block in the PDF file which details the number of pages in this funky string:

/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]

The number of 'R' characters is the number of pages

screenshot of terminal showing output from strings

lezambranof · Answer 10 · 2021-10-24T17:17:44.203

1

You can use mutool.

mutool show FILE.pdf trailer/Root/Pages/Count

mutool is part of the MuPDF software package.

edited Oct 24 '21 at 17:17

answered Oct 11 '21 at 08:17

lezambranof

141
1
5

score 0 · Answer 11 · answered Aug 13 '15 at 19:41

Here is a R function that reports the PDF file page number by using the pdfinfo command.

pdf.file.page.number <- function(fname) {
    a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
    page.number <- as.numeric(readLines(a))
    close(a)
    page.number
}
if (F) {
    pdf.file.page.number("a.pdf")
}

score 0 · Answer 12 · answered Nov 03 '15 at 00:17

Here is a Windows command script using gsscript that reports the PDF file page number

@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem

:vars
  set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
  set __lastpagenumber__=1
  set __pdffile__="%~1"
  set __pdffilename__="%~n1"
  set __datetime__=%date%%time%
  set __datetime__=%__datetime__:.=%
  set __datetime__=%__datetime__::=%
  set __datetime__=%__datetime__:,=%
  set __datetime__=%__datetime__:/=% 
  set __datetime__=%__datetime__: =% 
  set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"

:check
  if %__pdffile__%=="" goto error1
  if not exist %__pdffile__% goto error2
  if not exist %__gs__% goto error3

:main
  %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
  FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A 
  set __lastpagenumber__=%__lastpagenumber__: =%
  if exist %__tmpfile__% del %__tmpfile__%

:output
  echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
  goto end

:error1
  echo no pdf file selected
  echo usage: %~n0 PDFFILE
  goto end

:error2
  echo no pdf file found
  echo usage: %~n0 PDFFILE
  goto end

:error3
  echo.can not find the ghostscript bin file
  echo.   %__gs__%
  echo.please download it from:
  echo.   http://www.ghostscript.com/download/
  echo.and install to "C:\prg\ghostscript"
  goto end

:end
  exit /b

score 0 · Answer 13 · answered Jan 18 '17 at 22:03

0

The R package pdftools and the function pdf_info() provides information on the number of pages in a pdf.

library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages

$pages
[1] 65

answered Jan 18 '17 at 22:03

emeryville

332
1
4
19

Saran · Answer 14 · 2017-06-21T16:06:16.403

If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep.

This should return just the number of pages:

grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf

Example: https://regex101.com/r/BrUTKn/1

Switches description:

-m 1 is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)
-a is neccessary to treat the binary file as text
-o to show only the match
-P to use Perl regular expression

Regex explanation:

starting "delimiter": (?<=\/N ) lookbehind of /N (nb. space character not seen here)
actual result: \d+ any number of digits
ending "delimiter": (?=\/) lookahead of /

Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.

score 0 · Answer 15 · answered Dec 09 '22 at 13:03

I got problems with imagemagick installations on production server. After hours of attempts, I decided to get rid of IM, and found another approach:

Install poppler-utils:

$ sudo apt install poppler-utils     [On Debian/Ubuntu & Mint]
$ sudo dnf install poppler-utils     [On RHEL/CentOS & Fedora]
$ sudo zypper install poppler-tools  [On OpenSUSE]  
$ sudo pacman -S poppler             [On Arch Linux]

Then execute via shell in your PL ( e.g. PHP):

shell_exec("pdfinfo $filePath | grep Pages | cut -f 2 -d':' | xargs");

score 0 · Answer 16 · answered Dec 09 '22 at 16:48

0

This works fine in Imagemagick.

convert image.pdf -format "%n\n" info: | head -n 1

answered Dec 09 '22 at 16:48

fmw42

46,825
10
62
80

score -1 · Answer 17 · answered Dec 31 '21 at 09:09

Often you read regex /\/Page\W/ but it won't work for me for several pdf files. So here is an other regex expression, that works for me.

$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);

Get the number of pages in a PDF document

This question is for referencing and comparing. The solution is the accepted answer below.

Using Imagick (a PHP extension)

Using FPDI (a PHP library)

Opening a stream and search with a regular expression:

So, what does work reliable and accurate?

17 Answers17

A simple command line executable called: pdfinfo.

Linked