77

This question is for referencing and comparing. The solution is the accepted answer below.

Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.

Here are some of the answers I found insufficient or simply NOT working:

Using Imagick (a PHP extension)

Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.

Using FPDI (a PHP library)

FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:

FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.

Opening a stream and search with a regular expression:

This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.

$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));

if(!$stream || !$content)
    return 0;

$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";

if(preg_match_all($regex, $content, $matches))
    $count = max($matches);

return $count;
  • /\/Count\s+(\d+)/ (looks for /Count <number>) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
  • /\/Page\W*(\d+)/ (looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
  • /\/N\s+(\d+)/ (looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N ; most, if not all, not containing the pagecount. Source.

So, what does work reliable and accurate?

See the answer below

Community
  • 1
  • 1
Richard de Wit
  • 7,102
  • 7
  • 44
  • 54

17 Answers17

111

A simple command line executable called: pdfinfo.

It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.

One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:

Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.

There is an easy way of extracting the pagecount from the output, here in PHP:

// Make a function for convenience 
function getPDFPages($document)
{
    $cmd = "/path/to/pdfinfo";           // Linux
    $cmd = "C:\\path\\to\\pdfinfo.exe";  // Windows
    
    // Parse entire output
    // Surround with double quotes if file name has spaces
    exec("$cmd \"$document\"", $output);

    // Iterate through lines
    $pagecount = 0;
    foreach($output as $op)
    {
        // Extract the number
        if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
        {
            $pagecount = intval($matches[1]);
            break;
        }
    }
    
    return $pagecount;
}

// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13

Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).

I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.

Security Notice: Use escapeshellarg on $document if document name is being fed from user input or file uploads.

Ali Nadalizadeh
  • 2,726
  • 3
  • 22
  • 24
Richard de Wit
  • 7,102
  • 7
  • 44
  • 54
  • 13
    +1 For taking time to help community and for sharing your knowledge gained as a result of this problem – Hanky Panky Apr 18 '13 at 12:04
  • As an alternative (if pdfinfo is not available on the server), you can also use **pdftk** with the `dump_data` option. You just have to do a few changes : - Set the $cmd variable to the pdftk binary - Change the preg_match call from `Pages` to `NumberOfPages` And that's all :-) – Levure Feb 02 '15 at 12:56
  • @bouchon - It sure looks like something nice (the Server one that is, the rest has a GUI), although you have to install it. `pdfinfo` is a single binary file. Just download it and place it anywhere (e.g. next to your PHP script for easy access) – Richard de Wit Feb 02 '15 at 13:28
  • 7
    I make a composer package for this. Wish it can help https://github.com/howtomakeaturn/pdfinfo – 尤川豪 Mar 21 '15 at 09:46
  • 1
    @尤川豪 Wow, that's really impressive! I'm honored :) – Richard de Wit Mar 21 '15 at 19:34
  • 5
    This can be done right in the shell using the usual gnu tools: pdfinfo $PDF_File | grep Pages | awk '{print $2}' – Sunday Apr 20 '15 at 14:42
  • Any recommendation for Centos/Amazon Linux? pdfinfo and xpdf don't seem to be available for this OS. – Mark Kasson Nov 16 '15 at 20:27
  • 1
    I found poppler. sudo yum install poppler-utils and now I have pdfinfo in Amazon Linux on EC2 – Mark Kasson Nov 17 '15 at 13:14
  • `mutool info` from the mupdf-tools package is significantly faster than `pdfinfo`. – frabjous Aug 18 '16 at 17:53
  • 1
    pdfinfo is inside `sudo apt-get install poppler-utils` for the lazy ubuntu / lxss users – toster-cx May 16 '17 at 08:05
  • What if I need to get the number of pages of an online PDF without downloading it? – Jacquelyn.Marquardt Sep 04 '17 at 17:43
  • @f126ck well you need *some* way of reading the file. So I guess you could load it with `curl` or `wget` into a temp file and then execute that script on it – Richard de Wit Sep 04 '17 at 17:58
  • `pdfinfo` is definitively worth it to count pages against command line `gs` or npm packages like `pdf2json`. Its output is immediate compared to others that take several seconds for large files. Thank you! – maganap Jul 10 '20 at 16:43
  • I can only recommend `qpdf`, because qpdf returns json or pure values, and you save yourself the parsing. This is not only faster but also less code. – CodeBrauer Aug 26 '20 at 12:22
  • Be aware that `pdfinfo` (and all of `poppler-utils`) is licensed under `GPL`. https://www.glyphandcog.com/opensource.html – omnesia Jul 06 '21 at 14:48
31

Simplest of all is using ImageMagick

here is a sample code

$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();

otherwise you can also use PDF libraries like MPDF or TCPDF for PHP

Richard de Wit
  • 7,102
  • 7
  • 44
  • 54
Kuldeep Dangi
  • 4,126
  • 5
  • 33
  • 56
  • Brilliant, thank you, just something to note though, not all PHP installations have the imagick mod installed... you may need to check if that class exists first. – Craig Wayne Aug 19 '18 at 23:00
  • i found this worked at first, but then some PDFs gave an error of 'Failed to read the file' presumably they were not compatible. Suggest using the library noted above: https://github.com/howtomakeaturn/pdfinfo – pgee70 Dec 11 '18 at 09:21
  • Imagick::pingImage not implemented im getting an error like this – Mukhilan Elangovan Mar 07 '23 at 09:36
10

You can use qpdf like below. If a file file_name.pdf has 100 pages,

$ qpdf --show-npages file_name.pdf
100
SuperNova
  • 25,512
  • 7
  • 93
  • 64
5

Here is a simple example to get the number of pages in PDF with PHP.

<?php

function count_pdf_pages($pdfname) {
  $pdftext = file_get_contents($pdfname);
  $num = preg_match_all("/\/Page\W/", $pdftext, $dummy);

  return $num;
}

$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);

echo $pages;

?>
Purvik Dhorajiya
  • 4,662
  • 3
  • 34
  • 43
  • In case of PDFs without incremental updates this may often work. – mkl Oct 27 '20 at 14:02
  • Can confirm this works on many occasions. But recently i ran into problems with PDF's consisting of more that 150 pages. E.g for a 179 page PDF, this counts for 181. Other than that, simple and useful. – Skywarth Apr 08 '21 at 21:14
  • The reason for the extra pages is likely pdfmarks for Bookmarks. See the bookmarks section of the [Adobe pdfmarks reference](https://opensource.adobe.com/dc-acrobat-sdk-docs/acrobatsdk/pdfs/acrobatsdk_pdfmark.pdf) – Kurt Friars Aug 18 '21 at 10:15
3

if you can't install any additional packages, you can use this simple one-liner:

foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)
Muad'Dib
  • 511
  • 4
  • 11
  • Could you explain what it does and needs? What is `sed -n`, `sort -rn` and `head -n`? Also it seems that you are looking for `/Count `, which I showed in my question, doesn't work. – Richard de Wit Sep 25 '14 at 06:36
  • 1
    strings - grabs all strings from PDF binary. Sed - matches values found in 'Count' strings. The (-n) when used in conjunction with (p)rint will avoid repetition of line printing. sort - will take found 'Count' values and sort in (-r)everse order, handling each as a (n)umbers (descending). head - will print first -n line numbers. In this case, 1 (default is 10), which will be the highest 'Count' value. I haven't run across any PDFs that haven't had a Count value. Just luck I guess. Have you verified that your regex is working properly outside of preg_match_all? – Muad'Dib Sep 26 '14 at 02:14
  • 2
    Thank you for your explanation. Yes I have. I have tested a lot of PDFs for this (as I'm mainly working with PDFs, like 100/day) and approx. 40% of all PDFs actually have the `Count` value. I've also tested this by simply writing the stream to a textfile and search for it (or even parts of it) manually. On some PDFs I've found it, but on most PDFs I didn't. – Richard de Wit Sep 26 '14 at 05:42
2

This seems to work pretty well, without the need for special packages or parsing command output.

<?php                                                                               

$target_pdf = "multi-page-test.pdf";                                                
$cmd = sprintf("identify %s", $target_pdf);                                         
exec($cmd, $output);                                                                
$pages = count($output);
dhildreth
  • 637
  • 1
  • 6
  • 15
  • Running this commande returns me the followig `identify-im6.q16: attempt to perform an operation not allowed by the security policy PDF @ error/constitute.c/IsCoderAuthorized/408.` ! – Meloman Jun 04 '21 at 09:10
2

Since you're ok with using command line utilities, you can use cpdf (Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:

cpdf.exe -pages "my file.pdf"
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
2

I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer@

/**
 * Wrapper for pdfinfo program, part of xpdf bundle
 * http://www.xpdfreader.com/about.html
 * 
 * this will put all pdfinfo output into keyed array, then make them accessible via getValue
 */
class PDFInfoWrapper {

    const PDFINFO_CMD = 'pdfinfo';

    /**
     * keyed array to hold all the info
     */
    protected $info = array();

    /**
     * raw output in case we need it
     */
    public $raw = "";

    /**
     * Constructor
     * @param string $filePath - path to file
     */
    public function __construct($filePath) {
        exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);

        //loop each line and split into key and value
        foreach($output as $line) {
            $colon = strpos($line, ':');
            if($colon) {
                $key = trim(substr($line, 0, $colon));
                $val = trim(substr($line, $colon + 1));

                //use strtolower to make case insensitive
                $this->info[strtolower($key)] = $val;
            }
        }

        //store the raw output
        $this->raw = implode("\n", $output);

    }

    /**
     * get a value
     * @param string $key - key name, case insensitive
     * @returns string value
     */
    public function getValue($key) {
        return @$this->info[strtolower($key)];
    }

    /**
     * list all the keys
     * @returns array of key names
     */
    public function getAllKeys() {
        return array_keys($this->info);
    }

}
james-geldart
  • 709
  • 7
  • 9
  • Thinking about this, for security a check that $filePath is valid (i.e. `if(!file_exists($filePath)) return false`) prior to calling exec() should probably be added – james-geldart Aug 28 '20 at 09:30
  • james well said, I would use is_readable instead of file_exists though :) Thanks – Oliver M Grech Apr 14 '22 at 11:04
1

this simple 1 liner seems to do the job well:

strings $path_to_pdf | grep Kids | grep -o R | wc -l

there is a block in the PDF file which details the number of pages in this funky string:

/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]

The number of 'R' characters is the number of pages

screenshot of terminal showing output from strings

dryliketoast
  • 153
  • 1
  • 7
1

You can use mutool.

mutool show FILE.pdf trailer/Root/Pages/Count

mutool is part of the MuPDF software package.

lezambranof
  • 141
  • 1
  • 5
0

Here is a R function that reports the PDF file page number by using the pdfinfo command.

pdf.file.page.number <- function(fname) {
    a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
    page.number <- as.numeric(readLines(a))
    close(a)
    page.number
}
if (F) {
    pdf.file.page.number("a.pdf")
}
Feiming Chen
  • 69
  • 1
  • 3
0

Here is a Windows command script using gsscript that reports the PDF file page number

@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem

:vars
  set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
  set __lastpagenumber__=1
  set __pdffile__="%~1"
  set __pdffilename__="%~n1"
  set __datetime__=%date%%time%
  set __datetime__=%__datetime__:.=%
  set __datetime__=%__datetime__::=%
  set __datetime__=%__datetime__:,=%
  set __datetime__=%__datetime__:/=% 
  set __datetime__=%__datetime__: =% 
  set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"

:check
  if %__pdffile__%=="" goto error1
  if not exist %__pdffile__% goto error2
  if not exist %__gs__% goto error3

:main
  %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
  FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A 
  set __lastpagenumber__=%__lastpagenumber__: =%
  if exist %__tmpfile__% del %__tmpfile__%

:output
  echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
  goto end

:error1
  echo no pdf file selected
  echo usage: %~n0 PDFFILE
  goto end

:error2
  echo no pdf file found
  echo usage: %~n0 PDFFILE
  goto end

:error3
  echo.can not find the ghostscript bin file
  echo.   %__gs__%
  echo.please download it from:
  echo.   http://www.ghostscript.com/download/
  echo.and install to "C:\prg\ghostscript"
  goto end

:end
  exit /b
0

The R package pdftools and the function pdf_info() provides information on the number of pages in a pdf.

library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages

$pages
[1] 65
emeryville
  • 332
  • 1
  • 4
  • 19
0

If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep.

This should return just the number of pages:

grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf

Example: https://regex101.com/r/BrUTKn/1

Switches description:

  • -m 1 is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)
  • -a is neccessary to treat the binary file as text
  • -o to show only the match
  • -P to use Perl regular expression

Regex explanation:

  • starting "delimiter": (?<=\/N ) lookbehind of /N (nb. space character not seen here)
  • actual result: \d+ any number of digits
  • ending "delimiter": (?=\/) lookahead of /

Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.

Saran
  • 3,845
  • 3
  • 37
  • 59
0

I got problems with imagemagick installations on production server. After hours of attempts, I decided to get rid of IM, and found another approach:

Install poppler-utils:

$ sudo apt install poppler-utils     [On Debian/Ubuntu & Mint]
$ sudo dnf install poppler-utils     [On RHEL/CentOS & Fedora]
$ sudo zypper install poppler-tools  [On OpenSUSE]  
$ sudo pacman -S poppler             [On Arch Linux]

Then execute via shell in your PL ( e.g. PHP):

shell_exec("pdfinfo $filePath | grep Pages | cut -f 2 -d':' | xargs");
Alex Rsk
  • 93
  • 1
  • 7
0

This works fine in Imagemagick.

convert image.pdf -format "%n\n" info: | head -n 1

fmw42
  • 46,825
  • 10
  • 62
  • 80
-1

Often you read regex /\/Page\W/ but it won't work for me for several pdf files. So here is an other regex expression, that works for me.

$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);
hulky
  • 1