How do I read a particular page (given a page number) from a PDF document using PDFBox?
Asked
Active
Viewed 6.3k times
22
-
Can you be more specific about what you mean by "read"? – Adrian Petrescu Jul 27 '11 at 05:30
-
1@Adrian: Say, I want the page #2 in `PDPage` object. – missingfaktor Jul 27 '11 at 05:34
6 Answers
32
This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );

Nicolas Modrzyk
- 13,961
- 2
- 36
- 40
-
3What is the type of `doc` here? The `PDDocument` class doesn't seem to have a `getAllPages` method. – missingfaktor Jul 27 '11 at 05:36
-
4@missingfaktor `doc` is a [PDDocumentCatalog](http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/PDDocumentCatalog.html) object – Jacob Jul 27 '11 at 05:45
-
For those coming along here later: http://pdfbox.apache.org/cookbook/textextraction.html Basically -- use PDFTextStripper, not PDPage as PDPage seems to be more about displaying a page on screen than getting text http://stackoverflow.com/questions/13563482/read-text-from-a-particular-page-using-pdfbox#answer-15689797 – Don Cheadle Oct 09 '14 at 16:06
-
1For pdfbox 2.0 I simply used: pdDoc.getPage(pageNumber); where pdDoc is a type of PDDocument. – jcomouth Sep 17 '15 at 08:51
-
2For PDFBox 1.8.10 there seems to be no method getAllPages() for the PDDocument type. The link does not work any more unfortunately. – Uli Köhler Oct 09 '15 at 22:10
21
//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}

Raymond C Borges Hink
- 420
- 4
- 8
4
Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}

sam9046
- 566
- 1
- 7
- 14
-
2Personal preference, but I find "if (! found)" to be much more readable than the "if (found == false)" syntax :) – user85116 Apr 21 '14 at 19:52
1
you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);

Prasad Khode
- 6,602
- 11
- 44
- 59

Bilal Shahid
- 490
- 9
- 16
1
Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();

Mowazzem Hosen
- 457
- 4
- 10
0
Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.

Paul
- 139,544
- 27
- 275
- 264