PDFBox 2.0.3/Java 7 - OOM Error when importing page from one PDF to another

Question

I have some code that reviews every page in a large PDF (20,000+ pages) and if that page contains a certain String, then it imports that page to another PDF.

Due to the number of occurrences, the PDF that it's being imported into grows almost as large as the source PDF - When it gets too large, it bombs out with the below exception:

Exception in thread "main" java.lang.OutofMemoryError: Java heap space
at java.utils.Arrays.copyOf (Unknown Source)
at java.io.ByteArrayOutputStream.toByteArray (Unknown Source)
at org.apache.pdfbox.cos.COSOutputStream.close(COSOutputStream.java:87)
at java.io.FilterOutputStream.close(Unknown Source)
at org.apache.pdfbox.cos.COSStream$1.close(COSStream.java:223)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:138)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:104)
at org.apache.pdfbox.pdfmodel.PDDocument.importPage(PDDocument.java:562)
at ExtractPage.extractString(ExtractPage.java:57)
at RunApp.run(RunApp.java:15)

I have researched the issue and it looks like the use of a temp file for streaming could resolve my problem. However, i just can't work out how to implement it into my code.

I do have a work around where i would batch the pages into seperate files and then merge them afterwards, using the soultion mentioned here - However, it certainley would be much more effcient and cleaner to avoid this.

Please see a summary of my code below:

File sourceFile = new File (C:\\Temp\\extractFROM.pdf);
PDDocument sourceDocument = PDDocument.load(SourceFile, MemoryUsageSetting.setupTempFileOnly();
PDPageTree sourcePageTree = sourceDocument.getDocumentCatalog().getPages(); 
PDDocument tempDocument = new PDDocument (MemoryUsageSetting.setupTempFileOnly())

for (PDPage page : sourcePageTree) {
// Code to extract page text and confirm if contains String
if (above psuedo code is true) {
tempDocument.importPage(page);
}
}

tempDocument.save(sourceFile);

Once it's exported around 7000 or so pages, that's when it bombs out at the tempDocument.importPage(page) line. It works perfectly for PDFs below that number.

Can anyone assist?

@TilmanHausherr - i can't get 2.0.8 as i'm using PDFBox for a client and they are a little behind the times in terms of sourcing. As such, i'm stuck with 2.0.3 — Rusty Shackleford, Nov 30 '17 at 21:24
@mkl - I have increased the heap as a run configuration to 670mb (The maximum i can secure with my client equipment) and this has successfully resolved the issue - In fact, i tried it on a PDF twice the size as the original failing PDF, and it easily managed this as well. If i could mark your comment as an accepted answer, i would! — Rusty Shackleford, Nov 30 '17 at 21:29

score 1 · Accepted Answer · answered Nov 30 '17 at 22:09

A program running into an OutofMemoryError might have a memory leak, or it might simply require more memory to run properly.

Thus, one change to try in such a situation is to simply increase the memory assigned to the program. If the program then runs without an issue, you can consider this a fix. As long as the memory assigned does not become completely unreasonable, that is...

This appears to be the case here, as the op confirmed

I have increased the heap as a run configuration to 670mb (The maximum i can secure with my client equipment) and this has successfully resolved the issue - In fact, i tried it on a PDF twice the size as the original failing PDF, and it easily managed this as well.

Just to provide further information, i have faced memory leaks prior to this and have successfully managed to correct them without increasing the heap. However, I've been working on this problem for a few days now and couldn't find a solution to resolve it. It also appears my JVM had a default setting of max 256mb, so no wonder it was falling over considering the sizes of the PDFs i am working with. — Rusty Shackleford, Nov 30 '17 at 22:14

PDFBox 2.0.3/Java 7 - OOM Error when importing page from one PDF to another

1 Answers1