I have some code that reviews every page in a large PDF (20,000+ pages) and if that page contains a certain String, then it imports that page to another PDF.
Due to the number of occurrences, the PDF that it's being imported into grows almost as large as the source PDF - When it gets too large, it bombs out with the below exception:
Exception in thread "main" java.lang.OutofMemoryError: Java heap space
at java.utils.Arrays.copyOf (Unknown Source)
at java.io.ByteArrayOutputStream.toByteArray (Unknown Source)
at org.apache.pdfbox.cos.COSOutputStream.close(COSOutputStream.java:87)
at java.io.FilterOutputStream.close(Unknown Source)
at org.apache.pdfbox.cos.COSStream$1.close(COSStream.java:223)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:138)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:104)
at org.apache.pdfbox.pdfmodel.PDDocument.importPage(PDDocument.java:562)
at ExtractPage.extractString(ExtractPage.java:57)
at RunApp.run(RunApp.java:15)
I have researched the issue and it looks like the use of a temp file for streaming could resolve my problem. However, i just can't work out how to implement it into my code.
I do have a work around where i would batch the pages into seperate files and then merge them afterwards, using the soultion mentioned here - However, it certainley would be much more effcient and cleaner to avoid this.
Please see a summary of my code below:
File sourceFile = new File (C:\\Temp\\extractFROM.pdf);
PDDocument sourceDocument = PDDocument.load(SourceFile, MemoryUsageSetting.setupTempFileOnly();
PDPageTree sourcePageTree = sourceDocument.getDocumentCatalog().getPages();
PDDocument tempDocument = new PDDocument (MemoryUsageSetting.setupTempFileOnly())
for (PDPage page : sourcePageTree) {
// Code to extract page text and confirm if contains String
if (above psuedo code is true) {
tempDocument.importPage(page);
}
}
tempDocument.save(sourceFile);
Once it's exported around 7000 or so pages, that's when it bombs out at the tempDocument.importPage(page) line. It works perfectly for PDFs below that number.
Can anyone assist?