0

We are trying to render images from different PDF files, using PDFRenderer's method renderImageWithDPI. On a particular PDF, for some pages, the library renderer has a different behaviour.

The rendering itself takes way longer than for other similar pages, and the memory consumption reaches unusually big values: the memory consumed by the process goes up with about 50MB every 1 - 2 seconds, until it reaches values like 5GB of RAM consumed by the application process while in renderImageWithDPI. Once the thread finishes renderImageWithDPI, the memory consumption drops with 1.5 - 2 GB almost immediately. Due to the high memory consumption, sometimes a Java Heap Space Exception can be thrown.

The pages on which this happens are not visibly different than others, with the same width, height, and disk size. The rendering is done with 250 DPI, with ImageType RGB. Also, the application is running with the "-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider" parameter.

Is this a memory leak or an expected behaviour? Also, could somebody explain why some pages suck up 2GB of memory and take 1 minute to be rendered, while others are rendered in a couple of seconds?

Cristian
  • 417
  • 1
  • 9
  • 18
  • Can you share the pdf in question? – mkl Apr 06 '17 at 07:46
  • Could you give me your e-mail address so that I send you a google drive link to it? – Cristian Apr 06 '17 at 07:52
  • maybe shadings, maybe complex patterns... Please send the link also to tilman at snafu dot de. – Tilman Hausherr Apr 06 '17 at 08:18
  • @Cristian you can find an address if you click on my name here, mkl@... but definitively also send to Tilman, he is an active pdfbox developer. – mkl Apr 06 '17 at 09:03
  • Are you using the latest version (2.0.5)? – Tilman Hausherr Apr 06 '17 at 11:18
  • No, we are using 2.0.4 currently. Should we upgrade? – Cristian Apr 06 '17 at 11:26
  • Page 34 is quite slow, but that one has tenthousand image XObjects. You can lesse the memory footprint with the tricks mentioned in the PDFBox FAQ. – Tilman Hausherr Apr 06 '17 at 17:29
  • Indeed, page 34 is the slowest page in the book. I will have a look at the "PDF rendering" section of FAQ as well. The increase of memory footprint with 50MB every 2 seconds while processing page 34 is due to the big number of XObjects encountered in that specific page? – Cristian Apr 07 '17 at 06:01
  • Well, it's the most likely. 10132 XObjects for a single page is the worst I've ever seen. You can see this yourself with PDFDebugger command line app, go to page 34, then resources, then XObject. You can also observe that the next time the page is shown, it is done much faster. – Tilman Hausherr Apr 07 '17 at 07:58
  • Is the document confidential? If not, I'd like to put p34 into a public issue. – Tilman Hausherr Apr 07 '17 at 08:19
  • Hi Tilman. The pdf is confidential, so please don't extract the page and make it public. If you can extract data from the page, which can't be used to reconstruct it in details later, please go ahead and do it. I would like to accept your answer with details, so if you want this please create and answer and summarize what we have discussed in the comments. Thanks for all the support. – Cristian Apr 07 '17 at 08:49
  • Another thing to try: `-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true` . I'll make an answer tonight. I'll keep the p34 for myself and put "confidential" in its filename so I remember that. – Tilman Hausherr Apr 07 '17 at 08:53

1 Answers1

0

Analysis of the PDF shows that page 34 has over 10000 XObject elements, almost all of them CMYK images. You can see this yourself with the PDFDebugger command line app, go to page 34, then resources, then XObject. Converting them is not very fast in java. Memory usage is most likely due to us caching these images. You can observe that the next time the page is shown, it is done much faster. Disabling the cache is shown in the FAQ.

I also get some speed improvement (21 seconds instead of 89 seconds) by using this option: -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true. However image quality may be very slightly different, see PDFBOX-3569 for a discussion.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • We will try all the options from FAQ and also the last one that you mentioned, with the remark that image quality is actually important for us. Thanks for the support in understanding this issue. – Cristian Apr 07 '17 at 18:15
  • After reading https://issues.apache.org/jira/browse/PDFBOX-3569, my understanding is that KcmsServiceProvider and UsePureJavaCMYKConversion should not be used at the same time because it slows down the rendering. We will stick only with KcmsServiceProvider since we can't switch between the two depending on one pdf or another. – Cristian Apr 10 '17 at 12:35
  • In my case, using both made it faster. – Tilman Hausherr Apr 10 '17 at 12:38
  • @TilmanHausherr Can you please have a look at this https://stackoverflow.com/questions/51615758/memory-leak-issue-with-pdfbox – Richa Jul 31 '18 at 14:33