1

When i try to convert pdf to image then for some pdfs i get a "out of memory" error. So i increased heap size and then i again got the error for some different pdf file. for the time being assume I have no memory leak from other objects. So what would be the reason for this memory out of error? Would it be just that the image is so large(which is not the case i think) that it consumes heap, or maybe pdfbox stores buffered image of each pages in its memory and this contributes to the growing heap size? Any insight would be wonderful.

Here's the link to the pdf I am trying to render. https://drive.google.com/file/d/0B_Ke2amBgdpeNFFDem5KVVVzanc/view?usp=sharing Here's the code segment.

PDFRenderer pdfRenderer = new PDFRenderer(pdDoc);
BufferedImage image = pdfRenderer.renderImageWithDPI(page-1, 300,ImageType.GRAY);
//image=ImageHelper.convertImageToGrayscale(image);
ImageIOUtil.writeImage(image,"G:/Trial/tempImg.png", 300);

Please note that for this particular pdf problem was solved by increasing the heap size but what I want to know is that does pdfbox stores buffered images in its memory and contributes to heap size.

Here's another pdf which faced the same issue even after increasing heap size . https://drive.google.com/file/d/0B_Ke2amBgdpedDBtaG1QcW1oYlU/view?usp=sharing In this pdf my code takes forever while rendering page 44. I don't know why this is happening.

ANKIT
  • 126
  • 2
  • 11
  • maybe add the size of what you are trying to convert and the snippet of code doing it – Zeromus Jun 24 '16 at 08:08
  • I have edited my post and uploaded the code and file. – ANKIT Jun 24 '16 at 08:25
  • 1
    regarding memory usage: if i'm not mistaken pdfbox uses a lot of memory (especially with colored images), yes it keeps all those pages in memory even though you dont need them (an year ago or so they had plans for an readOnDemand/remove after usage but i didn't keep up to date). you can try to use a scratch file to save memory but its gonna be slow – Zeromus Jun 24 '16 at 08:51
  • Your file is huge... maybe you'll need even more -Xmx space. I don't see any problems with p44. It's just a bunch of very large scans. Yes, PDFBox does store a lot in memory. Make sure that when converting, you don't keep the images (e.g. in an array) so that the space becomes available. And if you're using JDK8, don't forget the special setting. https://pdfbox.apache.org/2.0/getting-started.html – Tilman Hausherr Jun 24 '16 at 09:00
  • so i guess that's the reason for my out of memory error in 1st pdf. But why does my code haults while rendering page 44 fo 2nd pdf , any idea? – ANKIT Jun 24 '16 at 09:01
  • I am already using that setting and i am using -xmx1024m, should i increase it more? . So all these problems is because of the image size and isn't related to any memory leak in pdfbox? – ANKIT Jun 24 '16 at 09:03
  • Also, I am not storing images in a list , my above code segment is in a function which is called for every page. – ANKIT Jun 24 '16 at 09:05
  • 1
    @Zeromus we are caching images but using a SoftReference since the 2.0 release, so they shouldn't be kept in memory. I have no problem with p44. Btw the extracted pages are up to 31MB large (p8). It is often a bad idea to scan in color. And scanning text papers to JPEG (as in the linked PDF) is also a bad idea, due to the artefacts. Sadly, many poorly programmed multifunctional copiers do this. – Tilman Hausherr Jun 24 '16 at 09:22
  • @ANKIT try -Xmx2g. – Tilman Hausherr Jun 24 '16 at 09:22
  • @Tilman Hausherr, I am using jvm 32 bit and I cant seem to set my xmx to 2g . – ANKIT Jun 24 '16 at 09:24
  • well the limit to the 32 bit version is known, you cant upgrade to 64? – Zeromus Jun 24 '16 at 09:24
  • I was curious as to why does stack overflow prevents us from posting new question before 90 minutes of posting previous question? – ANKIT Jun 24 '16 at 09:25
  • @ANKIT then try -Xmx1999m (not the big X). Consider replacing your jvm. – Tilman Hausherr Jun 24 '16 at 09:25
  • Oracle FAQ: The maximum theoretical heap limit for the 32-bit JVM is 4G. Due to various additional constraints such as available swap, kernel address space usage, memory fragmentation, and VM overhead, in practice the limit can be much lower. On most modern 32-bit Windows systems the maximum heap size will range from 1.4G to 1.6G. On 32-bit Solaris kernels the address space is limited to 2G. On 64-bit operating systems running the 32-bit VM, the max heap size can be higher, approaching 4G on many Solaris systems. – Zeromus Jun 24 '16 at 09:26
  • I am using tess4j wrapper and it doesn't work with 64bit jvm, so I can't update – ANKIT Jun 24 '16 at 09:26
  • @ANKIT https://www.google.com/search?q=90+minutes+stackoverflow – Tilman Hausherr Jun 24 '16 at 09:27
  • @Tilman Hausherr , I wanted to ask another question, guess will have to wait for half an hour. I will upload my entire function in that question and Please take a look because that is related to the scratch file buffer we were discussing about.(and the memory error) – ANKIT Jun 24 '16 at 09:30
  • @TilmanHausherr , here's the link :- http://stackoverflow.com/questions/38010063/facing-set-datapath-error-while-using-tesseract-in-java – ANKIT Jun 24 '16 at 09:32
  • So to end this discussion the above problem I am facing is not due to any kind of bug or memory leak, but is due to the image size and shorthand heap size capability of 32-bit JVM. Correct me If I am wrong. – ANKIT Jun 24 '16 at 09:33
  • The pdf size is not too large in my opinion. But well I dont have any idea as to how much is too large for the JVM – ANKIT Jun 24 '16 at 09:38
  • @TilmanHausherr , Would it make a difference If I use other renderer like ghost4j? – ANKIT Jun 24 '16 at 10:20
  • @ANKIT sorry, I can't comment on a software that I haven't used. All I can say is that the rendering of ghostscript (I have used gswin) is great. – Tilman Hausherr Jun 24 '16 at 10:25

1 Answers1

0

Well It seems that this problem is not due to any bug or memory leaks but is due to image size. Proposed solutions:- 1) Increase you Xmx size 2) Switch over to 64- bit JVM.

EDIT:- Thanks for the answers. I am just going to lay it out here. Tests were performed by @Tilman Hausherr and results were that the heap size should be increased.Note that 64 bit jvm was used.

ANKIT
  • 126
  • 2
  • 11
  • On JDK7 64bit, one file works with -Xmx90m, the other with -Xmx400m. – Tilman Hausherr Jun 26 '16 at 11:09
  • Yes, increasing Xmx size is a fix, but when there are too many pdf files then heap is getting full.Hey @TilmanHausherr, can you do one thing for me please, after putting an image.flush() line reiterate the above code too many times and then see the heap dump. You can do that with any pdf. Please update me on the result. My result:- Too many finalizer class reference which would only increase with no of loops. – ANKIT Jun 26 '16 at 11:16
  • This will take some time. My PC is 6 years old. And I usually switch it off at night so that the room can cool down a bit. – Tilman Hausherr Jun 26 '16 at 11:31
  • @TilmanHausherr, Whenever you get time , do this and then please update me on the result. – ANKIT Jun 26 '16 at 11:37
  • Why is finalize method being used with objects such as bufferedImage which may require lots of instantiation. It only increase the number of finalize method in reference queue , I don't understand the purpose. Methods like close or dispose can be used insted. – ANKIT Jun 26 '16 at 11:38
  • The heat wave is over here, so I had your files rendered for 24 hours in a loop (30 loops were done) without any problem, -Xmx400m is set, on a 64bit JDK7 on W7. – Tilman Hausherr Jun 27 '16 at 15:49
  • @TilmanHausherr Thanks for running the code. Can you send me the code , I want to see where am i making mistake. But you did this on 64 bit so I cannot really compare but i'll try. Using system.gc() inbetween pages prevented memory error . I haven't run the code for long time yet. Tomorrow I will update you with the result wheter it is working or not. – ANKIT Jun 27 '16 at 16:49
  • Unfortunately I don't see any difference in your code and my code. Anyways thanks for the effort. May be the difference is due to jvm. – ANKIT Jun 27 '16 at 16:59