0

Hi I am facing issues while trying to convert PDF files to .jpeg I am running python from anaconda distribution on windows machine.

Below is the code that is working for some of the pdfs

import os
from wand.image import Image as wi
pdf_dir = r"C:\\Users\Downloads\python computer vison\Computer-Vision-with-Python\pdf_to_convert"
os.chdir(pdf_dir)
path = r"C:/Users/Downloads/python computer vison/Computer-Vision-with-Python/jpeg_extract/"
for pdf_file in os.listdir(pdf_dir):
    print("filename is ",pdf_file)
    pdf = wi(filename=pdf_file,resolution=300)
    #print("filename is ",pdf_file)
    pdfImage = pdf.convert("jpeg")
    i = 1
    for img in pdfImage.sequence:
        page = wi(image=img)
        page.save(filename=path+pdf_file+str(i)+".jpg")
        i+=

and below is the output

filename is  tmpdocument-page0.pdf
filename is  tmpdocument-page1.pdf
filename is  tmpdocument-page100.pdf
filename is  tmpdocument-page1000.pdf
filename is  tmpdocument-page1001.pdf
filename is  tmpdocument-page1002.pdf
filename is  tmpdocument-page1003.pdf
filename is  tmpdocument-page1004.pdf
filename is  tmpdocument-page1005.pdf
filename is  tmpdocument-page1006.pdf
filename is  tmpdocument-page1007.pdf
filename is  tmpdocument-page1008.pdf
filename is  tmpdocument-page1009.pdf
filename is  tmpdocument-page1012.pdf
---------------------------------------------------------------------------
CorruptImageError                         Traceback (most recent call last)
<ipython-input-7-84715f25da7c> in <module>()
      8     #path = r"C://Users/Downloads/Work /ml_training_samples/tmp/"
      9     print("filename is ",pdf_file)
---> 10     pdf = wi(filename=pdf_file,resolution=300)
     11     #print("filename is ",pdf_file)
     12     pdfImage = pdf.convert("jpeg")

~\Anaconda3\envs\python-cvcourse\lib\site-packages\wand\image.py in __init__(self, image, blob, file, filename, format, width, height, depth, background, resolution, pseudo)
   4706                     self.read(blob=blob, resolution=resolution)
   4707                 elif filename is not None:
-> 4708                     self.read(filename=filename, resolution=resolution)
   4709                 # clear the wand format, otherwise any subsequent call to
   4710                 # MagickGetImageBlob will silently change the image to this

~\Anaconda3\envs\python-cvcourse\lib\site-packages\wand\image.py in read(self, file, filename, blob, resolution)
   5000             r = library.MagickReadImage(self.wand, filename)
   5001         if not r:
-> 5002             self.raise_exception()
   5003 
   5004     def save(self, file=None, filename=None):

~\Anaconda3\envs\python-cvcourse\lib\site-packages\wand\resource.py in raise_exception(self, stacklevel)
    220             warnings.warn(e, stacklevel=stacklevel + 1)
    221         elif isinstance(e, Exception):
--> 222             raise e
    223 
    224     def __enter__(self):

CorruptImageError: unable to read image data `C:/Users/AppData/Local/Temp/magick-40700dP2k-1ORw81R1' @ error/pnm.c/ReadPNMImage/1346

bach ground so i have a pdf Image document i named as tmpdocument which has over 2200 pages so i split them using python into individual pdf documents.Now I am trying to convert them into jpeg.

problem:

so when I am trying to convert the pdf's into jpeg some of the pages are successful and some page fa9.ils with the above error since all these pages are from same document i highly doubt this is an format issue. also I am able to open and view the image in adobe so i'm sure that page is not corrupted.

Lastly Image magic takes so much disk space and then this issue I am truly lost is there any other way to achieve the above scenerio any inputs would be helpful.

Thanks.

Updated

Thanks for the reply. Yes I am using ghostscript 9.26. The pdf is kinda sensitive data so I cant post online unfortunately. temp folder is 18mb so i think that is okay.

I have found some code online it is generating the jpeg files but replacing them rather than creating new files i have never done any subprocess before and there is no visibility in this code if program is running or failed or how to kill it any inputs here also appreciated.

I understand it is not using image magick anymore still I am okay as long as i can generate jpeg.

import os, subprocess

pdf_dir = r"C:\\Users\Downloads\latest_python\python computer vison\Computer-Vision-with-Python\pdf_to_convert"
os.chdir(pdf_dir)
pdftoppm_path = r"C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin\pdftoppm.exe"
i = 1
for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith(".pdf"):
        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))
        i+=1
kumarm
  • 79
  • 3
  • 15
  • Sorry, I do not know Wand. Just some guesses. 1) Does the same file fail all the time. If so, then can you post a link to a PDF that fails. It may not be compatible with Ghostscript. 2) You may be filling your ImageMagick TEMP (/Temp and/or /tmp) directory and then it is too full to process more files. Clean it of any large or all files. What version of ImageMagick are you using, on what platform (assumed Windows) and what version of Ghostscript? If Ghostscript is too old or 9.26, change it to 9.23-9.25. 9.26 may have some issues. – fmw42 Mar 03 '19 at 01:41
  • its not jus one file i have over 2200 pages and a lot of them are failing i have updated my question above. – kumarm Mar 03 '19 at 02:01
  • If you are trying to process 2200 pages in one PDF, then can overload your memory resources and other issues. If Wand crashes, even for smaller PDF files, it can leave a lot of files in your temp directory. Did you check to see if it needed clearing? Try downgrading your Ghostscript. Poppler may be an option to replace Ghostscript. I am not that familiar, but think it can process PDF. But ImageMagick will not use it. So you rightly need to use subprocess. If you have one PDF that fails all the time, then try Ghostscript standalone to see if it is GS and also poplar. Downgrade GS and try again – fmw42 Mar 03 '19 at 03:30
  • Yes you are right wand and imagemagick were giving lot of issues and also subprocess is very slow. now I changed my approach(again) i am using pdf2image and its fast and reasonably well did convert over 1300 pages with in 30 mins but some pages are giving me DecompressionBombError: Image size (196305115 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack. I am not sure where to set the max pixel size to avoid this . I am also writing the code in another comment. – kumarm Mar 03 '19 at 04:00
  • import os from pdf2image import convert_from_path pdf_dir = r"C:\\Users\Downloads\latest_python\python computer vison\Computer-Vision-with-Python\pdf_to_convert" os.chdir(pdf_dir) for pdf_file in os.listdir(pdf_dir): print("file name is ",pdf_file) if pdf_file.endswith(".pdf"): pages = convert_from_path(pdf_file, 300) pdf_file = pdf_file[:-4] for page in pages: page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG") – kumarm Mar 03 '19 at 04:01
  • i did this and it worked from PIL import Image Image.MAX_IMAGE_PIXELS = None – kumarm Mar 03 '19 at 04:08

0 Answers0