Questions tagged [pdf2image]

A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.

pdf2image is a Python package that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object.

Resources

71 questions
1
vote
0 answers

PDF2image on AWS Lambda - resulted PNG has wrong fonts

I am using pdf2image convert_from_bytes on my own PDFs in order to get them in PNG format. The context is AWS Lambda, py 3.8. ... images = convert_from_bytes(infile, dpi=DPI, fmt=FMT) for…
TaiT's
  • 3,138
  • 3
  • 15
  • 26
1
vote
0 answers

Poppler is installed, pdf2image path error seems to have no resolution. Has this been fixed?

I am running debian buster on the docker image. I have installed every poppler package to rule anything unusual out. I have explicitly added the paths to all of the poppler files, containing directories, etc. I have followed the documentation…
Chris
  • 28,822
  • 27
  • 83
  • 158
0
votes
1 answer

Convert very large PDF to images with python

I have an extremely large PDF containing scans that are approximately 30.000px wide (wtf!). I have a python script that works well for normal sized PDF but when confronted to this large PDF outputs only 1 pixel wide white squares as images. The…
Seglinglin
  • 447
  • 1
  • 4
  • 17
0
votes
1 answer

How to fix TypeError: expected str, bytes or os.PathLike object, not UploadedFile

I'm trying to make OCR-platform using streamlit and easyocr. I already managed to do text conversions from images, but I can’t convert PDF to JPG in order to continue further processing. I tried downloading the pdf, then converting it to jpg, then…
0
votes
0 answers

Killed error while removing watermark in PDF & Merging the images to get PDF in VsCode. [ OS : Ubuntu ]

Lat two lines of my output shown in my Terminal Adding TEXT TOP @ LEFT CORNER for AP ECET 2020 Electronics and Communication Engineering September 14, 2020 Shift 2 English Question Paper Killed Removing the Watermark step from pdf is working…
0
votes
2 answers

How can I convert image coordinates to PDF coordinates when using pdf2image and table-transformers?

I am using pdf2image to convert pdf to images and detecting tables with table-transformers. I need help with coordinates. Issue is, I am getting perfect table borders but pixels in images are different from PDF coordinates. Any way to convert image…
0
votes
1 answer

Pdf file produces blank

I am creating a PDF file without text from a pdf file with text using the following program def remove_text_from_pdf(pdf_path_in, pdf_path_out): '''Removes the text from the PDF file and saves it as a new PDF file''' #Open the PDF file with the…
0
votes
0 answers

Obtained position of tables in pdf and plot the bounding box on the image

Following this script, I could know the bounding box of the tables in my e-pdf: tabula.read_pdf(file, stream=True,guess=True,lattice=False,multiple_tables=True, output_format="json", pages=pg_num) However, I want to plot the bounding boxes detected…
skw1990
  • 63
  • 6
0
votes
0 answers

PDFPageCountError: Unable to get page count

I am trying to use pdf2image, but I am getting this error: PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'C:\Users\user_name\Desktop\folder_name\folder2_name\folder3_name\007-084841-1 to 31 Dec'22': No error. It is…
CrisD
  • 1
  • 1
0
votes
0 answers

Error with the path of the poppler folder

I am getting the following when using a script with a poppler path : pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? But poppler is correctly installed on my computer. The code I am using…
clemdcz
  • 99
  • 7
0
votes
0 answers

How to remove boxes around shx text without AutoCAD?

I try to use OCR (Optical Character Reader) for a lot of documents of the same type. I use pdf2image library for Python. But when it sees pdfs with AutoCAD shx text it captures the bounding boxes around text as well. At first they are not visible on…
ArsenK
  • 1
0
votes
0 answers

Converting multi page pdf to jpeg results in a single page

Have written this python script to convert multi page pdfs to jpeg. import requests, io from pdf2image import convert_from_bytes url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf' response = requests.get(url) pages =…
Rahul
  • 895
  • 1
  • 13
  • 26
0
votes
0 answers

Difference in Length of ImageBytes while performing PIL IMAGE .getvalue() operation on AWS LAMBDA?

I am trying to perform .getvalue() operation on PIL image on AWS Lambda to extract the bytes of PIL Image but my byte string length is different when i perform this operation on Local Machine and Its different when i Perform it on AWS Lambda, below…
0
votes
0 answers

Django convert InMemoryUploadedFile PDF to images

I need to convert uploaded PDF to images. I'm using pdf2image function convert_from_path() to convert the image but am getting an error Unable to get page count. My code looks somewhat like this: pages =…
0
votes
1 answer

Python: pdf2image doesn't write .jpg - no error message

I'm working on a python script that checks the .pdf files in a directory, creates a new directory for each file, converts the .pdf into images, and writes the images as jpg into the new directory. I'm using pdf2image and have the following…
bluesky
  • 1
  • 2