Is it possible to extract a pdf with its white spaces in Python?

Question

I have been attempting to extract a pdf with Python after a tool was created to extract it using java and pdfbox.

While the Java implementation was successful for the same pdf, I have been struggling to do the same in python since both pdfminer and pypdf, and pypdf2 have not be able to extract the pdf line by line with spaces. In particular, pdfminer pdf2txt for some bizarre reason split the pdf in 3 columns and then read line by line.

The closest I've gotten was using the implementation of a stack overflow question which unfortunately does not keep the spaces. Given that I have variables that both have numbers, I am being unable to recover them in text form.

Given this, is it possible to extract a pdf with its white spaces in Python line by line?

The most success I've had getting text from pdf is by using [`pdftotext`](http://www.foolabs.com/xpdf/download.html). If you run linux, you likely have it already installed on your system. I used to run `pdftotext` from my python script, open the text file and then parse the text data. It wasn't perfect, but I found a corelation between the format of the pdf and the format of the text file and used that to parse it. — elssar, Jun 16 '13 at 15:50

score 0 · Answer 1 · answered Mar 17 '21 at 14:07

Following works in my case:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("sample.pdf")
for i,image in enumerate(images,start=1):
    image.save(f"./images/page_{i}.jpg","JPEG")

print(pytesseract.image_to_string("./images/page_1.jpg"))

The idea here is to first convert the PDF to an image and then read the text from it. This approach preserves the whitespace.

Dependecies:

conda install -c conda-forge tesseract
conda install pdf2image
conda install pytesseract

score 0 · Answer 2 · answered Jul 15 '21 at 17:47

You can use Aspose.PDF Cloud SDK for Python to extract text from PDF line by line along with whitespaces. Currently, It supports file processing from Cloud storage(Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage, FTP Storage and Aspose default Cloud Storage).

Here is sample code:

import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi

# Get Client Id and Client Secret from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
    app_key='xxxxxxxxxxxxxxxxxx',
    app_sid='xxxx-xxxx-xxxx-xxxx-xxxxxxxxxx')

pdf_api = PdfApi(pdf_api_client)
temp_folder="Temp"

#upload PDF file to storage
data_file = "C:/Temp/02_pages.pdf"
remote_name="02_pages.pdf"
pdf_api.upload_file(temp_folder + '/' + remote_name,data_file)

llx = 0
lly = 0
urx = 0
ury = 0

response = pdf_api.get_text(remote_name, llx, lly, urx, ury, folder= temp_folder)

for i in response.text_occurrences.list:
    print(i.text)

P.S: I'm a developer evangelist at Aspose

Is it possible to extract a pdf with its white spaces in Python?

2 Answers2