0

Extracting text from a pdf file situated in a blob storage, the output comes with spaces between characters in words.

I am downloading a pdf file from a container in Azure:

blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_client = blob_service_client.get_container_client(container=container)
blob_list = container_client.list_blobs()

for blob in blob_list:
    if str(blob.name).endswith(".pdf"):
        try:    
            blob_client_pdf = container_client.get_blob_client(str(blob.name))
            blob_download_pdf = blob_client_pdf.download_blob()

Then I convert it to bytes and using PyPDF2 PdfFileReader to read the current file:

from io import BytesIO

stream = BytesIO()
blob_download_pdf.download_to_stream(stream)
fileReader = PyPDF2.PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)

Get the text of the first page of the file:

text_pdf = fileReader.getPage(0).extractText()

And upload a json in a container:

dict_json["0"] = text_pdf
body = json.dumps(dict_json)
blob_client_json = blob_service_client.get_blob_client("corpus", blob="1234.json")
blob_client_json.upload_blob(body, overwrite=True)

But the output comes with a lot of spaces between letters in words. Add a photo:

enter image description here

1 Answers1

0

PyPDF2 had lots of improvements of the text extraction in 2022. Please just upgrade.

Additionally, you should move from PyPDF2 to pypdf (I'm the maintainer of both). See https://pypdf.readthedocs.io/en/latest/user/migration-1-to-2.html for details. If you take pypdf==3.5.0 you will get lots of warnings telling you what to adjust once you hit the deprecated code.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958