Extracting text from a pdf file situated in a blob storage, the output comes with spaces between characters in words.
I am downloading a pdf file from a container in Azure:
blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_client = blob_service_client.get_container_client(container=container)
blob_list = container_client.list_blobs()
for blob in blob_list:
if str(blob.name).endswith(".pdf"):
try:
blob_client_pdf = container_client.get_blob_client(str(blob.name))
blob_download_pdf = blob_client_pdf.download_blob()
Then I convert it to bytes and using PyPDF2 PdfFileReader to read the current file:
from io import BytesIO
stream = BytesIO()
blob_download_pdf.download_to_stream(stream)
fileReader = PyPDF2.PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)
Get the text of the first page of the file:
text_pdf = fileReader.getPage(0).extractText()
And upload a json in a container:
dict_json["0"] = text_pdf
body = json.dumps(dict_json)
blob_client_json = blob_service_client.get_blob_client("corpus", blob="1234.json")
blob_client_json.upload_blob(body, overwrite=True)
But the output comes with a lot of spaces between letters in words. Add a photo: