Converting a Hindi PDF to Editable Text in Docx with Proper Text Encoding Detection and Conversion

Question

This code is a Python script to convert a PDF file to a .docx file. It performs the following steps:

Import the necessary libraries and modules, including codecs, chardet, pdfminer, and python-docx.
Detect the text encoding of the PDF file by opening it in binary mode and passing its contents to the chardet library's detect function. The function returns a dictionary of encoding information, and the script stores the value of the "encoding" key in the "encoding" variable.
Use pdfminer to convert the PDF file to text. PDFResourceManager is used to store shared resources such as fonts or images used by multiple pages. PDFPageInterpreter is used to process each page of the PDF and extract the text. The extracted text is stored in a StringIO object named "retstr".
Decode the extracted text using the codecs.decode function and the detected encoding, and store the result in the "text" variable.
Create a new Document object from the python-docx library, add a paragraph containing the converted text, and save the .docx file as "output.docx".

I have attached my experimental Python code below :-

import codecs
import chardet
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
from docx import Document

# Detect the text encoding of the PDF file
with open("input.pdf", "rb") as pdf_file:
    result = chardet.detect(pdf_file.read())
    encoding = result["encoding"]

# Convert the PDF file to text using pdfminer
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
with open("input.pdf", "rb") as pdf_file:
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(pdf_file):
        interpreter.process_page(page)
    text = retstr.getvalue()

# Convert the text to Unicode using the detected encoding
text = codecs.decode(text, encoding)

# Save the converted text to a .docx file
doc = Document()
doc.add_paragraph(text)
doc.save("output.docx")

But I am getting an error on line 27 of the code.

TypeError: decode() argument 'encoding' must be str, not None

After updating the line 27 code to text = text.decode(encoding) I am now getting

AttributeError: 'str' object has no attribute 'decode'

You already have a string object, why do you need to decode it ?. — Pavan Kumar Polavarapu, Feb 07 '23 at 05:45
The reason for decoding the string is because the encoding of the text in the PDF file may not be in Unicode. By using the *chardet* library to detect the encoding of the text, we can then convert it to Unicode using the *codecs* library's *decode* function. This ensures that the text is in a format that can be properly processed and displayed, without encountering any encoding errors. — Pranav, Feb 07 '23 at 05:50
I will wait for other responses but in my opinion all strings in python are Unicode characters represented in UTF-8 format. If you are sure that the data will be in other format, you might want to use BytesIO — Pavan Kumar Polavarapu, Feb 07 '23 at 06:07
I don't think it's a good idea to feed the raw PDF to `chardet`. PDF is a binary format, not text – the text contained is embedded and encoded in some way or another, but `chardet` wouldn't know. It's like guessing the brand of different potato chips by taste, but you don't unpack them but instead start chewing on the packaging. — lenz, Feb 07 '23 at 21:13
I don't know much about how PDFs are built up internally, but I'm convinced it contains metadata information that specifies how the text is encoded. A solid PDF library should be able to properly parse this information and return the extracted text correctly decoded. However, I've seen PDFs where the creators used tricks or hacks (for whatever reason), such that the text looks fine when displayed on screen, but when extracting it (eg. through copy-pasting) you end up with garbled character salad. `chardet` _may or may not_ help you recover from such a situation (using it after extraction). — lenz, Feb 07 '23 at 21:20
"_the encoding of the text in the PDF file may not be in Unicode_" - The encoding is guaranteed to not be in Unicode because Unicode is not an encoding - it's a standard. If you can provide a [mre] which demonstrates a specific re-creatable problem (which means some data is probably needed, as well as code) then you may be more likely to get a good answer. — andrewJames, Feb 07 '23 at 23:12

Converting a Hindi PDF to Editable Text in Docx with Proper Text Encoding Detection and Conversion

0 Answers0