I need to capture some text from some PDFs. I use PymuPDF to do this. But facing ligature issue while writing those selected text inside a text file.
I use the following code snippet to read the PDF
pdf = fitz.open("file_path")
full_text = ""
for page_n in range(pdf.page_count):
page = pdf.load_page(page_n)
full_text += page.get_text()
pdf.close()
# do some operation to get desire text
desire_text = ...
And use the following code snippet to write them inside txt file
with open('output.txt', 'w') as f:
f.write(desire_text)
but got the error:
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
33 with open('output.txt', 'w') as f:
---> 34 f.write(desire_text)
File c:\Python311\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 2: character maps to
I know that the PDF contain some ligature like ffl
which create the issue. I can manually replace them using string replace, but I don't thing manually handle this can be efficient for large pdf.