I translated a pdf file using pdfminer and I realize that in several situations I found a strange non-ascii 'fi'
replacing 'fi'.
An easy way to correct this problem seems to be
content=re.sub('fi','fi',content)
However, I only could correct the problem because I noticed it and It is worth mentioning that it is very difficult to note it. I only note because I was writing a report in latex about a mistake my code was doing due to an incorrect classification that Spacy was providing to the 'fortified' (with this character). In this moment, I realize that the dvi file (output of the latex laguage) was failing. When I checked it I realized that these two characters 'fi' were replaced by something else.
This seems to be probably a kind of pdf font problem.
Is there a list of problems like this that I can predict and automatically solve before any nlp activity? Or maybe a way to use Spacy to check if a given word is unknown (I believe that this word 'fortified' with the strange replacement was unknown for spacy)? Or yet to look for non-ascii characters in the translated text?
Which of these solutions work?