2

I translated a pdf file using pdfminer and I realize that in several situations I found a strange non-ascii 'fi' replacing 'fi'.

An easy way to correct this problem seems to be

 content=re.sub('fi','fi',content)

However, I only could correct the problem because I noticed it and It is worth mentioning that it is very difficult to note it. I only note because I was writing a report in latex about a mistake my code was doing due to an incorrect classification that Spacy was providing to the 'fortified' (with this character). In this moment, I realize that the dvi file (output of the latex laguage) was failing. When I checked it I realized that these two characters 'fi' were replaced by something else.

This seems to be probably a kind of pdf font problem.

Is there a list of problems like this that I can predict and automatically solve before any nlp activity? Or maybe a way to use Spacy to check if a given word is unknown (I believe that this word 'fortified' with the strange replacement was unknown for spacy)? Or yet to look for non-ascii characters in the translated text?

Which of these solutions work?

DanielTheRocketMan
  • 3,199
  • 5
  • 36
  • 65
  • 2
    Those are called [ligatures](https://en.wikipedia.org/wiki/Orthographic_ligature#Ligatures_in_Unicode_(Latin_alphabets)); you'll find a table of ligature Unicode code points in that Wikipedia article. Their use is [discouraged](http://www.unicode.org/faq/ligature_digraph.html), so in a way this is indeed a problem of the pdf font in question, more exactly, of its **ToUnicode** map. – mkl Aug 12 '20 at 08:11
  • 1
    @mkl Is there any solution, especially when we don't have the source of the pdf, but just the pdf file itself ? – peter.cyc Aug 14 '20 at 17:56
  • Simply replace characters with ligature unicode code points by the code points of their constituents. – mkl Aug 14 '20 at 21:51
  • This is a GREAT question. I'm hitting this problem trying to check PDFs we create from text, and I see two character "fi" turning into ligature "fi", so the comparison for equality fails. the OP's question about how to deal with this mis-match is really unanswered here, except to say "yes, that's a problem. Good luck." Or "Simply replace characters..." as if that helped us to know all of the ligatures that might come along and whether replacing them by "code points of their constituents" does any good. – pauljohn32 Nov 21 '22 at 19:47
  • Yes! I have replaced the characters! Se my answer below! – DanielTheRocketMan Nov 22 '22 at 21:36

1 Answers1

0

In the end, I have now replaced automatically all ligatures:

        if(isinstance(content, str)):
            content=re.sub(r'\uA732', 'AA', content)
            content=re.sub(r'\uA733', 'aa', content)
    
            content=re.sub(r'\u00C6', 'AE', content)
            content=re.sub(r'\u00E6', 'ae', content)
    
            content=re.sub(r'\uA734', 'AO', content)
            content=re.sub(r'\uA735', 'ao', content)
    
            content=re.sub(r'\uA736', 'AU', content)
            content=re.sub(r'\uA737', 'au', content)
            
            content=re.sub(r'\uA738', 'AV', content)
            content=re.sub(r'\uA739', 'av', content)
    
            content=re.sub(r'\uA73A', 'AV', content)
            content=re.sub(r'\uA73B', 'av', content)
    
            content=re.sub(r'\uA73C', 'AY', content)
            content=re.sub(r'\uA73D', 'ay', content)
            
            content=re.sub(r'\u1F670', 'et', content)        
    
            content=re.sub(r'\uFB00', 'ff', content)
            content=re.sub(r'\uFB03', 'ffi', content)
            content=re.sub(r'\uFB04', 'ffl', content)
            content=re.sub(r'\uFB01', 'fi', content)
            content=re.sub(r'\uFB02', 'fl', content)
    
            content=re.sub(r'\u01F6', 'Hv', content)
            content=re.sub(r'\u0195', 'hv', content)
    
            content=re.sub(r'\u2114', 'lb', content)
            
            content=re.sub(r'\u1EFA', 'lL', content)
            content=re.sub(r'\u1EFB', 'll', content)
    
            content=re.sub(r'\u0152', 'OE', content)
            content=re.sub(r'\u0153', 'oe', content)
    
            content=re.sub(r'\uA74E', 'OO', content)
            content=re.sub(r'\uA74F', 'oo', content)
            
            content=re.sub(r'\uFB06', 'st', content)
            
            content=re.sub(r'\uFB05', 'ft', content)        
            
            content=re.sub(r'\uA728', 'TZ', content)
            content=re.sub(r'\uA729', 'tz', content)
            
            content=re.sub(r'\u1D6B', 'ue', content)
            content=re.sub(r'\uAB63', 'uo', content)        
            
             content=re.sub(r'\uA760', 'VY', content)
             content=re.sub(r'\uA761', 'vy', content)        
DanielTheRocketMan
  • 3,199
  • 5
  • 36
  • 65