1

When I am doing string comparison, I am getting that 2 strings are not equal even though they are equal.

I am extracting text from 2 PDFs. Extracted text is same. But I can see some font change in one of them. I am not understanding why?

str1 = 'Confirmations'

str2 = 'Confirmations'

str1 == str2

False

Matteo
  • 93
  • 8
bh7781
  • 33
  • 1
  • 1
  • 9

3 Answers3

2

You need to compare the normalized forms of the strings to ignore irrelevant typographical differences.

eg:

In [59]: import unicodedata

In [60]: str1 = 'Confirmations'

In [61]: str2 = 'Confirmations'

In [62]: str1 == str2
Out[62]: False

In [63]: unicodedata.normalize('NFKD', str1) == unicodedata.normalize('NFKD', str2)
Out[63]: True
sh7
  • 21
  • 2
1

The problem is that "fi" inside the string in the first case is a ligature (https://en.wikipedia.org/wiki/Typographic_ligature), while in the second is the sum of "f" and "i".

You can use a function to check if the ligature is present and substitute it with plain text

def ligature(string):
    if 'fi' in string:
        string.replace('fi', 'fi')
    return string

you can also add other if statements for other ligatures if you found more in your text.

Matteo
  • 93
  • 8
  • Hi @bh7781 if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – Matteo Jul 29 '19 at 07:21
1

Using difflib library you can see that there is visible differnce between string that you want to compare. To check it by yourself you can try instruction as follows:

>>> import difflib
>>> str2 = 'Confirmations'
>>> str1 = 'Confirmations'
>>> print('\n'.join(difflib.ndiff([str1], [str2])))

which yields to

- Confirmations
?    ^

+ Confirmations
?    ^^

>>>
s3nh
  • 546
  • 2
  • 11