How to compare two strings with different unicode?

Question

When I am doing string comparison, I am getting that 2 strings are not equal even though they are equal.

I am extracting text from 2 PDFs. Extracted text is same. But I can see some font change in one of them. I am not understanding why?

str1 = 'Conﬁrmations'

str2 = 'Confirmations'

str1 == str2

False

score 2 · Answer 1 · answered Oct 06 '21 at 09:09

You need to compare the normalized forms of the strings to ignore irrelevant typographical differences.

eg:

In [59]: import unicodedata

In [60]: str1 = 'Conﬁrmations'

In [61]: str2 = 'Confirmations'

In [62]: str1 == str2
Out[62]: False

In [63]: unicodedata.normalize('NFKD', str1) == unicodedata.normalize('NFKD', str2)
Out[63]: True

score 1 · Accepted Answer · answered Jul 26 '19 at 12:33

1

The problem is that "fi" inside the string in the first case is a ligature (https://en.wikipedia.org/wiki/Typographic_ligature), while in the second is the sum of "f" and "i".

You can use a function to check if the ligature is present and substitute it with plain text

def ligature(string):
    if 'ﬁ' in string:
        string.replace('ﬁ', 'fi')
    return string

you can also add other if statements for other ligatures if you found more in your text.

answered Jul 26 '19 at 12:33

Matteo

93
8

Hi @bh7781 if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – Matteo Jul 29 '19 at 07:21

score 1 · Answer 3 · answered Jul 26 '19 at 12:35

Using difflib library you can see that there is visible differnce between string that you want to compare. To check it by yourself you can try instruction as follows:

>>> import difflib
>>> str2 = 'Confirmations'
>>> str1 = 'Conﬁrmations'
>>> print('\n'.join(difflib.ndiff([str1], [str2])))

which yields to

- Conﬁrmations
?    ^

+ Confirmations
?    ^^

>>>

How to compare two strings with different unicode?

3 Answers3