Not too suprising. Fingerprint clustering only applies the fingerprint()
function to each cell, and then compares their equivalence one by one. Now here is the result of fingerprint
in the three cases you mention:
1
row value value.fingerprint()
1. école ecole
2. école école ecole ecole ecole
2
row value value.fingerprint()
1. école ecole
2. école école ecole ecole ecole
3
row value value.fingerprint()
1. ecole ecole
2. école école école ecole
Why this difference in the third case? Because the fingerprint algorithm actually performs the following operations, in a strict order.
1. remove leading and trailing whitespace
" école école école " -> "école école école"
2. change all characters to their lowercase representation
"éCole écoLe école" -> "école école école"
3. remove all punctuation and control characters
"école-école, école" -> "école école école"
4. split the string into whitespace-separated tokens
"école école école" -> ["école", "école", "école"]
5. sort the tokens and remove duplicates
["école", "école", "école"] -> ["école"]
6. join the tokens back together
["école"] -> "école"
7. normalize extended western characters to their ASCII representation
"école" -> "ecole"
One might wonder if operation 7 should not be done before. But in your example, the bug, if there is one, is maybe in the 3rd case. The string "école" is very different from the string "ecole école école", they should not be merged in my opinion. Neither the given name "John-John" is equivalent to "John".
EDIT : One of the developpers agrees with you.