3

I thinks there is a bug (or a very surprising feature...) in the way openrefine manage diacritics in "key collision-fingerprint" clustering:

row 1 : école row 2 : école école ecole

-> clustering -> 0 cluster

same issue with

row 1 : école row 2 : école école ecole -> 0 cluster

But this case works well:

row 1 : ecole row 2 : école école école -> 1 cluster

Mathieu Saby
  • 125
  • 5

1 Answers1

4

Not too suprising. Fingerprint clustering only applies the fingerprint() function to each cell, and then compares their equivalence one by one. Now here is the result of fingerprint in the three cases you mention:

1

row value               value.fingerprint()
1.  école               ecole
2.  école école ecole   ecole ecole

2

row value               value.fingerprint()
1.  école               ecole
2.  école école ecole   ecole ecole

3

row value               value.fingerprint()
1.  ecole               ecole
2.  école école école   ecole

Why this difference in the third case? Because the fingerprint algorithm actually performs the following operations, in a strict order.

1. remove leading and trailing whitespace

" école école école " -> "école école école"

2. change all characters to their lowercase representation

"éCole écoLe école" -> "école école école"

3. remove all punctuation and control characters

"école-école, école" -> "école école école"

4. split the string into whitespace-separated tokens

"école école école" -> ["école", "école", "école"]

5. sort the tokens and remove duplicates

["école", "école", "école"] -> ["école"]

6. join the tokens back together

["école"] -> "école"

7. normalize extended western characters to their ASCII representation

"école" -> "ecole"

One might wonder if operation 7 should not be done before. But in your example, the bug, if there is one, is maybe in the 3rd case. The string "école" is very different from the string "ecole école école", they should not be merged in my opinion. Neither the given name "John-John" is equivalent to "John".

EDIT : One of the developpers agrees with you.

Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23