7

Lets say I have 9 rows of records. Each 3 rows have the same value. For instance:

Mike  
Mike  
Mike  
John  
John  
John  
Ryan  
Ryan  
Ryan

Is there a way I can search for similarities of these records? For example spelling mistakes, additional characters, missing characters, etc. So, for example, the correct version is Mike, but there might be a record down in the list with value Mke which is incorrect (spelling mistake). How can I find this and replace it with the correct one?

The above example is obviously simplified. I actually have ~1mln rows. Right now to achieve the 'grouping' of the elements I just sort them alphabetically.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Sergiu Prodan
  • 93
  • 1
  • 2
  • 7

3 Answers3

14

I was facing the exact same problem! With a few searches I could get and modify the following VBA code that will enable a function named =Similarity(). This function will output a number that goes from 0 to 1, according to the similarity of the two input cells.

  • How I used it:

I ordered alphabetically my column info and applied the formula. Then I created a Conditional Formatting Rule to highlight the ones with a high similarity rate (i.e.: at least 65%). Then I searched for each highlighted occurrence and fixed my records manually.

  • Usage:

    =Similarity(cell1, cell2)
    

Obs.: The similarity indicator goes from 0 to 1 (0% to 100%)

  • Example:

enter image description here

  • To use it, you must:

    1. Open VBE (Alt+F11)
    2. Insert Module
    3. Paste the following code into the Module Window

enter image description here

Code:

Public Function Similarity(ByVal String1 As String, _
    ByVal String2 As String, _
    Optional ByRef RetMatch As String, _
    Optional min_match = 1) As Single

Dim b1() As Byte, b2() As Byte
Dim lngLen1 As Long, lngLen2 As Long
Dim lngResult As Long

If UCase(String1) = UCase(String2) Then
    Similarity = 1
Else:
    lngLen1 = Len(String1)
    lngLen2 = Len(String2)
    If (lngLen1 = 0) Or (lngLen2 = 0) Then
        Similarity = 0
    Else:
        b1() = StrConv(UCase(String1), vbFromUnicode)
        b2() = StrConv(UCase(String2), vbFromUnicode)
        lngResult = Similarity_sub(0, lngLen1 - 1, _
        0, lngLen2 - 1, _
        b1, b2, _
        String1, _
        RetMatch, _
        min_match)
        Erase b1
        Erase b2
        If lngLen1 >= lngLen2 Then
            Similarity = lngResult / lngLen1
        Else
            Similarity = lngResult / lngLen2
        End If
    End If
End If

End Function

Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _
                                ByVal start2 As Long, ByVal end2 As Long, _
                                ByRef b1() As Byte, ByRef b2() As Byte, _
                                ByVal FirstString As String, _
                                ByRef RetMatch As String, _
                                ByVal min_match As Long, _
                                Optional recur_level As Integer = 0) As Long
'* CALLED BY: Similarity *(RECURSIVE)

Dim lngCurr1 As Long, lngCurr2 As Long
Dim lngMatchAt1 As Long, lngMatchAt2 As Long
Dim I As Long
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long
Dim strRetMatch1 As String, strRetMatch2 As String

If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _
Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then
    Exit Function '(exit if start/end is out of string, or length is too short)
End If

For lngCurr1 = start1 To end1
    For lngCurr2 = start2 To end2
        I = 0
        Do Until b1(lngCurr1 + I) <> b2(lngCurr2 + I)
            I = I + 1
            If I > lngLongestMatch Then
                lngMatchAt1 = lngCurr1
                lngMatchAt2 = lngCurr2
                lngLongestMatch = I
            End If
            If (lngCurr1 + I) > end1 Or (lngCurr2 + I) > end2 Then Exit Do
        Loop
    Next lngCurr2
Next lngCurr1

If lngLongestMatch < min_match Then Exit Function

lngLocalLongestMatch = lngLongestMatch
RetMatch = ""

lngLongestMatch = lngLongestMatch _
+ Similarity_sub(start1, lngMatchAt1 - 1, _
start2, lngMatchAt2 - 1, _
b1, b2, _
FirstString, _
strRetMatch1, _
min_match, _
recur_level + 1)
If strRetMatch1 <> "" Then
    RetMatch = RetMatch & strRetMatch1 & "*"
Else
    RetMatch = RetMatch & IIf(recur_level = 0 _
    And lngLocalLongestMatch > 0 _
    And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _
    , "*", "")
End If


RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch)


lngLongestMatch = lngLongestMatch _
+ Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _
lngMatchAt2 + lngLocalLongestMatch, end2, _
b1, b2, _
FirstString, _
strRetMatch2, _
min_match, _
recur_level + 1)

If strRetMatch2 <> "" Then
    RetMatch = RetMatch & "*" & strRetMatch2
Else
    RetMatch = RetMatch & IIf(recur_level = 0 _
    And lngLocalLongestMatch > 0 _
    And ((lngMatchAt1 + lngLocalLongestMatch < end1) _
    Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _
    , "*", "")
End If

Similarity_sub = lngLongestMatch

End Function
  • Output according to your data set:

enter image description here

phuclv
  • 37,963
  • 15
  • 156
  • 475
dot.Py
  • 5,007
  • 5
  • 31
  • 52
  • thank you for the answer, it works fine. the problem now becomes that I have ~11k "right" values. So manual work would require lots of time. do you have any ideas how would I automate this stuff? – Sergiu Prodan Mar 07 '17 at 20:51
  • Well... I know a little about stemming and lemmatization, but it's kinda hard to apply their algos for names, since names don't respect gramatical rules. I mean, both Myike, Myke and Miyke are gramatically (and socially?) acceptable, so how would your program know which value is the "correct" when it finds, for example, "Myke"? Should it replace by "Mike" or what? Maybe you can remove the duplicates from the 11k rows that have a "high degree of similarity", and then you can create a function for replacing a name by another, using a "name dictionary", `IF()`, `SUBSTITUTE()` and `OR()`. – dot.Py Mar 07 '17 at 21:40
1

Text similarity can get quite complicated, depending on just how far you want to go. A complete survey of all the different algorithms can be found in this paper A Survey of Text Similarity Approaches (Gomaa & Fahmy, IJCA 2013). It can hurt your head, but it's good stuff.

Specific to VBA you can refer to this previous answer on SO

Community
  • 1
  • 1
PeterT
  • 8,232
  • 1
  • 17
  • 38
0

I don't know a completely automated way to do that. There is an Excel "Fuzzy Match" add-in that might be helpful: https://www.microsoft.com/en-us/download/details.aspx?id=15011

I have used it. It works most of the time but struggles with larger worksheets.

jce
  • 133
  • 1
  • 10