0

I'm working on an application that will require the Levenshtein algorithm to calculate the similarity of two strings.

Along time ago I adapted a C# version (which can be easily found floating around in the internet) to VB.NET and it looks like this:

Public Function Levenshtein1(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim cost As Integer
    Dim s1c As Char

    For i = 1 To n
        d(i, 0) = i
    Next
    For j = 1 To m
        d(0, j) = j
    Next

    For i = 1 To n
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

Then, trying to tweak it and improve its performance, I ended with version:

Public Function Levenshtein2(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim s1c As Char
    Dim cost As Integer

    For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j

            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

Basically, I thought that the array of distances d(,) could be initialized inside of the main for cycles, instead of requiring two initial (and additional) cycles. I really thought this would be a huge improvement... unfortunately, not only does not improve over the original, it actually runs slower!

I have already tried to analyze both versions by looking at the generated IL code but I just can't understand it.

So, I was hoping that someone could shed some light on this issue and explain why the second version (even when it has fewer for cycles) runs slower than the original?

NOTE: The time difference is about 0.15 nano seconds. This don't look like much but when you have to check thousands of millions of strings... the difference becomes quite notable.

xfx
  • 1,329
  • 8
  • 18

3 Answers3

2

It's because of this:

 For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j 'THIS LINE HERE

You were just initializing this array at the beginning, but now you are initializing it n times. There is a cost involved with accessing memory in an array like this, and you are doing it an extra n times now. You could change the line to say: If i = 1 Then d(0, j) = j. However, in my tests, you still basically end up with a slightly slower version than the original. And that again makes sense. You're performing this if statement n*m times. Again there is some cost. Moving it out like it is in the original version is a lot cheaper It ends up being O(n). Since the overall algorithm is O(n*m), any step you can move out into an O(n) step is going to be a win.

aquinas
  • 23,318
  • 5
  • 58
  • 81
  • How did I miss that!? Thank you very much! – xfx Sep 12 '12 at 23:45
  • I just moved the initialization of the rows outside of the inner for cycle and now the second version of the algorithm is about 40% faster that the original -- so, once again, thank you! (I'm mad at myself for not seeing that!) – xfx Sep 12 '12 at 23:49
2

You can split the following line:

d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)

as follows:

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)

It this way you avoid one summation

Further more you can place the last "min" comparison inside the if part and avoid assigning cost:

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If

So you save a summation when s1c = s2(j - 1)

goicox
  • 31
  • 1
  • 9
0

Not the direct answer to your question, but for faster performance you should consider either using a jagged array (array of arrays) instead of a multidimensional array. What are the differences between a multidimensional array and an array of arrays in C#? and Why are multi-dimensional arrays in .NET slower than normal arrays?

You will see that the jagged array has a code size of 7 as opposed to 10 with multidimensional arrays.

The code below is uses a jagged array, single dimensional array

Public Function Levenshtein3(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d()() As Integer = New Integer(n)() {}
    Dim cost As Integer
    Dim s1c As Char

    For i = 0 To n
        d(i) = New Integer(m) {}
    Next

    For j = 1 To m
        d(0)(j) = j
    Next

    For i = 1 To n
        d(i)(0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function
Community
  • 1
  • 1
Seph
  • 8,472
  • 10
  • 63
  • 94
  • Thank you for the suggestion Seph. I will give it a try but I'm afraid that the for cycle to initialize the second dimension of the array will take away any improvements the jagged arrays could provide. – xfx Sep 14 '12 at 11:19
  • It is definitely an improvement. About 10 - 15 milliseconds less than version 2 of the algorithm. – xfx Sep 14 '12 at 11:35