0

I'm looking for a method to find similar symbolnames, where those names are often a combination of text an numbers, like "value1", "_value2", "test_5" etc.

Now to find similar names I tried using the Levenshtein distance, but for the algorithm the difference between a "_value1" and ".value1" is the same as for "_value1" and "_value8". Is there a way to compare strings without allowing to change numbers?

The code I'm currently using is from http://www.dotnetperls.com/levenshtein

Thanks in advance!

Robin
  • 123
  • 7

2 Answers2

4

You can give any unequal comparison that involves a numeral a very high distance, like 200. This will keep a distance of 1 (similar) between "_text1" and ".text1", but a distance of 200 (very dissimilar) between "text1" and "text10".

You would do this by changing steps two ...

// Step 2
d[0, 0] = 0;

for (int i = 1; i <= n; i++);
{
    if('0' <= s[i - 1] && s[i - 1] <= '9')
        d[i, 0] = d[i-1, 0] + 200;
    else
        d[i, 0] = d[i-1, 0] + 1;
}


for (int j = 1; j <= m; j++)
{
    if('0' <= t[j - 1] && t[j - 1] <= '9')
        d[0, j] = d[0, j-1] + 200;
    else
        d[0, j] = d[0, j-1] + 1;
}

... and five ...

// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
if(('0' <= t[j - 1] && t[j - 1] <= '9') ||
    '0' <= s[i - 1] && s[i - 1] <= '9'))
        cost *= 200;
Kittsil
  • 2,349
  • 13
  • 22
  • 1
    Hello Kottsil, thanks for the fast reply! That's almost the solution I was looking for, except due to the last "Math.Min(...)" the final difference between e.g. "var_3" and "var_1" is 2, which is still pretty low...but I probably can figure this out myself. Btw: your first "for(int i...)" needs to start with "i = 1" – Robin Aug 22 '16 at 08:58
2

In regard to Kittsil's answer, here is my complete solution. I'm not sure if it's completely correct, but it seems to work for me.

        ushort n = (ushort)s.Length;
        ushort m = (ushort)t.Length;
        ushort[,] d = new ushort[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        d[0, 0] = 0;
        for (int i = 1; i <= n; i++)
        {
            if ('0' <= s[i - 1] && s[i - 1] <= '9')
                d[i, 0] = (ushort)(d[i - 1, 0] + 200);
            else
                d[i, 0] = (ushort)(d[i - 1, 0] + 1);
        }


        for (int j = 1; j <= m; j++)
        {
            if ('0' <= t[j - 1] && t[j - 1] <= '9')
                d[0, j] = (ushort)(d[0, j - 1] + 200);
            else
                d[0, j] = (ushort)(d[0, j - 1] + 1);
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                bool isIdentical = t[j - 1] == s[i - 1];
                bool isNumber = ('0' <= t[j - 1] && t[j - 1] <= '9') || ('0' <= s[i - 1] && s[i - 1] <= '9');

                int cost1 = isIdentical ? 0 : (isNumber ? 200 : 1);
                int cost2 = isNumber ? 200 : 1;

                // Step 6
                d[i, j] = (ushort)(Math.Min(Math.Min(d[i - 1, j] + cost2, d[i, j - 1] + cost2), d[i - 1, j - 1] + cost1));
            }
        }
        // Step 7
        return d[n, m];
Robin
  • 123
  • 7