comparing strings and comparing how close they match

Question

I extract exceptions from a log, here is an example of one:

Exception: System.InvalidOperationException: Collection was modified; enumeration operation may not execute. at System.Collections.Generic.List`1.Enumerator.MoveNextRare() at test.Modules.UI.Table.<>c_DisplayClass2.b_0() at System.Win

Sometimes the logs are in different language so it will look like this:

Exception: System.InvalidOperationException: La colección fue modificada, la operación de enumeración no puede ejecutar. at System.Collections.Generic.List`1.Enumerator.MoveNextRare() at Test.Modules.UI.Table.<>c_DisplayClass2.b_0() at System.Win

As you can see only the exception part differs as its in a different language but the part after will be identical. I have all these exceptions stored in a database, all trimmed to 300 characters in length as there often much longer but 300 characters is sufficent to tell if there the same or not

So i was thinking maybe skip the exception and compare the next 300 characters after but its going to be extremely difficult to know where the exception ends, there isnt anything specific that displays the start and end of the exception.

Any ideas how i could overcome this? Maybe i just use Levenshtein to highlight where there is a close match, then i can filter those and maybe setup an interface that allows me to link exceptions once i manually identify there the same exception just there written in a different language?

My end goal is to reviewing thousands of these logs and see how many exceptions were found that are the same, most of the logs are english but maybe 25% are non english so wheras normally i could just run a query for an exact match on the exception because the languages is different for the exception part its probably only going to be a 60-70% match. There might be rare cases where the part after the exception is close in match to a different exception but that would be rare so not too much of a concern

I need to do this in PHP

Are you sure all exceptions do not end with a colon? – Cups Nov 03 '12 at 16:09 — Cups, Nov 03 '12 at 16:09

score 0 · Answer 1 · answered Nov 03 '12 at 16:51

Not 100% robust, but you could match based on both the text before the 2nd semicolon, AND the text that follows the word at. I bet the word at is followed by a new line, and so the word + new line is very unlikely to appear in the exception message itself(making it a decent choice as a delimiter).

I think any scheme you devise, you want to totally ignore the exception message,. You aren't going to find common structure between languages, and so allowing the text message to be part of the matching ranking will only dilute the confidence of your matching.

comparing strings and comparing how close they match

1 Answers1