I'm trying to quantify the difference between two strings as part of a change-monitor system.
The issue I'm having is that the strings are large - I can often be dealing with strings with 100K+ characters.
I'm currently using Levenshtein distance, but computing the levenshtein distance for large strings is very inefficient. Even the best implementations only manage O(min(mn))
.
Since both strings are of approximately the same length, the distance calculation process can take many seconds.
I do not need high precision. A change resolution of 1 in 1000 (e.g. 0.1%) would be plenty for my application.
What options are there for more efficient string distance computation?