4

I'm trying to quantify the difference between two strings as part of a change-monitor system.

The issue I'm having is that the strings are large - I can often be dealing with strings with 100K+ characters.

I'm currently using Levenshtein distance, but computing the levenshtein distance for large strings is very inefficient. Even the best implementations only manage O(min(mn)).

Since both strings are of approximately the same length, the distance calculation process can take many seconds.

I do not need high precision. A change resolution of 1 in 1000 (e.g. 0.1%) would be plenty for my application.

What options are there for more efficient string distance computation?

mavroprovato
  • 8,023
  • 5
  • 37
  • 52
Fake Name
  • 5,556
  • 5
  • 44
  • 66
  • Aaaand stackoverflow doesn't have mathjax. WTF? – Fake Name Dec 09 '14 at 08:38
  • http://meta.stackexchange.com/questions/30559/latex-on-stack-overflow – Oliver Charlesworth Dec 09 '14 at 08:40
  • Interesting question! Are you implementing the levenstein distance through creating a matrix? That might be slow. Now you haven't written which kind of language your using but if you create a byte array of each string, maybe you can just iterate through them? I mean 100k iterations should be fairly quick if you could deal with just getting a number `d` - the difference in characters. I do however think you can't get a lower time complexity, but you might get constant memory if you use for example Java, which would yield a faster practical implementation. – Johan S Dec 09 '14 at 09:01
  • By the way your time complexity is that really correct? – Johan S Dec 09 '14 at 09:08
  • @JohanS - [Seems correct](http://en.wikipedia.org/wiki/Edit_distance#Computation). The naive string comparison doesn't work because a single removed character at the beginning of the string would make every character thereon not match. – Fake Name Dec 09 '14 at 09:16
  • I have found [this paper](http://arxiv.org/abs/1005.4033), but it's purely academic, and I have to confess I straight-up can't understand the math (at least at this time), and there are no implementations of it that I've seen. – Fake Name Dec 09 '14 at 09:20
  • Well yeah, I'm probably guessing you won't be able to do it with anything else than the vanilla levenstein algorithm. If you have any kind of threshold or similar maybe you could break early? Think about general optimizations like re-using the matrix etc if you don't run this parallelized etc. – Johan S Dec 09 '14 at 09:24
  • @JohanS - Take a look at the paper I linked. They claim they've figured out an exponential improvement (**!**), but I don't know if it's passed peer-review. – Fake Name Dec 09 '14 at 09:31
  • I checked it out and I'm currently studying myself and I get the feeling that it's academic mumbo-jumbo, since they show/ doesn't reference to any source code. Try to find the paper which they have improved? – Johan S Dec 09 '14 at 09:51
  • @JohanS - I'm still chewing over this problem. I spend a while studying that paper, and from what I can tell, their improvement is basically placing bounds on the access of *one* of the strings. Their model apparently assumes that one string is expensive to access, and the other is free. – Fake Name Apr 29 '15 at 02:46

1 Answers1

0

If you can tolerate some error, you can try splitting the strings into smaller chunks, and calculate their pairwise L-distances.

The method would obviously yield accurate result for replacements, inserts and deletes would incur an accuracy penalty depending on the number of chunks (worst case scenario would give you a distance of 2 * <number of insert/deletes> * <number of chunks> instead of <number of insert/deletes>)

The next step could be to make the process adaptive, I see two ways of doing it, depending on the expected nature of changes:

  1. Try a small chunk size first then move on to larger and larger chunks and observe the drop between each iteration. That should help you estimate how much of your measured distance is error (though I haven't worked out exactly how).
  2. Once you find a difference between two chunks, try to identify what the difference is (exactly how many characters were added/deleted overall), and shift your next chunk to the left or to the right accordingly.
biziclop
  • 48,926
  • 12
  • 77
  • 104