Difference string matching

Question

With string matching, you look for exact matches.

There are algorithms that account for up to k binary differences included omission of a character, the addition of a character, or replacement of a character (forgot the algorithm name), in O(n) time complexity.

Is there an algorithm that instead returns the total difference between the strings - as opposed to the number of differences.

In effect, this algorithm is a more generalised version of the other algorithm, where the other algorithm registers the value 1 for every difference (a != d), as opposed to the amount they differ by i.e. 3 for d - a.

In the original algorithm, a string matches if it has a total numbers of mismatches less than k, in the algorithm I'm looking for, I want the condition to be that the string has a total difference less than a value e.

@super That's a single comparison of two strings. I'm looking to match the string within. i.e. `str1.contains(str2)` - ideally in `O(n+m)` time — Tobi Akinyemi, Jul 27 '20 at 20:25
It sounds like you might be looking for `edit distance` or `Levenshtein distance`. — 500 - Internal Server Error, Jul 27 '20 at 20:27
@500-InternalServerError are those the same algorithm? I believe that's the algorithm I was referring to that I said I forgot the name of — Tobi Akinyemi, Jul 27 '20 at 20:32
@500-InternalServerError those don't account the amount mismatched by, only the fact a mismatch has occurred — Tobi Akinyemi, Jul 27 '20 at 20:32
@500-InternalServerError do you reckon it would be easy to extend that algorithm; i.e. on a mismatch, instead of only incrementing by 1, you increment by the actual distance — Tobi Akinyemi, Jul 27 '20 at 20:34
`those don't account the amount mismatched by, only the fact a mismatch has occurred` - are you sure about that? I haven't used either, just trying to help out with terminology. — 500 - Internal Server Error, Jul 27 '20 at 20:37
@super that's a weird question. I clearly want to match based on distance as opposed to (binary-)mismatch-count — Tobi Akinyemi, Jul 27 '20 at 20:38
FWIW, I don't see why you couldn't do a version that sums the deltas for when a character changes, but what, then, about inserts and deletes? — 500 - Internal Server Error, Jul 27 '20 at 20:40
@super i guess. It's not an alphabet though. The range is continuous, not discrete — Tobi Akinyemi, Jul 27 '20 at 20:41
Adapting the levensthein-metric for this is fairly straight forward. You'll just have to replace three constants in the code by the respective difference values. The more interesting part would be the choice of difference for insertions/deletions. — , Jul 27 '20 at 21:12
@Paul I don't think I would want to support insertions / deletions — Tobi Akinyemi, Jul 27 '20 at 22:44
@TobiAkinyemi then you're limited to strings of equal length and it's just the pairwise difference of the characters. — , Jul 28 '20 at 16:48
@Paul Limited to *matches** of the same length. I was thinking about it yesterday and I don't think this algorithm would be any better than a naive (brute-force) implementation. The algorithm for `Levenshtein distance` is `O(n*m)` which is really bad compared to other `O(n+m)` matching algorithms, but the benfit is you obtain k-diff matching as opposed to exact matching. — Tobi Akinyemi, Jul 28 '20 at 17:45
@TobiAkinyemi I guess your basic problem isn't that you're looking for an algorithm, but that you're lacking a definition of "total difference". You should first get a clean definition for that before you continue to look for algorithms. — , Jul 28 '20 at 20:32
@Paul My definition has not at all changed since creating the question. What I essentially said is the `Levenshtein` algorithm (which I was trying to adapt) is slow. — Tobi Akinyemi, Jul 28 '20 at 20:33
@TobiAkinyemi then you'll also have insertions and deletions and are pretty much left with that complexity. There's no way to get faster than `O(n*m)` without loosing information — , Jul 28 '20 at 20:39

score 0 · Answer 1 · answered Apr 11 '22 at 00:03

0

•Search the Web by using Google •“efficient string matching algorithm” •17,100,000 results in 0.53 Seconds •Workout how Google manage this? So many results with string matching with many documents. •Assume the size of one document is 100 characters, and string size 35 characters •17,100,000 * 100 * 35 comparisons •59,850,000,000 •One comparison in 1 nano second, 59 Seconds

answered Apr 11 '22 at 00:03

Zernosh Haider

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 11 '22 at 06:31
This doesn't answer the question. – Douglas Zare Apr 15 '22 at 15:55

Difference string matching

1 Answers1