Missing/additional words when comparing texts

Question

I want to compare two text files. I don't have a problem when there's only a spelling mistake (missing character, a wrong one or an additional one) but the problem is when there is a missing line/word or an additional one. In my research, i found out that a lot of people suggested Levenshtein for comparing texts but I don't see how it can help in this case. example: if text1 was:

Montorgueil
1 Amalfi 8,20
1 Bali   3,90

and text2 was:

H
Montorgueil
bali     3,90

I have to figure out a way to say that I've got an additional line 'H' , a missing line '1 Amalfi 8,20' and a missing letter '1'

Are there any algorithms that I can use it? I am not even looking for code.

What do you want to get in the end? Should these two be similar? What would be an example of different texts? You need to define your own baselines, before you consider usage of any algorithm. Consider providing 3-4 sample input/output, which are similar, and try to pinpoint your edge cases. — Victor Zakharov, Nov 29 '14 at 16:55
You may want to start your research here [A Generic, Reusable Diff Algorithm in C#](http://www.codeproject.com/Articles/6943/A-Generic-Reusable-Diff-Algorithm-in-C-II). — Victor Zakharov, Nov 29 '14 at 16:58
I think Levenshtein distance https://en.wikipedia.org/wiki/Levenshtein_distance is only suitable for comparing two words, to see how similar or dissimilar they are. I doubt it is useful for what you are trying to do. — RenniePet, Nov 29 '14 at 18:24

Missing/additional words when comparing texts

0 Answers0