10

Possible Duplicate:
Is there a way to diff files from C++?

I have long text strings that I wish to diff and patch. That is given strings a and b:

string a = ...;
string b = ...;

string a_diff_b = create_patch(a,b);
string a2 = apply_patch(a_diff_b, b);

assert(a == a2);

If a_diff_b was human readable that would be a bonus.

One way to implement this would be to use system(3) to call the diff and patch shell commands from diffutils and pipe them the strings. Another way would be to implement the functions myself (I was thinking treat each line atomically and use the standard edit distance n^3 algorithm linewise with backtracking).

I was wondering if anyone knows of a good Linux C or C++ library that would do the job in-process?

Community
  • 1
  • 1
Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319

3 Answers3

9

You could google implementation of Myers Diff algorithm. ("An O(ND) Difference Algorithm and Its Variations") or libraries that solve "Longest common subsequence" problem.

As far as I know, the situation with diff/patch in C++ isn't good - there are several libraries (including diff match patch, libmba), but according to my experience they're either somewhat poorly documented or have heavy external dependencies (diff match patch requires Qt 4, for example) or are specialized on type you don't need (std::string when you need unicode, for example), or aren't generic enough, or use generic algorithm which has very high memory requirements ((M+N)^2 where M and N are lengths of input sequences).

You could also try to implement Myers algorithm ((N+M) memory requirements) yourself, but the solution of problem is extremely difficult to understand - expect to waste at least a week reading documentation. Somewhat human-readable explanation of Myers algorithm is available here.

SigTerm
  • 26,089
  • 6
  • 66
  • 115
  • I read the original paper last night: http://www.xmailserver.org/diff2.pdf. It's fairly straightforward if you know the edit distance algorithm first. Basically rather than search the whole edit graph it searches the path with the minimum changes first, and memoizes the results for the next iteration, extending them by one change each time. Thus once it finds the endpoint it will be a solution with minimal changes, and it will have only searched better possible solutions first. Its a specific case of a best first search algorithm (`A*`). – Andrew Tomazos Nov 19 '12 at 13:03
  • 1
    @AndrewTomazos-Fathomling: It's not straightforward. For example, it is extremely difficult to tell what exactly is called "middle snake". Of course, it might be straightforward for people with mathematical background. – SigTerm Nov 19 '12 at 14:03
  • The "middle snake" refinement is an extension to the basic algorithm. It just means simultaneously conducting a search from the top-left to bottom-right, and visa-versa. When the two searches meet you have a solution. You can discard backtracking path information and use this dual algorithm repeatedly as a "binary search" divide-and-conquer recursion so that although it will take logarithmically more time, you only need linear space (if space is at a premium). – Andrew Tomazos Nov 19 '12 at 15:32
  • The "middle snake" is not a refinement. It is central to the Myers algorithm. I wrote libmba and the diff implementation has no external depenencies (it only uses one other module from the libmba package so you can completely isolate it by tweaking the Makefile) and it's lean. I realize it's fun to implement stuff like this yourself and that would be a great programming exercise, but you're going to be hard pressed to find something better. – squarewav Nov 18 '15 at 18:31
  • Regarding the Myers's algorithm, I implemented lightweight C/C++ library (http://github.com/martinsos/edlib) based on his later article, where he describes bit-vector algorithm for calculating edit distance. It is certainly not trivial to implement, especially if you want it to be fast. It also returns "middle snake" as you call it (I call it alignment path). – Martinsos Jul 09 '16 at 00:30
8

I believe that

https://github.com/cubicdaiya/dtl/wiki/Tutorial

may have what you need

kirbyfan64sos
  • 10,377
  • 6
  • 54
  • 75
Caribou
  • 2,070
  • 13
  • 29
3

http://code.google.com/p/google-diff-match-patch/

The Diff Match and Patch libraries offer robust algorithms to perform the operations required for synchronizing plain text.

Currently available in Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses.

Community
  • 1
  • 1
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465