56

I need to work with large files and must find differences between two. And I don't need the different bits, but the number of differences.

To find the number of different rows I come up with

diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l

And it works, but is there a better way to do it?

And how to count the exact number of differences (with standard tools like bash, diff, awk, sed some old version of perl)?

animuson
  • 53,861
  • 28
  • 137
  • 147
Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110
  • Where in the question does it say that he wants to count the line differences, and not the **character** differences? I see "bits" and "exact number of differences", but "rows" was just his attempt to do it.. – vstepaniuk Apr 03 '20 at 17:26

7 Answers7

52

If you want to count the number of lines that are different use this:

diff -U 0 file1 file2 | grep ^@ | wc -l

Doesn't John's answer double count the different lines?

Josh
  • 784
  • 6
  • 13
  • Yes, it double counts. See my comment on the accepted answer. The command in this answer is correct. – Henrik Warne Dec 19 '12 at 17:33
  • 2
    This appears to potentially double-count lines to me as well, both on MacOSX and Ubuntu. Batches of contiguous lines can be grouped together in a single block, and it depends on your task as to whether or not that should be one difference or several. – Michael H. May 07 '13 at 19:18
  • Don't forget coloured output means lines begin with an escape sequence! Had to use hexdump to figure that one out. – James Morris Aug 02 '13 at 23:19
  • 11
    As @khedron points out batches of contiguous lines can be grouped together in a single block. By my reckoning this means this method is prone to undercounting. –  Oct 08 '13 at 09:20
  • 6
    You could write `grep -c ^@` instead of `grep ^@ | wc -l` – Shiplu Mokaddim Mar 05 '14 at 12:18
  • 8
    "Prone to undercounting" is putting it mildly - run this command on two entirely different files and it will give you a result of 1. – nemetroid Jul 22 '17 at 18:26
49
diff -U 0 file1 file2 | grep -v ^@ | wc -l

That minus 2 for the two file names at the top of the diff listing. Unified format is probably a bit faster than side-by-side format.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • 7
    This doesn't work, as I define "working" http://pastie.org/pastes/3179433/text There are is only one character in each file, what does the number "4" relate to? – Stop Slandering Monica Cellio Jan 13 '12 at 17:39
  • This do work. For your example you have four lines: the first two are the name of each file (as explained in the answer), and the other two are the two differences, 1 line with 'a' removed and 1 line with 'b' added. – Rafael Barbosa Oct 02 '12 at 13:48
  • 5
    It depends on how you count differences. In this example [pastie.org/5553254](http://pastie.org/5553254), I consider there to be 2 lines that differ, i.e. I agree with sequoia mcdowell. It is also inconvenient to have to subtract 2 from the result (due to the printing of the 2 diff:ed files). Therefore, I think Josh's answer is the correct one. It can be shortened slightly by using the –c (count) option on grep, instead of piping to wc –l, like this: `diff -U 0 file1 file2 | grep -c ^@` – Henrik Warne Dec 19 '12 at 17:31
  • `diff -U 0 file1 file2 | grep -v ^@ | tail -n +3 | wc -l` should give the correct count. It excludes the filenames at the top of the diff output. – Matt Kneiser Jul 20 '15 at 18:22
  • 6
    correct solution is here https://unix.stackexchange.com/questions/53719/get-correct-number-of-lines-in-diff-output as accepted answer – tsusanka Oct 18 '15 at 19:29
  • @tsusanka Why don't you make that an answer (pointing to the other answer)? Otherwise it's buried here in tho comments! Also, I notice that answer is almost the same as the code in the question (with a small bugfix). – Neal Gokli Dec 20 '18 at 20:15
  • It seems like the original question was not looking for this way of counting or Josh's way of counting, given the example code in the question and "To find the number of different rows". Though I guess they did accept this answer! – Neal Gokli Dec 20 '18 at 20:19
5

Since every output line that differs starts with < or > character, I would suggest this:

diff file1 file2 | grep ^[\>\<] | wc -l

By using only \< or \> in the script line you can count differences only in one of the files.

Michal Nemec
  • 187
  • 2
  • 3
5

If using Linux/Unix, what about comm -1 file1 file2 to print lines in file1 that aren't in file2, comm -1 file1 file2 | wc -l to count them, and similarly for comm -2 ...?

dubiousjim
  • 4,722
  • 1
  • 36
  • 34
  • 1
    As sureshw points out in another answer, `comm` expects its arguments to be *sorted* files. So this suggestion can only be relied on in special cases. (I think it would be easy to write your own version of `comm` using awk that worked for not-sorted input, too, but doubt that this satisfies the spirit of the original question anymore.) – dubiousjim May 31 '12 at 01:03
3

I believe the correct solution is in this answer, that is:

$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1
Neal Gokli
  • 475
  • 7
  • 18
tsusanka
  • 4,801
  • 7
  • 36
  • 42
0

If you're dealing with files with analogous content that should be sorted the same line-for-line (like CSV files describing similar things) and you would e.g. want to find 2 differences in the following files:

File a:    File b:
min,max    min,max
1,5        2,5
3,4        3,4
-2,10      -1,1

you could implement it in Python like this:

different_lines = 0
with open(file1) as a, open(file2) as b:
    for line in a:
        other_line = b.readline()
        if line != other_line:
            different_lines += 1
Daniel Lee
  • 2,030
  • 1
  • 23
  • 29
0

Here is a way to count any kind of differences between two files, with specified regex for those differences - here . for any character except newline:

git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l

An excerpt from man git-diff :

--patience
           Generate a diff using the "patience diff" algorithm.
--word-diff[=<mode>]
           Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below.
           porcelain
               Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff
               format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input
               are represented by a tilde ~ on a line of its own.
--word-diff-regex=<regex>
           Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it
           was already enabled.
           Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!)
           for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches
           all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline.
           For example, --word-diff-regex=.  will treat each character as a word and, correspondingly, show differences character by character.

pcre2grep is part of pcre2-utils package on Ubuntu 20.04.

vstepaniuk
  • 667
  • 6
  • 14