How to comparing two big files on unique strings using fgrep/comm?

Question

I have two files. A file disk.txt contains 57665977 rows and database.txt 39035203 rows;

To test my script I made two example files:

$ cat database.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29

$ cat disk.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49

What I try to accomplish is to create files for the differences.

A file with the uniques in disk.txt (so I can delete them from disk)
A file with the uniques in database.txt (So I can retrieve them from backup and restore)

Using `comm` to retrieve differences

I used comm to see the differences between the two files. Sadly comm also returns the duplicates after some uniques.

$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49

$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49

using comm on one of these large files takes 28,38s. This is really fast but is solely not a solution.

using `fgrep` to strip duplicates from `comm` result

I can use fgrep to remove the duplicates from the comm result and this works on the example.

$ fgrep -vf duplicate-plus-uniq-disk.txt duplicate-plus-uniq-database.txt
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29

$ fgrep -vf duplicate-plus-uniq-database.txt duplicate-plus-uniq-disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1

On the large files this script just crashed after a while. So it is not a viable option to solve my problem.

Using python `difflib` to get uniques

I tried using this python script I got from BigSpicyPotato's answer on a different post

import difflib
with open(r'disk.txt','r') as masterdata:
    with open(r'database.txt','r') as useddata:
        with open(r'uniq-disk.txt','w+') as Newdata:
            usedfile = [ x.strip('\n') for x in list(useddata) ]
            masterfile = [ x.strip('\n') for x in list(masterdata) ]

            for line in masterfile:
                if line not in usedfile:
                    Newdata.write(line + '\n')

this also works on the example. Currently this is still running and takes up alot my CPU power.. Looking at the uniq-disk file it is really slow aswell..

Question

Any faster / better option I can try in bash / python? I was aswell looking into awk / sed to maybe parse the the results form comm.

Hi @markp-fuso, thanks for the feedback. I adjusted the post with more similar sample inputs to what my files contain. Hope it helps! — Max Visser, Jun 17 '22 at 13:19
Hi @markp-fuso `comm --version`` is not working for me.. I am guessing I run an outdated version. I use comm on macosx 12.2 ``man comm`` does not contain any mention of how to get the version number. The manually seems to be written on January 26, 2005. — Max Visser, Jun 17 '22 at 13:45
you mention the `comm` takes `28,38s` to run ... is that the time for *each* of the `comm` calls or is that the total/combined time it takes to run the 2 `comm` calls? — markp-fuso, Jun 17 '22 at 13:45
To answer your question the data and output I provide is copied directly from console. (before I only added the * and ^ to make the problem more clear). I just reran my test and it is still an issue. — Max Visser, Jun 17 '22 at 13:46
@markp-fuso the 28,38s is on my large files for one of the comm commands. — Max Visser, Jun 17 '22 at 13:47
wondering the same thing as KamilCuk ... are there any other characters in your files for the lines that start with `10fff` (even an extra space at the end of one line will generate the output you're showing); consider running `grep -h 10fff database.txt disk.txt | od -c` ... should show each character for both lines; add this output to the question if you're not sure how to read the `od -c` output — markp-fuso, Jun 17 '22 at 14:08
Well that white space did make me learn alot... Thanks for your help @markp-fuso Much appriciated! — Max Visser, Jun 17 '22 at 14:18
further performance improvement would (obviously) come from replacing the current 2x `comm` calls with a single call of 'something' that runs on the same order as a single `comm`; KamilCuk's `join | ...` suggestion may work; a custom `python` or `awk` script that processes the 2 input files in parallel would also work (obviously) with a bit more coding .... — markp-fuso, Jun 17 '22 at 14:33

KamilCuk · Answer 1 · 2022-06-17T13:53:00.160

2

From man comm, * added by me:

Compare **sorted** files FILE1 and FILE2 line by line.

You have to sort the files for comm.

sort database.txt > database_sorted.txt
sort disk.txt > disk_sorted.txt
comm -13 database_sorted.txt disk_sorted.txt

See man sort for various speed and memory enhancing options, like --batch-size, --temporary-directory --buffer-size --parallel.

A file with the uniques in disk.txt
A file with the uniques in database.txt

After sorting, you can implement your python program that compares line-by-line the files and write to mentioned files, just like comm with custom output. Do not store whole files in memory.

You can also do something along this with join or comm --output-delimiter=' ':

join -v1 -v2 -o 1.1,2.1 disk_sorted.txt database_sorted.txt | tee >(
    cut -d' ' -f1 | grep -v '^$' > unique_in_disk.txt) |
    cut -d' ' -f2 | grep -v '^$' > unique_in_database.txt

edited Jun 17 '22 at 13:53

answered Jun 17 '22 at 12:31

KamilCuk

120,984
8
59
111

Hi KamilCuk, I already sorted my files but comm still returns duplicates between files. Your other suggestion seems to work on my example files. So I will now try it on my real files. :) – Max Visser Jun 17 '22 at 13:21
1

Check the duplicate lines with `hexdump -C` for unprintable characters. – KamilCuk Jun 17 '22 at 13:52
How does ``hexdump -C`` work exactly does it give an error if it find unprintable characters? – Max Visser Jun 17 '22 at 14:00
`work exactly` It displays file contents in hexadecimal and other forms. `does it give an error if it find unprintable characters` no, you have to go through the output manually. – KamilCuk Jun 17 '22 at 14:14

score 0 · Answer 2 · answered Jun 17 '22 at 14:30

0

comm does exactly what I needed. I had a white space behind line 10 of my disk.txt file. therefor comm returned it as a unique string. Please check @KamilCuk answer for more context about sorting your files and using comm.

answered Jun 17 '22 at 14:30

Max Visser

577
4
10

score 0 · Answer 3 · answered Jun 17 '22 at 18:37

 # WHINY_USERS=1 isn't trying to insult anyone - 
 # it's a special shell variable recognized by 
 # mawk-1 to presort the results

 WHINY_USERS=1 {m,g}awk '

 function phr(_) { 
    print \
          "\n Uniques for file : { "\
     (_)" } \n\n -----------------\n" 
 } 
 BEGIN {
          split(_,____) 
        split(_,______) 
     PROCINFO["sorted_in"] = "@val_num_asc"
                        FS = "^$"
 } FNR==NF {
    ______[++_____]=FILENAME 
 }       {
      if($_ in ____) { 
          delete ____[$_] 
      } else {
          ____[$_]=_____ ":" NR 
      }
 } END  { 
    for(__ in ______) { 
         phr(______[__])
             _____=_<_
         for(_ in ____) {
             if(+(___=____[_])==+__) {
                 print "  ",++_____,_,
                       "(line "(substr(___,
                       index(___,":")+!!__))")" 
     }   }   }
 printf("\n\n") } ' testfile_disk.txt testfile_database.txt

|

 Uniques for file : { testfile_disk.txt } 

 -----------------

   1 07fffeed-5a0b-41f8-86cd-e6d99834c187 (line 7)
   2 08ffff24-fb12-488c-87eb-1a07072fc706 (line 8)
   3 09ffff29-ba3d-4582-8ce2-80b47ed927d1 (line 9)

 Uniques for file : { testfile_database.txt } 

 -----------------

   1 11ffffaf-fd54-49f3-9719-4a63690430d9 (line 18)
   2 12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29 (line 19)

How to comparing two big files on unique strings using fgrep/comm?

Using comm to retrieve differences

using fgrep to strip duplicates from comm result

Using python difflib to get uniques

Question

3 Answers3

Using `comm` to retrieve differences

using `fgrep` to strip duplicates from `comm` result

Using python `difflib` to get uniques