I have two files. A file disk.txt contains 57665977 rows and database.txt 39035203 rows;
To test my script I made two example files:
$ cat database.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ cat disk.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
What I try to accomplish is to create files for the differences.
- A file with the uniques in disk.txt (so I can delete them from disk)
- A file with the uniques in database.txt (So I can retrieve them from backup and restore)
Using comm
to retrieve differences
I used comm
to see the differences between the two files. Sadly comm
also returns the duplicates after some uniques.
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
using comm
on one of these large files takes 28,38s. This is really fast but is solely not a solution.
using fgrep
to strip duplicates from comm
result
I can use fgrep
to remove the duplicates from the comm
result and this works on the example.
$ fgrep -vf duplicate-plus-uniq-disk.txt duplicate-plus-uniq-database.txt
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ fgrep -vf duplicate-plus-uniq-database.txt duplicate-plus-uniq-disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
On the large files this script just crashed after a while. So it is not a viable option to solve my problem.
Using python difflib
to get uniques
I tried using this python script I got from BigSpicyPotato's answer on a different post
import difflib
with open(r'disk.txt','r') as masterdata:
with open(r'database.txt','r') as useddata:
with open(r'uniq-disk.txt','w+') as Newdata:
usedfile = [ x.strip('\n') for x in list(useddata) ]
masterfile = [ x.strip('\n') for x in list(masterdata) ]
for line in masterfile:
if line not in usedfile:
Newdata.write(line + '\n')
this also works on the example. Currently this is still running and takes up alot my CPU power.. Looking at the uniq-disk file it is really slow aswell..
Question
Any faster / better option I can try in bash / python? I was aswell looking into awk
/ sed
to maybe parse the the results form comm
.