Efficient way to find to compare lines of two huge but sorted files

Question

I'm actually stuck on something, and need some help ..

I have two huge txt files (tab separator), with this kind of format :

File1 :

1  robert  youh  xpla@ioaio.fr
2  patrick  yuad  qqqq@ioaio.fr
3  bob  fsgfq  ddd@ioaio.fr
8  tim  qqjh  hjahj@uayua.com
9  john  ajkajk  rtaeraer@auiaui.com

File2 :

1  baby  france  paris
2  father  usa  detroit
3  mother  uk  london
4  baby  italy  milan

Both files are sorted, but File1 is bigger than File2. I need to find the lines from File1 which don’t appear in File2 (according to the first column, so just the first column is used for comparison).

I tried many ways : - awk : didnt find a way to do it (because of difference of length) - python (with for which line statement and fileinput.input ...), my script spent around 5 min to perform 0,3% lines.

Result : I am able to retreive correct result with python (did test on small files), but I can't deal with my very large files.

Any idea to help ?

My files are around 2 millions lines each.

Show us your python code and we might be able to help make it more efficient. — Don Roby, Feb 11 '15 at 11:43
here is my python source code : http://pastebin.com/QtTzFPWK , args are : --nomatch file1 file2 — Damien, Feb 11 '15 at 12:11
Given your particular example, `cat file1` would suffice, as there are no matches between the `file1` and `file2` that you quote. You perhaps ought to give a somewhat more representative example... — twalberg, Feb 11 '15 at 15:08

volcano · Accepted Answer · 2015-02-11T12:24:22.940

1

Assuming - as you state - that files are sorted (I did not test it, but it should work)

FILE1_POS = 0
FILE2_POS = 1
MAX_POS = 2

missing = []
get_rec_no = lambda l, f=FILE1_POS: int(l.split(None, MAX_POS)[f])
with open('File1') as source, open('File2') as trgt:
    trgt = iter(trgt)
    for line in source:
        src_rec_no = get_rec_no(line)
        try:
            trgt_rec_no = get_rec_no(trgt.next(), FILE2_POS)
            while trgt_rec_no < src_rec_no:
               missing.append(src_rec_no)
               trgt_rec_no = get_rec_no(trgt.next(), FILE2_POS)
         except StopIteration:
            break
    for line in source:
        missing.append(get_rec_no(line))

EDIT:

I changed by you request

edited Feb 11 '15 at 12:24

answered Feb 11 '15 at 11:56

volcano

3,578
21
28

Well my real file is not formatted like this, in file1 id is in column 1, in file2 id is in column 2. How to adapt your code ?? – Damien Feb 11 '15 at 12:14
Thanks i'm going to try it. why do you do twice for line in source ? I cant understand that – Damien Feb 11 '15 at 14:02
@Damien, the moment you exhaust _File2_, you may either set a flag in the first loop and use it to skip processing of _File2_ - or finish processing _File1_ in a separate loop. I liked this solution better. – volcano Feb 11 '15 at 14:32
@Damien, just out of curiosity - how fast it was? – volcano Feb 12 '15 at 08:39

score 1 · Answer 2 · answered Feb 11 '15 at 12:16

If you are having problems because of performance, you may want to look into using something like Hadoop/MapReduce to split the files into a number of smaller files, and then you can run each subprocess on different processors to achieve speedup.

As a simple example, of splitting the file in two, you could have a subfile which contained all the keys between [A-M], [M-Z]. In this way if you know the key from one file, you know which subfile to search for its possible match. In theory, you will have cut the search time in half if you split the files in two (however, not exactly in half because there is overhead involved).

Basically, the programming would involved writing a mapper.py and a reducer.py. If you're not familiar with Hadoop/MapReduce, there are many good tutorials on setting up a MapReduce job in Python, but the one I would suggest is the Udacity course called "Intro to Hadoop and MapReduce" which solves some similar problems using Python and MapReduce.

Also, you may want to consider editing your tags to include Hadoop MapReduce, in which case you could get some more specific help on writing the mapper.py and reducer.py.

user3065349 · Answer 3 · 2015-02-12T09:43:18.827

The fact that the files are of a different length doesn't exclude an awk solution. Taking your two inputs:

 [ damien $] cat file1
cat: file1: No such file or directory
 [ damien $] cat file1.txt
1  robert  youh  xpla@ioaio.fr
2  patrick  yuad  qqqq@ioaio.fr
3  bob  fsgfq  ddd@ioaio.fr
8  tim  qqjh  hjahj@uayua.com
9  john  ajkajk  rtaeraer@auiaui.com
 [ damien $] cat file2.txt
1  baby  france  paris
2  father  usa  detroit
3  mother  uk  london
4  baby  italy  milan

 [ damien $]

consider the following script:

 [ damien $] cat n_join.awk
#!/usr/bin/gawk -f


NR==FNR{
  #  file2[$1]=$0;
   file2[$1]=1;
}
NR!=FNR{
    if(!($1 in file2)){
# print current record if not in file2
        print ;
    }else{
# if $1 from file1 has been found.
# if it's really unique in file1, it
# can be deleted from file2. Otherwise
# comment this line:
        delete file2[$1];
    }
}
 [ damien $]

which gives the output:

[ damien $] chmod +x n_join.awk
 [ damien $] ./n_join.awk file2.txt file1.txt
8  tim  qqjh  hjahj@uayua.com
9  john  ajkajk  rtaeraer@auiaui.com
 [ damien $]

note that file2.txt must be passed in first. I have no ideaa if this will work on a file that's 2 million lines long, but would be interested to know if you have the time to try it. :)

If you can make the files available (unlikely), I'll try it myself... :D

EDIT: I understand that you've accepted your answer and have probably moved on with your life, however, I would like to add in some additional information.

If I create two large files of the type that you specify: file1.bit.txt containing 5 million records:

 [ damien $] seq 1 1 5000000 > file1.big.txt
 [ damien $] sed -i 's|$| bob  fsgfq  ddd@ioaio.fr|' file1.big.txt
 [ damien $] head file1.big.txt
1 bob  fsgfq  ddd@ioaio.fr
2 bob  fsgfq  ddd@ioaio.fr
3 bob  fsgfq  ddd@ioaio.fr
4 bob  fsgfq  ddd@ioaio.fr
5 bob  fsgfq  ddd@ioaio.fr
6 bob  fsgfq  ddd@ioaio.fr
7 bob  fsgfq  ddd@ioaio.fr
8 bob  fsgfq  ddd@ioaio.fr
9 bob  fsgfq  ddd@ioaio.fr
10 bob  fsgfq  ddd@ioaio.fr
 [ damien $] tail file1.big.txt
4999991 bob  fsgfq  ddd@ioaio.fr
4999992 bob  fsgfq  ddd@ioaio.fr
4999993 bob  fsgfq  ddd@ioaio.fr
4999994 bob  fsgfq  ddd@ioaio.fr
4999995 bob  fsgfq  ddd@ioaio.fr
4999996 bob  fsgfq  ddd@ioaio.fr
4999997 bob  fsgfq  ddd@ioaio.fr
4999998 bob  fsgfq  ddd@ioaio.fr
4999999 bob  fsgfq  ddd@ioaio.fr
5000000 bob  fsgfq  ddd@ioaio.fr
 [ damien $]
 [ damien $]
 [ damien $]
 [ damien $]

and

[ damien $]
 [ damien $] seq 2 2 5000000 > file2.big.txt
 [ damien $]  sed -i 's|$| baby  france  paris|' file2.big.txt
 [ damien $] head file2.big.txt
2 baby  france  paris
4 baby  france  paris
6 baby  france  paris
8 baby  france  paris
10 baby  france  paris
12 baby  france  paris
14 baby  france  paris
16 baby  france  paris
18 baby  france  paris
20 baby  france  paris
 [ damien $] tail file2.big.txt
4999982 baby  france  paris
4999984 baby  france  paris
4999986 baby  france  paris
4999988 baby  france  paris
4999990 baby  france  paris
4999992 baby  france  paris
4999994 baby  france  paris
4999996 baby  france  paris
4999998 baby  france  paris
5000000 baby  france  paris
 [ damien $]

with only even numbered keys. Running my script gives:

 [ damien $]
 [ damien $] time ./n_join.awk file2.big.txt file1.big.txt > output.big

real    0m4.154s
user    0m3.893s
sys     0m0.207s
 [ damien $]
 [ damien $] head output.big
1 bob  fsgfq  ddd@ioaio.fr
3 bob  fsgfq  ddd@ioaio.fr
5 bob  fsgfq  ddd@ioaio.fr
7 bob  fsgfq  ddd@ioaio.fr
9 bob  fsgfq  ddd@ioaio.fr
11 bob  fsgfq  ddd@ioaio.fr
13 bob  fsgfq  ddd@ioaio.fr
15 bob  fsgfq  ddd@ioaio.fr
17 bob  fsgfq  ddd@ioaio.fr
19 bob  fsgfq  ddd@ioaio.fr
 [ damien $] tail output.big
4999981 bob  fsgfq  ddd@ioaio.fr
4999983 bob  fsgfq  ddd@ioaio.fr
4999985 bob  fsgfq  ddd@ioaio.fr
4999987 bob  fsgfq  ddd@ioaio.fr
4999989 bob  fsgfq  ddd@ioaio.fr
4999991 bob  fsgfq  ddd@ioaio.fr
4999993 bob  fsgfq  ddd@ioaio.fr
4999995 bob  fsgfq  ddd@ioaio.fr
4999997 bob  fsgfq  ddd@ioaio.fr
4999999 bob  fsgfq  ddd@ioaio.fr
 [ damien $] wc -l output.big
2500000 output.big
 [ damien $]

where the whole thing completes in about 4 seconds, which doesn't seem at all prohibitive. Either there's a big difference in the data sets or your script operated significantly differently to mine. Maybe this is useful to somebody. :/

Ps. I have a i7-3770 CPU @ 3.40GHz according to /proc/cpuinfo

Other than the file size, It's not clear to me why this wouldn't work. Presumably, you cannot make the files available for me to try? — user3065349, Feb 12 '15 at 08:06

score 0 · Answer 4 · answered Feb 11 '15 at 13:51

0

Try this in bash:

join -v 1 -j 1 file1.txt file2.txt

The result is:

8 tim qqjh hjahj@uayua.com
9 john ajkajk rtaeraer@auiaui.com

The parameters are as follows:

-j 1 is the field to join on, i.e. the first field

-v 1 means suppress the lines from the first file

answered Feb 11 '15 at 13:51

David K-J

930
7
14

This is not what i'm trying to do. I want to retreive the list of item that doesnt match with the second file (not join those match)) – Damien Feb 11 '15 at 16:24
Forget for a moment that the command is called `join`.. the output is the two rows in file1 that aren't in file2, which is what you asked for. - "the lines from File1 which don’t appear in File2 (according to the first column, so just the first column is used for comparison)." – David K-J Feb 11 '15 at 17:15

Efficient way to find to compare lines of two huge but sorted files

4 Answers4