The fact that the files are of a different length doesn't exclude an awk solution. Taking your two inputs:
[ damien $] cat file1
cat: file1: No such file or directory
[ damien $] cat file1.txt
1 robert youh xpla@ioaio.fr
2 patrick yuad qqqq@ioaio.fr
3 bob fsgfq ddd@ioaio.fr
8 tim qqjh hjahj@uayua.com
9 john ajkajk rtaeraer@auiaui.com
[ damien $] cat file2.txt
1 baby france paris
2 father usa detroit
3 mother uk london
4 baby italy milan
[ damien $]
consider the following script:
[ damien $] cat n_join.awk
#!/usr/bin/gawk -f
NR==FNR{
# file2[$1]=$0;
file2[$1]=1;
}
NR!=FNR{
if(!($1 in file2)){
# print current record if not in file2
print ;
}else{
# if $1 from file1 has been found.
# if it's really unique in file1, it
# can be deleted from file2. Otherwise
# comment this line:
delete file2[$1];
}
}
[ damien $]
which gives the output:
[ damien $] chmod +x n_join.awk
[ damien $] ./n_join.awk file2.txt file1.txt
8 tim qqjh hjahj@uayua.com
9 john ajkajk rtaeraer@auiaui.com
[ damien $]
note that file2.txt must be passed in first. I have no ideaa if this will work on a file that's 2 million lines long, but would be interested to know if you have the time to try it. :)
If you can make the files available (unlikely), I'll try it myself... :D
EDIT: I understand that you've accepted your answer and have probably moved on with your life, however, I would like to add in some additional information.
If I create two large files of the type that you specify: file1.bit.txt containing 5 million records:
[ damien $] seq 1 1 5000000 > file1.big.txt
[ damien $] sed -i 's|$| bob fsgfq ddd@ioaio.fr|' file1.big.txt
[ damien $] head file1.big.txt
1 bob fsgfq ddd@ioaio.fr
2 bob fsgfq ddd@ioaio.fr
3 bob fsgfq ddd@ioaio.fr
4 bob fsgfq ddd@ioaio.fr
5 bob fsgfq ddd@ioaio.fr
6 bob fsgfq ddd@ioaio.fr
7 bob fsgfq ddd@ioaio.fr
8 bob fsgfq ddd@ioaio.fr
9 bob fsgfq ddd@ioaio.fr
10 bob fsgfq ddd@ioaio.fr
[ damien $] tail file1.big.txt
4999991 bob fsgfq ddd@ioaio.fr
4999992 bob fsgfq ddd@ioaio.fr
4999993 bob fsgfq ddd@ioaio.fr
4999994 bob fsgfq ddd@ioaio.fr
4999995 bob fsgfq ddd@ioaio.fr
4999996 bob fsgfq ddd@ioaio.fr
4999997 bob fsgfq ddd@ioaio.fr
4999998 bob fsgfq ddd@ioaio.fr
4999999 bob fsgfq ddd@ioaio.fr
5000000 bob fsgfq ddd@ioaio.fr
[ damien $]
[ damien $]
[ damien $]
[ damien $]
and
[ damien $]
[ damien $] seq 2 2 5000000 > file2.big.txt
[ damien $] sed -i 's|$| baby france paris|' file2.big.txt
[ damien $] head file2.big.txt
2 baby france paris
4 baby france paris
6 baby france paris
8 baby france paris
10 baby france paris
12 baby france paris
14 baby france paris
16 baby france paris
18 baby france paris
20 baby france paris
[ damien $] tail file2.big.txt
4999982 baby france paris
4999984 baby france paris
4999986 baby france paris
4999988 baby france paris
4999990 baby france paris
4999992 baby france paris
4999994 baby france paris
4999996 baby france paris
4999998 baby france paris
5000000 baby france paris
[ damien $]
with only even numbered keys. Running my script gives:
[ damien $]
[ damien $] time ./n_join.awk file2.big.txt file1.big.txt > output.big
real 0m4.154s
user 0m3.893s
sys 0m0.207s
[ damien $]
[ damien $] head output.big
1 bob fsgfq ddd@ioaio.fr
3 bob fsgfq ddd@ioaio.fr
5 bob fsgfq ddd@ioaio.fr
7 bob fsgfq ddd@ioaio.fr
9 bob fsgfq ddd@ioaio.fr
11 bob fsgfq ddd@ioaio.fr
13 bob fsgfq ddd@ioaio.fr
15 bob fsgfq ddd@ioaio.fr
17 bob fsgfq ddd@ioaio.fr
19 bob fsgfq ddd@ioaio.fr
[ damien $] tail output.big
4999981 bob fsgfq ddd@ioaio.fr
4999983 bob fsgfq ddd@ioaio.fr
4999985 bob fsgfq ddd@ioaio.fr
4999987 bob fsgfq ddd@ioaio.fr
4999989 bob fsgfq ddd@ioaio.fr
4999991 bob fsgfq ddd@ioaio.fr
4999993 bob fsgfq ddd@ioaio.fr
4999995 bob fsgfq ddd@ioaio.fr
4999997 bob fsgfq ddd@ioaio.fr
4999999 bob fsgfq ddd@ioaio.fr
[ damien $] wc -l output.big
2500000 output.big
[ damien $]
where the whole thing completes in about 4 seconds, which doesn't seem at all prohibitive. Either there's a big difference in the data sets or your script operated significantly differently to mine. Maybe this is useful to somebody. :/
Ps. I have a i7-3770 CPU @ 3.40GHz according to /proc/cpuinfo