Finding matches from one large file in another?

Question

I have two email lists. One is a newline delimited file of just email addresses containing 100k lines and the second file contains email,date,ipaddress newline. It has 4M lines, and contains duplicates which I am not concerned with.

grep -f fileA.txt fileB.txt works when fileA.txt is a test file of 100 or 1000 lines but 100k it is isn't getting anywhere.

I'm open to perl as well :)

score 1 · Answer 1 · answered Feb 18 '11 at 02:05

When faced with this kind of thing and I don't/can't store all of one file in an array, as suggested by Eric, I resort to a slightly unconventional approach. Each file is exported to a separate table in a database (I like Perl for this part) and the desired results are obtained via SQL queries.

score 0 · Answer 2 · edited Mar 20 '17 at 10:16

0

I'm assuming Linux with this. I would try creating a ramdisk and putting both files in it. That may be the fastest thing to try. Put this in fstab and then do a mount /mnt for a really quick way of setting it up:

ramdisk /mnt tmpfs mode=1777,size=1G

edited Mar 20 '17 at 10:16

Community

1

answered Feb 18 '11 at 01:33

carson

1,630
11
15

I don't think speed is the issue. The OP seems to be asking how to do this and with only 100k lines that should occur in the blink of an eye. – John Gardeniers Feb 18 '11 at 02:00

score 0 · Answer 3 · answered Feb 18 '11 at 01:36

0

You may be able to speed it up a bit by using the -F option so it's searching for fixed strings.

grep -Ff fileA.txt fileB.txt

Did you time your tests? What does extrapolating that time to the larger files tell you?

answered Feb 18 '11 at 01:36

Dennis Williamson

62,149
16
116
151

A test I just performed showed a considerable difference when using `-F`. `time grep -Fqsf 30000-lines 900000-lines` took about 1.2% of the time of `time grep -qsf 10000-lines 900000-lines`! Note that's with *three times as many lines* in the pattern file! – Dennis Williamson Feb 18 '11 at 01:55

score 0 · Answer 4 · answered Feb 18 '11 at 01:37

0

Sort and then diff them? That ought to work.

answered Feb 18 '11 at 01:37

mfinni

36,144
4
53
86

The files are different sizes, diff wont work I think – coredump Feb 18 '11 at 01:45
Diff will work on files of different sizes. I do like John's answer of using SQL, that's something that it's very good at. – mfinni Feb 18 '11 at 14:37

score 0 · Answer 5 · answered Feb 18 '11 at 01:42

0

in perl:

#!/usr/bin/perl -w

my ($emailfile, $file2);

if ( open ( $emailfile, '/path/file') )
{
  my @emails = split(/\n/, $emailfile);

  if ( open ( $file1, '/path/file') )
  {
    foreach my $email ( @emails )
    {
      while ( <$file2> )
      {
        if ( $_ =~ /$email/)
          print $_;
      }
    }
  }
}

answered Feb 18 '11 at 01:42

Eric Fossum

225
3
11

Won't that try to put all file in the memory, for each one of the loops? – coredump Feb 18 '11 at 01:46
true, sorry, you can also place the foreach loop inside the while loop to only iterate the larger file once. I just went a little too fast for myself :/ – Eric Fossum Feb 18 '11 at 01:53

Finding matches from one large file in another?

5 Answers5