0

I have two email lists. One is a newline delimited file of just email addresses containing 100k lines and the second file contains email,date,ipaddress newline. It has 4M lines, and contains duplicates which I am not concerned with.

grep -f fileA.txt fileB.txt works when fileA.txt is a test file of 100 or 1000 lines but 100k it is isn't getting anywhere.

I'm open to perl as well :)

Lloyd
  • 1

5 Answers5

1

When faced with this kind of thing and I don't/can't store all of one file in an array, as suggested by Eric, I resort to a slightly unconventional approach. Each file is exported to a separate table in a database (I like Perl for this part) and the desired results are obtained via SQL queries.

John Gardeniers
  • 27,458
  • 12
  • 55
  • 109
0

I'm assuming Linux with this. I would try creating a ramdisk and putting both files in it. That may be the fastest thing to try. Put this in fstab and then do a mount /mnt for a really quick way of setting it up:

ramdisk /mnt tmpfs mode=1777,size=1G
carson
  • 1,630
  • 11
  • 15
  • I don't think speed is the issue. The OP seems to be asking how to do this and with only 100k lines that should occur in the blink of an eye. – John Gardeniers Feb 18 '11 at 02:00
0

You may be able to speed it up a bit by using the -F option so it's searching for fixed strings.

grep -Ff fileA.txt fileB.txt

Did you time your tests? What does extrapolating that time to the larger files tell you?

Dennis Williamson
  • 62,149
  • 16
  • 116
  • 151
  • A test I just performed showed a considerable difference when using `-F`. `time grep -Fqsf 30000-lines 900000-lines` took about 1.2% of the time of `time grep -qsf 10000-lines 900000-lines`! Note that's with *three times as many lines* in the pattern file! – Dennis Williamson Feb 18 '11 at 01:55
0

Sort and then diff them? That ought to work.

mfinni
  • 36,144
  • 4
  • 53
  • 86
0

in perl:

#!/usr/bin/perl -w

my ($emailfile, $file2);

if ( open ( $emailfile, '/path/file') )
{
  my @emails = split(/\n/, $emailfile);

  if ( open ( $file1, '/path/file') )
  {
    foreach my $email ( @emails )
    {
      while ( <$file2> )
      {
        if ( $_ =~ /$email/)
          print $_;
      }
    }
  }
}
Eric Fossum
  • 225
  • 3
  • 11
  • Won't that try to put all file in the memory, for each one of the loops? – coredump Feb 18 '11 at 01:46
  • true, sorry, you can also place the foreach loop inside the while loop to only iterate the larger file once. I just went a little too fast for myself :/ – Eric Fossum Feb 18 '11 at 01:53