-1

I have been trying to find values that match between two columns (columns a and column b) of a large file and print the common values, plus the corresponding column d. I have been doing this by interating through hashes, however, because the file is so large, there is not enough memory to produce the output file. Is there any other way to do the same thing using less memory resources.

Any help is much appreciated.

The script I have written thus far is below:

#!usr/bin/perl
use warnings;
use strict;

open (FILE1, "<input.txt") || die "$!\n Couldn't open input.txt\n";
open (Output, ">output.txt")||die "Can't Open output.txt ";
my $hash1={};
my $hash2={};

while (<FILE1>) {
    chomp (my $line=$_);
    my ($a, $b, $c, $d) = split (/\t/, $line);

    if ($a) {
        $hash1->{$a}{info1} = "$d"; #original_ID-> YOB
    }
    if ($b) {
        $hash2->{$b}{info2} = "$a"; #original_ID-> sire
    }

    foreach my $key (keys %$hash2) {
        if (exists $hash1{$a}) {
            $info1 = $hash1->{$a}->{info1};
            print "$a\t$info1\n";
        }
    }
}

close FILE1;
close Output;
print "Done\n";

To clarify, the input file is a large pedigree file. An example is:

1    2   3   1977
2    4   5   1944
3    4   5   1950
4    5   6   1930
5    7   6   1928

An example of the output file is:

2   1944
4   1950
5   1928
Armali
  • 18,255
  • 14
  • 57
  • 171
e2121
  • 19
  • 3
  • Can you provide a small snippet of the input file and the desired output ? – Georgi Rangelov Jun 11 '15 at 07:30
  • 1
    http://perl.goeszen.com/working-with-very-large-hashes.html and sidenote: you can get rid of extra level of hash `info1,info2` – mpapec Jun 11 '15 at 07:39
  • @GeorgiRangelov, I have added an example of the input and output to my original post. – e2121 Jun 11 '15 at 09:14
  • how big is your file? Have you considered any of the `tie` modules? – Patrick J. S. Jun 11 '15 at 09:44
  • You *might* want to move the `foreach` until after the `while` unless that repeat behavior is important. – Jim Davis Jun 11 '15 at 15:40
  • 1
    For large amount of data where you're matching across multiple columns, I would suggest putting the data into a SQLite database and doing SQL queries. It will be faster, more flexible and more efficient. – Schwern Jun 11 '15 at 19:35
  • 1
    The program does not produce the output from the input as shown; you should provide consistent information. Also it is unclear what you mean by _corresponding column d_: column d from the line with matching column a, or column d from the line with matching column b. – Armali Nov 06 '15 at 09:47
  • Also: Never use `$a` like that. One letter var names are bad anyway, but `$a` and `$b` have a particular meaning in perl (used for `sort`). – Sobrique Nov 06 '15 at 10:48
  • And what are the columns anyway. Given you've got 4 columns, 3 of which have similar data, how do they correlate? (You don't seem to use `$c` at all?) – Sobrique Nov 06 '15 at 11:00

1 Answers1

0

Does the below work for you ?

#!/usr/local/bin/perl

use strict;
use warnings;
use DBM::Deep;
use List::MoreUtils qw(uniq);

my @seen;

my $db = DBM::Deep->new(
    file => "foo.db",
    autoflush => 1
);

while (<>) {
    chomp;
    my @fields = split /\s+/;
    $$db{$fields[0]} = $fields[3];
    push @seen, $fields[1];
}

for (uniq @seen) {
    print $_ . " " . $$db{$_} . "\n" if exists $$db{$_};
}
Georgi Rangelov
  • 335
  • 1
  • 11
  • 1
    FYI, dbm-deep has a [memory leak](https://github.com/robkinyon/dbm-deep/issues/11) somewhere. In my experience, it won't really reduce memory usage on a large file, it will just slow things down. – SES Jun 12 '15 at 17:41