Perl script for combining 2 files with multiple entries

Question

I have a tab-delimited text file like this:

contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data

And another like this:

contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N

I need to combine them so that each 'multiple' entry of one of the files is duplicated and populated with its data in the other, so that I get:

contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data

I have little scripting experience, but have done this where e.g. "contig11" occurs only once in one of the files, with hashes/keys. But I can't even begin to get my head around to do this! Really appreciate some help or hints as to how to tackle this problem.

EDIT So I have tried ikegami's suggestion (see answers) with this: However, this has produced the output I needed except the GO:100 column onwards ($rest in script???) - any ideas what I'm doing wrong?

#!/usr/bin/env/perl

use warnings;

open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";

my %goterm;

while (<GOTERMS>)
{
    my($id, $rest) = /^(\S++)(,*)/s;
    push @{$goterm{$id}}, $rest;
}

while (my $row2 = <SNPS>)
{
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} })
    {
        print("$row2$rest\n");
    }
}

close GOTERMS;
close SNPS;

Are the order of output lines important? Using hash for both files, the result will be out of the original order... Actually how these lines are got? Are they come from a file or are there programs generating them? — TrueY, Apr 15 '13 at 18:44
@TrueY The order of the output is not important to me. They are just files but with potentially tens or hundreds of thousands of lines in each file. — Amy Ellison, Apr 15 '13 at 18:48

ikegami · Accepted Answer · 2013-04-16T01:10:35.580

Look at your output. It's clearly produced by

for each row of the second file,
- for each row of the first file with the same id,
  - print out the combined rows

So the question is: How does you find the rows of the first file with the same id as a row of the second file?

The answer is: You store the rows of the first file in a hash indexed by the row's id.

my %file1;
while (<$file1_fh>) {
   my ($id, $rest) = /^(\S++)(.*)/s;
   push @{ $file1{$id} }, $rest;
}

So the earlier pseudo code resolves to

while (my $row2 = <$file2_fh>) {
   chomp($row2);
   my ($id) = $row2 =~ /^(\S+)/;
   for my $rest (@{ $file1{$id} }) {
      print("$row2$rest");
   }
}

#!/usr/bin/env perl

use strict;   
use warnings;

open(my $GOTERMS, $ARGV[0])
     or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
open(my $SNPS, $ARGV[1])
     or die("Error opening SNP file \"$ARGV[1]\": $!\n");

my %goterm;
while (<$GOTERMS>) {
    my ($id, $rest) = /^(\S++)(.*)/s;
    push @{ $goterm{$id} }, $rest;
}

while (my $row2 = <$SNPS>) {
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} }) {
        print("$row2$rest");
    }
}

I am trying this but not getting all the data from file1 - am I missing something? I have added my script to the question. — Amy Ellison, Apr 15 '13 at 19:12
If you change `,*` back to `.*` and you remove the `\n` you added, you get exactly the requested output. — ikegami, Apr 16 '13 at 01:08
Woops, what a stupid typo - thank you for your script and explanation - very helpful! — Amy Ellison, Apr 16 '13 at 12:44

score 0 · Answer 2 · answered Apr 15 '13 at 18:32

I will describe how you can do this. You need each file pu to array (each libe is an array item). Then you just need to compare these array in needed way. You need 2 loops. Main loops for each record of array/file which contains string which you you will use to campare (in your example it will be 2nd file). Under this loop you need to have another loop for each record in a array/file with records which you will compare with. And just check each record of array with the each recrod of another array and process results.

foreach my $record2 (@array2) {
    foreach my $record1 (@array1){
        if ($record2->{field} eq $record1->{field}){
            #here you need to create the string which you will show
            my $res_string = $record2->{field}.$record1->{field};
            print "$res_string\n";
        }
    }
}

Or dont use array. Just read files and compare each line with each line of another file. General idea is the same ))

Perl script for combining 2 files with multiple entries

2 Answers2