0

I am trying to divide a big file into different files containing single information for each variable inside the file.

my input file look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  PID008SM

...info here 1.....

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CL001-SC

....info here 2....

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CL001-SC

....info here 3....        

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  PID008SM

....info here 4....

In this case I would like to create two output file (one for PID008SM and CL001-SC) with the information related to each of them.

Output for CL001-SC:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CL001-SC

....info here 2...

....info here 3...

Output for PID008SM

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  PID008SM
....info here 1....

....info here 4....

The script that I have used is in Perl but any suggestion it is more than welcome. Thank you in advance.

code:

#!/usr/bin/perl;
use strict;
use warnings;

my $file1 = $ARGV[0] ;
my $file2 = $ARGV[1];

open (F1, $file1); #Opens first .vcf file for comparison
open (F2, $file2); #2nd for comparison

my %file;

## Create the hash key with each line of the file2
while (<F2> ) {
        #chomp;
        $file{$_}='';
}

## Print the line , if key   exist in the hash ;       

foreach my $string (<F1>) {

        if ( exists $file{$_}) and ($string =~ /(#)(.+?)(#)/s) {
                print $string;
        }
}
  • 2
    So you have same header over and over again and then paragraphs in between? Specify more accurately so we can give a more accurate answer. – fedorqui Apr 19 '13 at 11:28
  • 2
    Your code appears to have nothing to do with your question. Your question is about splitting a file into two, while the code compares two different files for lines that match. Please clarify. –  Apr 19 '13 at 11:31
  • @fedorqui Correct, I have the same header over and over and paragraphs in between. I would like to extract the information in between for each sample and (if it possible) keep one heather. Apologies for not being more exhaustive in the description above. – Gaia Andreoletti Apr 19 '13 at 14:05
  • @dan1111 I was trying with that sample perl script just to see if I was able to find a match between the two files and then print the output in a different file. I am sorry for not being clear. – Gaia Andreoletti Apr 19 '13 at 14:07

2 Answers2

1

Something like this perhaps?

use strict;
use warnings;

open my $fh, '<', 'chrom.txt' or die $!;

my %fh;

while (<$fh>) {

  if ( /^#CHROM/ ) {

    my $name =  (split)[-1];

    if ($fh{$name}) {
      select $fh{$name};
      next;
    }

    my $file = "$name.txt";
    open $fh{$name}, '>', $file or die qq{Unable to open "$file" for output: $!};
    print STDOUT qq{Created file "$file"\n};
    select $fh{$name};
  }

  print;
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thank a lot! So basically you create a hash for each header and you creates and output file for each key. And then how does it put the information between heather in each output file? Thanks for the help! – Gaia Andreoletti Apr 19 '13 at 14:15
  • 2
    The `select` function switches the default output file handle so that `print` without a file handle parameter outputs to the last selected handle. `STDOUT` is the selected handle when any program starts. Please accept the answer that you think best answers your question. – Borodin Apr 19 '13 at 15:26
0
awk '/^#CHROM/{typ=$10;a[$0]++} a[$0]<2{print >> typ}' inputFile

this awk script seems to work +

Sidharth C. Nadhan
  • 2,191
  • 2
  • 17
  • 16