2

I have a list of genes and the following information:

  • Their name 'XLOC_0000...'
  • The genomic scaffold on which they're located 'Scaffold...'
  • The location of each feature on its Scaffold ('start', 'stop')

I've written a piece of Perl code that finds each gene in the genomic scaffolds and saves it to a file. Briefly, first I put each gene in a hash of arrays, e.g.

 my %geneID = map { $xloc[$_] => [ $scaffold[$_], $start[$_], $stop[$_] ] } (0 .. $#xloc);

I then make a hash of the fasta file containing scaffolds:

open FASTA, '<', 'genome.fasta' || die "Can't open 'genome.fasta'\n"; #Read in 'fasta' file
my (@head, @sequence);
while (<FASTA>) { 
    chomp;
    push @head, $_ if /^>/;     
    push @sequence, $_ if /^[A-Z]/;
}

my %scaf;
@scaf{@head} = @sequence; # All scaffolds, as ordered in FH.

Then I assign the elements of the first HoA, and using substr, find the gene's start and stop position within the scaffold of the same name

foreach my $xloc (sort keys %geneID) {
        print "gene sequence for $xloc is: ";
        my $chm = @{$geneID{$xloc}}[0];
        my $start = @{$geneID{$xloc}}[1];  
        my $end = @{$geneID{$xloc}}[2];
        my $seq = substr($scaf{$chm},$start-1,$end-($start-1));         
        print "$seq\n"; 
}

The problem with this is that if I have xlocs with the same name, e.g. XLOC_00001, the the hash key only takes the last value. I want to be able to add multiple 'sub-values' to each hash, find their locations using substr, and essentially join them together at the end.

Any suggestions on how to do this?


UPDATE:

This is a test example showing the sort of results I get:

'GENOME' FASTA FILE

>Scaffold1
ONEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold2
TWOATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold3
THREEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold4
FOURATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold5
FIVEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold6
SIXATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold7
SEVENATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold8
EIGHTATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold9
NINEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold10
TENATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA

KEYS and values for %geneID:

Key: XLOC_000027 contains the values: >Scaffold1 1 10 
Key: XLOC_000037 contains the values: >Scaffold2 1 15 
Key: XLOC_000038 contains the values: >Scaffold3 2 9 
Key: XLOC_000051 contains the values: >Scaffold4 6 8 
Key: XLOC_000077 contains the values: >Scaffold5 2 7 
Key: XLOC_000079 contains the values: >Scaffold6 4 16 
Key: XLOC_000096 contains the values: >Scaffold7 4 9 
Key: XLOC_000100 contains the values: >Scaffold8 3 20 
Key: XLOC_000117 contains the values: >Scaffold9 6 8 
Key: XLOC_000119 contains the values: >Scaffold10 7 14 

Results, showing 'gene' as substring of scaffold on which it's located for each XLOC:

gene sequence for XLOC_000027 is: ONEATCGCG
gene sequence for XLOC_000037 is: TWOATCGCGCTTAG
gene sequence for XLOC_000038 is: HREEATCG
gene sequence for XLOC_000051 is: TCGCGCT
gene sequence for XLOC_000077 is: IVEATC
gene sequence for XLOC_000079 is: ATCGCGCTTAGTGCA
gene sequence for XLOC_000096 is: ENATCGCG
gene sequence for XLOC_000100 is: GHTATCGCGCTTAGTGCAG
gene sequence for XLOC_000117 is: TCGCGCT
gene sequence for XLOC_000119 is: GCGCTTAGTGCAG
cHao
  • 84,970
  • 20
  • 145
  • 172
fugu
  • 6,417
  • 5
  • 40
  • 75
  • I think you need to show some example data. I get most of it but cannot understand *"find their locations using substr"*. It is also unclear how `%geneID` and `@xloc` are related to `%scaf`, and which hash it is that has duplicate keys. – Borodin Jun 20 '13 at 15:54
  • Added an update to show output – fugu Jun 20 '13 at 16:05
  • @Nick Have you considered using a module for this? Such as [`BioPerl`](http://search.cpan.org/perldoc?BioPerl). No sense in trying to reinvent the wheel. – TLP Jun 20 '13 at 19:23

2 Answers2

2

It sounds like you need to push each set of (scaffold, start, stop) values onto an array for each element of the %geneID hash. Like this

my %geneID;
push @{ $geneID{ $xloc[$_] } }, [ $scaffold[$_], $start[$_], $stop[$_] ] for 0 .. $#xloc;

Then, once the %scaf hash has been built, you can construct a concatentation of the subsequences in a loop over all constituents of the sequence.

for my $xloc (sort keys %geneID) {

  my $sequence;
  for my $part (@{ $geneID{$xloc} }) {
    my ($chm, $start, $end) = @$part;
    my $off = $start - 1;
    my $len = $end - $off;
    $sequence .= substr $scaf{$chm}, $off, $len;
  }

  print "gene sequence for $xloc is: $sequence\n";
}

I hope this helps.


Update

By the way, you have a bug in your file open statement.

open FASTA, '<', 'genome.fasta' || die "Can't open 'genome.fasta'\n"

is the same as

open FASTA, '<', ('genome.fasta' || die "Can't open 'genome.fasta'\n")

and because the filename is always true (unless it is 0) die will never be called.

Idiomatically you should use the lower-priority or operator, together with a lexical file handle as global file handles are considered bad practice.

open my $fasta, '<', 'genome.fasta' or die "Can't open 'genome.fasta'\n"

And, if it matters to you, putting a \n on the end of your die string prevents perl from displaying the file and line number where the error occurred.

This whole loop is better written

my $fasta_file = 'genome.fasta';
open my $fasta, '<', $fasta_file or die "Can't open '$fasta_file'";
my (%scaf, $scaffold);
while (<$fasta>) {
  chomp;
  $scaffold = $_ if /^>/;   
  $scaf{$scaffold} = $_ if /^[A-Z]/;
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thanks - this looks great. I'll have a play around with this approach and let you know how it goes. I really appreciate your help! Nick – fugu Jun 21 '13 at 09:42
  • @ Borodin - I think I understand the loops you've suggested: Data Dumper output shows exactly what I want when I use your suggestion. However, the two loops through %geneID show the following errors: "Use of uninitialized value in substr" and "Use of uninitialized value in concatenation (.) or string at" – fugu Jun 21 '13 at 14:24
  • @Nick: That error arises because `$scaf{$chm}` doesn't exist. i.e. the `@scaffold` array contains a value that doesn't correspond to any of the keys of the `%scaf` hash. – Borodin Jun 21 '13 at 16:23
1

If you know you are going to have duplicates you can create your hashes in this format:

use strict;
use warnings FATAL => 'all';

use Data::Dumper;

my %hash;
push @{$hash{key1}}, 'Value1';
push @{$hash{key2}}, 'Value2';
push @{$hash{key1}}, 'Value3';

print Dumper ( \%hash );

Perl has a property called autovivification, which will allows it to create the hash value if it doesnt exist, and append to one if it does. (assuming you are using list-context)

$VAR1 = {
          'key2' => [
                      'Value2'
                    ],
          'key1' => [
                      'Value1',
                      'Value3'
                    ]
        };

You can now get arrayref from your hash key and examine all of the Xlocs (whatever that is).

Hunter McMillen
  • 59,865
  • 24
  • 119
  • 170