-3

I have my data in the format shown below. I have some specific CCNA numbers and want to extract the location values in (202-1107, 1700-2557,..) format. How can I code it in Perl?

>lcl|CP001340.1_cds_ACL93468.1_1 [locus_tag=CCNA_00001] [protein=pyruvate, phosphate dikinase regulatory protein] [protein_id=ACL93468.1] [location=202..1107] [gbkey=CDS]
GTGGTTAAGCAACCGTTAACGGATGATCCACAGGAGAGTCTGGCGCAGGGCGAGAGCGAAAGGCTGCCGC
CACGCTTCGCCACCTACTTCCATATCCACTTGGTTTCAGACTCCACAGGCGAGACGCTGAACGCGATGGC
GCGGGCGGT
>lcl|CP001340.1_cds_ACL93470.1_3 [locus_tag=CCNA_00003] [protein=shikimate 5-dehydrogenase] [protein_id=ACL93470.1] [location=1700..2557] [gbkey=CDS]
ATGACCAACGCCATCACGGGCGCGGCCATTGTCGGCGGTGTCTGCGGTCAACCGATCAAGCATTCGATGA
GCCCGGTGATCCACAACGCCTGGATCGCAGCGGCCGGCCTTGACGCGGCTTATGTGCCATTCGCCCCGGC
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • 1
    Look for a module that can read those kind of files. For example on metacpan. – TLP Sep 04 '21 at 09:37
  • 1
    Bioperl probably has a reader for the format (looks like fasta?) – Shawn Sep 04 '21 at 09:50
  • 2
    Have you tried anything? Are you having some specific issues? If so, please show us your code and explain what specific problems you have. – Dada Sep 04 '21 at 10:31
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 08 '21 at 07:35

1 Answers1

1

Following sample code demonstrate how you can

  • define a search criteria
  • read data file
  • form an array of records
  • fill data structure with genome data chunks
  • output data matching the search criteria
use strict;
use warnings;
use feature 'say';

my($protein_id,$location) = ('ACL93470.1','1700..2557');  # search criteria
my(@chain,$genome);
my($proteins,$id);

# read genome data into array @chain 

while( <DATA> ) {
    push @chain, $genome if defined $genome and />lcl/;
    $genome = undef if />lcl/;
    $genome .= $_;
}

push @chain, $genome if defined $genome;

# build data structure $proteins

for( @chain ) {
    ($id) = /^>lcl\|(\S+)/;
    my @elements = /\[(.*?)\]/g;
    $proteins->{$id} = { map { split('=', $_) } @elements };
    ($proteins->{$id}{protein}) = /([^]]*)\z/;
}

# output data of search criteria

for( keys %$proteins ) {
    if( $proteins->{$_}{protein_id} eq $protein_id and $proteins->{$_}{location} eq $location ) {
        say "Protein ID: $protein_id\n"
          . "Location:   $location\n"
          . "Protein:    $proteins->{$_}{protein}";
    }
} 

__DATA__
>lcl|CP001340.1_cds_ACL93468.1_1 [locus_tag=CCNA_00001] [protein=pyruvate, phosphate dikinase regulatory protein] [protein_id=ACL93468.1] [location=202..1107] [gbkey=CDS]
GTGGTTAAGCAACCGTTAACGGATGATCCACAGGAGAGTCTGGCGCAGGGCGAGAGCGAAAGGCTGCCGC
CACGCTTCGCCACCTACTTCCATATCCACTTGGTTTCAGACTCCACAGGCGAGACGCTGAACGCGATGGC
GCGGGCGGT
>lcl|CP001340.1_cds_ACL93470.1_3 [locus_tag=CCNA_00003] [protein=shikimate 5-dehydrogenase] [protein_id=ACL93470.1] [location=1700..2557] [gbkey=CDS]
ATGACCAACGCCATCACGGGCGCGGCCATTGTCGGCGGTGTCTGCGGTCAACCGATCAAGCATTCGATGA
GCCCGGTGATCCACAACGCCTGGATCGCAGCGGCCGGCCTTGACGCGGCTTATGTGCCATTCGCCCCGGC

Output

Protein ID: ACL93470.1
Location:   1700..2557
Protein:
ATGACCAACGCCATCACGGGCGCGGCCATTGTCGGCGGTGTCTGCGGTCAACCGATCAAGCATTCGATGA
GCCCGGTGATCCACAACGCCTGGATCGCAGCGGCCGGCCTTGACGCGGCTTATGTGCCATTCGCCCCGGC
Polar Bear
  • 6,762
  • 1
  • 5
  • 12