1

I'm somewhat new to perl programming and I've got a hash which could be formulated like this:

$hash{"snake"}{ACB2}   = [70, 120];
$hash{"snake"}{SGJK}   = [183, 120];
$hash{"snake"}{KDMFS}   = [1213, 120];
$hash{"snake"}{VCS2}   = [21, 120];
...
$hash{"bear"}{ACB2}   = [12, 87];
$hash{"bear"}{GASF}   = [131, 87];
$hash{"bear"}{SDVS}   = [53, 87];
...
$hash{"monkey"}{ACB2}   = [70, 230];
$hash{"monkey"}{GMSD}   = [234, 230];
$hash{"monkey"}{GJAS}   = [521, 230];
$hash{"monkey"}{ASDA}   = [134, 230];
$hash{"monkey"}{ASMD}   = [700, 230];

The structure of the hash is in summary:

%hash{Organism}{ProteinID}=(protein_length, total_of_proteins_in_that_organism)

I would like to sort this hash according to some conditions. First, I would only like to take into consideration those organisms with a total number of proteins higher than 100, then I would like to show the name of the organism as well as the largest protein and its length.

For this, I'm going for the following approach:

    foreach my $org (sort keys %hash) {
        foreach my $prot (keys %{ $hash{$org} }) {
            if ($hash{$org}{$prot}[1] > 100) {
                @sortedarray = sort {$hash{$b}[0]<=>$hash{$a}[0]} keys %hash;

                print $org."\n";
                print @sortedarray[-1]."\n";
                print $hash{$org}{$sortedarray[-1]}[0]."\n"; 
            }
        }
    }

However, this prints the name of the organism as many times as the total number of proteins, for instance, it prints "snake" 120 times. Besides, this is not sorting properly because i guess I should make use of the variables $org and $prot in the sorting line.

Finally, the output should look like this:

snake
"Largest protein": KDMFS [1213]

monkey
"Largest protein": ASMD [700]
zdim
  • 64,580
  • 5
  • 52
  • 81
  • Should the output show all with "_total number of proteins higher than 100_" as the text says or is only the largest one, as in the desired output example? – zdim Nov 22 '19 at 08:17

4 Answers4

4

All data sorted in print

use warnings;
use strict;
use feature 'say';

use List::Util qw(max);

my %hash;   
$hash{"snake"}{ACB2}   = [70, 120];
$hash{"snake"}{SGJK}   = [183, 120];
$hash{"snake"}{KDMFS}   = [1213, 120];
$hash{"snake"}{VCS2}   = [21, 120];
$hash{"bear"}{ACB2}   = [12, 87];
$hash{"bear"}{GASF}   = [131, 87];
$hash{"bear"}{SDVS}   = [53, 87];    
$hash{"monkey"}{ACB2}   = [70, 230];
$hash{"monkey"}{GMSD}   = [234, 230];
$hash{"monkey"}{GJAS}   = [521, 230];
$hash{"monkey"}{ASDA}   = [134, 230];
$hash{"monkey"}{ASMD}   = [700, 230];

my @top_level_keys_sorted = 
    sort {   
        ( max map { $hash{$b}{$_}->[0] } keys %{$hash{$b}} ) <=> 
        ( max map { $hash{$a}{$_}->[0] } keys %{$hash{$a}} )
    }   
    keys %hash;

for my $k (@top_level_keys_sorted) {
    say $k; 
    say "\t$_ --> @{$hash{$k}{$_}}" for 
        sort { $hash{$k}{$b}->[0] <=> $hash{$k}{$a}->[0] } 
        keys %{$hash{$k}};
}

This first sorts the top-level keys by the first number in the arrayref value, per requirement. With that sorted list of keys on hand we then go inside each key's hashref and sort further. That loop is what we'd tweak to limit output as wanted (first 100 by total number, only largest by length, etc).

It prints

snake
        KDMFS --> 1213 120
        SGJK --> 183 120
        ACB2 --> 70 120
        VCS2 --> 21 120
monkey
        ASMD --> 700 230
        GJAS --> 521 230
        GMSD --> 234 230
        ASDA --> 134 230
        ACB2 --> 70 230
bear
        GASF --> 131 87
        SDVS --> 53 87
        ACB2 --> 12 87

I can't tell whether output should show all of "organisms with a total number of proteins higher than 100" (text) or only the largest one (desired output) so I am leaving all of it. Cut if off as needed. To get only the largest one either compare max from each key in the loop or see this post (same problem).

Note that a hash itself cannot be "sorted" as it is inherently unordered. But we can print things out sorted, as above, or generate ancillary data structures which can be sorted, if needed.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • @FernandoDelgadoChaves Please let me know if you need comments or explanations here. This sorts your full list because I am not sure what exactly you wanted (as stated in my answer), but I hope that it is easy to modify it to print only what you want. If not, please clarify and I can edit. – zdim Nov 22 '19 at 18:21
  • I'm not familiar with mapping hashes but I guess this works! Isn't there a way of doing it without external packages? – Fernando Delgado Chaves Nov 22 '19 at 18:34
  • @FernandoDelgadoChaves The "mapping" is a standard way to process a list in perl -- kind of `out = map { ... } in` where code in that block `{...}` is applied to each element of `in` and those transformed elements form `out`. This allows us to solve the whole thing directly with `sort` -- without having to iterate over elements etc. – zdim Nov 22 '19 at 18:46
  • @FernandoDelgadoChaves The `List::Util` is core in modern versions of Perl, so no need to install it. But, in general: yes, you can do it "_without external packages_" -- can write a routine to find the largest element of a list, instead of using the existing `max` (from `List::Util`). That's silly and a road to madness though -- for many jobs it is practically necessary to use "external" libraries (in all programming languages). If there are problems getting packages installed, that should better be resolved. In this case, `List::Util` routines are easy to write yourself. – zdim Nov 22 '19 at 18:49
  • @FernandoDelgadoChaves Note that you can install packages as a user, if there are admin issues where you work. There's a system how to do it, which is easy and explained in many Stackoverflow posts. I understand that doing that as well is a distraction, but if you use Perl it's well worth doing this _once_. Libraries ("packages") are extremely helpful, and often necessary. – zdim Nov 22 '19 at 18:55
1

Are you using

use strict;
use warnings;

at the beginning of your script? At least some sources of problems would be highlighted that way. Without them, Perl will silently do stuff that it could easily point out as stupid, pointless or even most likely programming errors.


The assignment

$hash{"snake"}{ACB2}   = (70, 120);

will only assign the value 120, since the assignment expects a scalar but you have a bunch of values on the left.

To assign an arrayref, you must explicitly state it:

$hash{"snake"}{ACB2}   = [70, 120];

Your use of sigils ($,@, %) seems off.

  • Use $ if you want to handle a scalar or an individual array or hash value.
  • Use @ if you want to handle an array (or an array or hash slice (multiple values); e.g. @array[0,2] would return the first and third item from the array @array).
  • Use % if you want to handle a hash.

So

@sortedarray[-1]

should be

$sortedarray[-1]

since you're only accessing a single value.

Silvar
  • 705
  • 3
  • 8
  • Hi Silvar, I'm using strict and warnings, and also [] instead of (), also $ instead of @ while referring a list postiion. I corrected these mistakes, please note this is a shortened version of the actual problem. – Fernando Delgado Chaves Nov 22 '19 at 07:23
1

If I understood you correctly then following code should do what you expect. I keep the result in hash, feel free to print data in any form you heart desire

use strict;
use warnings;

use Data::Dumper;

my $debug = 1;

my %data;
my $totalProteinsSearch = 100;

while( <DATA> ) {
    chomp;
    my @row = split ',';

    $data{$row[0]}{$row[1]} = { proteinLength => $row[2], totalProteins => $row[3] };
}

print Dumper(\%data) if $debug == 1;

my %result;

while( my($organism,$value) = each %data ) {
    while( my($proteinID, $data) = each %{$value} ) {
        next if $data->{totalProteins} < $totalProteinsSearch;
        $result{$organism} = {
                                proteinID => $proteinID,
                                proteinLength => $data->{proteinLength}, 
                                totalProteins => $data->{totalProteins} 
                            }
            if not defined $result{$organism}
                or 
            $data->{proteinLength} > $result{$organism}{proteinLength};
    }
}

print Dumper(\%result) if $debug;

__DATA__
snake,ACB2,70,120
snake,SGJK,183,120
snake,KDMFS,1213,120
snake,VCS2,21,120
bear,ACB2,12,87
bear,GASF,131,87
bear,SDVS,53,87
monkey,ACB2,70,230
monkey,GMSD,234,230
monkey,GJAS,521,230
monkey,ASDA,134,230
monkey,ASMD,700,230

You can print information for example like following [turn off debug $debug = 0]

while( my($organism,$data) = each %result ) {
    printf "%s\nLargetst protein: %s [%d]\n\n",
            $organism, 
            $data->{proteinID},
            $data->{proteinLength};
}
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • It looks like you are reading a file with that while() loop and creating the hash whilst that. I did not mention such a thing, but I got my hash from a .txt file creating the structure I mentioned in the question. Then you create another hash to show results? Isn't that too long? Isn't there a way to have a unique hash and show specific information of it? Besides, organism name should only appear once and the total number of proteins within organism should not apear. Thank you in advance! – Fernando Delgado Chaves Nov 22 '19 at 07:31
  • You did not provide a sample of `.txt` file with data which you read. I have recreated the hash from `__DATA__` block - my guess is that your `.txt` file looks quite similar.Then I analyse data and store results in new hash which holds unique record for an organism which comply with your requirements. You ask in comment _Isn't thre a way to have a unique hash and show specific information of it?_ If you turn variable `$debug = 0` then you will not see anything. I use `print Dumper(...)` to demonstrate what was read and stored result. I have mentioned you can print `%result` in any form. – Polar Bear Nov 22 '19 at 18:59
  • Well I tried executing without the while() loop and providing the hash instead as I wrote it in the question. I get the following error: Not a HASH reference at ejemplo1.pl line 31. whcih relates to this line: next if $data->{totalProteins} < $totalProteinsSearch; the problem is at the totalProteins part... – Fernando Delgado Chaves Nov 22 '19 at 23:37
1

You could use List::Util reduce to get the max per organism.

I realize that you might not be familiar with this function from List::Util but for this question's example, it is pretty straightforward.

The composition of the structure holding the data is a hash of arrays with the organism as the key and the entire line stored in a array reference as the value of the hash.

The choice of a data structure is partly determined by the form of the desired output.

#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce/;

use constant {org_idx => 0, prot_idx => 1, len_idx => 2, cnt_idx => 3};

my %stuff;

while (<DATA>) {
    chomp;
    my @data = split /,/;

    # saving all 4 items (per line)
    # key is the organism
    push @{$stuff{$data[org_idx]}}, \@data; 
}

for my $org (sort keys %stuff) {
    my $aref = $stuff{$org}; # an array of arrays reference

    # to find the record with the max length
    # $max becomes an array reference containing all 4 of the items as a list
    my $max = reduce{$a->[len_idx] > $b->[len_idx] ? $a : $b} @$aref;

    next if $max->[cnt_idx] < 100;

    print $org, "\n";
    print "Largest protein: $max->[prot_idx] [$max->[len_idx]]\n";
}



__DATA__
snake,ACB2,70,120
snake,SGJK,183,120
snake,KDMFS,1213,120
snake,VCS2,21,120
bear,ACB2,12,87
bear,GASF,131,87
bear,SDVS,53,87
monkey,ACB2,70,230
monkey,GMSD,234,230
monkey,GJAS,521,230
monkey,ASDA,134,230
monkey,ASMD,700,230

Prints

monkey
Largest protein: ASMD [700]
snake
Largest protein: KDMFS [1213]

Data::Dumper of %stuff

$VAR1 = {
          'monkey' => [
                        [
                          'monkey',
                          'ACB2',
                          70,
                          '230'
                        ],
                        [
                          'monkey',
                          'GMSD',
                          234,
                          '230'
                        ],
                        [
                          'monkey',
                          'GJAS',
                          521,
                          '230'
                        ],
                        [
                          'monkey',
                          'ASDA',
                          134,
                          '230'
                        ],
                        [
                          'monkey',
                          'ASMD',
                          700,
                          '230'
                        ]
                      ],
          'bear' => [
                      [
                        'bear',
                        'ACB2',
                        12,
                        '87'
                      ],
                      [
                        'bear',
                        'GASF',
                        131,
                        '87'
                      ],
                      [
                        'bear',
                        'SDVS',
                        53,
                        '87'
                      ]
                    ],
          'snake' => [
                       [
                         'snake',
                         'ACB2',
                         70,
                         '120'
                       ],
                       [
                         'snake',
                         'SGJK',
                         183,
                         '120'
                       ],
                       [
                         'snake',
                         'KDMFS',
                         1213,
                         '120'
                       ],
                       [
                         'snake',
                         'VCS2',
                         21,
                         '120'
                       ]
                     ]
        };
Chris Charley
  • 6,403
  • 2
  • 24
  • 26