3

The code is getting messy somewhere in the loop! Please help me to solve it.

Details

Replicate and/or reassign most of the array elements in the multi-dimensional array, using reference elements.

  • File-1: List of array indices & the elements that needs to be maintained in the original array.

  • File-2: The original multi-dimensional array that needs to be rewritten with the above info. Except the elements from the above, the rest of all elements have to be reassigned.

  • File-3: Expected output (reassigned array elements)

Note: Addition to the array indices from file1, rest of all the indices will be replaced with the reference line. Reference line is usually present in the first line of the array.
In the modified array, the reference line is not needed.

File-1:

ID1    2    E1,E4
ID2    5    E6,E7,E9
ID3    1    E3

File-2:

ID1.txt

Ref K L M N O P A B C D
E1 S H G U S K R E K K
E2 S L G N O P A B C D
E3 S L G N O P A B C D
E4 U L G G O P A B C D
E5 U L M G O P A J C D
E6 U L M G O P A J C D
E7 U L M G O P A J C D
E8 U L M G O P A J C D
E9 S L M N O P A J C D
E10 S L M N O P A J C D
.
.
.

File-3: Expected output

new_ID1.txt

E1    K L G N O P A B C D
E2    K L M N O P A B C D
E3    K L M N O P A B C D
E4    K L G N O P A B C D
E5    K L M N O P A B C D
E6    K L M N O P A B C D
E7    K L M N O P A B C D
E8    K L M N O P A B C D
E9    K L M N O P A B C D
E10    K L M N O P A B C D
.
.
.

In the expected output, (new_ID1.txt), second index of the array for "E1" and "E4" is maintained from the original array. Everything else is replaced by the reference line in "E2,E3,E5...".

Code

#!/usr/bin/perl 

use strict;
use warnings;

my %HoHoA = ();

open(IN,"ids.txt");
my @ids = <IN>; chomp @ids; close IN;

open(IN2,"indices_and_values.txt");

while(my $l = <IN2>)
{
    chomp $l;
my @tmp = split "\t", $l;
my $lid = $tmp[0];
my $pos = $tmp[1];
my @gps = @tmp[2..$#tmp];

    foreach my $g (@gps)
    {
        push @{$HoHoA{$lid}{$g}}, $pos;
    }
}
close IN2;


foreach my $outer (sort keys %HoHoA)
{
open(IN3,"$outer.txt");
my @rS = <IN3>; chomp @rS; close IN3;

    my @orgArr = (); my @refArr = (); my @newArr = ();
    foreach my $unk (@rS) 
    { 
        @orgArr = split "\t", $unk;
        if($unk =~ /^Ref/)
        { 
            @refArr = split "\t", $unk;
            next;
        }
    foreach my $inner (sort keys %{$HoHoA{$outer}})
    {
        if($inner =~ /^$orgArr[0]/)
        {
            foreach my $ele (sort {$a <=> $b} @{$HoHoA{$outer}{$inner}})
            {
                $refArr[$ele] = $orgArr[$ele];
            }
        }
        #else
        #{
        #}
    }
    print ">$orgArr[0]\t";
    print join("\t",@refArr[1..$#refArr]);
    print "\n";
}
    @rS = ();
    print "\n";

}

Biffen
  • 6,249
  • 6
  • 28
  • 36
  • I don't understand. It says that E1 is kept (at least the "_second index_"...) but E1 row in ID1.txt file is `S H G U...` while in the desired output file it is `K L G N...` --- completely different. What do you mean by "array E1" -- the row that starts with E1? – zdim Oct 23 '18 at 17:09
  • @zdim Only the given array indices from file 1 have to be kept NOT the entire elements. For instance, E1[2] should be maintained. E1[2] is "G". Every other element should be replaced by the Ref elements (the first line of the array). So it is K L G N O P A B C D. I (assumed) mentioned array E1, E2 and so on because of multiple lines (rows/columns). Would be helpful to know the other way too. File-1 format is as follows, "id index_value arrays" – perlbeginner Oct 23 '18 at 17:29
  • Ah, thank you -- yes, it says all that in the question. Let me know if I still misunderstood something (see answer) – zdim Oct 23 '18 at 18:24

2 Answers2

2

The shown code is well-meant but a bit too complicated; you may have lost your way in the maneuvers over the nested data structure. Here's another, simpler, approach.

Parse the information from the "reference" file (File-1) into a hash (E1 => [2, ...], ..). I put indices for data to be kept in an arrayref to allow for multiple indices for a row. Then go line by line, replacing data at these indices for rows that have a key, and print output as you go.

use warnings;
use strict;
use feature 'say';

my ($ref_file, $data_file) = @ARGV;
die "Usage: $0 ref-file data-file\n" if not $ref_file or not $data_file;

open my $fh, '<', $ref_file or die "Can't open $ref_file: $!";
my %rows;
while (<$fh>) {
    my (undef, $idx, $row_id) = split;
    for (split /,/, $row_id) {
        push @{$rows{$_}}, $idx;        # elem => [ indices ]
    }
}

my $outfile = 'new_' . $data_file;
open    $fh,     '<', $data_file  or die "Can't open $data_file: $!";
open my $fh_out, '>', $outfile    or die "Can't open $outfile: $!";

my @ref = split ' ', <$fh>;
shift @ref;                  # toss the first field

while (<$fh>) {
    my ($row_id, @data) = split;

    if (exists $rows{$row_id}) {              # this row needs attention
        my @new_row = @ref;
        foreach my $idx (@{$rows{$row_id}}) { # keep data at these indices
            $new_row[$idx] = $data[$idx];
        }
        say $fh_out join "\t", $row_id, @new_row;
    }
    else {                                    # use whole reference line
        say $fh_out join "\t", $row_id, @ref;
    }
}

The new file (shown with two spaces instead of the actual tabs, for readability)

E1  K  L  G  N  O  P  A  B  C  D
E2  K  L  M  N  O  P  A  B  C  D
E3  K  L  M  N  O  P  A  B  C  D
E4  K  L  G  N  O  P  A  B  C  D
E5  K  L  M  N  O  P  A  B  C  D
E6  K  L  M  N  O  P  A  B  C  D
E7  K  L  M  N  O  P  A  B  C  D
E8  K  L  M  N  O  P  A  B  C  D
E9  K  L  M  N  O  P  A  B  C  D
E10  K  L  M  N  O  P  A  B  C  D

Note that the given input file happens to have the same entries as the reference line to use in replacement at many indices of interest -- so we can't see those "changes" in the above output. (I tested by changing the input file so to be able to see.)

zdim
  • 64,580
  • 5
  • 52
  • 81
  • unfortunately it gives the same input as output still! – perlbeginner Oct 23 '18 at 18:48
  • 1
    @perlbeginner Hum? It prints the output as shown, which is your desired output from the question ... ? It uses the input file that you give in the question. I also tested for all fields that should be changed, as explained, and it correctly keeps data at those indices (while replacing all else by that reference line). – zdim Oct 23 '18 at 18:53
  • @perlbeginner I don't understand: copy-paste the program and run it with files that you show in the question (File-1 and ID1.txt) -- it has to print what is shown in the answer (I copy-pasted it from my console, when printed with two spaces instead of tabs). This is precisely the desired output. – zdim Oct 23 '18 at 18:56
  • I copy-paste and tried exactly the same. The output you showed here is the desired output. But when I run it, it gives the similar to the file2 (ID1.txt) not new_ID1.txt – perlbeginner Oct 23 '18 at 19:39
  • @perlbeginner I just checked: I copy-pasted my code above into a new file, and ran `script.pl File-1 ID1.txt`. It produced `new_ID1.txt` with content as shown above. Are you sure you are checking the right file? Do you have other files around which may confuse matters? – zdim Oct 23 '18 at 19:43
  • @perlbeginner Note, I copy-pasted `File-1` and `ID1.txt` from here, exactly as shown. (Well, without those trailing `.` lines). If input is actually different then the code may not find any of what it looks for and may not produce output (and you may be looking at your old "output" which ins't good?) – zdim Oct 23 '18 at 19:47
  • Yeah! great effort of you and thank your very much, it gives the desired output now. Just one short thing, I made it loop for many ids (ID1.txt, ID2.txt and so on for multiple files) but this one is made for single file right? – perlbeginner Oct 23 '18 at 20:03
  • @perlbeginner Great :) Yes, this will only work with one file as it stands. It shouldn't be hard to modify it for multiple files -- can you do that? One way: make the reference file a command-line option (best processed with `Getopt::Long`), and then after that's processed the files remain in `@ARGV`. Then you can put the code in a loop over `@ARGV`. Another way: write a wrapper script which takes the list of files to process, and then out of it run this program, each time with reference file and one of files as input. – zdim Oct 23 '18 at 20:18
  • Neat that. Thanks a lot :) – perlbeginner Oct 23 '18 at 20:49
  • @perlbeginner Good -- judged by the complexity of code you posted handling multiple input files shouldn't be a problem, but let me know if it is. (Or post another question of course.) – zdim Oct 23 '18 at 20:54
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/182433/discussion-between-perlbeginner-and-zdim). – perlbeginner Oct 24 '18 at 14:24
1

This is one way to do it, if I understood your problem statement correctly:

#!/usr/bin/perl

use strict;
use warnings;

my %keep_idx;

open FILE, "file-1" or die "Couldn't open file-1";
while(<FILE>) {
    my (undef, $idx, $id_str) = split /\s+/;
    my @ids = split /,/, $id_str;
    foreach my $id (@ids) {
        $keep_idx{$id}{$idx} = 1;
    }
}
close FILE;

open FILE, "file-2" or die "Couldn't open file-2";
open OUTFILE, ">file-3" or die "Couldn't open file-3";
my (undef, @ref) = split /\s+/, <FILE>;
while(<FILE>) {
    my ($id, @src) = split /\s+/;
    my $line = "$id";
    for (my $i = 0; $i <= $#src; $i++) {
        my $e = $keep_idx{$id}{$i} ? $src[$i] : $ref[$i];
        $line .= " $e";
    }
    print OUTFILE "$line\n";
}
close OUTFILE;
close FILE;
Automaton
  • 143
  • 8