Perl script for counting occurrences of three different IDs in string where occurrence may be zero

Question

I've attached below an example featuring 5 lines from my input file (tab-delimited):

G157    G157.2  535 3   344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157    G157.6  535 42  276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157    G157.9  535 4   344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157    G157.11 535 6   276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs

The second column contains transcript IDs, the fifth column contains sample IDs which had reads mapping to that transcript, and the sixth column contains a list of all of those reads. Below is an explanation of the structure of the sixth column:

SAMPLEID:readinfo/separated/by/forwardslashes,readinfo/separated/by/forwardslashes;SAMPLEID:readinfo/separated/by/forwardslashes

There are three sample IDs (276,344,&365), but each transcript can have coverage from any one, two, or all three samples.

This is what I WANT the output to look like:

transcript_id   276 344 365
G157.1  0   0   2
G157.2  0   3   0
G157.6  9   18  15
G157.9  0   4   0
G157.11 6   0   0

I've been able to get this to work, but because I'm relatively new to Perl, I'm not able to figure out how to accomplish the task entirely in Perl. I piece together a matrix at the end using a second R script. This is my Perl script:

#!/usr/bin/perl -w
use strict;
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
while(<INFILE>) {
    $line=$_;
    chomp $line;
    if ($line=~m/\S+/) {
        my ($transcript) = ($line=~m/\S+\s+(\S+)/);
        my ($transcript_info) = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
        my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
        my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);
    
       if ($unique_donor_counts !~ m/,/) {
           my $commas = $transcript_info =~ y/,//;
           my $counts = $commas+1;
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);  
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;
               my $counts = $commas+1;
               my ($donor) = ($i=~m/(\d\d\d)/);
               print "$transcript\t$donor\t$counts\n";
           }
       }
    }
}

which outputs:

G157.1  2   365
G157.2  3   344
G157.6  9   276
G157.6  18  344
G157.6  15  365
G157.9  4   344
G157.11 6   276

Instead of printing, I save it to an output file and run this R script:

library(tidyr)

DF <- read.delim("output", sep = "\t", header = F)
df <- tidyr::pivot_wider(output, id_cols = V1, names_from = V2, values_from = V3)
write.table(df, file = "FLcount", sep = "\t", col.names = T, row.names = F, quote = F, dec = ".")

The problem I was encountering in Perl is that I can't figure out how to get a count of 0 when the sample ID doesn't occur in that line. I really want to get better at Perl, so I'd like to figure out how to do this all in one script instead of manipulating the matrix using R. Thank you for taking the time to help a Perl newbie learn!

edited post from my original with some progress

I need to print an output file giving the read count for every sample for each transcript ID. An example of what that output looks like for my provided example input looks like this: `transcript_id 276 344 365 G157.1 0 0 2 G157.2 0 3 0 G157.6 9 18 15 G157.9 0 4 0 G157.11 6 0 0` — winstonhc, Jan 25 '22 at 01:45
I don't understand what you mean though, because my expected output is already in my original question? I highlighted it in my original post in bold, but I don't understand what you mean by "the right format". What is wrong about the format I used in my post? — winstonhc, Jan 25 '22 at 03:00
The problem with the format was with the numbers you posted in a comment, not the output you had not highlighted in the question, which I did not consider. After you highlighted it, I did notice it. I fixed the formatting for you now. — TLP, Jan 25 '22 at 13:58
Your output does not match what I get when I run your program with your input. For one thing, the numbers come in a different order. For another, some are missing. For another, some are right, some are one off. You should never post code and output which you have not tested first. — TLP, Jan 25 '22 at 15:35

Shawn · Answer 1 · 2022-01-25T06:11:22.717

I'd do it with a lot of splits.

#!/usr/bin/env perl
use strict;
use warnings; # Prefer over the -w option
use feature qw/say/;

# Reads from standard input/filenames given on command line, prints to
# standard output.

# All samples encountered in the entire input
my %samples;
# The data for each line of input
my @transcripts;

# For each line of input
while (my $line = <<>>) {
    # Split into fields on tabs.
    my @F = split /\t/, $line;

    # Add to the known samples
    my @s = split /,/, $F[4];
    @samples{@s} = (1) x @s;

    my %counts;
    # Split the 6th column on semicolon and iterate over each element
    for my $read (split /;/, $F[5]) {
        # Extract the sample id and increment its count
        my ($id, @reads) = split /[:,]/, $read;
        $counts{$id} += @reads;
    }

    # Save this line's counts
    push @transcripts, [ $F[0], \%counts ];
}

# Sample ids in sorted order
my @samples = sort { $a <=> $b } keys %samples;

# Header line
say join("\t", "transcript_id", @samples);

foreach my $transcript (@transcripts) {
    # And print the counts of each transcript's sample, replacing undef values with 0.
    say join("\t", $$transcript[0], map { $_ // 0 } @{$$transcript[1]}{@samples});
}

Example:

$ perl count.pl input.tsv
transcript_id   276 344 365
G157.2  0   3   0
G157.6  10  18  14
G157.9  0   4   0
G157.11 6   0   0

One thing this uses in several places is a slice, to set/retrieve multiple elements of a hash in one expression. Also helps to be familiar with references, which are used in more complicated data structures than just plain arrays and hashes.

TLP · Accepted Answer · 2022-01-25T16:26:55.077

I'll walk through your program file and comment where needed.

#!/usr/bin/perl -w
use strict;

The -w switch turns on warnings globally. You should use the lexical version of warnings: use warnings. It is generally considered a better practice, as it allows you to turn off warnings if you need to, with for example: no warnings 'experimental'.

my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

All of these variables are unused. So remove all unused variables.

You should not declare all of your variables at the top of your program and make them all global. You should declare them as close as you can to the place you are going to use them, and in the smallest scope possible. The scope of a variable declared with my is the nearest enclosing block: { block here }. This will encapsulate your code, and also make it easier to read.

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";

This is the old, two argument open, which is generally considered bad practice, as it allows code insertion. You want the three argument open, with explicit open mode. You also want to include error reporting $! in your die statement:

open my $fh, "<", $ARGV[0] or die "Cannot open '$ARGV[0]' for reading: $!";

You also do not want to put a newline at the end of the die statement, because that will suppress the line number in the error.

while(<INFILE>) {
    $line=$_;
    chomp $line;

You do not have to jump through all these hoops. You can do either

while (my $line = <$fh>) {    # declared in the smallest possible scope
    chomp $line;

Or

while (<$fh>) {
    chomp;        # chomps $_

The latter has the benefit of not having to type out what variable you want to match your regexes against. E.g. instead of $line =~ m/..../ just type m/..../. It is a commonly used Perl idiom.

    if ($line=~m/\S+/) {
        my ($transcript)         = ($line=~m/\S+\s+(\S+)/);
        my ($transcript_info)    = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
        my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);

You could just match once, and capture three values at the same time:

my ($trans,$udc,$trans_info) = $line =~ /^\S+\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)\s+(\S+)/;

Note that you have to have the variables in the same order as the captured values.

This might also be somewhat simpler if you use split

my @vals = split ' ', $line;
my $trans = $vals[1];        # and so on...

This part

       my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);

is wrong. You think it just matches a number in front of the first colon, e.g. 344:. If you only have one colon, that is what it does. Otherwise it matches all the characters before the last colon in the line, because \S+ is greedy and will try to match as much as possible. When you later split $unique_donor_ID on ;, you will miss all the data after the last colon. You can see the result here. Notice line 2 of the output which begins with 276 and ends with 365. 365 is the start of the last record.

       if ($unique_donor_counts !~ m/,/) {            # checks for 311,322, ..
           my $commas = $transcript_info =~ y/,//;    # count by destroying data
           my $counts = $commas+1;                    # unnecessary extra variable
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);    # splitting the wrong variable
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;              
               my $counts = $commas+1;
               my ($donor) = ($i=~m/(\d\d\d)/);       # risky way to get the first 3 numbers
               print "$transcript\t$donor\t$counts\n";
        .....

Your logic is to check for commas in the donor id short list, and handle the single values separately. That is most likely a flawed approach. Why not just split the last field on ; and handle the results from there?

use strict;
use warnings;
use feature 'say';

my %dat;             # file data
my @donors;          # store donor ids
my @trans;           # store the transcript names in original order
while (<DATA>) {
    my @f = split;                      # split line on whitespace
    my $trans = $f[1];                  # second field is transcript
    push @trans, $trans;                # save trans for later
    my $tinfo = $f[5];                  # 6th field is our payload
    my @data = split /;/, $tinfo;       # split payload
    for (@data) {                       # for each data segment
        my ($donor) = /^(\d+):/;    # ... extract donor id
        my $count = () = /,/g;      # ... count commas
        $count++;                   # ... add 1 comma for whatever reason
        #say join "\t", $trans, $count, $donor;  # optional print
        push @donors, $donor;       # store donor ids
        $dat{$trans}{$donor} = $count;   # this is the data we want
    }
}

sub uniq {         # for deduping
    my %seen;
    grep { !$seen{$_}++ } @_;
}
@donors= sort { $a <=> $b } uniq(@donors);   # dedupe and sort
@trans  = uniq(@trans);                      # just dedupe, preserve original order from file
say join "\t", "transcript_id",  @donors;    # print header
for my $key (@trans) {
    # for each donor id, for each key, print number, or 0 if undefined
    say join "\t", $key, map { $dat{$key}{$_} // 0 } @donors;
}

__DATA__
G157    G157.2  535 3   344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157    G157.6  535 42  276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157    G157.9  535 4   344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157    G157.11 535 6   276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs

Note the two lines

my ($donor) = /^(\d+):/;    
my $count = () = /,/g;

Are using the implicit pattern match, /,/g meaning the same as $_ =~ /,/g. In the first case we match and capture, and assign to $donor. In the second case, we enforce a count of matches with = () =, a Perl trick to change the context of the regex. This does the same as y/,//, except that it is not destructive, and also can be used with a proper regex instead of a single character.

Output:

transcript_id   276     344     365
G157.2  0       3       0
G157.6  10      18      14
G157.9  0       4       0
G157.11 6       0       0

Perl script for counting occurrences of three different IDs in string where occurrence may be zero

2 Answers2