I've attached below an example featuring 5 lines from my input file (tab-delimited):
G157 G157.2 535 3 344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157 G157.6 535 42 276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157 G157.9 535 4 344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157 G157.11 535 6 276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs
The second column contains transcript IDs, the fifth column contains sample IDs which had reads mapping to that transcript, and the sixth column contains a list of all of those reads. Below is an explanation of the structure of the sixth column:
SAMPLEID:readinfo/separated/by/forwardslashes,readinfo/separated/by/forwardslashes;SAMPLEID:readinfo/separated/by/forwardslashes
There are three sample IDs (276,344,&365), but each transcript can have coverage from any one, two, or all three samples.
This is what I WANT the output to look like:
transcript_id 276 344 365
G157.1 0 0 2
G157.2 0 3 0
G157.6 9 18 15
G157.9 0 4 0
G157.11 6 0 0
I've been able to get this to work, but because I'm relatively new to Perl, I'm not able to figure out how to accomplish the task entirely in Perl. I piece together a matrix at the end using a second R script. This is my Perl script:
#!/usr/bin/perl -w
use strict;
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash,
$inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron,
$countGene, %unique, );
open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
while(<INFILE>) {
$line=$_;
chomp $line;
if ($line=~m/\S+/) {
my ($transcript) = ($line=~m/\S+\s+(\S+)/);
my ($transcript_info) = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);
if ($unique_donor_counts !~ m/,/) {
my $commas = $transcript_info =~ y/,//;
my $counts = $commas+1;
print "$transcript\t$unique_donor_ID\t$counts\n";
} else {
my @spl = split(';', $unique_donor_ID);
foreach my $i (@spl) {
my $commas = $i =~ y/,//;
my $counts = $commas+1;
my ($donor) = ($i=~m/(\d\d\d)/);
print "$transcript\t$donor\t$counts\n";
}
}
}
}
which outputs:
G157.1 2 365
G157.2 3 344
G157.6 9 276
G157.6 18 344
G157.6 15 365
G157.9 4 344
G157.11 6 276
Instead of printing, I save it to an output file and run this R script:
library(tidyr)
DF <- read.delim("output", sep = "\t", header = F)
df <- tidyr::pivot_wider(output, id_cols = V1, names_from = V2, values_from = V3)
write.table(df, file = "FLcount", sep = "\t", col.names = T, row.names = F, quote = F, dec = ".")
The problem I was encountering in Perl is that I can't figure out how to get a count of 0 when the sample ID doesn't occur in that line. I really want to get better at Perl, so I'd like to figure out how to do this all in one script instead of manipulating the matrix using R. Thank you for taking the time to help a Perl newbie learn!
edited post from my original with some progress