This perl
script will build a hash that you should be able to work with. For convenience I used List::MoreUtils
for uniq
and Data::Printer
for dumping the data structure:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use DDP;
my %paper ;
my @categories;
while (<DATA>){
chomp;
my @record = split /\t/ ;
$paper{$record[0]} = { map { $_ => 1 } @record[1..$#record] } ;
push @categories , @record[1..$#record] ;
}
@categories = uniq @categories;
foreach (keys %paper) {
foreach my $category(@categories) {
$paper{$_}{$category} //= 0 ;
}
};
p %paper ;
__DATA__
19801464 Animals Biodiversity Computational Biology/methods DNA
19696045 Environmental Microbiology Computational Biology/methods Software
Output
{
19696045 {
'Animals Biodiversity' 0,
'Computational Biology/methods' 1,
DNA 0,
'Environmental Microbiology' 1,
Software 1
},
19801464 {
'Animals Biodiversity' 1,
'Computational Biology/methods' 1,
DNA 1,
'Environmental Microbiology' 0,
Software 0
}
}
From there to producing the output you want may require printf
to format the lines properly. The following might be enough for your purposes:
print "\t", (join " ", @categories);
for (keys %paper) {
print "\n", $_, "\t\t" ;
for my $category(@categories) {
print $paper{$_}{$category}," "x17 ;
}
}
Edit
A few alternatives for formatting your output ... (we use x
to multiply the format sections by the length, or number of elements, in the @categories
array so they match):
Using format
my $format_line = 'format STDOUT =' ."\n"
. '@# 'x ~~@categories . "\n"
. 'values %{ $paper{$num} }' . "\n"
. '.'."\n";
for $num (keys %paper) {
print $num ;
no warnings 'redefine';
eval $format_line;
write;
}
Using printf
:
print (" "x9, join " ", @categories, "\n");
for $num (keys %paper) {
print $num ;
map{ printf "%19d", $_ } values %{ $paper{$num} } ;
print "\n";
}
Using form
:
use Perl6::Form;
for $num (keys %paper) {
print form
"{<<<<<<<<}" . "{>}" x ~~@categories ,
$num , values %{ $paper{$num} }
}
Depending on what you plan on doing with the data, you may be able to do the rest your of analysis in perl, so perhaps precise formatting for printing might not be a priority until a later stage in your workflow. See BioPerl for ideas.