Parse MEDLINE file for GWAS mining

Question

I'm trying to parse as a 0,1 table a MedLine file to perform some statistical downstream analysis: PCA, GWAS, etc. I formatted it using a Python module called Bio.Medline with some additional shell commands. Now, I don't know how to continue.

I need to transform File 1, - a key-value file with one paper per line and tab-separated keywords - into a file with collapsed keywords and presence/absence of keywords shown as 1 or 0 values.

I would like to do this with Perl but other solutions are welcome.

Thanks, Bernardo

File 1:

19801464    Animals Biodiversity    Computational Biology/methods   DNA
19696045    Environmental Microbiology  Computational Biology/methods   Software

Desired output:

    Animals Biodiversity    Computational Biology/methods   DNA Environmental Microbiology  Software
19801464    1   1   1   0   0
19696045    0   1   0   1   1

A first-pass way would be to go through all the keywords and put those into an array. Now, go through the lines of the document and for each paperID, create a new object in a hash object (`papers["19801464"]` in Ruby) and assign that hash a key of each keyword and a value of 0 or 1 depending on the keyword's presence on that line (`papers["19801464"]["Animal Biodiversity"] = 1` in Ruby). — Mark Silverberg, Jul 16 '14 at 01:51
Shouldn't record `19696045` in your desired output have a "1" in columns 4 for "Environmental Microbiology" ? — G. Cito, Jul 16 '14 at 06:02
I moved to PCA step, here it is: http://stats.stackexchange.com/questions/108148/pull-out-most-important-variables-from-pca — biotech, Jul 16 '14 at 11:50

score 2 · Answer 1 · answered Jul 16 '14 at 08:51

You can perform this with Python and Pandas :

In [1]: df = pd.read_table("file", header=None, sep="\t", names=["A", "B","C","D"], index_col=0)
In [2]: df
Out[2]: 
          A                           B                              C  \
0  19801464        Animals Biodiversity  Computational Biology/methods   
1  19696045  Environmental Microbiology  Computational Biology/methods   

          D  
0       DNA  
1  Software  

In [3]: b = pd.get_dummies(df.B)

In [4]: c = pd.get_dummies(df.C)

In [5]: d = pd.get_dummies(df.D)

In [6]: presence_absence = b.merge(c, right_index=True, left_index=True).merge(d,right_index=True, left_index=True)

In [7]: presence_absence
Out[7]: 
          Animals Biodiversity  Environmental Microbiology  \
A                                                            
19801464                     1                           0   
19696045                     0                           1   

          Computational Biology/methods  DNA  Software  
A                                                       
19801464                              1    1         0  
19696045                              1    0         1

Hope this helps

G. Cito · Accepted Answer · 2014-07-19T13:41:55.713

This perl script will build a hash that you should be able to work with. For convenience I used List::MoreUtils for uniq and Data::Printer for dumping the data structure:

#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use DDP;

my %paper ;
my @categories;

while (<DATA>){
  chomp;
  my @record = split /\t/ ;
  $paper{$record[0]}  = { map { $_ => 1 } @record[1..$#record] } ;
  push @categories , @record[1..$#record] ;
}

@categories = uniq @categories; 

foreach (keys %paper) {
  foreach my $category(@categories) {
    $paper{$_}{$category} //= 0 ;
  } 
}; 

p %paper ;

__DATA__
19801464   Animals Biodiversity  Computational Biology/methods  DNA     
19696045   Environmental Microbiology   Computational Biology/methods Software

Output

{
    19696045   {
        'Animals Biodiversity'            0,
        'Computational Biology/methods'   1,
        DNA                               0,
        'Environmental Microbiology'      1,
        Software                          1
    },
    19801464   {
        'Animals Biodiversity'            1,
        'Computational Biology/methods'   1,
        DNA                               1,
        'Environmental Microbiology'      0,
        Software                          0
    }
}

From there to producing the output you want may require printf to format the lines properly. The following might be enough for your purposes:

print "\t", (join "  ", @categories); 
for (keys %paper) {
  print "\n", $_, "\t\t" ;
  for my $category(@categories) { 
    print $paper{$_}{$category}," "x17 ; 
  }  
}

Edit

A few alternatives for formatting your output ... (we use x to multiply the format sections by the length, or number of elements, in the @categories array so they match):

Using format

my $format_line = 'format STDOUT =' ."\n"
                . '@# 'x ~~@categories . "\n" 
                . 'values %{ $paper{$num} }' . "\n"
                . '.'."\n"; 
for $num (keys %paper) {
  print $num ;
  no warnings 'redefine'; 
  eval $format_line;
write;
}

Using printf:

print (" "x9, join "  ", @categories, "\n"); 
for $num (keys %paper) {
  print $num  ;
  map{ printf "%19d", $_ }  values %{ $paper{$num} } ;
  print "\n";   
}

Using form:

use Perl6::Form;                                                              
for $num (keys %paper) {                                                       
  print form                                                         
  "{<<<<<<<<}" . "{>}" x ~~@categories ,                                      
    $num       , values %{ $paper{$num} }                                      
}

Depending on what you plan on doing with the data, you may be able to do the rest your of analysis in perl, so perhaps precise formatting for printing might not be a priority until a later stage in your workflow. See BioPerl for ideas.

Thanks for your response G. Cito. My idea is to continue in R. — biotech, Jul 16 '14 at 07:52
How could I print yes or no instead of 1,0? Seems you are using a function that automatically prints 1 or 0. — biotech, Jul 16 '14 at 10:20
Since 1 and 0 are booleans, you could change the `print` statement to something like: `( $paper{$_}{$category} ? print "yes " : print "no " ) ;`. Again, to get the proper formatting in your output you'll need to use `printf` or [`perl` formats](http://http://perldoc.perl.org/perlform.html). A tool like [PDL](http://pdl.perl.org/) (Perl Data Language) or R may have more of these kinds of utility functions "built in" and ready to use. — G. Cito, Jul 16 '14 at 14:17

Parse MEDLINE file for GWAS mining

2 Answers2