2

I am trying to manipulate a list (about 50 columns) where I basically want to select some columns (some 7 or 10). However, some of those columns have empty entries. I am guessing something like this is a minimal working example:

A    B    C    D    E#note these are 5 tab separated columns
this    that    semething    something more     the end 
this.line    is    very    incomplete    #column E empty
but    this    is    v.very    complete
whereas        this    is    not #column B empty

As you can see, the 3rd line is empty in the last position.

I want to find a way of efficiently replacing all empty fields of my interest by a string, say "NA".

Of course, I could do it in the following way, but it is not very elegant to do this for all the 10 columns that I have in my real data:

#!/usr/local/bin/perl
use strict;
use warnings;

open my $file,"<","$path\\file.txt"; #with correct path

my @selecteddata;my $blankE;my $blankB;
while (<$data>) {
    chomp $_;
    my @line= split "\t";
    if (not defined $line[4]){
    $blankE="NA";
} else {
    $blankE=$line[4];
}
    if (not defined $line[1]){
    $blankB="NA";
} else {
    $blankB=$line[1];
}
    push @selecteddata,"$blankB[0]\t$line[1]\t$line[2]\t$line[3]$line[4]\n";
}
close $data;

Alternatively, I can pre-process the file and replace all undefined entries by "NA", but I would like to avoid this.

So the main question is this: is there a more elegant way to replace blank entries only in the columns that I am interested by some word?

Thank you!

Sos
  • 1,783
  • 2
  • 20
  • 46

3 Answers3

3

The trick to not ignoring trailing tabs is to specify a negative LIMIT as the 4th argument to split (kudos ikegami).

map makes light work of setting the "NA" values:

while ( <$data> ) {
    chomp;

    my @fields = split /\t/, $_, -1;

    @fields = map { length($_) ? $_ : 'NA' } @fields;  # Transform @fields

    my $updated = join("\t", @fields) . "\n";

    push @selected_data, $updated ;
}

In one-liner mode:

$ perl -lne 'print join "\t", map { length ? $_ : "NA" } split /\t/, $_, -1' input > output
Zaid
  • 36,680
  • 16
  • 86
  • 155
  • @ikegami : Thanks for the bullet-proofing – Zaid Apr 22 '14 at 16:00
  • `-F` and `-a` have no use in one liner; argument for `length` can be omitted. – mpapec Apr 22 '14 at 16:11
  • @ikegami : Thanks for the bug-fixing :) – Zaid Apr 22 '14 at 16:14
  • I was say that -F and -a are useful, but the `split` isn't. :) *shrug* – ikegami Apr 22 '14 at 16:18
  • Alternative: `@fields = map s/^$/NA/r, @fields` – Zaid Apr 22 '14 at 16:20
  • And to further nitpick, since `@fields` gets overwritten, perhaps foreach instead of map `length($_) or $_ ='NA' for @fields;` – mpapec Apr 22 '14 at 16:20
  • perhaps `s/^$/NA/r` but `/r` is from 5.14? – mpapec Apr 22 '14 at 16:22
  • Yep. Or the less aesthetic `map { s/^$/NA/; $_ } @fields;` to cover older versions of Perl – Zaid Apr 22 '14 at 16:23
  • +1 Because this is beautiful ... I am ashamed but invoke TIMTOWDI hoping for forgiveness :) – G. Cito Apr 22 '14 at 17:28
  • After posting my "baby perl" I found [this multiple substitution example](http://stackoverflow.com/questions/3140553/multiple-substitutions-with-a-single-regular-expression-in-perl) on SO with contributions from @ether and @Zaid! ... I'm still doing baby perl but I feel more kindergarten every day ;-) – G. Cito Apr 22 '14 at 17:38
3

I would say that using split and join undoubtedly is the most clear, since you'll likely need to be doing that for other parsing as well. However, this could be solved using look around assertions as well

Basically, the boundary between elements will either be a tab or the end or beginning of a string, so if those conditions are true for both directions, then we have an empty field:

use strict;
use warnings;

while (<DATA>) {
    s/(?:^|(?<=\t))(?=\t|$)/NA/g;
    print;
}

__DATA__
a   b   c   d   e
a   b   c   d   e
a   b       d   e
    b   c   d   e
a   b           
a   b       d   
a               e

Outputs:

a       b       c       d       e
a       b       c       d       e
a       b       NA      d       e
NA      b       c       d       e
a       b       NA      NA      NA
a       b       NA      d       NA
a       NA      NA      NA      e

Turning this into a one liner is trivial, but I will point out that this could be done using \K as well saving 2 characters: s/(?:\t|^)\K(?=\t|$)/NA/g;

Zaid
  • 36,680
  • 16
  • 86
  • 155
Miller
  • 34,962
  • 4
  • 39
  • 60
1

I'm not sure if just using a sequence of substitutions looking for tabs that are either preceded/followed by spaces would catch everything but it's quick and easy if you have a lazy brain ;-)

 perl -pne 's/\t\t/\tNA\t/;s/\t\s/\tNA/;s/^\t/NA\t/' col_data-undef.txt

I'm not sure if in a neater scriptish format it looks less or more yucky :-)

#!/usr/bin/env perl
# read_cols.pl - munge tab separated data with empty "cells"
use strict; 
use warnings;

while (<>){
 s/\t\t/\tNA\t/;
 s/\t\s/\tNA/;
 s/^\t/NA\t/;
 print ;
}

Here's the output:

Here's vim buffers of the input and output with tabs as ^I in red ;-)

./read_cols.pl col_data-undef.txt > col_data-NA.txt

Buffers showing tabs

Is everything in the correct order? Would it work on 50 columns ?!?

Sometimes lazy is good but sometimes you need @ikegami ... :-)

G. Cito
  • 6,210
  • 3
  • 29
  • 42
  • In my own defense I give you [Exhibit A](http://stackoverflow.com/a/3140845/2019415) here on SO where Zaid comments neutrally on similar syntax ;-) – G. Cito Apr 22 '14 at 17:32
  • Just so you know, this code is buggy (in particular the second `s///`). – Zaid Apr 22 '14 at 20:50
  • @Zaid :-\ oops - thought it worked so I `rm`ed it ... will try again ... do you mean the code at the link or the list of `s///`'s above? – G. Cito Apr 23 '14 at 00:49
  • What if there is a tab followed by a space? According to the OP's requirements such a field should not be NA-ized. – Zaid Apr 23 '14 at 07:41