Selecting surrounding lines around the missing sequence numbers

Question

I have one file inside that file it is present as given below

TEST_4002_sample11_1_20110531.TXT
TEST_4002_sample11_2_20110531.TXT
TEST_4002_sample11_4_20110531.TXT
TEST_4002_sample11_5_20110531.TXT
TEST_4002_sample11_6_20110531.TXT
TEST_4002_sample10_1_20110531.TXT
TEST_4002_sample10_2_20110531.TXT
TEST_4002_sample10_4_20110531.TXT
TEST_4002_sample10_5_20110531.TXT

I want the output if the 4th filed of that file sequence is missing, then print previous file name and next file name as output.

TEST_4002_sample11_2_20110531.TXT
TEST_4002_sample11_4_20110531.TXT
TEST_4002_sample10_2_20110531.TXT
TEST_4002_sample10_4_20110531.TXT

Hm. I'm really don't understand why you closed this question. It is real world programming question with concrete solution (as you can see in answers). Provided the example input and the wanted output. — clt60, Jun 14 '11 at 07:19

score 1 · Accepted Answer · answered Jun 10 '11 at 13:32

1

This awk variant seems to produce the required output:

awk -F_ '$4>c+1{print p"\n"$0}{p=$0;c=$4}'

answered Jun 10 '11 at 13:32

ripat

3,076
6
26
38

Very simple perl script you have given. thanks for your response – gyrous Jun 14 '11 at 04:52

score 1 · Answer 2 · answered Jun 12 '11 at 00:08

1

simple perl way:

perl -F_ -lane 'print "$o\n$_" if $F[3]-$n>1;$o=$_;$n=$F[3]' < file

answered Jun 12 '11 at 00:08

clt60

62,119
17
107
194

Thanks for your perl command. – gyrous Jun 14 '11 at 04:50

Qtax · Answer 3 · 2011-06-10T11:24:18.580

In Perl you could do something like this:

use strict;
use warnings;

my $prev_line;
my $prev_val;

while(<>){
    # get the 4th value
    my $val = (split '_')[3];

    # skip if invalid line
    next if !defined $val;

    # print if missed sequence
    if(defined($prev_val) && $val > $prev_val + 1){
        print $prev_line . $_;
    }

    # save for next iteration
    $prev_line = $_;
    $prev_val = $val;
}

Save that in foo.pl and run it with something like:

cat file.txt | perl foo.pl

I'm sure it can be shortened quite a lot. Could use something like this if all lines are valid:

perl -n -e '$v=(/[^_]/g)[3];print"$l$_"if$l&&$v>$p+1;$p=$v;$l=$_' file.txt

or

perl -naF_ -e '$v=$F[3];print"$l$_"if$l&&$v>$p+1;$p=$v;$l=$_' file.txt

score 0 · Answer 4 · answered Jun 10 '11 at 11:39

As far as I understand what you need, here is a Perl script that do the job:

#!/usr/local/bin/perl 
use strict;
use warnings;

my $prev = '';
my %seq1;
while(<DATA>) {
    chomp;
    my ($seq1, $seq2) = $_ =~ /^.*?(\d+)_(\d+)_\d+\.TXT$/;
    $seq1{$seq1} = $seq2 - 1 unless exists $seq1{$seq1};
    if ($seq1{$seq1}+1 != $seq2) {
        print $prev,"\n",$_,"\n";
    }
    $prev = $_;
    $seq1{$seq1} = $seq2;
}


__DATA__
TEST_4002_sample11_1_20110531.TXT
TEST_4002_sample11_2_20110531.TXT
TEST_4002_sample11_4_20110531.TXT
TEST_4002_sample11_5_20110531.TXT
TEST_4002_sample11_6_20110531.TXT
TEST_4002_sample10_1_20110531.TXT
TEST_4002_sample10_2_20110531.TXT
TEST_4002_sample10_4_20110531.TXT
TEST_4002_sample10_5_20110531.TXT

output:

TEST_4002_sample11_2_20110531.TXT
TEST_4002_sample11_4_20110531.TXT
TEST_4002_sample10_2_20110531.TXT
TEST_4002_sample10_4_20110531.TXT

score 0 · Answer 5 · answered Jun 10 '11 at 13:47

I used glob to get the files (it's possible that it's as simple as <TEST_*.TXT>).

use strict;
use warnings;

my %last = ( name => '', group => '', seq => 0 );

foreach my $file ( sort glob('TEST_[0-9][0-9][0-9][0-9]_sample[0-9][0-9]_[0-9]_*.TXT')
    ) {
    my ( $group, $seq ) = $file =~ m/(\d{4,}_sample\d+)_(\d+)/;
    if ( $group eq $last{group} && $seq - $last{seq} > 1 ) { 
        print join( "\n", $last{name}, $file, '' );
    }
    @last{ qw<name group seq> } = ( $file, $group, $seq );
}

Selecting surrounding lines around the missing sequence numbers

5 Answers5