Removing lines with the first and the last occurence of a column value

Question

I have following file

    ID      Score    Other
    ABR     0.98     NBNMSB
    BCG     0.76     NBNMSB
    CVD     0.6      NBNMSB
    BCG     0.9      VSCVA
    CVD     0.56     VSCVA
    ABR     0.9      VSCVA
    CVD     0.7      BAVSC
    BCG     0.4      BAVSC
    ABR     0.5      BAVSC
    AAC     0.1      BAVSC
    ABR     0.8      NBNMSB
    BCG     0.6      NBNMSB
    CVD     0.3      NBNMSB
    BCG     0.7      VSCVA
    CVD     0.0      VSCVA
    ABR     0.1      VSCVA
    CVD     0.5      BAVSC
    BCG     0.8      BAVSC
    ABR     1.0      BAVSC

And I want to exclude the first and the last occurrence of a value in column 3 such that I get an output as:

ID      Score    Other
BCG     0.76     NBNMSB
CVD     0.56     VSCVA
BCG     0.4      BAVSC
ABR     0.5      BAVSC
BCG     0.6      NBNMSB
CVD     0.0      VSCVA
BCG     0.8      BAVSC

What would you expect if you fed your proposed output back through the filter (i.e. if there's only two of a specific value in column 3, so that literally removing the first and the last occurrence would in fact remove all occurrences)? — twalberg, Jan 27 '16 at 20:24

F. Knorr · Answer 1 · 2016-01-27T20:19:19.820

4

In awk you can try this

awk 'NR==1
     {last[NR%3]=$3;lastLine[NR%3]=$0;}
     last[(NR-1)%3]==last[(NR-2)%3] && 
           last[(NR-1)%3]==last[NR%3]{print lastLine[(NR-1)%3]}' test

which yields the expected output:

ID      Score    Other
BCG     0.76     NBNMSB
CVD     0.56     VSCVA
BCG     0.4      BAVSC
ABR     0.5      BAVSC
BCG     0.6      NBNMSB
CVD     0.0      VSCVA
BCG     0.8      BAVSC

Explanation
1. The NR==1 simple prints the first line.
2. The {last[NR%3]=$3;lastLine[NR%3]=$0;} stores the last two lines and the current line in an array (lastLine).
3. By last[(NR-1)%3]==last[(NR-2)%3] && last[(NR-1)%3]==last[NR%3] we check whether last line has the same value in the third column as the current line and the second last as the last line (i.e., whether they all have the same value in the 3rd column). In this case we print the last line.

edited Jan 27 '16 at 20:19

answered Jan 27 '16 at 20:07

F. Knorr

3,045
15
22

Can you break the code if it is not too much trouble? TIA – AishwaryaKulkarni Jan 27 '16 at 20:09
I have provide an explanation. Did you mean that? – F. Knorr Jan 27 '16 at 20:16
Yes also how do I go about it if I have 4 columns instead of 3 for instance I have the format as abc 1234 2345 NBNMSB and so on as rows, how do I tweak the above code for unique 4 th column, I tried to replace 3 with 4, is there anything else I need to change? – AishwaryaKulkarni Jan 27 '16 at 20:21
In this case, just replace each $3 by $4. Don't change the `%3` -- this is necessary to compare just the last three lines (i.e., to correctly discard the first and last occurrence of the specified column's value.) – F. Knorr Jan 27 '16 at 20:25
I did something like this : awk 'NR==1{last[NR%4]=$4;lastLine[NR%4]=$0;}last[(NR-1)%4]==last[(NR-2)%4]&&last[(NR-1)%4]==last[NR%4]{print lastLine[(NR-1)%4]}' file and yet it didn't spit out any results, I hope I am doing it right. – AishwaryaKulkarni Jan 27 '16 at 20:27
awk 'NR==1 {last[NR%3]=$4;lastLine[NR%3]=$0;} last[(NR-1)%3]==last[(NR-2)%3] && last[(NR-1)%3]==last[NR%3]{print lastLine[(NR-1)%3]}' test – F. Knorr Jan 27 '16 at 20:39

jas · Accepted Answer · 2016-01-27T21:14:43.303

2

If you have tac (or gtac) you can remove the first instances, reverse the file, remove the first (really last) instances and flip the file one last time.

$ awk '$3==p;{p=$3}' file1 | tac | awk '$3==p;{p=$3}' | tac
BCG     0.76     NBNMSB
CVD     0.56     VSCVA
BCG     0.4      BAVSC
ABR     0.5      BAVSC
BCG     0.6      NBNMSB
CVD     0.0      VSCVA
BCG     0.8      BAVSC

EDIT:

Here is a more flexible version. Just set the initial value of c to the desired column:

Use column 3:

 c=3 && awk -v c=$c '$c==p;{p=$c}' file1 | tac | awk -v c=$c '$c==p;{p=$c}' | tac

Use column 4:

 c=4 && awk -v c=$c '$c==p;{p=$c}' file1 | tac | awk -v c=$c '$c==p;{p=$c}' | tac

edited Jan 27 '16 at 21:14

answered Jan 27 '16 at 20:19

jas

10,715
2
30
41

Also how do I go about it if I have 4 columns instead of 3 for instance I have the format as abc 1234 2345 NBNMSB and so on as rows, how do I tweak the above code for unique 4 th column, I tried to replace 3 with 4, is there anything else I need to change? – AishwaryaKulkarni Jan 27 '16 at 20:24
Globally changing `$3` to `$4` would be the only change necessary. – jas Jan 27 '16 at 20:26
Excellent. Also see the edit for a more flexible version. – jas Jan 27 '16 at 20:37

bkmoney · Answer 3 · 2016-01-27T21:18:05.267

Another simpler awk is:

awk 'NR == 1; prev != $3 {prev = $3; line = 0; next} 
{if (line) print line; line = $0}' foo.txt | column -t

You will get

ID   Score  Other
BCG  0.76   NBNMSB
CVD  0.56   VSCVA
BCG  0.4    BAVSC
ABR  0.5    BAVSC
BCG  0.6    NBNMSB
CVD  0.0    VSCVA
BCG  0.8    BAVSC

What this does is store the 3rd column and the line in variables called prev and line, and prints them out if they aren't the first and last occurrences.

Notice that this only takes 1 pass through the file, as opposed to using tac and multiple passes.

score 1 · Answer 4 · answered Jan 27 '16 at 22:03

This might work for you (GNU sed):

sed -r '1p;$!N;/(\S+)\n.*\1$/!d;P;D' file

Print the first line reqardless (header line). Read two lines at a time and if the those two lines don't have the same third column delete them both. Otherwise print the first and append the next line and repeat.

Removing lines with the first and the last occurence of a column value

4 Answers4