Output whole line once for each unique value of a column (Bash)

Question

This must surely be a trivial task with awk or otherwise, but it's left me scratching my head this morning. I have a file with a format similar to this:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

I would like to print a line for each distinct value of the peptides in column 2, meaning the above input would become:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

This is what I've tried so far, but clearly neither does what I need:

awk '{print $2}' file | sort | uniq
# Prints only the peptides...
awk '{print $0, "\t", $1}' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...

One last thing, It will need to treat peptides which are substrings of other peptides as distinct values (eg VSSILED and VSSILEDKILSR). Thanks :)

score 20 · Answer 1 · answered Aug 21 '12 at 10:23

20

Just use sort:

sort -k 2,2 -u file

The -u removes duplicate entries (as you wanted), and the -k 2,2 makes just the field 2 the sorting field (and so ignores the rest when checking for duplicates).

answered Aug 21 '12 at 10:23

flolo

15,148
4
32
57

2

Awesome.. ..and if you want to get the top X number of unique entries, once you've sorted the file using 'sort', instead of just getting only one unique entry, you can use a little app I created here: https://github.com/danieliversen/MiscStuff/blob/master/scripts/findTopUniques.java – Daniel Iversen Feb 25 '16 at 10:30

score 16 · Accepted Answer · answered Aug 21 '12 at 10:22

16

One way using awk:

awk '!array[$2]++' file.txt

Results:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

answered Aug 21 '12 at 10:22

Steve

51,466
13
89
103

Could you explain your approach, please? (like why array and ++) – Mauri1313 Feb 13 '23 at 13:48
1

@Mauri1313 Basically, this just uses an associative array (called 'array') to print only the unique lines based on the second field in the line. If the value of the key is undefined (which is the case for the first occurrence of the key), then the expression returns true and the line is printed. If the key has already been seen, the value of the key is set to a non-zero value, and the expression returns false, in which case the line is not printed. – Steve Feb 13 '23 at 15:47

score 2 · Answer 3 · answered Aug 21 '12 at 10:20

2

I would use Perl for this:

perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt

The n switch works line by line with the input, the a switch splits the line into the @F array.

answered Aug 21 '12 at 10:20

choroba

231,213
25
204
289

1

Same thing in awk: awk '{ if(!($2 in peptides)) { peptides[$2] = 1; print $_ } } ' > fp – themel Aug 21 '12 at 10:24
I can see that this is where Perl really excels. Great answer, thank you. – Bede Constantinides Aug 21 '12 at 11:05

score 2 · Answer 4 · answered Aug 21 '12 at 10:35

2

awk '{if($2==temp){next;}else{print}temp=$2}' your_file

tested below:

> awk '{if($2==temp){next;}else{print}temp=$2}' temp
pep> AEYTCVAETK         2       genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK            1       genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR      5       genes ADUm.367
pep> VSSILEDKTT         9       genes ADUm.1192,ADUm.2731
pep> AIQLTGK            10      genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR       3       genes ADUm.2146,ADUm.5750

answered Aug 21 '12 at 10:35

Vijay

65,327
90
227
319

More verbose but very easy to understand. Thanks :) – Bede Constantinides Aug 21 '12 at 11:07
1

This returns `AIQLTGK` twice. – Thor Aug 21 '12 at 11:17

Output whole line once for each unique value of a column (Bash)

4 Answers4

Linked

Related