10

This must surely be a trivial task with awk or otherwise, but it's left me scratching my head this morning. I have a file with a format similar to this:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

I would like to print a line for each distinct value of the peptides in column 2, meaning the above input would become:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

This is what I've tried so far, but clearly neither does what I need:

awk '{print $2}' file | sort | uniq
# Prints only the peptides...
awk '{print $0, "\t", $1}' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...

One last thing, It will need to treat peptides which are substrings of other peptides as distinct values (eg VSSILED and VSSILEDKILSR). Thanks :)

Bede Constantinides
  • 2,424
  • 3
  • 25
  • 28

4 Answers4

20

Just use sort:

sort -k 2,2 -u file

The -u removes duplicate entries (as you wanted), and the -k 2,2 makes just the field 2 the sorting field (and so ignores the rest when checking for duplicates).

flolo
  • 15,148
  • 4
  • 32
  • 57
  • 2
    Awesome.. ..and if you want to get the top X number of unique entries, once you've sorted the file using 'sort', instead of just getting only one unique entry, you can use a little app I created here: https://github.com/danieliversen/MiscStuff/blob/master/scripts/findTopUniques.java – Daniel Iversen Feb 25 '16 at 10:30
16

One way using awk:

awk '!array[$2]++' file.txt

Results:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
Steve
  • 51,466
  • 13
  • 89
  • 103
  • Could you explain your approach, please? (like why array and ++) – Mauri1313 Feb 13 '23 at 13:48
  • 1
    @Mauri1313 Basically, this just uses an associative array (called 'array') to print only the unique lines based on the second field in the line. If the value of the key is undefined (which is the case for the first occurrence of the key), then the expression returns true and the line is printed. If the key has already been seen, the value of the key is set to a non-zero value, and the expression returns false, in which case the line is not printed. – Steve Feb 13 '23 at 15:47
2

I would use Perl for this:

perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt

The n switch works line by line with the input, the a switch splits the line into the @F array.

choroba
  • 231,213
  • 25
  • 204
  • 289
2
awk '{if($2==temp){next;}else{print}temp=$2}' your_file

tested below:

> awk '{if($2==temp){next;}else{print}temp=$2}' temp
pep> AEYTCVAETK         2       genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK            1       genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR      5       genes ADUm.367
pep> VSSILEDKTT         9       genes ADUm.1192,ADUm.2731
pep> AIQLTGK            10      genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR       3       genes ADUm.2146,ADUm.5750
Vijay
  • 65,327
  • 90
  • 227
  • 319