Advanced `uniq` with "unique part regex"

Question

uniq is a tool that enables once to filter lines in a file such that only unique lines are shown. uniq has some support to specify when two lines are "equivalent", but the options are limited.

I'm looking for a tool/extension on uniq that allows one to enter a regex. If the captured group is the same for two lines, then the two lines are considered "equivalent". Only the "first match" is returned for each equivalence class.

Example:

file.dat:

foo!bar!baz
!baz!quix
!bar!foobar
ID!baz!

Using grep -P '(!\w+!)' -o, one can extract the "unique parts":

!bar!
!baz!
!bar!
!baz!

This means that the first line is considered to be "equivalent" with the third and the second with the fourth. Thus only the first and the second are printed (the third and fourth are ignored).

Then uniq '(!\w+!)' < file.dat should return:

foo!bar!baz
!baz!quix

Do you have a better example? Not sure how you'd ever make that regex do what you want without writing something custom, but quite sure there'd be a solution using some standard tools if we can see how your data would look. — arco444, Oct 29 '14 at 15:07

score 2 · Accepted Answer · answered Oct 29 '14 at 15:19

2

Not using uniq but using gnu-awk you can get the results you want:

awk -v re='![[:alnum:]]+!' 'match($0, re, a) && !(a[0] in p) {p[a[0]]; print}' file
foo!bar!baz
!baz!quix

Passing required regex using a command line variable -v re=...
match function matches regex for each line and returns matched text in [a]
Every time match succeeds we store matched text in an associative array p and print
Thus effectively getting uniq function with regex support

answered Oct 29 '14 at 15:19

anubhava

761,203
64
569
643

I'm getting a syntax error: `context is match($0, >>> re, <<<` – Lincoln Bergeson Feb 03 '17 at 18:25
Make sure you're using gnu-awk as written in my answer. – anubhava Feb 03 '17 at 18:53
I'm on Mac OS, maybe that's the problem. – Lincoln Bergeson Feb 03 '17 at 19:12

Lucas Trzesniewski · Answer 2 · 2014-10-29T15:26:49.697

2

Here's a simple Perl script that will do the work:

#!/usr/bin/env perl
use strict;
use warnings;

my $re = qr($ARGV[0]);

my %matches;
while(<STDIN>) {
    next if $_ !~ $re;
    print if !$matches{$1};
    $matches{$1} = 1;
}

Usage:

$ ./uniq.pl '(!\w+!)' < file.dat
foo!bar!baz
!baz!quix

Here, I've used $1 to match on the first extracted group, but you can replace it with $& to use the whole pattern match.
This script will filter out lines that don't match the regex, but you can adjust it if you need a different behavior.

edited Oct 29 '14 at 15:26

answered Oct 29 '14 at 15:20

Lucas Trzesniewski

50,214
11
107
158

Man, StackOverflow's awesome. Thanks @LucasTrzesniewski :-) – Lincoln Bergeson Feb 03 '17 at 19:43

arco444 · Answer 3 · 2014-10-29T15:38:54.970

1

You can do this with just grep and sort

DATAFILE=file.dat

for match in $(grep -P '(!\w+!)' -o "$DATAFILE" | sort -u); do 
  grep -m1 "$match" "$DATAFILE";
done

Outputs:

foo!bar!baz
!baz!quix

edited Oct 29 '14 at 15:38

answered Oct 29 '14 at 15:25

arco444

22,002
12
63
67

2

Isn't a side effect that the values will get sorted? – Willem Van Onsem Oct 29 '14 at 15:25
I don't know about "side effect" - it's something that will happen. How big is the input file? If you want something much smarter, the perl solution is perfect. – arco444 Oct 29 '14 at 15:27

Advanced `uniq` with "unique part regex"

3 Answers3

Linked