With Gnu or FreeBSD sort
, you can do it as follows;
sort -k1,1 -k2,2g file | sort -k1,1g -su
The first sort
sorts the file into order by first and then second column value. The second sort
uniquifies the file (-u
) using only the first column to determine uniqueness. It also uses the -s
flag to guarantee that the second column is still in order. In both cases, the sort uses the -g
flag when it matters (see below), which does general numeric comparison, unlike the Posix-standard -n
flag which only compares leading integers.
Performance note: (And thanks to OP for spurring me to do the measurements):
Leaving the g
off of -k1,1
in the first sort is not a typo; it actually considerably speeds the sort up (on large files, with Gnu sort). Standard or integer (-n
) sorts are much faster than general numeric sorts, perhaps 10 times as fast. However, all key types are about twice as fast for files which are "mostly sorted". For more-or-less uniformly sampled random numbers, a lexicographic sort is a close approximation to a general numeric sort; close enough that the result shows the "mostly sorted" speed-up.
It would have been possible to only sort by the second field in the first sort: sort -k2,2g file | sort -k1,1g -su
but this is much slower, both because the primary sort in the first pass is general numeric instead of lexicographic and because the file is no longer mostly sorted for the second pass.
Here's just one sample point, although I did a few tests with similar results. The input file consists of 299,902 lines, each containing two numbers in the range 0 to 1,000,000, with three decimal digits. There are precisely 100,000 distinct numbers in the first column; each appears from one to five times with different numbers in the second column. (All numbers in the second column are distinct, as it happens.)
All timings were collected with bash's time
verb, taking the real (wallclock) time. (Sort multithreads nicely so the user time was always greater).
With the first column correctly sorted and the second column randomised:
sort -k1,1 -k2,2g sorted | sort -k1,1g -su 1.24s
sort -k1,1g -k2,2g sorted | sort -k1,1g -su 1.78s
sort -k2,2g sorted | sort -k1,1g -su 3.00s
With the first column randomised:
sort -k1,1 -k2,2g unsorted | sort -k1,1g -su 1.42s
sort -k1,1g -k2,2g unsorted | sort -k1,1g -su 2.19s
sort -k2,2g unsorted | sort -k1,1g -su 3.01s