0

I am trying to obtain the smallest $2 value for every $1 value. My data looks like follows:

0 0
23.9901 13.604
23.9901 13.604
23.9901 3.364
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.79 15.36
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718

The output must look like:

0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718

I can manage this by creating multiple files for each $1 value and finding the min in each file. But I am wondering if there might be a more elegant solution for this?

Thanks.

jamie
  • 89
  • 4
  • 11
  • 1
    The smallest value for 23.9901 is 3.364. Are same $1 values always consecutives? – Casimir et Hippolyte Feb 15 '16 at 15:24
  • Yes, that is correct. Just fixed it. $1 values are always sorted incrementally. – jamie Feb 15 '16 at 15:30
  • Also, your sample output looks to have rounded `$1` to two decimal places, has you only have `45.79 0` in your output, did you want ot process `45.7902 12.1518`. Hm.. no you have included a similar case in `54.1102 14.0718` . Oh yeah, and you forgot to include your attempt in code to solve your problem. We can help you fix your code but, S.O. isn't a free coding service. Good luck. – shellter Feb 15 '16 at 15:31

5 Answers5

3

With Gnu or FreeBSD sort, you can do it as follows;

sort -k1,1 -k2,2g file | sort -k1,1g -su

The first sort sorts the file into order by first and then second column value. The second sort uniquifies the file (-u) using only the first column to determine uniqueness. It also uses the -s flag to guarantee that the second column is still in order. In both cases, the sort uses the -g flag when it matters (see below), which does general numeric comparison, unlike the Posix-standard -n flag which only compares leading integers.

Performance note: (And thanks to OP for spurring me to do the measurements):

Leaving the g off of -k1,1 in the first sort is not a typo; it actually considerably speeds the sort up (on large files, with Gnu sort). Standard or integer (-n) sorts are much faster than general numeric sorts, perhaps 10 times as fast. However, all key types are about twice as fast for files which are "mostly sorted". For more-or-less uniformly sampled random numbers, a lexicographic sort is a close approximation to a general numeric sort; close enough that the result shows the "mostly sorted" speed-up.

It would have been possible to only sort by the second field in the first sort: sort -k2,2g file | sort -k1,1g -su but this is much slower, both because the primary sort in the first pass is general numeric instead of lexicographic and because the file is no longer mostly sorted for the second pass.

Here's just one sample point, although I did a few tests with similar results. The input file consists of 299,902 lines, each containing two numbers in the range 0 to 1,000,000, with three decimal digits. There are precisely 100,000 distinct numbers in the first column; each appears from one to five times with different numbers in the second column. (All numbers in the second column are distinct, as it happens.)

All timings were collected with bash's time verb, taking the real (wallclock) time. (Sort multithreads nicely so the user time was always greater).

With the first column correctly sorted and the second column randomised:

sort -k1,1  -k2,2g sorted | sort -k1,1g -su          1.24s
sort -k1,1g -k2,2g sorted | sort -k1,1g -su          1.78s
sort        -k2,2g sorted | sort -k1,1g -su          3.00s

With the first column randomised:

sort -k1,1  -k2,2g unsorted | sort -k1,1g -su        1.42s
sort -k1,1g -k2,2g unsorted | sort -k1,1g -su        2.19s
sort        -k2,2g unsorted | sort -k1,1g -su        3.01s
Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
  • This is a great way to solve this problem! However, a bit slow for large data... – jamie Feb 15 '16 at 16:10
  • @jamie: Yes, this solution is more apt for the case where the first column is not already sorted. I missed that in the specification. – rici Feb 15 '16 at 16:44
2

You can use this gnu-awk command:

awk '!($1 in m) || m[$1]>$2{m[$1]=$2} END{for (i in m) print i, m[i]}' file

Or to get the order same as the input file:

awk 'BEGIN{PROCINFO["sorted_in"]="@ind_num_asc"} !($1 in m) || m[$1] > $2 {m[$1] = $2}
     END{for (i in m) print i, m[i]}' file

BEGIN{PROCINFO["sorted_in"]="@ind_num_asc"} is used to order the associative array by numerical index.

Output:

0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

You can do that:

awk 'NR==1{k=$1;v=$2;next} k==$1 { if (v>$2) v=$2; next} {print k,v; k=$1;v=$2}END{print k,v}'

indented:

# for the first record store the two fields
NR==1 {
    k=$1
    v=$2
    next
}
# when the first field doesn\'t change
k==$1 {
    # check if the second field is lower
    if (v>$2)
       v=$2
    next
}
{
    # otherwise print stored fields and reinitialize them
    print k,v
    k=$1
    v=$2
}
END {
    print k,v
}'
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

In Perl:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

my %min;

while (<>) {
  chomp;
  my ($key, $value) = split;
  if (!exists $min{$key} or $value < $min{$key}) {
    $min{$key} = $value;
  }
}

for (sort { $a <=> $b } keys %min) {
  say "$_ $min{$_}";
}

It's written as a Unix filter, so it reads from STDIN and writes to STDOUT. Call it as:

$ ./get_min < input_file > output_file
Dave Cross
  • 68,119
  • 3
  • 51
  • 97
0

When you want to use sort, you first have to fix the ordering. Sort will not understand the decimal point, so temporary change that for a x.
Now sort numeric on the numeric fields and put back the decimal point. The resulting list is sorted correctly, take the first value of each key.

sed 's/\./ x /g' inputfile | sort -n -k1,3 -k4,6  | sed 's/ x /./g'  | sort -u -k1,1
Walter A
  • 19,067
  • 2
  • 23
  • 43