-1

I have three fields in dataset file.

field 1 acts as id

field 2 is used to compare the min

field 3 is boolean either 0 or 1.

I need to find the min value in field 2 but with respect to field 1. that is, consider below dataset.

dataset

1 0.432 0 
1 0.12  1
1 0.298 0
2 0.056 0
2 0.06  1
3 0.982 0

I need to compare the values in field 2 for first 3 records and check if field 3 has value = 1 for the min value in field 2. if so, ++count.

Then find min again in field 2 but for records with field 1 = 2. that is, only record 4 and 5.and so on...

What would be the best way to go about with it? The file contains approx 2,000,000 records.

Is it possible to sort field 2 and then take one record for each different value of field1?

Community
  • 1
  • 1
Murlidhar Fichadia
  • 2,589
  • 6
  • 43
  • 93

1 Answers1

1

the easiest...

$ sort -n file | awk '!a[$1]++'

1 0.12  1
2 0.056 0
3 0.982 0

to count the sum

$ sort -n file | awk '!a[$1]++{sum+=$3} END{print sum}'
1

however, if there is a match in field two and you want to pick the record with last field 1, you have to reverse sort for field 3, i.e. sort -k1,2n -k3r

Explanation

!a[$1]++ is an awk idiom to select the first unique entry for field 1. Creating a counter mapped with key, logically will be true only for the first entry (due to negation and automatic conversion of values to boolean)

sorting: first two fields in ascending order (but numerical sort so 2 < 11), third is descending order (reverse) so that 1 will appear before 0. Since last field is one digit only numerical sorting or lexical sorting doesn't matter, otherwise you want it to be numerical too.

karakfa
  • 66,216
  • 7
  • 41
  • 56