0

I have a space-separated file that contains:

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv
5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.22e-01 7.23e-01 ./dir2/dir2/file2.csv
5.75e-01 7.00e-1 5.02e-01 7.93e-01 ./dir3/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

I would like to extract the value of column 5 that corresponds to the maximum of column 3 for each value of dir#. For example, let's say I'm talking about dir1. That corresponds to these rows:

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv

and I can find these using:

max_val_acc_=$(awk '$5 ~ /dir1/ { print }' filename.txt)
echo $max_val_acc

Now I think I need to pipe this result through a sort and take the head, but I can't get it working. The result I'm looking for (for dir1) is:

./dir1/dir2/file1.csv

and the complete result for all dir#:

./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
./dir3/dir2/file2.csv
StatsSorceress
  • 3,019
  • 7
  • 41
  • 82

5 Answers5

1

Is this what you're looking for?

$ cat tst.awk
{
    split($5,path,"/")
    dir = path[2]
    if ( !(dir in max) || ($3 > max[dir]) ) {
        max[dir] = $3
        val[dir] = $5
    }
}
END {
    for (dir in val) {
        print val[dir]
    }
}

$ awk -f tst.awk file
./dir3/dir2/file2.csv
./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

I'm not sure I understood you right but this is in awk how I understood you:

awk -v s="dir1" '         # search parameter in your s
index($5,"./" s "/") {    # if your s is found in $5
    if(max==""||$3>max){  # we initialize $3 or compare to previous max
        max=$3;           # store new max
        maxv=$5           # and new mac value 
    }
}
END{ print maxv }         # print the stored max value
' file                    # oh just the file
./dir1/dir2/file1.csv
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    When doing min/max comparisons always seed with the first value read rather than hoping all the input values will be positive. Also, try it with `s="dir2"` and you'll see it fail since `dir2` appears on every line (you need to make it `index($0,"./"s"/")` or similar). – Ed Morton Mar 16 '17 at 17:34
0

another alternative with sort and awk

$ sort -k5 -k3,3r file | awk -F/ '!a[$NF]++'

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv

above is for filenames, if based on first dir names

$ sort -k3,3r file | awk '{split($NF,d,"/")} !a[d[2]]++'

5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

and if you want to print the dirs only

$ sort -k3,3r file | awk '{split($NF,d,"/")} !a[d[2]]++{print $NF}'

./dir2/dir2/file1.csv
./dir1/dir2/file1.csv
./dir3/dir2/file2.csv
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • What does the 3r do? I understand -k3 is to sort by column 3, but I read the man page for sort and still don't understand -k3,3r – StatsSorceress Mar 24 '17 at 18:59
0

Just for fun - without any (awk, perl like) programming language

file="./data.txt"
paste -d ' ' "$file" <(cut -d/ -f2 "$file") |\
    LC_ALL=C sort -k6 -k3gr | uniq -f5 | cut -d' ' -f5

output

./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
./dir3/dir2/file2.csv

Use some awk solution. As i said, this is only for showing another way.

clt60
  • 62,119
  • 17
  • 107
  • 194
0

While not matching your requested output and similar to the other answers already posted, I find this awk command more memorable:

< file | sort -k3,3r | awk -F "/" '!seen[$2]++'

output:

5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

The more general point being: sort the entire list by size (column 3) and don't worry about sorting it by directory name (part of column 5), then extract the first of each directory name (i.e. awk prints only newly seen key name).

If you really want the output to just the directories names and sorted, then add the following to the pipe chain:

| cut -d ' ' -f5- | sort
gregory
  • 10,969
  • 2
  • 30
  • 42