Bash: get value of one column based on value of another column

Question

I have a space-separated file that contains:

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv
5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.22e-01 7.23e-01 ./dir2/dir2/file2.csv
5.75e-01 7.00e-1 5.02e-01 7.93e-01 ./dir3/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

I would like to extract the value of column 5 that corresponds to the maximum of column 3 for each value of dir#. For example, let's say I'm talking about dir1. That corresponds to these rows:

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv

and I can find these using:

max_val_acc_=$(awk '$5 ~ /dir1/ { print }' filename.txt)
echo $max_val_acc

Now I think I need to pipe this result through a sort and take the head, but I can't get it working. The result I'm looking for (for dir1) is:

./dir1/dir2/file1.csv

and the complete result for all dir#:

./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
./dir3/dir2/file2.csv

What should the output be if 2 dirs have the same max $3 value? — Ed Morton, Mar 16 '17 at 21:23

Ed Morton · Answer 1 · 2017-03-16T17:36:50.333

1

Is this what you're looking for?

$ cat tst.awk
{
    split($5,path,"/")
    dir = path[2]
    if ( !(dir in max) || ($3 > max[dir]) ) {
        max[dir] = $3
        val[dir] = $5
    }
}
END {
    for (dir in val) {
        print val[dir]
    }
}

$ awk -f tst.awk file
./dir3/dir2/file2.csv
./dir1/dir2/file1.csv
./dir2/dir2/file1.csv

edited Mar 16 '17 at 17:36

answered Mar 16 '17 at 17:29

Ed Morton

188,023
17
78
185

James Brown · Answer 2 · 2017-03-17T09:20:13.613

0

I'm not sure I understood you right but this is in awk how I understood you:

awk -v s="dir1" '         # search parameter in your s
index($5,"./" s "/") {    # if your s is found in $5
    if(max==""||$3>max){  # we initialize $3 or compare to previous max
        max=$3;           # store new max
        maxv=$5           # and new mac value 
    }
}
END{ print maxv }         # print the stored max value
' file                    # oh just the file
./dir1/dir2/file1.csv

edited Mar 17 '17 at 09:20

answered Mar 16 '17 at 17:19

James Brown

36,089
7
43
59

1

When doing min/max comparisons always seed with the first value read rather than hoping all the input values will be positive. Also, try it with `s="dir2"` and you'll see it fail since `dir2` appears on every line (you need to make it `index($0,"./"s"/")` or similar). – Ed Morton Mar 16 '17 at 17:34

karakfa · Answer 3 · 2017-03-16T19:16:32.450

another alternative with sort and awk

$ sort -k5 -k3,3r file | awk -F/ '!a[$NF]++'

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv

above is for filenames, if based on first dir names

$ sort -k3,3r file | awk '{split($NF,d,"/")} !a[d[2]]++'

5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

and if you want to print the dirs only

$ sort -k3,3r file | awk '{split($NF,d,"/")} !a[d[2]]++{print $NF}'

./dir2/dir2/file1.csv
./dir1/dir2/file1.csv
./dir3/dir2/file2.csv

What does the 3r do? I understand -k3 is to sort by column 3, but I read the man page for sort and still don't understand -k3,3r — StatsSorceress, Mar 24 '17 at 18:59

score 0 · Answer 4 · answered Mar 16 '17 at 21:15

Just for fun - without any (awk, perl like) programming language

file="./data.txt"
paste -d ' ' "$file" <(cut -d/ -f2 "$file") |\
    LC_ALL=C sort -k6 -k3gr | uniq -f5 | cut -d' ' -f5

output

./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
./dir3/dir2/file2.csv

Use some awk solution. As i said, this is only for showing another way.

score 0 · Answer 5 · answered Mar 16 '17 at 21:35

While not matching your requested output and similar to the other answers already posted, I find this awk command more memorable:

< file | sort -k3,3r | awk -F "/" '!seen[$2]++'

output:

5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

The more general point being: sort the entire list by size (column 3) and don't worry about sorting it by directory name (part of column 5), then extract the first of each directory name (i.e. awk prints only newly seen key name).

If you really want the output to just the directories names and sorted, then add the following to the pipe chain:

| cut -d ' ' -f5- | sort

Bash: get value of one column based on value of another column

5 Answers5