2

I am using a column in one file to look up values in another file. The second file is very large and I would like to find all the values in a single pass with awk. I have tried doing this with an associative array, but am stumped how to get the out put i want. I want to take F1, use $2 to look up values in F2, and get the output I show below, which is $0 from F1 as the header, followed by $10 from F2 sorted and counted for each unique string ( ie pipped through sort | uniq -c).

F1

+ID=dnaK.p01 12121 TTGGGCAGTTGAAACCAGACGTTTCGCCCCTATTACAGAC[T]CACAACCACATGATGACCG

F2

solid309_20110930_FRAG_BC_bcSample12273_1541_657_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGCTC     BBBBBBA@BBBAB@?B@BBBB<5BBBAA@:>>&B7
solid309_20110930_FRAG_BC_bcSample12295_323_1714_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGAGA     BB@@A@@A@@@?@<=?@@=><6*7=?9993>4&7,
solid309_20110930_FRAG_BC_bcSample12325_1148_609_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGAGA     BBBB@B@?@B@@A@??BBBA@<.<==:6:1>9(<-
solid309_20110930_FRAG_BC_bcSample11796_1531_1170_F3    0       NC_012759       12122   42      35M     *       0       0       CACAACCACATGATGACCGAATATATAGTGGAGCA     '&&+&&)&')&0(.,',(.3+&&&+,&&&&&&&&&
solid309_20110930_FRAG_BC_bcSample12110_1166_1149_F3    0       NC_012759       12122   42      35M     *       0       0       CACAACCACATGATGACCGAATATATAGTGGAGAC     -(:18)538;,9277*'8:<)&,0-+)//3&'1+'
solid309_20110930_FRAG_BC_bcSample183_686_962_F3        0       NC_012759       12123   42      35M     *       0       0       ACAACCACATGATGACCGAATATATAGTGGAGTGC     BB?BBBB;BBBBBB@ABB;@7AA@@A@*>?+B8@9

I am doing this with the following script

for line in `awk '{if ($1~"-") print ($2-34);else print $2}' $1`
do
awk -v l=$line '{if ($1~"-") l=l+34;if ($2==l) print }' $1 >> f2
awk -v l=$line '{if ($4==l) print $10}' URA2.sam | sort | uniq -c |awk '{if ($1>15) print}'>> f2
done

Which requires multiple passes with awk for each line in. I was thinking I could use an associative array made from F1 to do this with one pass. F2 is sorted by $4. I used the following script to try to get the output I wanted.

awk 'FNR==NR{a[$2]=$0;next}$4 in a{print $10}' f1 f2 | sort | uniq -c
jeffpkamp
  • 2,732
  • 2
  • 27
  • 51
  • consider editing your question to show what the current output is from the `awk` 1-liner, if you remove the sort | uniq -c. That seems very close to something that should work, given your description 'using a column in one file to look up values in another file'. BUT your example output doesn't really seem to match your stated requirement 'followed by $10 from F2 sorted and counted for each unique string'. Good luck. – shellter Jan 24 '14 at 22:30
  • Does the output per-header really need to be sorted or is it enough for it to be unique with a count? – Ed Morton Jan 24 '14 at 22:31
  • @EdMorton The values under need to be sorted for the uniq program because it only considers the next line when determining if a string is uniq. – jeffpkamp Jan 24 '14 at 22:53
  • @jeffpkamp if all you want is a count of unique entries, that's trivially done within awk (see the answers already posted), you don't need to pipe it's output to uniq. The relatively hard thing to do in awk is sorting. – Ed Morton Jan 24 '14 at 22:55

2 Answers2

4

Here's the output using GNU awk for asorti() and delete array:

$ cat tst.awk
function prtCounts(     val,sorted,idx) {
    if (prev in f1) {
        print f1[prev]
        asorti(count,sorted)
        for (idx=1; idx in sorted; idx++) {
            val = sorted[idx]
            print count[val], val
        }
    }
    delete count
}

NR==FNR { f1[$2] = $0; next }

{
    if ( (FNR>1) && ($4!=prev) )
        prtCounts()
    count[$10]++
    prev = $4
}

END { prtCounts() }

$ gawk -f tst.awk file1 file2
a 1 b c d
1 BALH
2 BLAH
b 2 b c d
1 HAHA
2 ZAHA
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • This works on the test fine, but I am having problems with the true data set. In the real data set, it is giving me the sum of unqiue strings for each unique $4, rather than just those specified in the F1 file. – jeffpkamp Jan 24 '14 at 23:10
  • @jeffpkamp - actually, I THINK I understand the situation you're describing so I updated my answer with a solution. I just posted the gawk version with sorted output, keeping it brief... – Ed Morton Jan 24 '14 at 23:19
  • now I am not getting any output. Is there a place i can post two small files to test with? 256K worth of data looks ugly on a post :/. – jeffpkamp Jan 24 '14 at 23:36
  • I'm working on getting a small set to test it on, looking at the original script your wrote (Which I copied, not typed), it looks like it prints out all fields wheather they are in f1[array] or not. Give me 5 minutes to put up 2 test files. – jeffpkamp Jan 25 '14 at 00:01
1

I would process F1 first and use an associate array to save, the second field as the key and the whole line as the value. Then you only will have to process F2 once, and each time that fourth field changes, print the number of repeated values in the tenth field, like:

awk '
    ## Process F1.
    FNR == NR {
        f1[$2] = $0
        next
    }

    ## Process F2.
    {
        if (FNR == 1 || prev_key == $4) {
            f2[$10]++
        }
        else {
            printf "%s\n", f1[prev_key]
            for (key in f2) {
                printf "%d %s\n", f2[key], key
            }
            delete f2
            f2[$10]++
        }
        prev_key = $4
    }

    END {
        printf "%s\n", f1[prev_key]
        for (key in f2) {
            printf "%d %s\n", f2[key], key
        }
    }
' F1 F2

It yields:

a 1 b c d
1 BALH
2 BLAH
b 2 b c d
2 ZAHA
1 HAHA
Birei
  • 35,723
  • 2
  • 77
  • 82