5

I use awk to extract and calculate information from two different files and I want to merge the results into a single file in columns ( for example, the output of first file in columns 1 and 2 and the output of the second one in 3 and 4 ).

The input files contain:

file1

SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196    SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004    SRR513804.544253

file2

>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT

To parse the first file I do this:

awk '
{
  s      = NF
  center = $1
}
{
  printf "%s\t %d\n", center, s
}
' file1

To parse the second file I do this:

awk '
/^>/ {
    if (count != "")
      printf "%s\t %d\n", seq_id, count
    count  = 0
    seq_id = $0
    next
}

NF {
  long  = length($0)
  count = count+long
}
END{
  if (count != "")
    printf "%s\t %d\n", seq_id, count
}
' file2

My provisional solution is create one temporal and overwrite in the second step. There is a more "elegant" way to get this output?

Thor
  • 45,082
  • 11
  • 119
  • 130
user2245731
  • 449
  • 3
  • 7
  • 16
  • 2
    +1 for better than average first(ish) post and correct use of formatting tool. BUT, you'll get the best and fastest replies if you include Minimal sample input, required output from that input, (current output, error msgs and current thinking on problem (not appropriate for your question)). Don't make us guess about how the data looks! ; -) Good luck. – shellter May 23 '13 at 18:19
  • I'm not quite clear on what your current code is trying to do, but the usual awk way to tell what file you're on is to compare `NR` (number of rows seen total) with `FNR` (number of rows seen in the current file). That said, it's probably easiest to use `paste` after the fact, or `join` if you want to make sure a column matches up. – Kevin May 23 '13 at 18:28
  • 4
    Do you have to use `awk`? The `paste` command does exactly what you want. – Paul May 23 '13 at 18:29
  • 4
    Please add desired output to the question. – Thor May 23 '13 at 19:09
  • With [taq:paste] I need to have the output of each action in different files and then merge it, no? Oh! And Thank you for edition help, now the post it's more understandable. – user2245731 May 24 '13 at 08:27
  • Is there just one line in each input file? If not, what is common between the files? Are you using the line number? Or is there some common value in both files? – Bruce Barnett May 25 '13 at 17:28
  • Your first script could be simply `{print $1. NF}` so I don't see why you are doing it that way, Why not `{print $1,$2}` unless the number of fields vary per line. If so, you have to give more information. – Bruce Barnett May 25 '13 at 17:38

1 Answers1

1

I am not fully clear on the requirement and if you can update the question may be we can help improvise the answer. However, from what I have gathered is that you would like to summarize the output from both files. I have made an assumption that content in both files are in sequential order. If that is not the case, then we will have to add additional checks while printing the summary.

Content of script.awk (re-using most of your existing code):

NR==FNR {
    s[NR]      = NF
    center[NR] = $1
    next
}

/^>/ {
    seq_id[++y] = $0
    ++i
    next
}

NF {
    long[i] += length($0)
}
END {
    for(x=1;x<=length(s);x++) {
        printf "%s\t %d\t %d\n", center[x], s[x], long[x]
    }
}

Test:

$ cat file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196    SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004    SRR513804.544253

$ cat file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT

$ awk -f script.awk file1 file2
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120  4   200
ST695_116193610:4:2206:10596:165949  3   0
jaypal singh
  • 74,723
  • 23
  • 102
  • 147