0

I have a large input file with 150+ columns and 50M rows, a sample of which is shown here:

id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1

I have a bash shell script:

function awkScript() {
awk -F, -v cols="$1" -v hdr="$2" '
   BEGIN {OFS=FS}
   NR==1 {n=split(cols,cn); 
          for(i=1;i<=NF;i++) 
            for(j=1;j<=n;j++) 
              if($i==cn[j]) c[++k]=i; 
          $(NF+1)=hdr}
   NR >1 {v1=$c[1]; v2=$c[2]; v3=$c[3]
          if(!v2 && !v3) $(NF+1) = v1?10:0
          else $(NF+1) = v3?(v1-v3)/v3:0 + v2?(v1-v2)/v2:0}1' "$3" 
}   

function awkScript1() {
awk -F, -v cols="$1" -v hdr="$2" '
   BEGIN {OFS=FS}
   NR==1 {n=split(cols,cn); 
          for(i=1;i<=NF;i++) 
            for(j=1;j<=n;j++) 
              if($i==cn[j]) c[++k]=i; 
          $(NF+1)=hdr}
   NR >1 {v1=$c[1]; v2=$c[2]; v3=$c[3]; v4=$c[4]
          $(NF+1) = v1?(v1/(v1+v2+v3+v4)):0
         }1' "$3"
}

function awkScriptWrapper() {
   awkScript "$1" "$2"
}

function awkScriptWrapper1() {
   awkScript1 "$1" "$2"
}

awkScript "c1,c2,c3" "Header1" "input.txt" | awkScriptWrapper "c4,c5,c6" "Header2" >> output.txt
awkScript1 "c7,c8,c9,c10" "Header3" "input.txt" | awkScriptWrapper1 "c11,c12,c13,c14" "Header4" >> output1.txt 

Sample of output.txt is:

id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1

Sample of output1.txt is:

id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,1,0.5

My requirement is that I have to append Header1,Header2,Header3,Header4 into the end of the same input file i.e., the above script should produce just 1 output file "finaloutput.txt":

id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1,1,0.5

I tried doing the following statements:

awkScript "c1,c2,c3" "Header1" "input.txt" | awkScriptWrapper "c4,c5,c6" "Header2" >> temp_output.txt
awkScript1 "c7,c8,c9,c10" "Header3" "temp_output.txt" | awkScriptWrapper1 "c11,c12,c13,c14" "Header4" >> finaloutput.txt

But I'm not getting it.

Any help would be much appreciated.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Zaire
  • 67
  • 7
  • Take a look at `man 1 join` and this answer I posted yesterday: https://stackoverflow.com/questions/41144043/subtract-corresponding-lines/41145735#41145735 – Andreas Louv Dec 15 '16 at 08:53
  • @andlrc: My AWK SCRIPTS are working fine. My issue is in redirecting both their results to the same output file. To my understanding the answer that i require is not given here [https://stackoverflow.com/questions/41144043/subtract-corresponding-lines/41145735#41145735]. Please take the time to read my question thoroughly. – Zaire Dec 15 '16 at 09:04
  • As I said, `join` seems to be the tool for the job. You can specify the output columns with `-o`. Remember that you can use process substitutions: `command1 | join ... - <(command2)` – Andreas Louv Dec 15 '16 at 11:48
  • Can u please take the time to elaborate and put it up as an answer? i'm fairly new to shellscripting. – Zaire Dec 15 '16 at 11:56
  • just use the output of the first script as the input file to the second. – karakfa Dec 15 '16 at 14:33
  • @karafka, i already tried that but i'm getting a `fatal: division by zero attempted` error. Please help – Zaire Dec 15 '16 at 15:03
  • @karakfa: just take a look at what i tried in the question above. – Zaire Dec 15 '16 at 15:06

1 Answers1

0

Assuming that you need to join two commands in a pipeline:

$ cmd1 | join --header -j1 -t, -o1.{1..17} -o2.16,2.17 - <(cmd2)
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1,1,0.5

The above assumes that cmd1 outputs:

id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1

While cmd2 outputs:

id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,1,0.5

How does it work?

--header will treat the first line in each file as fields headers
-j1 will joint on field one
-t, specified , as a field delimiter
-o xxx will specify output columns, 1.1 means columns one from file one, in this case cmd1. 2.1 means columns one from file two, in this case cmd2

-o1.{1..17} will expand to:

-o1.1 -o1.2 -o1.3 -o1.4 -o1.5 -o1.6 -o1.7 -o1.8 -o1.9 -o1.10 -o1.11 -o1.12 -o1.13 -o1.14 -o1.15 -o1.16 -o1.17

And is a quick way to specify the first 17 columns from cmd1.

- refers to standard input, which in this case it the output from cmd1

<(command) is a process substitution.

You can change to:

join [options] file1 file2

if you need to join two regular files.

Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
  • Did u take the time to test your answer. Because i'm getting errors while executing it. – Zaire Dec 15 '16 at 14:35
  • I'm running it as a bash script – Zaire Dec 15 '16 at 14:42
  • `--header unrecognized command` You are doing a poor job executing the command i posted. As you are missing the command name (`join`) – Andreas Louv Dec 15 '16 at 14:47
  • Please provide a solution that has been tested by u upon my sample inputs – Zaire Dec 15 '16 at 14:48
  • This is `awkScript "c1,c2,c3" "Header1" "stinput.txt" | awkScriptWrapper "c4,c5,c6" "Header2" | join --header -j1 -t, -o1.{1..17} -o2.16,2.17 - <(awkScript1 "c7,c8,c9,c10" "Header3" "stinput.txt" | awkScriptWrapper1 "c11,c12,c13,c14" "Header4")` what i used. Mind explaining where i've gone wrong? – Zaire Dec 15 '16 at 14:50
  • The above command i executed, and the error i'm getting is what i mentioned above – Zaire Dec 15 '16 at 14:54
  • @Zaire - When I execute your shown command with your sample input, I get the required output without error. You could `set -x` before the command to see what's going on. – Armali Oct 16 '18 at 11:59