0

I have numerous(nTotal) number of files each with one column of length L of float numbers, I want to add entry in line i_th of all these file and at the end. Compute its average and standard deviation. I first read each file. Then I try to add this array to a an array, which gives me a syntax error: (standard_in) 2: syntax error. I expect that suma[i] contains sum of all the entries on line i_th of all the files now. Then I find the average Edit I changed for loops as suggested.

for (( n= 1 ; n < $nTotal; n++ ))
do

   IFS=$'\n'
   arr1=($(./a.out filename | sed 's/:.*//'))

   for (( i= 1 ; i < $L; i++ ))       
   do
       sum[i]=`echo "${sum[i]} - ${arr1[i]}" | bc`
   done
done

for (( i= 1 ; i < $L; i++ ))  
do
   ya=$(echo -1*${sum[i]} | bc)
   aveSum=$(echo $ya/$nTotal | bc -l)
done

Edit: ./a.out produces files with one column of float numbers.

To find standard deviation though, I again read data files and store them in arrays (I'm sure this is not the smartest way of doing it but I couldn't think of anything else.). I also could not find the standard deviation using:

for (( i= 1 ; i < $L; i++ ))  
 do
    ya=$(echo -1*${sum[i]} | bc)
    ta=$(echo $ya/$nTotal | bc -l)

    tempval=`echo "${arr1[i]} - $ta * ${arr1[i]} - $ta" | bc`
    val[i]=`echo "${val[i]} - $tempval" | bc`
 done

Here I get zero for val[i] elements, I can't figure what is wrong. I would really appreciate it if you can guide me for this problem.

PyPhys
  • 129
  • 5
  • 10
  • `{1..L}` isn't doing anything useful for you. `echo {1..L}` -> `{1..L}`. Did you mean `$n` instead of `L`? Even if you did that won't work. What is `sum` in the second loop of the first snippet? What is `arra` in the first loop of the first snippet? What is `arra` in the second snippet? – Etan Reisner Oct 28 '14 at 15:09
  • Could you add to your question a typical output of your program `a.out`? – gniourf_gniourf Oct 28 '14 at 15:10
  • 3
    `for ((i=1; i<=L; i++)); do ...; done` is the Right Way to write what you might mean by `for i in {1..L}`. – Charles Duffy Oct 28 '14 at 15:12
  • 2
    ...by the way, that's entry 33 in http://mywiki.wooledge.org/BashPitfalls – Charles Duffy Oct 28 '14 at 15:13
  • 1
    I'd also suggest using `set -x` to look at what commends your script actually invokes when run (maybe with `PS4=':$LINENO+'` to show which line it's on at any given time), finding the first place it behaves unexpectedly, and asking a question focused on that behavior specifically (if it isn't obvious). – Charles Duffy Oct 28 '14 at 15:14
  • @EtanReisner I edited my question suma and arra were all typos. And L is the total number of lines in each data file which I consider a given variable. – PyPhys Oct 28 '14 at 15:28
  • @gniourf_gniourf **./a.out** is the code producing data files with one column of float numbers. – PyPhys Oct 28 '14 at 15:29
  • 2
    Brace expansions do not create an arithmetic environment, so even if parameter expansions *could* be used in braces, you would still have to write `{1..$L}`, not `{1..L}`. – chepner Oct 28 '14 at 15:30
  • So what's the purpose of your `sed` statement? – gniourf_gniourf Oct 28 '14 at 15:31
  • If `a.out` creates a file with content then you cannot pipe that output to `sed` and have that do anything meaningful. You want `sed` to operate on the output file. `sed '...' filename`. – Etan Reisner Oct 28 '14 at 15:32
  • Frankly, `bash` seems like the wrong language to do this kind of arithmetic-heavy coding. If it needs to be `bash`, specify what version you are using (3.2, 4.1, 4.2, 4.3, etc). The exact version will let us know what tools you have to work with. – chepner Oct 28 '14 at 15:33
  • 2
    In case you haven't figured this out by now you have a number of *serious* problems with this script, sufficient in number and severity to make it virtually impossible for any one answer to actually be of help to you short of simply rewriting the script for you. Perhaps you should retract the question. Go over some of the suggestions you've been given (particularly the `set -x` suggestion from @CharlesDuffy) fix the errors as best you can and come back with a better script and a more focused question. – Etan Reisner Oct 28 '14 at 15:34
  • @chepner I thought the same too. I'm using bash 3.2.51 though. – PyPhys Oct 28 '14 at 15:44
  • @EtanReisner I am working on implementing the suggestions now. – PyPhys Oct 28 '14 at 15:45

1 Answers1

1

Bash might not be exactly the easiest for this problem, particularly since it doesn't implement non-integer arithmetic. I'd use awk:

awk '{ n[FNR]++;
       delta = $1 - mean[FNR];
       mean[FNR] += delta / n[FNR];
       m2[FNR] += delta * ($1 - mean[FNR]);
     }
     END {for (i=1; i in n; ++i)
            print mean[i], sqrt(m2[i]/(n[i]-1));
     }' file1 file2 ...

The math is taken directly from the well-known "online" mean and variance algorithms. The program assumes that all files have exactly L lines, but if a few have more or less, the missing data will just be ignored; you might want to do a better validity test. In the particular case that only one file has too many lines, the standard deviation computation will trap a divide-by-zero; in one reading, that doesn't matter since the correct data will already have been printed, but you might want to fix that, too.

The program makes use of a couple of awk features: first, arrays are automatically (and lazily) initialized to 0 (if used as numbers); second, FNR is the line number in the current file. (NR is the line number in the input as a whole, but in this case FNR is more useful.)

rici
  • 234,347
  • 28
  • 237
  • 341
  • thanks. This is very nice. I ended up using a c code, however I was still curious as what is the way around this in shell. Thanks :) – PyPhys Oct 29 '14 at 15:23