0

Basically the file I'm getting has the first three columns pasted into followed by a column of blanks lines because it looks like nothing is getting appended into column4

I feel like I probably shouldn't be using the variables I created in the command substitution but I'm unsure how I would access these numbers that I need otherwise

#!/bin/sh # the first file in the expression of a bunch of patients to be made into data files that can be put into the graph
awk '{print "hs"$1,"\t",$2,"\t",$3}' $1 > temp1.txt     #important columns saved
numLines=`wc -l $1`     
touch column4.txt       #creates a column for the average of column 6-
for ((s=0;s<$numlines;s++)); do                 
        currentRow=0                            #Will eventually be the average of column 6- for the row of focus
        for ((i=6;i<=106;i++)); do              
                addition=`cut -f $i $1 | head -n $s | tail -n 1`        # cuts out the number at the row and column of focus for this loop
                currentRow=`expr $currentRow + $addition`              # adding the newly extracted number to the total
        done
        currentRow=`expr $currentRow / 101`                            #divides so the number is an average instead of a really big number
        echo $currentRow >> column4.txt                                 #appends this current row into a text file that can be pasted onto the first three columns
done
paste temp1.txt column4.txt
rm temp1.txt column4.txt

if it helps the input file is very large(about 106 columns and and tens of thousands of rows) but here's an example of what it looks like

Important identifier line grant regis 76 83 02 38 0 38 29 38 48 (..up to to 106 columns)
another important identifier bill susan 98 389 20 29 38 20 94 29 0 (.. same point)

And then output would look like (assuming we exclude the columns after ..)

Important identifier line 34.88
another important identifier 79.67

Sorry if something is unclear, tried my best to make it clear, just ask if there's something you're wondering about and I will edit or comment

Thank-you

Sam
  • 1,765
  • 11
  • 82
  • 176
  • 2
    In assignments, remove the `$` from the left hand side. – choroba May 16 '16 at 18:01
  • Of course, thank-you, that did not fix the problem however so I edited those out of the question because I'm trying to figure out something else with this question but thank-you for your help! – Sam May 16 '16 at 18:05
  • 1
    `numlines` is not the same as `numLines`. – choroba May 16 '16 at 18:14
  • 1
    [ShellCheck](http://www.shellcheck.net) points out that you're confusing the two names `numLines` and `numlines`, and that you're using bash features with `#!/bin/sh`. Can you fix that, update the post, and post the actual output (or errors) you get? – that other guy May 16 '16 at 18:14
  • 1
    `wc` with a file name returns the file name, too. Use redirection: ``numlines=`wc -l < "$1"` ``. – choroba May 16 '16 at 18:17
  • 1
    Is your input file tab separated? If it uses just spaces, you need to tell `cut`: `cut -d' '` – choroba May 16 '16 at 18:18
  • Even if you fix all the mentioned problems and make the script work, it will be unbearably slow, as it calls an external command for each element of each line. It's much faster to do all the summing in one go, e.g. by replacing spaces by pluses and piping to `bc`. I'd just rewrite the whole thing in Perl. – choroba May 16 '16 at 18:20
  • You could make this a lot easier for yourself if you used arrays. `while read -r -a lineArray; do ... done < $1` would really help things along – Jeffrey Cash May 18 '16 at 16:47

2 Answers2

0

awk to the rescue!

you can replace all with this script, using the values in the sample input

$ awk '{for(i=6;i<=NF;i++) sum+=$i; 
        printf "%s %s %s %.2f\n", $1,$2,$3, sum/(NF-5); 
        sum=0}' file

Important identifier line 39.11
another important identifier 79.67

for median (odd number of fields) you can do this

$ awk '{for(i=6;i<=NF;i++) a[i-5]=$i; 
        asort(a); 
        mid=(NF-4)/2; print mid, a[mid]}' file

5 38
5 29

for even number, the general approach is taking the average of neighboring numbers (can be weighted average by distance too).

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Thank-you @karakfa! I didn't even know that awk could do this, do you know anywhere that's good to learn a bit more about awk? I learned it for a few hours in one of my classes but I feel like I would like to learn more of it's abilities – Sam May 16 '16 at 18:41
  • Also! Is there a way to get the median with this? – Sam May 16 '16 at 18:42
0

You could try to use the following:

perl -MList::Util=sum -lanE '@n=grep{/^\d+$/}@F; say "@F[0..4] ",sum(@n)/@n'

prints:

Important identifier line grant regis 39.1111111111111
another important identifier bill susan 79.6666666666667

or for with the precision

perl -MList::Util=sum -lanE '@n=grep{/^\d+$/}@F; printf "@F[0..4] %.2f\n",sum(@n)/@n'

Important identifier line grant regis 39.11
another important identifier bill susan 79.67

The above calculates the average for all numeric values in the line. For the exact 6- could use for example:

perl -MList::Util=sum -lanE 'say "@F[0..4] ",sum(@F[5..@F])/(@F-6)'

also prints

Important identifier line grant regis 39.1111111111111
another important identifier bill susan 79.6666666666667

for printing both, the average and the median (odd or even num of elements)

perl -MList::Util=sum -lanE '
    @s = sort { $a <=> $b } @F[5..@F];
    $m = int(@s/2);
    printf "@F[0..4] %.2f %d\n",
    sum(@s)/(@s-1),
    (@s % 2) ? @s[$m] : sum(@s[$m-1,$m])/2
' filename

prints:

Important identifier line grant regis 39.11 38
another important identifier bill susan 79.67 29

and finally, the same as above - as an perl script with nice variables.

use strict;
use warnings;
use List::Util qw(sum);

while(<>) {
    chomp;
    my(@text) = split;
    my(@sorted_numbers) = sort { $a <=> $b } grep { /^\d+$/ } splice @text, 5;

    my $average = sum(@sorted_numbers)/@sorted_numbers;

    my $median;
    my $mid = int(@sorted_numbers / 2);

    if( @sorted_numbers % 2) {
        $median = $sorted_numbers[$mid];
    } else {
        $median  = sum(@sorted_numbers[$mid-1,$mid])/2;
    }
    printf "@text %.2f %d\n", $average, $median;
}
clt60
  • 62,119
  • 17
  • 107
  • 194