0

I'm trying to find the mean of several numbers in a file, which contains "< Overall >" on the line.

My code:

awk -v file=$file '{if ($1~"<Overall>") {rating+=$1; count++;}} {rating=rating/count; print file, rating;}}' $file | sed 's/<Overall>//'

I'm getting

awk: cmd. line:1: (FILENAME=[file] FNR=1) fatal: division by zero attempted

for every file. I can't see why count would be zero if the file does contain a line such as "< Overall >5"

EDIT: Sample from the (very large) input file, as requested:

<Author>RW53
<Content>Location! Location?       view from room of nearby freeway 
<Date>Dec 26, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>3
<Value>4
<Rooms>3
<Location>2
<Cleanliness>4
<Check in / front desk>3
<Service>-1
<Business service>-1

Expected output:

[filename] X

Where X is the average of all the lines containing < Overall >

daltojam
  • 17
  • 10

3 Answers3

4

Use an Awk as below,

awk -F'<Overall>' 'NF==2 {sum+=$2; count++}
                   END{printf "[%s] %s\n",FILENAME,(count?sum/count:0)}' file

For an input file containing two <Overall> clauses like this, it produces a result as follows the file-name being input-file

<Author>RW53
<Content>Location! Location?       view from room of nearby freeway
<Date>Dec 26, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>3
<Value>4
<Rooms>3
<Location>2
<Cleanliness>4
<Check in / front desk>3
<Service>-1
<Business service>-1
<Overall>2

Running it produces,

[input-file] 2.5

The part, -F'<Overall>' splits input-lines with de-limiter as <Overall>, basically only the lines having <Overall> and the number after it will be filtered, the number being $2 which is summed up and stored in sum variable and count is tracked in c.

The END clause gets executed after all lines are printed which basically prints the filename using the awk special variable FILENAME which retains the name of the file processed and the average is calculated iff the count is not zero.

Inian
  • 80,270
  • 14
  • 142
  • 161
1

You aren't waiting until you've completely read the file to compute the average rating. This is simpler if you use patterns rather than an if statement. You also need to remove <Overall> before you attempt to increment rating.

awk '$1 ~ /<Overall>/ {rating+=sub("<Overall>", "", $1); count++;}
     END {rating=rating/(count?count:1); print FILENAME, rating;}' "$file"

(Answer has been updated to fix a typo in the call to sub and to correctly avoid dividing by 0.)

chepner
  • 497,756
  • 71
  • 530
  • 681
  • I've got some unexpected output with this answer. The ratings in all the files are out of 5 but all of the output is greater than this. I think the logical OR is choosing 1 every time even when count!=0. However when I remove the "||1" every output is 1. Any ideas? – daltojam Mar 03 '17 at 13:49
  • @daltojam the first argument to sub() should be "", not " – linuxfan says Reinstate Monica Mar 03 '17 at 13:58
0
awk -F '>' '
   # separator of field if the >
   # for line that containt <Overall>
   /<Overall>/ {
       # evaluate the sum and increment counter
       Rate+=$2;Count++}
   # at end of the current file
   END{
      # print the average.
      printf( "[%s] %f\n", FILENAME, Rate / ( Count + ( ! Count  ) )
      }
   ' ${File}

# one liner
awk -F '>' '/<Overall>/{r+=$2;c++}END{printf("[%s] %f\n",FILENAME,r/(c+(!c))}' ${File}

Note:

  • ( c + ( ! c ) ) use a side effect of logical NOT (!). It value 1 if c = 0, 0 otherwise. So if c = 0 it add 1, if not it add 0 to itself insurring a division value of at least 1.
  • assume the full file reflect the sample for content
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43