1

I am new to Awk, and am having trouble despite this being a common and seemingly simple problem.

I am trying to get the average of a column, but my addition seems to be not working. My script:

BEGIN {FS = ","}
{
        AgentDC1 = $4;
        AgentDC2 = $5;
        AutoDC1 = $23;
        AutoDC2 = $24;
        CallDuration = $28;
        CallDurationMinutes = $27;
        CallStart = $33;
        ConnectTime = $35;

        num = (CallDuration ? CallDuration : 0)
        print num
        sum += num;
}
END {print sum;}

When run, it prints the values (in quotes, it this normal?) but then prints the average as 0 (without quotes). For example:

$ awk -f search.awk callrecords.csv
"644.0"
"149.0"
"397.0"
...
""
"117.0"
"165.0"
""
0

So empty slots are being printed as "", and nothing is being added to sum. I hate to post HOW DO questions, but I am really stuck here, none of the other SOs I found were illuminating.

Community
  • 1
  • 1
Will
  • 4,299
  • 5
  • 32
  • 50

2 Answers2

2

I suppose that the quotes are actually present in the data file. Awk will not remove them magically.

In awk, when you use a variable as though it were a number, awk just ignores the characters in the variable, starting with the first one which can't be part og a number. If nothing is left of the variable's value, awk uses the value 0.

Assuming that all your fields actually contain quotation marks, the value of num will start with a quote, so using it as a number will result in the value 0. It still prints out ok, because it is printed as a string.


Here is a gawk solution which can also deal with fields which contain commas. The FPAT regex was modified from the gawk manual, while the function fix was adapted from some code on the same page. Both assume the "normal" CSV convention that quotes in quoted fields are doubled. (As @EdMorton points out in a comment, embedded newlines will not be handled correctly.)

function fix(x) {
    if (substr(x, 1, 1) == "\"")
      return gensub(/""/, "\"", "g",
                    substr(x, 2, length(x) - 2))
    else
      return x
}
BEGIN {
    FPAT = "([^,\"][^,]*|(\"[^\"]*\")+)?
}
{
    AgentDC1 = fix($4)
    AgentDC2 = fix($5)
    AutoDC1 = fix($23)
    AutoDC2 = fix($24)
    CallDuration = fix($28)
    CallDurationMinutes = fix($27)
    CallStart = fix($33)
    ConnectTime = fix($35)
    # Unlike the original, this casts num to a number.
    # It's unnecessary. sum += CallDuration; would be just fine.
    num = CallDuration+0
    print num
    sum += num
}
END {print sum+0}
rici
  • 234,347
  • 28
  • 237
  • 341
  • There are no quotes in the column, but the top row is column headers. Is it detecting a string in the first row and assuming everything below is a string too? – Will Jul 09 '15 at 22:29
  • @will: no, every line is independent. Awk won't invent quotes; if it prints them out, it is because they are part of the variable's value. – rici Jul 09 '15 at 22:42
  • 2
    @Will you are wrong - there ARE quotes in input field 28, awk will not add spurious characters like quotes to the output. Reduce it to 5 or 6 fields and then post the sample input and expected output and the script you are running on THAT if you still have a question. I bet you are using data exported as CSV from Excel and are using Excel to view the file instead of a text editor and that's why you can't see the quotes. – Ed Morton Jul 09 '15 at 22:42
  • 2
    @EdMorton: you took the words right out of my mouth. – rici Jul 09 '15 at 22:48
  • `fix()` could just be `gsub(/^"|"$/,"",x)`. In fact you could just put `gsub(/"/,"")` at the top of the action block since the FPAT wouldn't allow escaped quotes within fields. – Ed Morton Jul 10 '15 at 13:46
  • 1
    How embarrassing! Of course there are quotes, and of course LibreOffice assumes you don't want to see them. Thanks rici and @Ed Norton, really appreciate the help. – Will Jul 10 '15 at 14:32
  • 1
    @edmorton: technically, a csv can include a quote if its not the first character. Within a quoted field, quotes need to be doubled and the FPAT I grabbed doesn't handle that case; I'll fix it in a bit. – rici Jul 10 '15 at 14:41
  • @rici different CSV specs handle quotes within fields as either doubled quotes (`"abc""def"`) or escaped quotes (`"abc\"def"`) but either way your `FPAT` setting handles it. In fact, I tried it with input of `"abc","def""ghi","klm\"nop",qrs,,"",yz` and it worked perfectly so IMHO that is THE `FPAT` setting everyone should be using for parsing CSV files that don't contain embedded newlines - great job! Can you think of any cases besides embedded newlines that it wouldn't handle? – Ed Morton Jul 10 '15 at 15:28
  • @EdMorton: Actually, there was a bug in the regex; otherwise, it wouldn't have handled the escaped quote test case you provided. (The bug would have manifested had you used `"klm\"n,op"` as the test case: it's a classic longest-match problem.) I think you need to know which convention is being used for escaping quotes. Anyway, I fixed the regex again (I hope). It's not immediately obvious to me how to handle CSV fields with embedded newlines in AWK, since it decomposes into records before decomposing into fields, and there is not RPAT corresponding to FPAT. Interesting problem. – rici Jul 10 '15 at 15:48
  • To handle embedded newlines you need to count matching pairs of quotes and append to the previous line until you get an even number. It gets nasty. – Ed Morton Jul 10 '15 at 16:15
  • @EdMorton: Yeah, but you also need to *recognize* an incomplete quoted pattern in the FPAT regex. So it's even nastier. – rici Jul 10 '15 at 16:19
2

You have quotes in your input data. Try this:

BEGIN {FS = "\"?,\"?"}
{
        gsub(/^"|"$/,"")
        AgentDC1 = $4
        AgentDC2 = $5
        AutoDC1 = $23
        AutoDC2 = $24
        CallDuration = $28
        CallDurationMinutes = $27
        CallStart = $33
        ConnectTime = $35

        num = (CallDuration ? CallDuration : 0)
        print num
        sum += num
}
END {print sum+0}

The above won't work if you have commas inside your fields.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185