Trying to reproduce awk in jq

Question

Preamble: Enterprise Network Engineer/Architect (Non-Programmer).
Synopsis: Logs moving from txt output to json
Issue: Not successfully trying to migrate from a functioning awk data extraction to using jq against json using one-liner.
Active awk:

awk '
   BEGIN{ FS="\t" }
  { arr[$1 FS $2] += $3; count[$1 FS $2] += 1 }
  END{ for (key in arr) printf "%s%s%s%s%s\n", key, FS, count[key], FS, arr[key] }
' | sort -nrk 4 | head -1 | awk '{ print $1" | "$2" | "$4/60/60 }'

End Result: using jq to count duplicate entries for src/dst ip address and dst port and ADD the cumulative duration of the connections.

SAMPLE JSON INPUT

{
  "ts": 1636xxxxx.41xxx34,
  "uid": "hex_code",
  "id.orig_h": "10.x.x.11",
  "id.orig_p": 42996,
  "id.resp_h": "10.x.x.123",
  "id.resp_p": 53,
  "proto": "udp",
  "service": "dns",
  "duration": 0.01117664844,
  "conn_state": "SF",
  "local_orig": true,
  "local_resp": true,
  "missed_bytes": 0,
  "history": "Dd",
  "orig_pkts": 1,
  "orig_ip_bytes": 71,
  "resp_pkts": 1,
  "resp_ip_bytes": 71
}
{
  "ts": 1xxxx0501.5xxx47,
  "uid": "hex_code",
  "id.orig_h": "10.x.x.11",
  "id.orig_p": 36299,
  "id.resp_h": "10.x.x.123",
  "id.resp_p": 53,
  "proto": "udp",
  "service": "dns",
  "duration": 0.00857415966797,
  "conn_state": "SF",
  "local_orig": true,
  "local_resp": true,
  "missed_bytes": 0,
  "history": "Dd",
  "orig_pkts": 1,
  "orig_ip_bytes": 74,
  "resp_pkts": 1,
  "resp_ip_bytes": 74
}

targeted jq output...
10.xxx.xxx.21 | 18.xx1.xx1.158 | 45.6606 <--time is shown cumulative duration

Given your example JSON, can you provide the output you want `jq` to produce? — pmf, Nov 10 '21 at 18:35
Is "SAMPLE JSON OUTPUT" actually the input? If so, what output do you want? If not, what input should produce that output? — ikegami, Nov 10 '21 at 19:24
If that's the input, you don't have a JSON document; you have a stream of JSON documents. To work with all the documents at once (e.g. to remove duplicates), you'll need to use `-n` and `[ inputs ]` — ikegami, Nov 10 '21 at 19:26
I see your edit. You're not making things clear. What is the input you provide to jq, and what output do you want? Is the thing identified as "SAMPLE JSON OUTPUT" the input? Is the thing identified as "current output" the output expected for that input? — ikegami, Nov 10 '21 at 19:34
Still not answering the question. Clearly identify the input for the jq program and the desired output from that program for that input. — ikegami, Nov 10 '21 at 19:35
To start from the beginning... we get txt based log files from network sensors, the organization moved to json structured log files for SIEM. Currently, I was using awk to grab parse the file to give me the cumulative time an IP address is connecting to another IP address. The log contains 1000's of duplicate entries (each time an IP connects to the same or different external IP addresses), so there are ton's of duplicate entries per source IP and destination IP. The goal is to TOTAL up how many times a source IP 'talks' with an destination IP and add each duration for the total amount of time — net_solv, Nov 10 '21 at 19:42

pmf · Accepted Answer · 2021-11-10T20:58:27.407

1

As far as I understand what you are trying to accomplish, this might go in your direction:

jq -sr '
  group_by([."id.orig_h", ."id.resp_h"])[]
  | [(first|."id.orig_h", ."id.resp_h"), (map(.duration)|add)]
  | @csv
' extract.json

Explanation: Your input is a stream of objects. Reading them in using -s (or --slurp) turns the stream into an array. With group_by we transform it into an array of arrays, separating equal values, here given as an array of the two IP fields. Next, for each element of the outer array (the group members) we construct an array of the two IP fields of the first member only (this suffices as all others are equal to this regard), and as third value we add up the values of each group members .duration field. Lastly, the constructed array is transformed to a line of CSV using @csv, which is printed raw thanks to the initial -r (--raw-output) parameter.

Note: I treated field values like "ts": 1636xxxxx.41xxx34 as obfuscation of an actual number. If, however, it's rather a string containing some x characters, then the string literal needs to be quoted ("ts": "1636xxxxx.41xxx34") for being proper JSON.

Answering a follow-up question, how to filter out durations equal to 0 and sort the remaining lines by duration from highest to lowest

jq -sr '
  group_by([."id.orig_h", ."id.resp_h"]) | map(
    [(first|."id.orig_h", ."id.resp_h"), (map(.duration)|add)]
    | select(.[2] > 0)
  )
  | sort_by(-.[2])[]
  | @csv
' extract.json

edited Nov 10 '21 at 20:58

answered Nov 10 '21 at 19:56

pmf

24,478
2
22
31

correct the .ts value was obfuscated, below is sample output from the code show jq -s 'group_by([."id.orig_h", ."id.resp_h"])[] | [(first|."id.orig_h", ."id.resp_h"), (map(.duration)|add)] | @csv' "\"0.0.0.0\",\"255.255.255.255\"," "\"10.xx.xx.31\",\"10.xx.xx.11\",0.06085658073425293" "\"10.xx.xx.11\",\"224.x.x.251\",462.63780641555786" "\"10.xx.xx11\",\"46.xx.xx.210\",34.87883520126343" "\"fe80::X:X:14f5:b599\",\"ff02::X\",12.92171025276184" "\"fe80::X:X:14f5:b599\",\"ff02::2\",366.5762286186218" "\"fe80::baX:X::80e0\",\"ff02::1\",89629.40626692772" – net_solv Nov 10 '21 at 20:10
Forgot to mark the output as raw. Fixed it (and added some closer explanation). – pmf Nov 10 '21 at 20:13
Thanks for your assistance, follow-up ?, is there a way to sort the output by longest/highest duration and can objects with duration = null be filtered out? – net_solv Nov 10 '21 at 20:22
Sure. Apply `select(.[2] > 0)` to each array (= later a line of CSV) in order to drop those whose third column doesn't match. Then, enclose all arrays (= later the whole CSV file) into an outer array (i.e. replace the `[]` after the grouping with a `map`) and apply `sort_by(.[2])` to it in order to sort by the third column (in numerically ascending order - to have descending order, `sort_by` the negative value `-.[2]`). – pmf Nov 10 '21 at 20:32
Sorry for my programmatic illiteracy... here is what I got but clueless on where to place the sort_by... jq -sr 'group_by([."id.orig_h", ."id.resp_h"])[] | [(first|."id.orig_h", ."id.resp_h"), (map(.duration)|add)] | select(.[2] > 0) | @csv' – net_solv Nov 10 '21 at 20:50
I already added it to my original answer :) – pmf Nov 10 '21 at 20:52
:) just saw... f5 helps, lol... Here is the current output "fe80::xxxx:xxxx:14f5:b599","ff02::1:xxxx:xxxx",408181.28009080887 which is AWESOME BTW! Last ?, is it better to 'format' the output in jq or with awk? ie: IP -> IP -> time in hours (408181.28xxxx /60 seconds /60 mintues) – net_solv Nov 10 '21 at 21:05
In terms of efficiency, it is always better to invoke fewer tools. So, if you've come to jq with the goal to abandon awk, then sure, do it in jq then. If, however, you'll still be employing both, then it doesn't really matter. At best, perform these conversions as late as possible, at least after you have in jq (potentially) reduced the number of lines, which results in fewer iterations and invocations. – pmf Nov 10 '21 at 21:22

Trying to reproduce awk in jq

1 Answers1