Multiple line parsing within bash

Question

I'm having trouble parsing a multiple line file. I've tried with awk but I only know how to do it with single lines.

The files contains records like this:

0123456789ab    "(channel
  (1
    (saturation(14))
  )
  (2
    (saturation(41))
  )
  (3
    (saturation(25))
  )
  (4
    (saturation(27))
  )
  (5
    (saturation(33))
    (ssid
      (0
        (ssid(TestingAlpha))
        (rssi(5))
      )
    )
  )
  (6
    (saturation(100))
    (ssid
      (0
        (ssid(TestingBravo))
        (rssi(70))
      )
      (1
        (ssid(TestingCharlie))
        (rssi(44))
      )
    )
  )
  (7
    (saturation(40))
  )
  (8
    (saturation(22))
  )
  (9
    (saturation(19))
  )
  (10
    (saturation(20))
  )
  (11
    (saturation(11))
  )
  (12
  )
  (13
    (saturation(11))
  )
)
"

It's a wireless survey. Any output that can be analyzed (database records, excel columns, etc) is acceptable.

Possible duplicate of https://stackoverflow.com/questions/31232843/jq-or-xsltproc-alternative-for-s-expressions — tripleee, Dec 15 '17 at 11:39
Using a trad Lisp format instead of JSON is Hacker News crazy. See also https://www.reddit.com/r/programming/comments/oon44/sexpressions_the_fatfree_alternative_to_json/ — tripleee, Dec 15 '17 at 11:41
I'm googing for how to use Awk to parse sexp and so far drawing a blank. It's not particularly hard if your format is this simple, but you need to store the results in memory until you reach the end. — tripleee, Dec 15 '17 at 11:43
one way is to remove all newlines and parse it as one huge line, i.e use `tr -d '\n` — Fredrik Pihl, Dec 15 '17 at 11:44
This is butt-ugly but turns your example into almost-JSON: `sed -e 's/(\([0-9][0-9]*\))/ \1/g' -e 's/(\([A-Za-z][A-Za-z]*\))/ "\1"/g' -e 's/(\([0-9a-z][0-9a-z]*\)/{"\1":/g' -e 's/)/},/g'` — tripleee, Dec 15 '17 at 11:49
Awk is very efficient at processing tabular data. Here you have kind of a tree data structure. Awk won't fit. You could try to transform data into a more common tree data format such as xml or json. Your can also have a look at Lisp, as it's full of parenthesis :) — Setop, Dec 15 '17 at 11:58
You need a context-free grammar to recognize an S-expression, and regular expressions simply aren't up to the task. — chepner, Dec 15 '17 at 12:37

score 1 · Answer 1 · answered Dec 15 '17 at 12:34

As I said in comment, Awk is very efficient at processing tabular data. Here you have kind of a tree data structure. Awk won't fit.

However because the format seems stable, we can cheat a bit :

BEGIN {
    OFS=";"
    print "bizid", "ssid", "channel", "saturation", "rssi"
}

NR == 1 {
    split($1,A," ")
    bizid=A[1]
    next
}

{
    level = length($1) / 2
}

function clearv(v,      R) {
    split(v,R,")")
    return R[1]
}

level == 1 {
    channel=$2
    next
}

level == 2 && $2 == "saturation" {
    saturation=clearv($3)
    next
}

level == 4 && $2 == "ssid" {
    ssid=clearv($3)
    next
}

level == 4 && $2 == "rssi" {
    print bizid, ssid, channel, saturation, clearv($3)
    next
}

produces :

bizid;ssid;channel;saturation;rssi
0123456789ab;TestingAlpha;5;33;5
0123456789ab;TestingBravo;6;100;70
0123456789ab;TestingCharlie;6;100;44

which seems acceptable for analysis.

Ed Morton · Answer 2 · 2017-12-15T15:16:48.383

It needs to be debugged (not by me!) but here's how to approach the problem: write a recursive function that just descends every time it hits "(", build up an array indexed by the current depth of the calls at that time, and print that array contents when you hit the last ")" in the string:

$ cat tst.awk
BEGIN { RS="[)]\\s*\"\\s*" }
function descend(tail) {
    if ( ++depth == 30 ) {
        print "ERROR: went too deep" | "cat>&2"
        exit 1
    }
    while ( match(tail,/([^()]+)([()])(.*)/,a) ) {
        val[depth] = gensub(/^\s+|\s+$/,"","g",a[1])
        if ( a[2] == "(" ) {
            descend(a[3])
        }
        else {
            for (i=1; i<=depth; i++) {
                printf "%s,", val[i]
            }
            print ""
        }
        tail = a[3]
    }
    --depth
}
{ sub(/^[^"]+"[(]/,""); descend($0) }

.

$ awk -f tst.awk file
channel,1,saturation,14,
channel,1,saturation,,
channel,1,saturation,,2,saturation,41,
channel,1,saturation,,2,saturation,,
channel,1,saturation,,2,saturation,,3,saturation,25,
channel,1,saturation,,2,saturation,,3,saturation,,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,27,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,33,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,,ssid,0,ssid,TestingAlpha,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,,ssid,0,ssid,,rssi,5,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,,ssid,0,ssid,,rssi,,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,,ssid,0,ssid,,rssi,,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,,ssid,0,ssid,,rssi,,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,,ssid,0,ssid,,rssi,,6,saturation,100,
channel,1,saturation,,2,saturation,,3,saturation,,4,saturation,,5,saturation,,ssid,0,ssid,,rssi,,6,saturation,,ssid,0,ssid,TestingBravo,
ERROR: went too deep

The above uses GNU awk for multi-char RS and gensub().

I do really like the idea of converting it to JSON and then using jq on it instead though, just not something I'm familiar enough with JSON or jq to tackle.

Multiple line parsing within bash

2 Answers2