awk and multilines matching (sub-regex)

Question

I am trying to use awk to parse a multiline expression. A single one of them looks like this:

_begin  hello world !
_attrib0    123
_attrib1    super duper
_attrib1    yet another value
_attrib2    foo
_end

I need to extract the value associated to _begin and _attrib1. So in the example, the awk script should return (one per line):

hello world ! super duper yet another value

The separator used is a tab (\t) character. Spaces are used only within strings.

ghoti · Accepted Answer · 2012-10-30T18:36:08.357

9

The following awk script does the job:

#!/usr/bin/awk -f
BEGIN { FS="\t"; }
/^_begin/      { output=$2; }
$1=="_attrib1" { output=output " " $2; }
/^_end/        { print output; }

You didn't specify whether you want a tab (\t) to be your output field separator. If you do, let me know and I'll update the answer. (Or you can; it's trivial.)

Of course, if you want a scary alternative (since we're getting close to Hallowe'en), here a solution using sed:

$ sed -ne '/^_begin./{s///;h;};/^_attrib1[^0-9]/{s///;H;x;s/\n/ /;x;};/^_end/{;g;p;}' input.txt 
hello world ! super duper yet another value

How does this work? Mwaahahaa, I'm glad you asked.

/^_begin./{s///;h;}; -- When we see _begin, strip it off and store the rest of the line to sed's "hold buffer".
/^_attrib1[^0-9]/{s///;H;x;s/\n/ /;x;}; -- When we see _attrib1, strip it off, append it to the hold buffer, swap the hold buffer and pattern space, replace newlines with spaces, and swap the hold buffer and pattern space back again.
/^_end/{;g;p;} -- We've reached the end, so pull the hold buffer into the pattern space and print it.

This assumes that your input field separator is just a single tab.

SO simple. Who ever said sed was arcane?!

edited Oct 30 '12 at 18:36

answered Oct 30 '12 at 17:36

ghoti

45,319
8
65
104

_attrib11 is making this script to fails (_attrib1 matches) – malat Oct 30 '12 at 17:57
There was no `_attrib11` in the sample data you provided. If you like, you can make conditions like `$1=="_attrib1"` instead of `/^_attrib1/` to handle that, or you can just leave it as a regex but terminate it, like `$1~/^_attrib1$/`. I recommend the first alternate solution; always choose string matching first, regex (at least) second. – ghoti Oct 30 '12 at 18:23
Updated my answer per your new requirement. Also added a `sed` alternative, for your reading pleasure. – ghoti Oct 30 '12 at 18:41
@ghoti, Your first example does not work for me. Prints only blank line. Why? – Tedee12345 Nov 11 '12 at 15:35
@Tedee12345 - Perhaps it has something to do with your input data. Why not [post a question](http://stackoverflow.com/questions/ask) and we'll see what we can do? – ghoti Nov 11 '12 at 17:17
@ghoti, You're right, it was a problem with the input data. I've had other than your field separator. Thanks for the hint. – Tedee12345 Nov 11 '12 at 19:59

score 1 · Answer 2 · answered Oct 30 '12 at 17:44

1

This should work:

#!/bin/bash 

awk 'BEGIN {FS="\t"} {if ($1=="_begin" || $1=="_attrib1") { output=output " " $2 }} END{print output}'

answered Oct 30 '12 at 17:44

sampson-chen

45,805
12
84
81

awk and multilines matching (sub-regex)

2 Answers2