2

Suppose I have multi line record with = as a record separator, but only if the = is the start of a line:

$ cat file
record 1, field 1
record 1, field 2 with a = in it
record 1, field 3
= record 2, field 1
record 2, field 2 also with a = in it
record 2, field 3
= final record 3, field 1
record 3, field 2

I would like to separate a file similar to this into records delimited by ^=[ \t] and fields by \n.

I tried:

$ gawk -v RS="^=[ \t]" -v FS="\n" '{printf "%s\n--- NF=%s, NR=%s ---\n", $0, NF, FNR}' file

but that results in:

record 1, field 1
record 1, field 2 with a = in it
record 1, field 3
= record 2, field 1
record 2, field 2 also with a = in it
record 2, field 3
= final record 3, field 1
record 3, field 2

--- NF=9, NR=1 ---

i.e., the ^ does not work as I expect it to as beginning of the line.

I know I can do:

$ gawk -v RS="\n=[ \t]" -v FS="\n" '{printf "%s\nNF=%s, NR=%s\n", $0, NF, FNR}'

But that feels like it would have Unix / Windows issues with line separators. It also has an extra \n attached to the final record

I could use sed to replace the ^=[ \t] with an extra \n then use gawk in paragraph mode:

$ sed 's/^=[ \t]/\
/' file | gawk -v RS="" -v FS="\n" '{printf "%s\n--- NF=%s, NR=%s ---\n", $0, NF, FNR}'
record 1, field 1
record 1, field 2 with a = in it
record 1, field 3
--- NF=3, NR=1 ---
record 2, field 1
record 2, field 2 also with a = in it
record 2, field 3
--- NF=3, NR=2 ---
final record 3, field 1
record 3, field 2
--- NF=2, NR=3 ---

Which is precisely what I am looking for.

Question: Is there a way to use ^ in RS to indicate 'start of the line' in gawk with multiline records so I don't have to pipe through sed? I guess I am looking for the equivalent of the m flag in a PCRE regex in gawk.

dawg
  • 98,345
  • 23
  • 131
  • 206

3 Answers3

4

^ means start of string, not start of line. There is no start of line character, just carriage return (\r = return the cursor to the start of the line) and line feed (\n = drop the cursor to the next line) characters which together or separately depending on the tool/OS are used to indicate end of line aka newline. Windows tools tend to use \r\n to mean newline while UNIX uses \n alone which is is why \n is often referred to as the newline character in UNIX.

Many tools, e.g. sed and grep (and awk by default) only read 1 line at a time and so their input buffer contains a single line at a time and so in that context start of string is the same as start of line which is why you often hear ^ referred to as the start of line character when in general, it isn't. Similarly $ is the end of string character, not the end of line character as it's often referred to but can be used to represent end of line when used in the context of a string input buffer that some tool is reading/populating one line at a time.

What that means is that if your tool is NOT reading one line at a time then the regexp to match a character X at the start of line in UNIX files is actually:

(^|\n)X

and at the end of a line is:

X(\n|$)

but be aware that that is also matching/consuming the linefeed char if present.

In Windows change \n to \r\n above and to work in both you can use \r?\n unless your file was created on Windows and could contain linefeed mid-line, e.g. CSVs exported from Excel could look like

field1,"field2 part a\nfield2 part b",field3\r\n

where the \n and \r would of course be literal. In that case you would not want the standalone \n mid-field to be misinterpreted as a newline.

Try this (gawk-only due to multi-char RS and \s shorthand for [[:space:]]):

$ awk -v RS='\n(=\\s*|$)' -F'\n' '{printf "%s\n--- NF=%s, NR=%s ---\n", $0, NF, FNR}' file
record 1, field 1
record 1, field 2 with a = in it
record 1, field 3
--- NF=3, NR=1 ---
record 2, field 1
record 2, field 2 also with a = in it
record 2, field 3
--- NF=3, NR=2 ---
final record 3, field 1
record 3, field 2
--- NF=2, NR=3 ---
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

you can avoid the last record extra new line by checking the last field

$ awk -F'\n' -v RS='\n=[ \t]' -v OFS='\n' '{NF-=$NF==""; 
                                            print $0, "---NF="NF ", ---NR="FNR}' file
record 1, field 1
record 1, field 2 with a = in it
record 1, field 3
---NF=3, ---NR=1
record 2, field 1
record 2, field 2 also with a = in it
record 2, field 3
---NF=3, ---NR=2
final record 3, field 1
record 3, field 2
---NF=2, ---NR=3
karakfa
  • 66,216
  • 7
  • 41
  • 56
0

I don't know if it makes a difference, but I found it slightly easier to do this inside of the BEGIN clause:

awk 'BEGIN {RS = "\n= "; FS = "\n"} {printf "%s\n--- NF=%s, NR=%s ---\n", $0, NF, FNR}' records

This gives the result:

record 1, field 1
record 1, field 2 with a = in it
record 1, field 3
--- NF=3, NR=1 ---
record 2, field 1
record 2, field 2 also with a = in it
record 2, field 3
--- NF=3, NR=2 ---
final record 3, field 1
record 3, field 2

--- NF=3, NR=3 ---

No explanation needed, as it really doesn't do anything but slightly reformulate what you already did. How does that look?

The problem with ^, afaik, is that there are no "lines" per se. There are records. I could be wrong, but I don't think that the "start of line" concept is relevant in this context. "Start of field" would be, or "start of record", although the latter would simply be something like:

$0 ~ /^chars/

But, I don't know much about the internal workings of this part of awk, so I welcome education on it.

Andrew
  • 475
  • 4
  • 15