3

I've got a block of text with some sections that are clearly delineated by four-space indentation:

PERCHANCE he for whom this bell tolls may be so ill, as that he knows not it
tolls for him; and perchance I may think myself so much better than I am, as
that they who are about me, and see my state, may have caused it to toll for me,
and I know not that. 

    The church is Catholic, universal, so are all her actions; all that she does
    belongs to all. When she baptizes a child, that action concerns me; for that
    child is thereby connected to that body which is my head too, and ingrafted into
    that body whereof I am a member.

And when she buries a man, that action concerns me: all mankind is of one
author, and is one volume; when one man dies, one chapter is not torn out of the
book, but translated into a better language; and every chapter must be so
translated; God employs several translators; some pieces are translated by age,
some by sickness, some by war, some by justice; but God's hand is in every
translation, and his hand shall bind up all our scattered leaves again for that
library where every book shall lie open to one another.

    As therefore the bell that rings to a sermon calls not upon the preacher only,
    but upon the congregation to come, so this bell calls us all; but how much more
    me, who am brought so near the door by this sickness.

There was a contention as far as a suit (in which both piety and dignity,
religion and estimation, were mingled), which of the religious orders should
ring to prayers first in the morning; and it was determined, that they should
ring first that rose earliest.

I'd like each indented block to be immediately preceded with START QUOTE and immediately followed with END QUOTE. I've been playing around with sed for fifteen minutes but still can't get it quite right. Here's my best effort so far:

#!/usr/bin/sed -Ef
/^$/ {
N
    /\n    / {
    P
    s/^\n//
    i\
    START QUOTE
    }
}

/^    / {
N
    /\n$/ {
    s/\n$/&END QUOTE/
    G
    }
}

Running ./parse.sed <script.txt, I get the following output:

PERCHANCE he for whom this bell tolls may be so ill, as that he knows not it
tolls for him; and perchance I may think myself so much better than I am, as
that they who are about me, and see my state, may have caused it to toll for me,
and I know not that. 

START QUOTE
    The church is Catholic, universal, so are all her actions; all that she does
    belongs to all. When she baptizes a child, that action concerns me; for that
    child is thereby connected to that body which is my head too, and ingrafted into
    that body whereof I am a member.

And when she buries a man, that action concerns me: all mankind is of one
author, and is one volume; when one man dies, one chapter is not torn out of the
book, but translated into a better language; and every chapter must be so
translated; God employs several translators; some pieces are translated by age,
some by sickness, some by war, some by justice; but God's hand is in every
translation, and his hand shall bind up all our scattered leaves again for that
library where every book shall lie open to one another.

START QUOTE
    As therefore the bell that rings to a sermon calls not upon the preacher only,
    but upon the congregation to come, so this bell calls us all; but how much more
    me, who am brought so near the door by this sickness.
END QUOTE

There was a contention as far as a suit (in which both piety and dignity,
religion and estimation, were mingled), which of the religious orders should
ring to prayers first in the morning; and it was determined, that they should
ring first that rose earliest.

Note the missing END QUOTE on the first quoted block. I think what's going on here is that the second command in the script:

/^    / {
N
    /\n$/ {
    s/\n$/&END QUOTE/
    G
    }
}

only properly finds the boundary at the end of the block if the current line is the last line of the quote block. But sometimes, it's off by one, and the boundary gets ingested in two separate N commands, and thus not recognized. Any pointers on what the right way to do this with sed is?

ravron
  • 11,014
  • 2
  • 39
  • 66

4 Answers4

1

Using sed

When looking for the end of the quote, the original script read in lines in pairs. As a consequence, the end of a quote was only found when the quote contained an odd number of lines. The solution is to read the whole quote in at once and then add END QUOTE to the end of it:

#!/usr/bin/sed -Ef
/^$/ {
N
    /\n    / {
    P
    s/^\n//
    i\
    START QUOTE
    }
}

/^    / {
    :a;N;/\n$/!ba
    s/$/END QUOTE\n/
}

The key change here is :a;N;/\n$/!ba which reads lines in until it finds an empty line.

[The above was tested under GNU sed. BSD (OSX) sed is often slightly different.]

Using awk

sed can do anything, but things with complex logic are often easier to do with awk. For your problem, try:

awk '/^    / && q{print;next} q{print "END QUOTE"; q=0} /^    /{print "START QUOTE"; q=1} 1' file

With your input, for example:

$ awk '/^    / && q{print;next} q{print "END QUOTE"; q=0} /^    /{print "START QUOTE"; q=1} 1' file
PERCHANCE he for whom this bell tolls may be so ill, as that he knows not it
tolls for him; and perchance I may think myself so much better than I am, as
that they who are about me, and see my state, may have caused it to toll for me,
and I know not that. 

START QUOTE
    The church is Catholic, universal, so are all her actions; all that she does
    belongs to all. When she baptizes a child, that action concerns me; for that
    child is thereby connected to that body which is my head too, and ingrafted into
    that body whereof I am a member.
END QUOTE

And when she buries a man, that action concerns me: all mankind is of one
author, and is one volume; when one man dies, one chapter is not torn out of the
book, but translated into a better language; and every chapter must be so
translated; God employs several translators; some pieces are translated by age,
some by sickness, some by war, some by justice; but God's hand is in every
translation, and his hand shall bind up all our scattered leaves again for that
library where every book shall lie open to one another.

START QUOTE
    As therefore the bell that rings to a sermon calls not upon the preacher only,
    but upon the congregation to come, so this bell calls us all; but how much more
    me, who am brought so near the door by this sickness.
END QUOTE

There was a contention as far as a suit (in which both piety and dignity,
religion and estimation, were mingled), which of the religious orders should
ring to prayers first in the morning; and it was determined, that they should
ring first that rose earliest.

How it works

This script uses a single variable q which is 1 when we are in a quote and zero otherwise.

  • /^ / && q{print;next}

    If q is true and this line begins with 4 spaces, then print the line, skip the rest of the commands and jump to thenext line.

  • q{print "END QUOTE"; q=0}

    If we get here when q is true, then this line does not begin with 4 spaces. This means that a quote has just ended and we print END QUOTE and reset q to false (0).

  • /^ /{print "START QUOTE"; q=1}

    If we get here with a line that begins with 4 spaces, then a quote has just started. We print START QUOTE and set q to true (1).

  • 1

    This is awk's cryptic shorthand for print the line.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • Neat. I've intentionally started by learning sed, and I plan to get to awk next. This does look pretty straightforward (for some definition of straightforward, anyways). That said, any thoughts on how to do this right in sed? It seems like you're suggesting I might want branching, which of course sed supports as well. – ravron Jun 10 '16 at 19:04
  • 2
    @RileyAvron I've added in a sed solution (and, yes, it involves looping). Key differences between sed and awk are that awk supports variables, and arithmetic and sed does not. Also, awk commands can be easier to read because awk supports if-then-else statements and `for` loops. Both sed and awk are excellent tools in their respective domains. It is just that complex logic is easier in awk. – John1024 Jun 10 '16 at 19:38
  • @RileyAvron - here is what you need to know about sed to use it effectively in 99% of situations: `s/old_regexp/new_string/`. Now read the book Effective Awk Programming, 4th Edition, by Arnold Robbins. Don't waste your time learning a bunch of sed constructs that became obsolete in the mid-1970s when awk was invented. – Ed Morton Jun 13 '16 at 19:57
  • Thanks, @EdMorton! I'd been planning to read O'Reilly's sed & awk book. Do you recommend that, as well? Or should I eschew it in favor of your suggestion? – ravron Jun 13 '16 at 19:59
  • That book is very old and outdated, missing many useful modern awk features. You don't need a book to learn the stuff you should use sed for (s, g, and p with -n) and the Awk book I mention is the best/most current for learning awk. – Ed Morton Jun 13 '16 at 22:18
1

Try this :

#!/usr/bin/sed -f
/^    / {
    H
    d
  }
/^$/ {
  x
  s/^\n    /START QUOTE&/
  /    /s/$/\nEND QUOTE\n/
}

Lines starting with four spaces are added to hold space and deleted from pattern space.

When next blank line /^$/ is found, x exchange the content of hold space and pattern space. We then add START BLOCK and END BLOCK to the beginning and the end of the block.

SLePort
  • 15,211
  • 3
  • 34
  • 44
1

This might work for you (GNU sed):

sed -r 'N;/^\n\s{4}\S/s//\nSTART QUOTE&/;/^\s{4}\S.*\n$/s//&END QUOTE\n/;t;P;D' file

Process the file in a running window of a pair of lines (N ...P;D). When the required pair matches prepend/append the required literal and then bail out (see t) and then resume with the next pair of lines.

An alternate method:

sed '/^    /{s/^/START QUOTE\n/;:a;n;/^    /ba;s/^/END QUOTE\n/}'  file
potong
  • 55,640
  • 6
  • 51
  • 83
1

sed is for simple subsitutions on individual lines, that is all. For anything else you should be using awk:

$ cat tst.awk
!inBlock && /^    / { print "START QUOTE"; inBlock=1 }
inBlock && !/^    / { print "END QUOTE"; inBlock=0 }
{ print }

.

$ awk -f tst.awk file
PERCHANCE he for whom this bell tolls may be so ill, as that he knows not it
tolls for him; and perchance I may think myself so much better than I am, as
that they who are about me, and see my state, may have caused it to toll for me,
and I know not that.

START QUOTE
    The church is Catholic, universal, so are all her actions; all that she does
    belongs to all. When she baptizes a child, that action concerns me; for that
    child is thereby connected to that body which is my head too, and ingrafted into
    that body whereof I am a member.
END QUOTE

And when she buries a man, that action concerns me: all mankind is of one
author, and is one volume; when one man dies, one chapter is not torn out of the
book, but translated into a better language; and every chapter must be so
translated; God employs several translators; some pieces are translated by age,
some by sickness, some by war, some by justice; but God's hand is in every
translation, and his hand shall bind up all our scattered leaves again for that
library where every book shall lie open to one another.

START QUOTE
    As therefore the bell that rings to a sermon calls not upon the preacher only,
    but upon the congregation to come, so this bell calls us all; but how much more
    me, who am brought so near the door by this sickness.
END QUOTE

There was a contention as far as a suit (in which both piety and dignity,
religion and estimation, were mingled), which of the religious orders should
ring to prayers first in the morning; and it was determined, that they should
ring first that rose earliest.
Ed Morton
  • 188,023
  • 17
  • 78
  • 185