4

I want to replace part of the following html text (excerpt of a huge file), to update old forum formatting (resulting from a very bad forum porting job done 2 years ago) to regular phpBB formatting:

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>written by User</i>

this should be filtered into:

    [quote=User]

I used the following regex in sed

    s/<blockquote.*written by \(.*\)<\/i>/[quote=\1]/g

this works on the given example, but in the actual file, several quotes like this can be in one line. In that case sed is too greedy, and places everything between the first and the last match in the [quote=...] tag. I cannot seem to make it replace every occurance of this pattern in the line... (I don't think there's any nested quotes, but that would make it even more difficult)

Ewout
  • 2,348
  • 1
  • 20
  • 24

3 Answers3

3

You need a version of sed(1) that uses Perl-compatible regular expressions, so that you can do things like make a minimal match, or one with a negative lookahead.

The easiest way to do this is simply to use Perl in the first place.

If you have an existing sed script, you can translate it into Perl using the s2p(1) utility. Note that in Perl you really want to use $1 on the right side of the s/// operator. In most cases the \1 is grandfathered, but in general you want $1 there:

s/<blockquote.*?written by (.*?)<\/i>/[quote=$1]/g;

Notice I have removed the backslash from the front of the parens. Another advantage of using Perl is that it uses the sane egrep-style regexes (like awk), not the ugly grep-style ones (like sed) that require all those confusing (and inconsistent) backslashes all over the place.

Another advantage to using Perl is you can use paired, nestable delimiters to avoid ugly backslashes. For example:

s{<blockquote.*?written by (.*?)</i>}
 {[quote=$1]}g;

Other advantage include that Perl gets along excellently well with UTF-8 (now the Web’s majority encoding form), and that you can do multiline matches without the extreme pain that sed requires for that. For example:

$ perl -CSD -00 -pe 's{<blockquote.*?written by (.*?)</i>}{[quote=$1]}gs' file1.utf8 file2.utf8 ...

The -CSD makes it treat stdin, stdout, and files as UTF-8. The -00 makes it read the entire file in one fell slurp, and the /s makes the dot cross newline boundaries as need be.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Awesome! Funny thing is I started with Perl in the first place, but because it is supposedly much faster, I was seduced to use sed... Not knowing it was so limited in this. Not sure if -00 is a good idea though, since it's a 500M file (sql containing html, I was incomplete in the first post). Thank you so much!!! – Ewout Jun 10 '12 at 07:39
1

I don't think sed supports non-greedy match. You can try perl though:

perl -pe 's/<blockquote.*?written by \(.*\)<\/i>/[quote=\1]/g' filename
Hari Menon
  • 33,649
  • 14
  • 85
  • 108
  • 1
    Good idea, but that won’t quite work just the way yiou have: you forgot to switch to *egrep*-style patterns wiht fewer backslashes, so you didn’t capture anything. See my answer. – tchrist Jun 09 '12 at 21:04
0

This might work for you:

sed '/<blockquote.*written by .*<\/i>/!b;s/<blockquote/\n/g;s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g;s/\n/\<blockquote/g' file

Explanation:

  • If a line doesn't contain the pattern then skip it. /<blockquote.*written by .*<\/i>/!b
  • Change the front of the pattern into a newline globally throughout the line. s/<blockquote/\n/g
  • Globally replace the newline followed by the remaining pattern using a [^\n]* instead of .*. s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g
  • Revert those newlines not changed to the original front pattern. s/\n/\<blockquote/g
potong
  • 55,640
  • 6
  • 51
  • 83