Positive lookbehind with a matching group to be extracted

Question

testString = ("<h2>Tricks</h2>"
              "<a href=\"#\"><i class=\"icon-envelope\"></i></a>")
import re
re.sub("(?<=[<h2>(.+?)</h2>\s+])<a href=\"#\"><i class=\"icon-(.+?)\"></i></a>", "{{ \\1 @ \\2 }}", testString)

This produces: invalid group reference.

Making the replacement take only \\1, only extracts envelope, that makes me think that the lookbehind is ignored. Is there a way to extract something from lookbehind?

I'm looking forward to produce:

<h2>Tricks</h2>
{{ Tricks @ envelope }}

You created a character class (a set of characters that is allowed to match) consisting of `<`, `h`, `2`, `>`, etc. there.. Don't use `[..]` unless you want to create a set of characters for a match (`\s`, `\d`, etc. are pre-built character classes). — Martijn Pieters, Feb 06 '13 at 15:16
Looks like you *really* want to use a HTML parser instead. Mixing Regular expressions and HTML get's real painful, really really fast. — Martijn Pieters, Feb 06 '13 at 15:18
I am trying to write a complex F&R for Sublime Text editor, to replace some of the stuff within my files. And, without that `[..]`, `.search` found nothing. — tomsseisums, Feb 06 '13 at 15:22
Without the character class, the lookbehind is not allowed because you are not allowed to use variable-width patterns in a lookbehind (no `+` or `*`). *with* the character class the lookbehind no longer matches what you think it matches. — Martijn Pieters, Feb 06 '13 at 15:23
@psycketom ST2 isn't stopping you from using an HTML library if it's more suited to your purposes for this F&R :) (of course, you could look at the `regex` library, which supports variable length look ahead/behind assertions) — Jon Clements, Feb 06 '13 at 15:23
@JonClements Could you point me into a HTML library direction? Have never seen such plugin before. — tomsseisums, Feb 06 '13 at 15:30
Look at http://www.crummy.com/software/BeautifulSoup/ - have a play with that — Jon Clements, Feb 06 '13 at 15:32

score 1 · Accepted Answer · answered Feb 06 '13 at 16:05

Looks like you really want to use a HTML parser instead. Mixing Regular expressions and HTML get's real painful, really really fast.

In your regular expression, you created a character class (a set of characters that is allowed to match) consisting of <, h, 2, >, etc. here:

[<h2>(.+?)</h2>\s+]

which could have been written as:

[<>h2()+.?/\s]

and it would match the same characters.

Don't use [..] unless you want to create a set of characters for a match (\s, \d, etc. are pre-built character classes).

However, even if you were to remove the brackets, the lookbehind is not allowed. You are not allowed to use variable-width patterns in a lookbehind (no + or *). So, with the character class the lookbehind no longer matches what you think it matches, without it the lookbehind is not permissable.

All in all, just just BeautifulSoup instead.

Positive lookbehind with a matching group to be extracted

1 Answers1