Disclaimer: I agree that parsing HTML is best performed using an HTML parser. However, the poster has specifically asked for a regex solution, and this particular problem presents a good vehicle to demonstrate a clever (and little-known) regex technique that is quite handy.
But first, there is a logic error in the original function. It blindly performs its numerical replacement which results in erroneous results when the WIDTH is exactly half the HEIGHT, e.g. Given the following:
<img src = "/" width="10" height="20" />
The original posted program returns the following erroneous result:
<img src = "/" width="40" height="40" />
The problem is that WIDTH gets doubled twice. Additional logic is needed to guarantee correct replacement.
A cool regex trick you may not know:
Here is a modified version of the original program which fixes the above mentioned error and includes a (commented) version of an improved regex:
import re
s = '<img src = "/" width="10" height="111" />'
def a(x):
b = x.group(0)
if x.group(1):
b = b.replace(x.group(1),
"width=\""+ str(int(x.group(2))*2) +"\"")
if x.group(3):
b = b.replace(x.group(3),
"width=\""+ str(int(x.group(4))*2) +"\"")
return b
reobj = re.compile(r'''
<img # Start of IMG tag.
(?: # Group for multiple attributes.
\s+ # Attributes separated by whitespace.
(?: # Group for attribute alternatives.
(width\s*=\s*"(\d+)") # $1: WIDTH attribute, $2 value.
| (height\s*=\s*"(\d+)") # $3: HEIGHT attribute, $4 value.
|[^\s>]+) # Other IMG attributes.
)+ # One or more attributes.
[^>]*> # End of IMG tag.
''', re.IGNORECASE | re.VERBOSE)
ss = re.sub(reobj, a, s)
print ss
Note that the WIDTH gets captured into groups $1 and $2 and HEIGHT into groups $3 and $4, even if their order is reversed in the target string. I wish I could say that I thought up this cool trick, but I didn't. I stole it from one of Steven Leveithan's excellent blog posts: Capturing Multiple, Optional HTML Attribute Values. Pretty nifty eh?
A cleaner regex solution
Clever as that may be, it is still more complex than it needs to be for this job. I would keep it simple and just do two separate replace operations like so:
import re
s = '<img src = "/" width="10" height="111" />'
def a(x):
return x.group(1) + str(int(x.group(2))*2)
ss = re.sub(r"(?i)(<img[^>]*?width\s*=\s*[\"'])(\d+)",a, s)
ss = re.sub(r"(?i)(<img[^>]*?height\s*=\s*[\"'])(\d+)",a, ss)
print ss
Smaller. cleaner, easier to read and probably the fastest solution. (Note that the callback function becomes trivial.)