1

i have to do many work to change like this:

<img src = "/" height="111" width="10" />

to

<img src = "/" height="222" width="20" />

so i want to use python Regular this is my code :

import re

s = '<img src = "werwerwe" height="111" width="10" />'

def a(x):
    print x.group(2)
    print x.group(4)

ss = re.sub(r'''<img.*(width\s*="?(\d+)"?)*\s*(height\s*="?(\d+)"?)*''',a, s)

print ss

so what can i do ,

thanks

updated:

it is ok now :

import re

s = '<img src = "/" height="111" width="10" />'


def a(x):
    b = x.group(0)
    b = b.replace(x.group(1),str(int(x.group(1))*2))
    b = b.replace(x.group(2),str(int(x.group(2))*2))
    return b

ss = re.sub(r'''<img.*?height=\"(\d+)\".*?width=\"(\d+)\"[^>]*>''',a, s)

print ss
zjm1126
  • 34,604
  • 53
  • 121
  • 166
  • that will spend much more time , my boss told me – zjm1126 May 04 '11 at 03:19
  • 2
    Your boss isn't concerned with you spending more time to find the wrong solution, is he? – Imran May 04 '11 at 03:20
  • Friends don't let friends parse HTML with regular expressions. – PaulMcG May 04 '11 at 03:41
  • Have you considered looking for a new job? Or at least a new boss?! – David Heffernan May 04 '11 at 03:52
  • I think we have all may have missed the point that he was looking for a fast solution ("but my boss told me it would spend much time"), since he already had one that worked. (At least, none of the answers so far have addressed this.) – Jeremy May 04 '11 at 04:18
  • At some point, ppl should be just fired if they don't listen to best-practice advices or constantly ignoring good advices from people that really know better than the OP. –  May 04 '11 at 04:37
  • Are you generating the original HTML yourself? – Winston Ewert May 05 '11 at 02:05

6 Answers6

4

Don't use regular expressions to parse HTML. Use BeautifulSoup

>>> from BeautifulSoup import BeautifulSoup
>>> ht = '<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>'
>>> soup = BeautifulSoup(ht)
>>> soup
<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>
>>> soup.findAll('img')
[<img src="foo/img.png" height="111" width="22" />, <img src="foo/img2.png" height="32" width="44" />]
>>> for img in soup.findAll('img'):
...     ht = int(img['height'])
...     wi = int(img['width'])
...     img['height'] = str(ht * 2)
...     img['width'] = str(wi * 2)
...     
... 
>>> print soup.prettify()
<html>
 <head>
  <title>
   foo
  </title>
 </head>
 <body>
  <p>
   whatever:
   <img src="foo/img.png" height="222" width="44" />
  </p>
  <ul>
   <li>
    <img src="foo/img2.png" height="64" width="88" />
   </li>
  </ul>
 </body>
 </html>
>>> 
jonesy
  • 3,502
  • 17
  • 23
  • this method is good , but it spend much time , because it is a web application , and in china show width=100 height =10 and in japan show width =200 height = 20 , so maybe mush use Regular , thanks – zjm1126 May 04 '11 at 02:47
  • 1
    How do regular expressions represent an improvement over this method given those conditions? I don't understand how having conditions that affect the actual numbers affects the mechanics of actually making the change. Can you explain further? – jonesy May 04 '11 at 03:06
2

Don't use regular expressions when dealing with HTML. Parse it properly with something like lxml.

import lxml.html

html = '<img src = "werwerwe" height="111" width="10" />'

etree = lxml.html.fromstring(html)

images = etree.xpath('//img')
for image in images:
    h = int(image.attrib['height'])
    w = int(image.attrib['width'])
    image.attrib['height'] = str(h*2)
    image.attrib['width'] = str(w*2)

print lxml.html.tostring(etree)

Gives:

<img src="werwerwe" height="222" width="20">

Acorn
  • 49,061
  • 27
  • 133
  • 172
2

Disclaimer: I agree that parsing HTML is best performed using an HTML parser. However, the poster has specifically asked for a regex solution, and this particular problem presents a good vehicle to demonstrate a clever (and little-known) regex technique that is quite handy.

But first, there is a logic error in the original function. It blindly performs its numerical replacement which results in erroneous results when the WIDTH is exactly half the HEIGHT, e.g. Given the following:

<img src = "/" width="10" height="20" />

The original posted program returns the following erroneous result:

<img src = "/" width="40" height="40" />

The problem is that WIDTH gets doubled twice. Additional logic is needed to guarantee correct replacement.

A cool regex trick you may not know:

Here is a modified version of the original program which fixes the above mentioned error and includes a (commented) version of an improved regex:

import re
s = '<img src = "/" width="10" height="111"  />'

def a(x):
    b = x.group(0)
    if x.group(1):
        b = b.replace(x.group(1),
            "width=\""+ str(int(x.group(2))*2) +"\"")
    if x.group(3):
        b = b.replace(x.group(3),
            "width=\""+ str(int(x.group(4))*2) +"\"")
    return b

reobj = re.compile(r'''
    <img                        # Start of IMG tag.
    (?:                         # Group for multiple attributes.
      \s+                       # Attributes separated by whitespace.
      (?:                       # Group for attribute alternatives.
        (width\s*=\s*"(\d+)")   # $1: WIDTH attribute, $2 value.
      | (height\s*=\s*"(\d+)")  # $3: HEIGHT attribute, $4 value.
      |[^\s>]+)                 # Other IMG attributes.
    )+                          # One or more attributes.
    [^>]*>                      # End of IMG tag.
    ''', re.IGNORECASE | re.VERBOSE)

ss = re.sub(reobj, a, s)

print ss

Note that the WIDTH gets captured into groups $1 and $2 and HEIGHT into groups $3 and $4, even if their order is reversed in the target string. I wish I could say that I thought up this cool trick, but I didn't. I stole it from one of Steven Leveithan's excellent blog posts: Capturing Multiple, Optional HTML Attribute Values. Pretty nifty eh?

A cleaner regex solution

Clever as that may be, it is still more complex than it needs to be for this job. I would keep it simple and just do two separate replace operations like so:

import re
s = '<img src = "/" width="10" height="111"  />'

def a(x):
    return x.group(1) + str(int(x.group(2))*2)

ss = re.sub(r"(?i)(<img[^>]*?width\s*=\s*[\"'])(\d+)",a, s)
ss = re.sub(r"(?i)(<img[^>]*?height\s*=\s*[\"'])(\d+)",a, ss)

print ss

Smaller. cleaner, easier to read and probably the fastest solution. (Note that the callback function becomes trivial.)

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
1

Nothing good will come from attempting to use regex to parse HTML. No matter what you do, it will eventually break.

So, use an html parser like python's HTMLParser, it will decode all of the HTML text and you just need to print it back out with your changes.

On another note, modifying html like you are doing looks suspicious. You are probably doing something the very hard way.

Winston Ewert
  • 44,070
  • 10
  • 68
  • 83
1

Once again, that task should be solved perfectly by a HTML Parser like suggested here and here.


If you still want to use a Regular Expressions for that purpose, you can use this one instead:

<img.*?(width|height)=\"(\d+)\".*?(width|height)=\"(\d+)\"

For example:

In text: <img src = "/" width="10" height="111"/> will match the following groups:

  • Group 1: "width"
  • Group 2: "10"
  • Group 3: "height"
  • Group 4: "111"

In text: <img src = "/" height="111" width="10"/> it will match:

  • Group 1: "height"
  • Group 2: "111"
  • Group 3: "width"
  • Group 4: "10"

Now it matches no matter if width is before height or viceversa, and I think the 4 groups give you enough info when doing the replacement.

Edit:
I captured the groups height and width for you to know which value matched first (otherwise, if you obtain 111 and 10 you won't know which one is the height and the width), but I don't think that's necessary in your case because all you have to do is duplicate both values, but could be useful in case you want increment height and width in different values.

Community
  • 1
  • 1
Oscar Mederos
  • 29,016
  • 22
  • 84
  • 124
0

Try with the following regex:

<img.*?height=\"(\d+)\".*?width=\"(\d+)\"

Group 1 will capture the height and Group 2 the width

Oscar Mederos
  • 29,016
  • 22
  • 84
  • 124
  • 1
    Cthulu approves of this solution. – Acorn May 04 '11 at 03:02
  • @Acorn I totally agree with you and @jonesy :) Take a look at all the answers I have answered about website scraping and parsing HTML, mainly in C# with HtmlAgilityPack, but to be honest, I'm not familiar with HTML Parsers in Python (of course, I know the existence of `BeautifulSoup`, probably the most famous when doing website scraping), and sometimes people (like in this time) want to go through the incorrect way. I was going to edit my question to tell him not to parse HTMLs with Regular Expressions, but then you did it ;) – Oscar Mederos May 04 '11 at 03:12
  • @Acorn In fact, take a look at the question the OP is asking now: http://stackoverflow.com/questions/5878079/how-to-replace-this-string-using-python. That woudln't happened if he would have used a HTML Parser :) – Oscar Mederos May 04 '11 at 03:16