3

I have an HTML page that lists a long index of topics and page numbers. I want to find all the page numbers and their anchor tag links and decrement the page numbers by 1.

Here is an example line in the HTML:

<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">28</a></p>

I'm trying to find the number 28 in both places and decrement by 1.

So far I've been able to find the number and replace it with itself, but I can't figure out how to decrement it. My code so far:

import fileinput
import re

for line in fileinput.input():
    line = re.sub(r'\>([0-9]+)\<', r'>\1<', line.rstrip())
    print(line)
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
John Gayle
  • 43
  • 3

2 Answers2

3

You can use a replacement function while substituting:

import re
s = '<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">28</a></p>'
re.sub(r'page(\d+)">\1', lambda m: 'page{0}">{0}'.format(int(m.group(1)) - 1), s)

Result:

<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page27">27</a></p>

With page(\d+)">\1 we match page followed by a number, followed by a ">, followed by the same number as in the pattern in the first pair of parentheses (\1).

The substitution function takes as parameter a match. So we take the first group of the match (m.group(1)), which is the number, we parse it and decrement it. Then we reconstruct the new string using the decremented number.

JuniorCompressor
  • 19,631
  • 4
  • 30
  • 57
  • 1
    It might be nice to provide a bit more explanation, rather than inlining everything and leaving the OP to trawl through it... that said, I like the all-in-one approach to replacing both values. – jonrsharpe Apr 07 '15 at 21:39
  • Thank you I would of never been able to figure that out on my own! I was able to incorporate your code into mine to get it to parse the xml line by line and spit it out as a new file. – John Gayle Apr 10 '15 at 20:06
  • @JohnGayle it's supposed to work on the whole xml and not line by line...if it doesn't it may need a tweak – JuniorCompressor Apr 10 '15 at 22:47
  • @JuniorCompressor This is the code I ended up with `p = re.sub(r'page(\d+)">\1', lambda m: 'page{0}">{0}'.format(int(m.group(1)) - 1), line.rstrip())` Then I write the results to a text file. – John Gayle Apr 12 '15 at 23:45
1

Note that you can pass a function as the repl argument to re.sub, which will be passed a single match object "for every non-overlapping occurrence of pattern":

def decrement(match):
    """Decrement the number in the match."""
    return str(int(match.group()) - 1)

Note that this is expecting match.group() to represent an integer; to only capture the number, and not include the > and <, use lookarounds (see demo):

page_num = re.compile(r'''
    (?<=>) # a > before the group
    \d+    # followed by one or more digits
    (?=<)  # and a < after the group
''', re.VERBOSE)

This works as you require:

>>> page_num.sub(decrement, line)
'<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">27</a></p>'

and can be applied similarly for '#page28"'.

However, note that you should generally use an actual HTML parser, not regular expressions, for parsing HTML (which isn't a regular language).

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • The `step=` argument is useless since there is absolutely no way to supply it. If you really wanted it, you could do: `def decrementer(step=1): return lambda match: str(int(match.group()) - step)` and then `page_num.sub(decrementer(), line)` or `page_num.sub(decrementer(2), line)`. – Matt Apr 07 '15 at 21:31
  • @Matt there are two ways to supply it: using `functools.partial`, or a `lambda`. That being said, I think your approach (aside from the `lambda` usage) is neater. – jonrsharpe Apr 07 '15 at 21:33
  • At which point it would be easier to inline the whole function as a lambda. Or you could do what I did. – Matt Apr 07 '15 at 21:34
  • By the way, I wish I could give you a second +1 for mentioning that HTML is not a regular language. Although it seems that in this case the OP has a limited data set in which regular expressions would do the job quicker without having to solve the problem for the general case. – Matt Apr 07 '15 at 21:43
  • 1
    @Matt I think that's often why people end up parsing HTML with regex - it seems simpler for the limited set they start with, but inevitably edge cases creep in and the expressions get hairier and then they'd probably have been best off biting the bullet and parsing properly to begin with! – jonrsharpe Apr 07 '15 at 21:46