8

I'm trying to achieve the following replacement in python. Replace all html tags with {n} & create a hash of [tag, {n}]
Original string -> "<h> This is a string. </H><P> This is another part. </P>"
Replaced text -> "{0} This is a string. {1}{2} This is another part. {3}"

Here's my code. I've started with replacement but I'm stuck at the replacement logic as I cannot figure out the best way to replace each occurrence in a consecutive manner i.e with {0}, {1} and so on:

import re
text = "<h> This is a string. </H><p> This is another part. </P>"

num_mat = re.findall(r"(?:<(\/*)[a-zA-Z0-9]+>)",text)
print(str(len(num_mat)))

reg = re.compile(r"(?:<(\/*)[a-zA-Z0-9]+>)",re.VERBOSE)

phctr = 0
#for phctr in num_mat:
#    phtxt = "{" + str(phctr) + "}"
phtxt = "{" + str(phctr) + "}"
newtext = re.sub(reg,phtxt,text)

print(newtext)

Can someone help with a better way of achieving this? Thank you!

Ans
  • 105
  • 4

1 Answers1

7
import re
import itertools as it

text = "<h> This is a string. </H><p> This is another part. </P>"

cnt = it.count()
print re.sub(r"</?\w+>", lambda x: '{{{}}}'.format(next(cnt)), text)

prints

{0} This is a string. {1}{2} This is another part. {3}

Works for simple tags only (no attributes/spaces in tags). For extended tags, you have to adapt the regexp.

Also, not reinitializing cnt = it.count() will keep the numbering going on.

UPDATE to get a mapping dict:

import re
import itertools as it

text = "<h> This is a string. </H><p> This is another part. </P>"

cnt = it.count()
d = {}
def replace(tag, d, cnt):
    if tag not in d:
        d[tag] = '{{{}}}'.format(next(cnt))
    return d[tag]
print re.sub(r"(</?\w+>)", lambda x: replace(x.group(1), d, cnt), text)
print d

prints:

{0} This is a string. {1}{2} This is another part. {3}
{'</P>': '{3}', '<h>': '{0}', '<p>': '{2}', '</H>': '{1}'}
eumiro
  • 207,213
  • 34
  • 299
  • 261
  • Wow! Thank you very much. I need to learn more about lamda. Thanks for introducing it to me. :) Can you also help me to show the best way to create a hash/dictionary with the combination of string that is replaced and the replaced string i.e {'':'{1}','':'{2}'} What I am doing is - findall occurrences of match in a loop and iterate to place into a dictionary. Just would like to know a more efficient way - looking at the style of your code – Ans Nov 29 '12 at 10:02
  • Sorry for the late reply. Thank you very much! – Ans Dec 02 '12 at 15:27
  • 18 Ramen Noodles lines of code replaced by one strand of pearls. Thanks. – JayJay123 Sep 18 '19 at 02:51