How to remove everything inside a specific html tag (along with the tag itself)

Question

Say you have this string:

text = """<p>Bla bla bla.</p><p>Blo blo blo<a 
href="http://www.example.com">bli bli</a>.</p><p>blu blu<br>
<span style="font-size: x-small;"><br>
content to remove</span></p>"""

My goal is to remove everything inside <span style="font-size: x-small;"><br>content to remove</span>, along with the opening and closing tags.

So I can only delete span tags (and its content) if attribute style is "font-size: x-small;".

My code doesn't work. Here it is:

import re    
pattern = re.compile(r"\<span style='font-size: x-small;'\>.*?\</span\>")
new_text = pattern.sub(lambda match: match.group(0).replace(match.group(0),'') ,text)

I'd rather go with Python itself, cause I no nothing about regex (as you can see...). But if regex is the way to go, I will take it.

score 1 · Answer 1 · answered Jun 27 '18 at 11:15

1

You could use find, indexing and string concatenation.

new_text = text[:text.find("<span")]+text[text.find("</span>")+7:]

text.find("</span>")+7 looks for the index of the first occurence of , then adds 7 to that index, the length of the tag itself.

There are many ways to approach this. For any non-trivial html parsing I'd recommend Beautifulsoup.

answered Jun 27 '18 at 11:15

Tom Rijntjes

614
4
16

Well it works, but I can only delete span tags (and its content) if `style="font-size: x-small;"`. I guess your code would remove all span tags content, which is no good for me. – Luis Rock Jun 27 '18 at 11:29
Only the first occurence. See the documentation for find. You could expand the find clause with the style tags to do so. – Tom Rijntjes Jun 27 '18 at 11:41
I saw the documentation, but I couldn't find a way to remove all span tags (along with content) `when style="font-size: x-small;"` – Luis Rock Jun 27 '18 at 20:08

score 1 · Answer 2 · answered Jun 27 '18 at 21:32

I found a way with Beautiful Soup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'html.parser')
spans_to_delete = soup.find_all('span', style=lambda value: value and 'font-size: x-small' in value)

if spans_to_delete:
    for span in spans_to_delete:
        span.extract()

    new_text = str(soup)
else:
    print('No span with desired style found')

Actually this thread's first answer gave me the directions.

score 0 · Answer 3 · answered Jun 27 '18 at 11:33

0

I would go with regex.

The regex \<span(.*)span> matches everything inside the span tags, including the opening and closing tags. Try this:

    String text = "<p>Bla bla bla.</p><p>Blo blo blo<a 
    href=\"http://www.example.com\">bli bli</a>.</p><p>blu blu<br><span 
    style=\"font-size: x-small;\"><br>content to remove</span></p>";
    text = text.replaceAll("\\<span(.*)span>", "");

answered Jun 27 '18 at 11:33

lbalso

21
1

OP asked specifically to avoid regex – Tom Rijntjes Jun 27 '18 at 11:41
No he didn't. He just said he prefers python, but he would consider a regex solution. – lbalso Jun 27 '18 at 11:45
As I said before, I can only delete span tags (and its content) if style="font-size: x-small;". I guess your code would remove all span tags content, which is no good for me. But, yes, I would consider regex or maybe beautifulsoup. – Luis Rock Jun 27 '18 at 20:06

How to remove everything inside a specific html tag (along with the tag itself)

3 Answers3