1

Say you have this string:

text = """<p>Bla bla bla.</p><p>Blo blo blo<a 
href="http://www.example.com">bli bli</a>.</p><p>blu blu<br>
<span style="font-size: x-small;"><br>
content to remove</span></p>"""

My goal is to remove everything inside <span style="font-size: x-small;"><br>content to remove</span>, along with the opening and closing tags.

So I can only delete span tags (and its content) if attribute style is "font-size: x-small;".

My code doesn't work. Here it is:

import re    
pattern = re.compile(r"\<span style='font-size: x-small;'\>.*?\</span\>")
new_text = pattern.sub(lambda match: match.group(0).replace(match.group(0),'') ,text) 

I'd rather go with Python itself, cause I no nothing about regex (as you can see...). But if regex is the way to go, I will take it.

Luis Rock
  • 357
  • 4
  • 17

3 Answers3

1

You could use find, indexing and string concatenation.

new_text = text[:text.find("<span")]+text[text.find("</span>")+7:]

text.find("</span>")+7 looks for the index of the first occurence of , then adds 7 to that index, the length of the tag itself.

There are many ways to approach this. For any non-trivial html parsing I'd recommend Beautifulsoup.

Tom Rijntjes
  • 614
  • 4
  • 16
  • Well it works, but I can only delete span tags (and its content) if `style="font-size: x-small;"`. I guess your code would remove all span tags content, which is no good for me. – Luis Rock Jun 27 '18 at 11:29
  • Only the first occurence. See the documentation for find. You could expand the find clause with the style tags to do so. – Tom Rijntjes Jun 27 '18 at 11:41
  • I saw the documentation, but I couldn't find a way to remove all span tags (along with content) `when style="font-size: x-small;"` – Luis Rock Jun 27 '18 at 20:08
1

I found a way with Beautiful Soup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'html.parser')
spans_to_delete = soup.find_all('span', style=lambda value: value and 'font-size: x-small' in value)

if spans_to_delete:
    for span in spans_to_delete:
        span.extract()

    new_text = str(soup)
else:
    print('No span with desired style found')

Actually this thread's first answer gave me the directions.

Luis Rock
  • 357
  • 4
  • 17
0

I would go with regex.

The regex \<span(.*)span> matches everything inside the span tags, including the opening and closing tags. Try this:

    String text = "<p>Bla bla bla.</p><p>Blo blo blo<a 
    href=\"http://www.example.com\">bli bli</a>.</p><p>blu blu<br><span 
    style=\"font-size: x-small;\"><br>content to remove</span></p>";
    text = text.replaceAll("\\<span(.*)span>", "");
lbalso
  • 21
  • 1
  • OP asked specifically to avoid regex – Tom Rijntjes Jun 27 '18 at 11:41
  • No he didn't. He just said he prefers python, but he would consider a regex solution. – lbalso Jun 27 '18 at 11:45
  • As I said before, I can only delete span tags (and its content) if style="font-size: x-small;". I guess your code would remove all span tags content, which is no good for me. But, yes, I would consider regex or maybe beautifulsoup. – Luis Rock Jun 27 '18 at 20:06