2

I'm trying to parse a number of web pages with text, tables and html. Every page has a different number of paragraphs, but while every paragraph begins with an opening <div>, the closing </div> does not occur until the end. I'm just trying to get the content, filtering out certain elements and replacing them by something else

Desired result: text1 <b>text2</b> (table_deleted) text3

Actual result text1\n\ntext2some text heretext 3text2some text heretext 3 (table deleted)

from bs4 import BeautifulSoup

html = """
<h1>title</h1>
<h3>extra data</h3>
<div>
    text1
    <div>
        <b>next2</b><table>some text here</table>text 3
    </div>
</div>"""

soup = BeautifulSoup(html, 'html5lib')
tags = soup.find('h3').find_all_next()
contents = ""
for tag in tags:
    if tag.name == 'table':
        contents += " (table deleted) "

    contents += tag.text.strip()

print(contents)
bluppfisk
  • 2,538
  • 3
  • 27
  • 56

1 Answers1

0

Don't use html5lib as parser instead use html.parser. That being said, you can access the "div" that is immediately after your "h3" tag using a css selector and the select_one method.

From there, you can unwrap the following "div" tag and replace the "table" tag using the replace_with method

In [107]: from bs4 import BeautifulSoup

In [108]: html = """
     ...: <h1>title</h1>
     ...: <h3>extra data</h3>
     ...: <div>
     ...:     text1
     ...:     <div>
     ...:         <b>next2</b><table>some text here</table>text 3
     ...:     </div>
     ...: </div>"""

In [109]: soup = BeautifulSoup(html, 'html.parser')

In [110]: my_div = soup.select_one('h3 + div')

In [111]: my_div
Out[111]: 
<div>
    text1
    <div>
<b>next2</b><table>some text here</table>text 3
    </div>
</div>

In [112]: my_div.div.unwrap()
Out[112]: <div></div>

In [113]: my_div
Out[113]: 
<div>
    text1

<b>next2</b><table>some text here</table>text 3

</div>

In [114]: my_div.table.replace_with('(table deleted)')
Out[114]: <table>some text here</table>

In [115]: my_div
Out[115]: 
<div>
    text1

<b>next2</b>(table deleted)text 3

</div>
styvane
  • 59,869
  • 19
  • 150
  • 156