Split HTML string into sections based on specific tag on python

Question

I'm fairly new to python. I spent days on the forum and the answers to my question exist but for javascript.

I have an html page with the news and I want the content to be parsed into a new section anytime there is an H4 tag. I want to name the section based on the content of the string and then later call the sections into separate emails (but that's for later). I can't seem to figure out how to create these sections. Below is what the code looks like. Any advice is very much appreciated sorry if my question is rudimentary. Thank you!

'<td><h3>Andean</h3><hr/></td>
</tr><tr>
    <td><h4>Bolivia bla bla</h4></td>
</tr>             
<tr>
    <td><p>* Bolivia&bla bla text text </p></td>
</tr><tr>
    <td><h3>Brazil</h3><hr/></td>
</tr><tr>
    <td><h4>BRAZIL: bla bla</h4></td>
</tr>             
<tr>'

score 0 · Answer 1 · answered Sep 30 '19 at 14:06

You can either do it "manually" by using Regular Expressions (https://en.wikipedia.org/wiki/Regular_expression) or use a library that's build specifically for parsing HTML (https://pypi.org/project/beautifulsoup4/). If you plan on doing more HTML parsing, I'd recommend using the purpose-built library. Both take a bit of getting used to if you're not familiar with them, however both are worth learning.

import re
from bs4 import BeautifulSoup

html_code = """<td><h3>Andean</h3><hr/></td>
</tr><tr>
    <td><h4>Bolivia bla bla</h4></td>
</tr>             
<tr>
    <td><p>* Bolivia&bla bla text text </p></td>
</tr><tr>
    <td><h3>Brazil</h3><hr/></td>
</tr><tr>
    <td><h4>BRAZIL: bla bla</h4></td>
</tr>             
<tr>"""

print('* with regex:')
print(re.findall('<h4>(.*?)</h4>', html_code))

print('* with beautiful soup:')
soup = BeautifulSoup(html_code)
tmp = soup.find_all('h4')
for val in tmp:
    print(val.contents)

will output

* with regex:
['Bolivia bla bla', 'BRAZIL: bla bla']
* with beautiful soup:
['Bolivia bla bla']
['BRAZIL: bla bla']

score 0 · Answer 2 · answered Sep 30 '19 at 14:59

You can use itertools.groupby:

import itertools, re
from bs4 import BeautifulSoup as soup
r = list(filter(None, [i.find(re.compile('h3|h4')) for i in soup(s, 'html.parser').find_all('td')]))
result = [(a, list(b)) for a, b in itertools.groupby(r, key=lambda x:x.name=='h4')]
final_result = [[b.text for b in result[i][-1]]+[b.text for b in result[i+1][-1]] for i in range(0, len(result), 2)]

Output:

[['Andean', 'Bolivia bla bla'], ['Brazil', 'BRAZIL: bla bla']]

score 0 · Answer 3 · answered Oct 16 '19 at 15:47

Hey thanks so much for your help @Ajax1234 and @orangeInk.

I took a closer look at the code, which has changed in the meantime. I ended up using a find all h2 for the titles and div with a particular class for the content, and looping through levels to create a dataframe where each corresponds to a section/country. I'm not sure if what I did is ideal but this is what I got :

comment_h2_tags = main_table.find_all('div',attrs={'class':'cr_title_in'})
comment_div_tags = main_table.find_all('div',attrs={'class':'itemBody'})

h2s = [] 
for h2_tag in comment_h2_tags:
    h2 = h2_tag
    h2 = (h2.a.text.strip())
    h2s.append(h2)
`

I'm imputing the Country name manually for now but I fgured Id' give an update. Thanks!

Split HTML string into sections based on specific tag on python

3 Answers3