Extract Text Data from a Div Tag but not a from a Child H3 Tag

Question

I have an HTML snippet that I need to get data from using BeautifuSoup:

<!doctype html>
<html lang="en">
    <body>
        <div class="sidebar-box">
            <h3><i class="fa fa-users"></i> Management Team</h3>
                        Chairman, Director
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-male"></i> Teacher</h3>
                        John Doe
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-mortar-board"></i> Awards </h3>
                        National Top Quality Educational Development
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-building"></i> School Type</h3>
                        Secondary
        </div>
    </body>
</html>

I need to get the .text value of the second div from the top "John Doe", but not the .text value inside the h3 tag in that div. My challenge is that currently I get both text values as in this code snippet:

# Python 3.7, BeautifulSoup 4.7
# html variable is equal to the above HTML snippet
from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()
print(school_head_teacher)

This outputs:

Teacher
                        John Doe

However, I only need the John Doe value.

chitown88 · Accepted Answer · 2019-02-15T11:14:02.177

I offered 2 solutions. The first not the most elegant solution. But just off the top of my head quickly, you can split that again and join together everything after 'Teacher'

Option 1:

html = '''
!doctype html>
<html lang="en">
    <body>
        <div class="sidebar-box">
            <h3><i class="fa fa-users"></i> Management Team</h3>
                        Chairman, Director
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-male"></i> Teacher</h3>
                        John Doe
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-mortar-board"></i> Awards </h3>
                        National Top Quality Educational Development
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-building"></i> School Type</h3>
                        Secondary
        </div>
    </body>
</html>'''



from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()

school_head_teacher = school_head_teacher.split()[1:]
school_head_teacher = ' '.join(school_head_teacher)

print(school_head_teacher)

Output:

print(school_head_teacher)
John Doe

Option 2:

This one I think is a bit better. You find the tag that has Teacher. Then you get the parent tag. Then since you want the second part, you use .next_sibling and the strip it.

soup4(text=re.compile('Teacher'))[0].parent.next_sibling.strip()

I had it in a for loop incase there's multiple teachers. But you can substitute the top code instead of the for loop

from bs4 import BeautifulSoup
import re

soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
for elem in soup4(text=re.compile('Teacher')):
    print (elem.parent.next_sibling.strip())

I am accepting your solution, the 'Option 2' really. It fully meets my needs, much pythonic and then even meets some use cases I didn't include in the question. — ArthurEzenwanne, Feb 16 '19 at 18:56

Jack Fleeting · Answer 2 · 2019-02-15T12:30:16.183

1

Another option:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

teacher_name = soup.find_all('div', class_='sidebar-box')
print(teacher_name[1].contents[2].strip())

Output:

John Doe

edited Feb 15 '19 at 12:30

answered Feb 15 '19 at 12:24

Jack Fleeting

24,385
6
23
45

score 1 · Answer 3 · answered Feb 16 '19 at 15:03

Since <div class="sidebar-box"> <h3><i class="fa fa-male"></i> Teacher</h3> John Doe </div>

Since John Doe is the next-sibling of <h3><i class="fa fa-male"></i> Teacher</h3>

We can use a combination of find_next() and next_sibling on <div class="sidebar-box">

!doctype html>
<html lang="en">
    <body>
        <div class="sidebar-box">
            <h3><i class="fa fa-users"></i> Management Team</h3>
                        Chairman, Director
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-male"></i> Teacher</h3>
                        John Doe
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-mortar-board"></i> Awards </h3>
                        National Top Quality Educational Development
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-building"></i> School Type</h3>
                        Secondary
        </div>
    </body>
</html>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup.find_all('div', {'class':'sidebar-box'})
head_teacher = school_head_teacher[1].find_next().next_sibling
print(head_teacher)

By this way you can loop over the other divs too that follow the same pattern.

for school_info in school_head_teacher:
    print (school_info.find_next().next_sibling)

Extract Text Data from a Div Tag but not a from a Child H3 Tag

3 Answers3