Parsing text in the same level as an html tag in beautifulsoup4

Question

I have this chunk of html I want to parse:

<div class="class123">
  <div><strong>title123</strong>
    <span style="something123">something else</span>
  </div>

  I want to parse this, how can do that?
</div>

How can I parse that with beautifulsoup? I know how to parse something inside a tag, but how to parse something on the same level?

soup1.find("div", class_="class123")

grabs everything inside the first div

What do you mean by on the same level? And, what have you tried so far? — AKS, Dec 01 '16 at 11:37

score 1 · Accepted Answer · answered Dec 01 '16 at 11:59

You can iterate over the div contents as

>>> from bs4 import NavigableString
>>> for x in soup.find("div", class_="class123").contents:
...     if isinstance(x, NavigableString):
...             print x.strip()
...

I want to parse this, how can do that?

The content will be list of the Tag and NavigableString objects contained within the parent.

Here NavigableString are strings which doesn't contain any sub elements.

score 0 · Answer 2 · edited May 23 '17 at 12:08

I think what you are asking is how to extract text contained in the element, not child elements or text contained in child elements.

You can use .findall(text=True, recursive=False) (see Only extracting text from this element, not its children).

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(
...     """<div class="class123">
...   <div><strong>title123</strong>
...     <span style="something123">something else</span>
...   </div>
... 
...   I want to parse this, how can do that?
... </div>""", 'lxml')
>>> 
>>> print(soup.find("div", class_="class123").find_all(text=True, recursive=False))
['\n', '\n\n  I want to parse this, how can do that?\n']

If there are multiple matching <div> elements you'll have to loop through them

>>> for result in soup.find_all("div", class_="class123"):
...     print(result.find_all(text=True, recursive=False))
... 
['\n', '\n\n  I want to parse this, how can do that?\n']

Lastly, you can tidy up the result to return a string

>>> print(" ".join([s.strip() for s in \
...     soup.find("div", class_="class123").find_all(text=True, recursive=False) \
...     ]).strip())
I want to parse this, how can do that?

Parsing text in the same level as an html tag in beautifulsoup4

2 Answers2