1

I have this chunk of html I want to parse:

<div class="class123">
  <div><strong>title123</strong>
    <span style="something123">something else</span>
  </div>

  I want to parse this, how can do that?
</div>

How can I parse that with beautifulsoup? I know how to parse something inside a tag, but how to parse something on the same level?

soup1.find("div", class_="class123") 

grabs everything inside the first div

Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73
Kurama
  • 569
  • 2
  • 6
  • 14

2 Answers2

1

You can iterate over the div contents as

>>> from bs4 import NavigableString
>>> for x in soup.find("div", class_="class123").contents:
...     if isinstance(x, NavigableString):
...             print x.strip()
...

I want to parse this, how can do that?

The content will be list of the Tag and NavigableString objects contained within the parent.

Here NavigableString are strings which doesn't contain any sub elements.

nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52
0

I think what you are asking is how to extract text contained in the element, not child elements or text contained in child elements.

You can use .findall(text=True, recursive=False) (see Only extracting text from this element, not its children).

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(
...     """<div class="class123">
...   <div><strong>title123</strong>
...     <span style="something123">something else</span>
...   </div>
... 
...   I want to parse this, how can do that?
... </div>""", 'lxml')
>>> 
>>> print(soup.find("div", class_="class123").find_all(text=True, recursive=False))
['\n', '\n\n  I want to parse this, how can do that?\n']

If there are multiple matching <div> elements you'll have to loop through them

>>> for result in soup.find_all("div", class_="class123"):
...     print(result.find_all(text=True, recursive=False))
... 
['\n', '\n\n  I want to parse this, how can do that?\n']

Lastly, you can tidy up the result to return a string

>>> print(" ".join([s.strip() for s in \
...     soup.find("div", class_="class123").find_all(text=True, recursive=False) \
...     ]).strip())
I want to parse this, how can do that?
Community
  • 1
  • 1