0

I want to extract content (Content here) from following html with BeautifulSoap and XPath respectively. How can it be done.

<div class="paragraph">
    <h1>Title here</h1>
    Content here
</div>

Output:

Content here
CodingGuy
  • 15
  • 3
  • Does this answer your question? [Only extracting text from this element, not its children](https://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children) – fpsthirty Nov 17 '19 at 09:48

1 Answers1

1

There are many ways you can achieve that.Here are few of them.

By using contents

OR By using next_element

OR

By using next_sibling

OR

By using stripped_strings

from bs4 import BeautifulSoup
html='''<div class="paragraph">
    <h1>Title here</h1>
    Content here
</div>'''

soup=BeautifulSoup(html,"html.parser")
print(soup.find('div',class_='paragraph').contents[2].strip())
print(soup.find('div',class_='paragraph').find('h1').next_element.next_element.strip())
print(soup.find('div',class_='paragraph').find('h1').next_sibling.strip())
print(list(soup.find('div',class_='paragraph').stripped_strings)[1])

You can use css selector as well.

html='''<div class="paragraph">
    <h1>Title here</h1>
    Content here
</div>'''

soup=BeautifulSoup(html,"html.parser")
print(soup.select_one('.paragraph').contents[2].strip())
print(soup.select_one('.paragraph >h1').next_element.next_element.strip())
print(soup.select_one('.paragraph >h1').next_sibling.strip())
print(list(soup.select_one('.paragraph').stripped_strings)[1])
KunduK
  • 32,888
  • 5
  • 17
  • 41