1

here is an example of the html I am trying to extract from:

    <div class="small subtle link">                      
                    <a href="https://example.com" target=&quot;_blank&quot;  nofollow >Example</a>
                

                
                     This text!
            </div>

I want to grab "This text!" but I keep getting "Example" with it when I do this

                myText=soup.findAll('div',{'class':re.compile('small subtle link')})
        if myText: 
            extractedText=myText.text.strip()

How do I leave out the text that is in the a tag?

AMC
  • 2,642
  • 7
  • 13
  • 35
soapy
  • 21
  • 4
  • Have you tried `extractedText=myText[-1].text.strip()` ? – IoaTzimas Nov 03 '20 at 23:48
  • Does this answer your question? [Only extracting text from this element, not its children](https://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children) – AMC Nov 04 '20 at 01:09
  • @Sophia P Pls check out my solution. – Sushil Nov 04 '20 at 01:32

3 Answers3

1

There are a few possible solutions, it all depends on the exact behaviour you're looking for.

This produces the correct output:

from bs4 import BeautifulSoup

html_src = \
    '''
    <html>
    <body>
    <div class="small subtle link">
        <a href="https://example.com" nofollow="" target='"_blank"'>
            Example
        </a>
        This text!
    </div>
    </body>
    </html>
    '''

soup = BeautifulSoup(html_src, 'lxml')
print(soup.prettify())

div_tag = soup.find(name='div', attrs={'class': 'small subtle link'})

div_content_text = []
for curr_text in div_tag.find_all(recursive=False, text=True):
    curr_text = curr_text.strip()
    if curr_text:
        div_content_text.append(curr_text)

print(div_content_text)

Edit: The solution by Sushil is quite clean, too.

AMC
  • 2,642
  • 7
  • 13
  • 35
0

This is what you need:

soup.div.find(text=True, recursive=False)
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30
0

You can try this:

print(div.a.find_next_sibling(text=True).strip())

This finds the a tag under the div and prints the text that comes after it.

Here is the full code:

from bs4 import BeautifulSoup

html = """
<div class="small subtle link">                      
                    <a href="https://example.com" target=&quot;_blank&quot;  nofollow >Example</a>
                

                
                     This text!
            </div>
"""

soup = BeautifulSoup(html,'html5lib')

div = soup.find('div', class_ = "small subtle link")

print(div.a.find_next_sibling(text=True).strip())

Output:

This text!
Sushil
  • 5,440
  • 1
  • 8
  • 26