-1

I am trying to use bs4 to extract this email, I've tried multiple methods and the output is still either none or blank.

<div> class = name 1
<div> class = name 2
    <div> class = name 3
        <div>
        <p> blah blah </p>
        <p>
            <a href = "mailto:email@email.com">
            email@email.com </a>
        </p>
    </div>
</div>

this was my first attempt but I still receive nothing

from bs4 import BeautifulSoup
from lxml import html
import requests
import time

html = requests.get('https://soundcloud.com/camcontrast')
soup = BeautifulSoup(html.text, 'lxml') 

for a in soup.select('.infoStats__description p a'):
    print(a['href'], a.get_text(strip=True))

3 Answers3

0

BeautifulSoup has a find_all() function that you could use for this purpose. Have a look at the find_all() documentation

And also the find_all() shortcut

from bs4 import BeautifulSoup

text = """
<div> class = name 1
<div> class = name 2
    <div> class = name 3
        <div>
        <p> blah blah </p>
        <p>
            <a href = "mailto:email@email.com">email@email.com </a>
        </p>
    </div>
</div>

"""

data = BeautifulSoup(text, 'html.parser')
for e in data.find_all('a'):
    print(e.string)

As an aside, you might want to check if the class text should be inside the <div> tags.

VMatić
  • 996
  • 2
  • 10
  • 18
0

You can use CSS selector "select" to extract email address from <a> tag: you can try it:

from bs4 import BeautifulSoup

html_doc="""
<div> class = name 1
<div> class = name 2
    <div> class = name 3
        <div>
        <p> blah blah </p>
        <p>
            <a href = "mailto:email@email.com">email@email.com </a>
        </p>
    </div>
</div>

"""

data = BeautifulSoup(html_doc,'lxml')
data = data.select('a')
for i in data:
    print(i.text)

Output will be:

email@email.com
Humayun Ahmad Rajib
  • 1,502
  • 1
  • 10
  • 22
0

Try getting all the a tags using .find_all('a') then filter out those having mailto

from bs4 import BeautifulSoup

text = """<div> class = name 1
<div> class = name 2
    <div> class = name 3
        <div>
        <p> blah blah </p>
        <p>
            <a href = "mailto:email@email.com">
            email@email.com </a>
        </p>
    </div>
</div>"""

soup = BeautifulSoup(text, 'html.parser')
links = soup.find_all('a')

for link in links:
    if(link.get('href').find('mailto:') > -1):
        print(link.string.strip())

Output:

email@email.com
Pygirl
  • 12,969
  • 5
  • 30
  • 43
  • I want to use this but without have to call the actual text. I want it to be able to pull it directly from the website. – Cameron Long Jun 11 '20 at 14:28
  • You have not given us the complete question then. This is the answer for your current post. Update the question then or better to ask a separate question. – Pygirl Jun 15 '20 at 11:22