0

I want to collect data from html files and then put some into variable or list. But i don't understand Beautiful Soup very much. especially how to navigate through structure.

Here what is the best way to get src url attribute ? :

<div id="headshot">
<img title="Photo of someone" alt="Photo of somenone" src="url/file.jpg">
</div>

Here how to navigate and put p class values in list ? :

                <p class="bioheading">value</p>
                <div class="biodata">value</div>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>
                <p class="bioheading">value</p>
                <p class="biodata"><a href"http://url.com/month=01&amp;year=2018&amp;day=02">January 01, 1900</a> (117 years old)</p>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>

same for this:

<div id="vitalbox" class="tab-content">
<div role="tabpanel" class="tab-pane active" id="home">
    <div class="row">
        <div class="col-xs-12 col-sm-4">
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
        </div>

Here how to get the gender value ? :

<input name="Gender" value="m" type="hidden">

Especially this html can be malformed. Sorry for this beginner question.

Best regards.

EDIT:

k=0
a_table=[]
bday1=''
for link in soup.findAll('a'):
    a_table.append(str(link.get('href')))
    #out.write(str(i)+'\t'+str(p.text)+'\n')
    if re.match(regs4,str(link.get('href')),re.M) != None:
        bday1 = re.search(regs1,str(link.get('href')),re.M)
    else:
        bday1 = 'http://url.com/calendar.asp?calmonth=01&amp;calyear=2018&amp;calday=01'
    k=k+1

I try this to collect a href= and check when it is wanted url. with regex .find_All() will not work get the error:

builtins.TypeError: 'NoneType' object is not callable

So I am using .findAll()

This will not work also there is several input:

for _input in soup.findAll('input'):
    if str(_input.attrs['name']) == 'Gender':
        if str(_input.attrs['value']) == 'f':
            out.write('F') 
        elif str(_input.attrs['value']) == 'm':
            out.write('M')
        else:
            out.write('—')

get this error:

builtins.KeyError: 'name'
RyosanCiffer
  • 115
  • 2
  • 12
  • Check this page- [how-to-scrape-websites-with-python-and-beautifulsoup](https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe) – Keyur Potdar Jan 06 '18 at 13:53

2 Answers2

3

Just some modifications/improvements to the Bill's answer:

  • you can use .select_one() instead of .select()[0] to find a single element by a CSS selector
  • you don't need attrs and use a dictionary-like access to tag attributes:

    soup.select_one('#headshot img')['src']
    
  • .get_text() is a bit more robust than accessing .text directly

  • you can improve the CSS selector used to get the p elements and use the fact that class names start with bio:

    #vitalbox #home p[class^=bio]
    
  • you should be using find_all() and not a deprecated findAll()

  • you can even use soup('p') shortcut instead of soup.find_all('p') and soup.input['value'] instead of soup.find('input').attrs['value']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

With select #headshot finds the element with id 'headshot', and img find the descendant element with this tag. Since select can potentially find a list of elements, we insist on the first item in the list, and ask for that elements src attribute.

>>> HTML = '''\
... <div id="headshot">
... <img title="Photo of someone" alt="Photo of somenone" src="url/file.jpg">
... </div>'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.select('#headshot img')[0].attrs['src']
'url/file.jpg'

Use findAll to identify all of the p elements, then in a list comprehension, get the text of each.

>>> HTML = '''\
... <p class="bioheading">value</p>
... <div class="biodata">value</div>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
... <p class="biodata"><a href"http://url.com/month=01&amp;year=2018&amp;day=02">January 01, 1900</a> (117 years old)</p>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> [p.text for p in soup.findAll('p')]
['value', 'value', 'value', 'value', 'January 01, 1900 (117 years old)', 'value', 'value', 'value', 'value']

As above, use select to unambiguously specify the content required, then get the text values in a list comprehension.

>>> HTML = '''\
... <div id="vitalbox" class="tab-content">
... <div role="tabpanel" class="tab-pane active" id="home">
... <div class="row">
... <div class="col-xs-12 col-sm-4">
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
...  <p class="biodata">value</p>
...  </div>'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> [p.text for p in soup.select('#vitalbox #home .row .col-xs-12 p')]
['value', 'value', 'value', 'value', 'value', 'value']

In this case there is only one element, namely input; therefore, I use find. Since I've used find (rather than a method that yields lists) I know at most one element will be returned. I request its attribute.

>>> HTML = '''\
... <input name="Gender" value="m" type="hidden">'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.find('input').attrs['value']
'm'
Bill Bell
  • 21,021
  • 5
  • 43
  • 58