How to collect data from html with beautiful soup and put it a list

Question

I want to collect data from html files and then put some into variable or list. But i don't understand Beautiful Soup very much. especially how to navigate through structure.

Here what is the best way to get src url attribute ? :

<div id="headshot">
<img title="Photo of someone" alt="Photo of somenone" src="url/file.jpg">
</div>

Here how to navigate and put p class values in list ? :

                <p class="bioheading">value</p>
                <div class="biodata">value</div>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>
                <p class="bioheading">value</p>
                <p class="biodata"><a href"http://url.com/month=01&amp;year=2018&amp;day=02">January 01, 1900</a> (117 years old)</p>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>

same for this:

<div id="vitalbox" class="tab-content">
<div role="tabpanel" class="tab-pane active" id="home">
    <div class="row">
        <div class="col-xs-12 col-sm-4">
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
        </div>

Here how to get the gender value ? :

<input name="Gender" value="m" type="hidden">

Especially this html can be malformed. Sorry for this beginner question.

Best regards.

EDIT:

k=0
a_table=[]
bday1=''
for link in soup.findAll('a'):
    a_table.append(str(link.get('href')))
    #out.write(str(i)+'\t'+str(p.text)+'\n')
    if re.match(regs4,str(link.get('href')),re.M) != None:
        bday1 = re.search(regs1,str(link.get('href')),re.M)
    else:
        bday1 = 'http://url.com/calendar.asp?calmonth=01&amp;calyear=2018&amp;calday=01'
    k=k+1

I try this to collect a href= and check when it is wanted url. with regex .find_All() will not work get the error:

builtins.TypeError: 'NoneType' object is not callable

So I am using .findAll()

This will not work also there is several input:

for _input in soup.findAll('input'):
    if str(_input.attrs['name']) == 'Gender':
        if str(_input.attrs['value']) == 'f':
            out.write('F') 
        elif str(_input.attrs['value']) == 'm':
            out.write('M')
        else:
            out.write('—')

get this error:

builtins.KeyError: 'name'

Check this page- [how-to-scrape-websites-with-python-and-beautifulsoup](https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe) — Keyur Potdar, Jan 06 '18 at 13:53

score 3 · Answer 1 · answered Jan 06 '18 at 19:56

Just some modifications/improvements to the Bill's answer:

you can use .select_one() instead of .select()[0] to find a single element by a CSS selector
you don't need attrs and use a dictionary-like access to tag attributes:
```
soup.select_one('#headshot img')['src']
```
.get_text() is a bit more robust than accessing .text directly
you can improve the CSS selector used to get the p elements and use the fact that class names start with bio:
```
#vitalbox #home p[class^=bio]
```
you should be using find_all() and not a deprecated findAll()
you can even use soup('p') shortcut instead of soup.find_all('p') and soup.input['value'] instead of soup.find('input').attrs['value']

p[class^=bio] this doesn't work for me. Also want to get p[class^=bio] and div[class^=bio]. — RyosanCiffer, Jan 08 '18 at 10:58

Bill Bell · Answer 2 · 2018-01-06T17:34:47.657

With select #headshot finds the element with id 'headshot', and img find the descendant element with this tag. Since select can potentially find a list of elements, we insist on the first item in the list, and ask for that elements src attribute.

>>> HTML = '''\
... <div id="headshot">
... <img title="Photo of someone" alt="Photo of somenone" src="url/file.jpg">
... </div>'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.select('#headshot img')[0].attrs['src']
'url/file.jpg'

Use findAll to identify all of the p elements, then in a list comprehension, get the text of each.

>>> HTML = '''\
... <p class="bioheading">value</p>
... <div class="biodata">value</div>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
... <p class="biodata"><a href"http://url.com/month=01&amp;year=2018&amp;day=02">January 01, 1900</a> (117 years old)</p>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> [p.text for p in soup.findAll('p')]
['value', 'value', 'value', 'value', 'January 01, 1900 (117 years old)', 'value', 'value', 'value', 'value']

As above, use select to unambiguously specify the content required, then get the text values in a list comprehension.

>>> HTML = '''\
... <div id="vitalbox" class="tab-content">
... <div role="tabpanel" class="tab-pane active" id="home">
... <div class="row">
... <div class="col-xs-12 col-sm-4">
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
... <p class="biodata">value</p>
... <p class="bioheading">value</p>
...  <p class="biodata">value</p>
...  </div>'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> [p.text for p in soup.select('#vitalbox #home .row .col-xs-12 p')]
['value', 'value', 'value', 'value', 'value', 'value']

In this case there is only one element, namely input; therefore, I use find. Since I've used find (rather than a method that yields lists) I know at most one element will be returned. I request its attribute.

>>> HTML = '''\
... <input name="Gender" value="m" type="hidden">'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.find('input').attrs['value']
'm'

How to collect data from html with beautiful soup and put it a list

2 Answers2

Linked