I want to collect data from html files and then put some into variable or list. But i don't understand Beautiful Soup very much. especially how to navigate through structure.
Here what is the best way to get src url attribute ? :
<div id="headshot">
<img title="Photo of someone" alt="Photo of somenone" src="url/file.jpg">
</div>
Here how to navigate and put p class values in list ? :
<p class="bioheading">value</p>
<div class="biodata">value</div>
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata"><a href"http://url.com/month=01&year=2018&day=02">January 01, 1900</a> (117 years old)</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
same for this:
<div id="vitalbox" class="tab-content">
<div role="tabpanel" class="tab-pane active" id="home">
<div class="row">
<div class="col-xs-12 col-sm-4">
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
</div>
Here how to get the gender value ? :
<input name="Gender" value="m" type="hidden">
Especially this html can be malformed. Sorry for this beginner question.
Best regards.
EDIT:
k=0
a_table=[]
bday1=''
for link in soup.findAll('a'):
a_table.append(str(link.get('href')))
#out.write(str(i)+'\t'+str(p.text)+'\n')
if re.match(regs4,str(link.get('href')),re.M) != None:
bday1 = re.search(regs1,str(link.get('href')),re.M)
else:
bday1 = 'http://url.com/calendar.asp?calmonth=01&calyear=2018&calday=01'
k=k+1
I try this to collect a href= and check when it is wanted url. with regex .find_All() will not work get the error:
builtins.TypeError: 'NoneType' object is not callable
So I am using .findAll()
This will not work also there is several input:
for _input in soup.findAll('input'):
if str(_input.attrs['name']) == 'Gender':
if str(_input.attrs['value']) == 'f':
out.write('F')
elif str(_input.attrs['value']) == 'm':
out.write('M')
else:
out.write('—')
get this error:
builtins.KeyError: 'name'