0

I was trying to scrape "span" tag using BeautifulSoup. here's my code..

import urllib
from bs4 import BeautifulSoup
url="someurl"
res=urllib.urlopen(url)
html=res.read()
soup=BeautifulSoup(html,"html.parser")
soup.findAll("span")

But when I do so, for some specific web pages. it does n't list all the spans. It just shows limited no. of spans. but when I do

soup.prettify()

It contains all the spans.. What might be the reason? Am I missing out on something? Also some answers I found were to use headless browsers like "htmlunit". but I am not sure what they exactly are? Can I integrate them into my django project?

soup.prettify gives https://drive.google.com/file/d/0BxhTzDujWhPVTzdIS2VWd1pZcHM/view?usp=sharing

expected output of soup.findAll("span")

list of all the spans

output im getting

[<span class="ssc-ftpl ssc_ga_tag" data-gaa="Opened" data-gac="Footer" data-gal="Responsible Gambling" tabindex="0"> Responsible Gambling</span>, <span class="ssc-ftpl ssc_ga_tag" data-gaa="Opened" data-gac="Footer" data-gal="About Betfair" tabindex="0"> About Betfair</span>, <span class="ssc-ftpl ssc-ftls " tabindex="0">English - UK</span>, <span class="ssc-ftpl" tabindex="0">\xa9 \xae</span>]

2 Answers2

1

Finally found out the soulution.. the problem was the default "html.parser", which was not able to handle. Use "html5lib" instead for parsing. and get the desired results.

soup=BeautifulSoup(html,"html5lib")
soup.findAll("span")

html5lib parser parses the page exactly the way a browser does.

0

Maybe you are trying to scrape a different page, but I didn't have a problem scraping that site. Here is my code:

url='https://www.betfair.com/sport/football'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

test = soup.find_all('span')
for span in test:
    print(span)

This produced a large list of spans including the lines/scores which is what I figured you are interested in:

<span class="ssc-lkh"></span>
<span>Join Now</span>
<span class="new flag-en"></span>
<span class="new flag-en"></span>
<span class="sportIcon-6423"></span>
<span class="sportName">American Football</span>
<span class="sportIcon-3988"></span>
<span class="sportName">Athletics</span>
<span class="sportIcon-61420"></span>
.....

Updated in response to the comment below

Here is some revised code to show that my code does indeed pull in the spans you need.

url='https://www.betfair.com/sport/football'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

test = soup.find_all('span',attrs={"class":"away-team-name"}) 
for span in test:
    print("away team" + span.text)

Produces:

away team
Marseille

away team
Lazio

away team
Academica

away team
Canada (W)

away team
Arnett Gardens FC

away team
UWI FC
....
dstudeba
  • 8,878
  • 3
  • 32
  • 41
  • yea. your one will also not be listing all the spans. check out for span containing the name of team of any ongoing or in-play match.. like https://drive.google.com/file/d/0BxhTzDujWhPVekN6UW9CUzd0eWc/view?usp=sharing this one – sidharth kumar Dec 20 '15 at 20:31
  • @sidharthkumar Did you try my code? When I run it it gets those `span`s. Please see my updated code where I explicitly get the `span` you say you can't get. – dstudeba Dec 20 '15 at 21:30
  • i did same(almost).. https://drive.google.com/file/d/0BxhTzDujWhPVNGE4Sk9xTW10UFE/view?usp=sharing – sidharth kumar Dec 21 '15 at 10:10
  • thanks for help. used html5lib as the parser and got the output – sidharth kumar Dec 22 '15 at 23:11