scrape span using BeautifulSoup

Question

I was trying to scrape "span" tag using BeautifulSoup. here's my code..

import urllib
from bs4 import BeautifulSoup
url="someurl"
res=urllib.urlopen(url)
html=res.read()
soup=BeautifulSoup(html,"html.parser")
soup.findAll("span")

But when I do so, for some specific web pages. it does n't list all the spans. It just shows limited no. of spans. but when I do

soup.prettify()

It contains all the spans.. What might be the reason? Am I missing out on something? Also some answers I found were to use headless browsers like "htmlunit". but I am not sure what they exactly are? Can I integrate them into my django project?

soup.prettify gives https://drive.google.com/file/d/0BxhTzDujWhPVTzdIS2VWd1pZcHM/view?usp=sharing

expected output of soup.findAll("span")

list of all the spans

output im getting

[<span class="ssc-ftpl ssc_ga_tag" data-gaa="Opened" data-gac="Footer" data-gal="Responsible Gambling" tabindex="0"> Responsible Gambling</span>, <span class="ssc-ftpl ssc_ga_tag" data-gaa="Opened" data-gac="Footer" data-gal="About Betfair" tabindex="0"> About Betfair</span>, <span class="ssc-ftpl ssc-ftls " tabindex="0">English - UK</span>, <span class="ssc-ftpl" tabindex="0">\xa9 \xae</span>]

Could you, please, provide input, expected output and real output? — awesoon, Dec 19 '15 at 14:50
Do a diff between the output of `print(soup)` and `print(soup.prettify())`. Is there any difference? — dstudeba, Dec 19 '15 at 15:34

score 1 · Answer 1 · answered Dec 22 '15 at 23:10

Finally found out the soulution.. the problem was the default "html.parser", which was not able to handle. Use "html5lib" instead for parsing. and get the desired results.

soup=BeautifulSoup(html,"html5lib")
soup.findAll("span")

html5lib parser parses the page exactly the way a browser does.

dstudeba · Answer 2 · 2015-12-20T21:28:42.023

Maybe you are trying to scrape a different page, but I didn't have a problem scraping that site. Here is my code:

url='https://www.betfair.com/sport/football'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

test = soup.find_all('span')
for span in test:
    print(span)

This produced a large list of spans including the lines/scores which is what I figured you are interested in:

<span class="ssc-lkh"></span>
<span>Join Now</span>
<span class="new flag-en"></span>
<span class="new flag-en"></span>
<span class="sportIcon-6423"></span>
<span class="sportName">American Football</span>
<span class="sportIcon-3988"></span>
<span class="sportName">Athletics</span>
<span class="sportIcon-61420"></span>
.....

Updated in response to the comment below

Here is some revised code to show that my code does indeed pull in the spans you need.

url='https://www.betfair.com/sport/football'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

test = soup.find_all('span',attrs={"class":"away-team-name"}) 
for span in test:
    print("away team" + span.text)

Produces:

away team
Marseille

away team
Lazio

away team
Academica

away team
Canada (W)

away team
Arnett Gardens FC

away team
UWI FC
....

yea. your one will also not be listing all the spans. check out for span containing the name of team of any ongoing or in-play match.. like https://drive.google.com/file/d/0BxhTzDujWhPVekN6UW9CUzd0eWc/view?usp=sharing this one — sidharth kumar, Dec 20 '15 at 20:31
@sidharthkumar Did you try my code? When I run it it gets those `span`s. Please see my updated code where I explicitly get the `span` you say you can't get. — dstudeba, Dec 20 '15 at 21:30
i did same(almost).. https://drive.google.com/file/d/0BxhTzDujWhPVNGE4Sk9xTW10UFE/view?usp=sharing — sidharth kumar, Dec 21 '15 at 10:10
thanks for help. used html5lib as the parser and got the output — sidharth kumar, Dec 22 '15 at 23:11

scrape span using BeautifulSoup

2 Answers2