BeautifulSoup findAll() not finding all, regardless of which parser I use

Question

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://robertsspaceindustries.com/pledge/ships'

uClient = uReq(my_url)

page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, features = "lxml")

containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)

I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://i.stack.imgur.com/mqash.jpg).

score 1 · Accepted Answer · answered Oct 31 '18 at 05:38

1

When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.

Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.

So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

answered Oct 31 '18 at 05:38

Tharindu

386
3
17

Thank you so much! I was just starting to think something like this. Should I use something like selenium to automate the browser and have it scroll to the bottom of the page before scraping the html? Or is there a more elegant way to do it? – Thomas DeGreve Oct 31 '18 at 17:17
@ThomasDeGreve this may help you: https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python – Tharindu Nov 02 '18 at 04:00

BeautifulSoup findAll() not finding all, regardless of which parser I use

1 Answers1