1

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://robertsspaceindustries.com/pledge/ships'

uClient = uReq(my_url)

page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, features = "lxml")

containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)   

I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://i.stack.imgur.com/mqash.jpg).

1 Answers1

1

When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.

Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.

enter image description here

So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

Tharindu
  • 386
  • 3
  • 17
  • Thank you so much! I was just starting to think something like this. Should I use something like selenium to automate the browser and have it scroll to the bottom of the page before scraping the html? Or is there a more elegant way to do it? – Thomas DeGreve Oct 31 '18 at 17:17
  • @ThomasDeGreve this may help you: https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python – Tharindu Nov 02 '18 at 04:00