2

I am having trouble scraping off some websites with more than one declaration of <!DOCTYPE html>.

I am using Python 2.7.9 with requests and BeautifulSoup from bs4. As I execute the requests.get(url), I noticed that the result captures the text from the outter <!DOCTYPE html> and NOT the second inner <!DOCTYPE html>.

My question is that is there a way, preferably using Python, to scrape all the information of a website consisting of more than 1 <!DOCTYPE html>?

This person also has the same problem as me; but his or her question received no answers: https://stackoverflow.com/questions/27259682/mechanize-cutting-off-html-early-python

Any help would be appreciated! Thanks.

Update v1:

I looked around StackOverflow and I encountered this post: Using Python requests.get to parse html code that does not load at once

The test link is http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/.

Note that the test link is not the link I am working with but the idea is pretty much the same. Both the sites uses JS to load the additional information (I should have stated this earlier but I did not realize it till now, sorry!).

After trying out Selenium to load the page I am working on (I did not try it on the test link), I still could not get the information inside the nested html.

I am certain that my code for Selenium works as intended. Any hints on how I should proceed?

Community
  • 1
  • 1
stungkit
  • 188
  • 1
  • 2
  • 8

1 Answers1

2

I solved my own question.

The answer is outlined in the steps below:

  1. Use an actual browser, preferably in Chrome, and visit the website in question.

  2. Observe and note the GET/POST requests in the XHR tab under the Network section in Chrome (right click the website and click "Inspect Element").

  3. From there, we replicate each GET/POST request in Python.

  4. For each GET/POST request, we can just scrape off the information normally.

No need to use Selenium!

stungkit
  • 188
  • 1
  • 2
  • 8