I am having trouble scraping off some websites with more than one declaration of <!DOCTYPE html>
.
I am using Python 2.7.9 with requests and BeautifulSoup from bs4. As I execute the requests.get(url), I noticed that the result captures the text from the outter <!DOCTYPE html>
and NOT the second inner <!DOCTYPE html>
.
My question is that is there a way, preferably using Python, to scrape all the information of a website consisting of more than 1 <!DOCTYPE html>
?
This person also has the same problem as me; but his or her question received no answers: https://stackoverflow.com/questions/27259682/mechanize-cutting-off-html-early-python
Any help would be appreciated! Thanks.
Update v1:
I looked around StackOverflow and I encountered this post: Using Python requests.get to parse html code that does not load at once
The test link is http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/.
Note that the test link is not the link I am working with but the idea is pretty much the same. Both the sites uses JS to load the additional information (I should have stated this earlier but I did not realize it till now, sorry!).
After trying out Selenium to load the page I am working on (I did not try it on the test link), I still could not get the information inside the nested html.
I am certain that my code for Selenium works as intended. Any hints on how I should proceed?