Python BeautifulSoup not processing entire page

Asked Jul 26 '14 at 19:51

Active Dec 03 '18 at 12:00

Viewed 18 times

I'm writing this nice tool which will return every link on a web page. I tried to do this with help from another 'stackoverflow' question and got this code meanwhile:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request(url)

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']

I thought that it worked, but unfortunately I saw that a lot of links are missing, I believe it's because some of the links are generated when DOM comes to action and BeautifulSoup does not takes it in consideration.

edited Dec 03 '18 at 12:00

Cœur

37,241
25
195
267

asked Jul 26 '14 at 19:51

Fernando Retimo

1,003
3
13
25

BeautifulSoup cannot execute JavaScript; it is not a browser. Use an actual browser (even one that is not displaying anything, a headless browser) to do so and get a full DOM. Selenium or PhantomJS or similar tools can all do that for you, driven by Python. – Martijn Pieters Jul 26 '14 at 19:55
Thank you, Do you have any code sample that can give me a lead? – Fernando Retimo Jul 26 '14 at 19:56

Python BeautifulSoup not processing entire page

0 Answers0