0

I'm writing this nice tool which will return every link on a web page. I tried to do this with help from another 'stackoverflow' question and got this code meanwhile:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request(url)

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']

I thought that it worked, but unfortunately I saw that a lot of links are missing, I believe it's because some of the links are generated when DOM comes to action and BeautifulSoup does not takes it in consideration.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Fernando Retimo
  • 1,003
  • 3
  • 13
  • 25
  • BeautifulSoup cannot execute JavaScript; it is not a browser. Use an actual browser (even one that is not displaying anything, a headless browser) to do so and get a full DOM. Selenium or PhantomJS or similar tools can all do that for you, driven by Python. – Martijn Pieters Jul 26 '14 at 19:55
  • Thank you, Do you have any code sample that can give me a lead? – Fernando Retimo Jul 26 '14 at 19:56

0 Answers0