0

I'm using the following to get all external Javascript references from a web page. How can I modify the code to search not only the url, but all pages of the website?

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('https://stackoverflow.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
    if link.has_key('src'):
        if 'http' in link['src']:
            print link['src']

First attempt to make it scrape two pages deep below. Any advice on how to make it return only unique urls? As is, most are duplicates. (note that all internal links contain the word "index" on the sites I need to run this on.)

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

site = 'http://www.stackoverflow.com/'
http = httplib2.Http()
status, response = http.request(site)

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        if 'index' in link['href']:
            page = site + link['href']
            status, response = http.request(page)

            for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
                if link.has_key('src'):
                    if 'http' in link['src']:
                        print "script" + " " + link['src']
            for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
                print "iframe" + " " + iframe['src']

            for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
                if link.has_key('href'):
                    if 'index' in link['href']:
                        page = site + link['href']
                        status, response = http.request(page)

                        for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
                            if link.has_key('src'):
                                if 'http' in link['src']:
                                    print "script" + " " + link['src']
                        for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
                            print "iframe" + " " + iframe['src']
Ali
  • 1,357
  • 2
  • 12
  • 18
j8d
  • 446
  • 7
  • 23
  • 1
    There's not really a way to know _all_ of the possible pages of a website, but you could scrape for any relative hrefs in anchor tags and recurse using those. – perfect5th Oct 02 '17 at 17:14

1 Answers1

1

Crawling websites is a vast subject. Deciding how to index content and crawl further deep into the website. It includes, content parsing like your rudimentary crawler or spider is doing. It is definitely non-trivial to write a bot similar in excellence to Google Bot. Professional crawling bots do a lot of work which may include

  • Monitor domain related changes to initiate crawl
  • Schedule sitemap lookup
  • Fetching web content (which is scope of this question)
  • Fetch set of links for further crawl
  • Adding weights or priorities to each URL
  • Monitoring when services from website go down

For just doing a crawl on specific website like Stackoverflow, I have modified your code for recursive crawling. It will be trivial to convert this code further to multi-threaded form. It uses bloomfilter to make sure it does not need to crawl same page again. Let me warn upfront, there will still be unexpected pitfalls while doing a crawl. Mature crawling software like Scrapy, Nutch or Heritrix do a way better job.

import requests
from bs4 import BeautifulSoup as Soup, SoupStrainer
from bs4.element import Tag
from bloom_filter import BloomFilter
from Queue import Queue
from urlparse import urljoin, urlparse

visited = BloomFilter(max_elements=100000, error_rate=0.1)
visitlist = Queue()

def isurlabsolute(url):
    return bool(urlparse(url).netloc)

def visit(url):
    print "Visiting %s" % url
    visited.add(url)
    return requests.get(url)


def parsehref(response):
    if response.status_code == 200:
        for link in Soup(response.content, 'lxml', parse_only=SoupStrainer('a')):
            if type(link) == Tag and link.has_attr('href'):
                href = link['href']
                if isurlabsolute(href) == False:
                    href = urljoin(response.url, href)
                href = str(href)
                if href not in visited:
                    visitlist.put_nowait(href)
                else:
                    print "Already visited %s" % href
    else:
        print "Got issues mate"

if __name__ == '__main__':
    visitlist.put_nowait('http://www.stackoverflow.com/')
    while visitlist.empty() != True:
        url = visitlist.get()
        resp = visit(url)
        parsehref(resp)
        visitlist.task_done()
    visitlist.join()
Supreet Sethi
  • 1,780
  • 14
  • 24