Web crawler recursively BeautifulSoup

Question

I am trying to recursively crawl a Wikipedia url for all English article links. I want to perform a depth first traversal of n but for some reason my code is not recurring for every pass. Any idea why?

def crawler(url, depth):
    if depth == 0:
        return None
    links = bs.find("div",{"id" : "bodyContent"}).findAll("a" , href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))

    print ("Level ",depth," ",url)
    for link in links:
        if ':' not in link['href']:
            crawler("https://en.wikipedia.org"+link['href'], depth - 1)

This is the call to the crawler

url = "https://en.wikipedia.org/wiki/Harry_Potter"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
crawler(url,3)

I have to solve this using beautifulsoup...it is a task that I have to do...I cannot use a data dump — user42967, Mar 05 '18 at 22:22
Where's the request made to the url inside the function? You'll have to send the request each time it recurs. — Keyur Potdar, Mar 06 '18 at 02:49

score 1 · Accepted Answer · answered Mar 06 '18 at 04:31

You need to get the page source (send a request to page) for every different URL. You are missing that part in your crawler() function. Adding those lines outside the function, won't call them recursively.

def crawler(url, depth):
    if depth == 0:
        return None

    html = urlopen(url)                        # You were missing 
    soup = BeautifulSoup(html, 'html.parser')  # these lines.

    links = soup.find("div",{"id" : "bodyContent"}).findAll("a", href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))

    print("Level ", depth, url)
    for link in links:
        if ':' not in link['href']:
            crawler("https://en.wikipedia.org"+link['href'], depth - 1)

url = "https://en.wikipedia.org/wiki/Big_data"
crawler(url, 3)

Partial Output:

Level  3 https://en.wikipedia.org/wiki/Big_data
Level  2 https://en.wikipedia.org/wiki/Big_Data_(band)
Level  1 https://en.wikipedia.org/wiki/Brooklyn
Level  1 https://en.wikipedia.org/wiki/Electropop
Level  1 https://en.wikipedia.org/wiki/Alternative_dance
Level  1 https://en.wikipedia.org/wiki/Indietronica
Level  1 https://en.wikipedia.org/wiki/Indie_rock
Level  1 https://en.wikipedia.org/wiki/Warner_Bros._Records
Level  1 https://en.wikipedia.org/wiki/Joywave
Level  1 https://en.wikipedia.org/wiki/Electronic_music
Level  1 https://en.wikipedia.org/wiki/Dangerous_(Big_Data_song)
Level  1 https://en.wikipedia.org/wiki/Joywave
Level  1 https://en.wikipedia.org/wiki/Billboard_(magazine)
Level  1 https://en.wikipedia.org/wiki/Alternative_Songs
Level  1 https://en.wikipedia.org/wiki/2.0_(Big_Data_album)

Web crawler recursively BeautifulSoup

1 Answers1