python 2.x wikipedia parsing

Question

I have this code:

import urllib
from bs4 import BeautifulSoup

base_url='https://en.wikipedia.org'
start_url='https://en.wikipedia.org/wiki/Computer_programming'
outfile_name='Computer_programming.csv'
no_of_links=10

fp=open(outfile_name, 'wb')

def get_links(link):
    html = urllib.urlopen(link).read()
    soup = BeautifulSoup(html, "lxml")
    ret_list=soup.select('p a[href]')
    count=0
    ret=[]
    for tag in ret_list:
        link=tag['href']
        if link[0]=='/' and ':' not in link and link[:5]=='/wiki' and '#' not in link:
            ret.append(base_url+link)
            count=count+1
        if count==no_of_links:
            return ret

l1=get_links(start_url)
for link in l1:
    fp.write('%s;%s\n'%(start_url,link))

for link1 in l1:
    l2=get_links(link1)
    for link in l2:
        fp.write('%s;%s\n'%(link1,link))

    for link2 in l2:
        l3=get_links(link2)
        for link in l3:
            fp.write('%s;%s\n'%(link2,link))

fp.close()

is saves an neighborhood of nodes in an csv file. But when I try to run it I'm getting this error:

for link in l3:

TypeError: 'NoneType' object is not iterable

I get the same error when I trying to run the code for another Wikipedia link, like https://en.wikipedia.org/wiki/Technology. The only page on which it works is: https://en.wikipedia.org/wiki/Computer_science. And that's a problem since I need to collect the data on more sites not only the Computer science one.

Can anyone give me a hint how to deal with it??

Thanks a lot.

You should debug your program line-by-line. It seems that at some moment in function get_links occurs count != no_of_links, so function returns None. — Eugene Primako, Jan 02 '16 at 15:51
the empty links should be skipped, not make the program stop, that's the problem — Lila, Jan 02 '16 at 16:08
And what if there is less than 10 links on a page? Try returning ret at the end of the function. — Eugene Primako, Jan 02 '16 at 16:13
Why do you loop over the links twice? `for link in l1:` and `for link1 in l1:`? You could just combine those loops — OneCricketeer, Jan 02 '16 at 16:14
Also, it looks like you're doing breadth first search. I would recommend a recursive function with a depth limit instead of nesting sequences of for loops — OneCricketeer, Jan 02 '16 at 16:21
I always like to code in small segments to make sure everything works before writing 100 lines and then trying to figure out what went wrong. — birdoftheday, Jan 02 '16 at 16:28
Parsing HTML is rarely the best approach for anything. Try using the [links API](https://www.mediawiki.org/wiki/API:Links) ([example](https://en.wikipedia.org/w/api.php?action=query&titles=Computer%20programming&prop=links&pllimit=500)). — Tgr, Jan 04 '16 at 06:38

tripleee · Answer 1 · 2016-01-02T16:17:36.210

If ret_list is empty, or if there are fewer links than requested, the code takes a branch without any explicit return statement, and therefore implicitly returns None when it falls off at the end of the function.

Without going into other problems with this code, you probably want something like this instead:

def get_links(link):
    html = urllib.urlopen(link).read()
    soup = BeautifulSoup(html, "lxml")
    ret_list=soup.select('p a[href]')
    count=0
    ret=[]
    for tag in ret_list:
        link=tag['href']
        if link[0]=='/' and ':' not in link and link[:5]=='/wiki' and '#' not in link:
            ret.append(base_url+link)
            count=count+1
        if count==no_of_links:
            break
    return ret

python 2.x wikipedia parsing

1 Answers1