4

I'm writing a web crawler with the ultimate goal of creating a map of the path the crawler has taken. While I haven't a clue at what rate other, and most definitely better crawlers pull down pages, mine clocks about 2,000 pages per minute.

The crawler works on a recursive backtracking algorithm which I have limited to a depth of 15. Furthermore, in order to prevent my crawler from endlessly revisitng pages, it stores the url of each page it has visited in a list, and checks that list for the next candidate url.

for href in tempUrl:
    ...
    if href not in urls:
         collect(href,parent,depth+1)

This method seems to become a problem by the time it has pulled down around 300,000 pages. At this point the crawler on average has been clocking 500 pages per minute.

So my question is, what is another method of achieving the same functionality while improving its efficiency.

I've thought that decreasing the size of each entry may help, so instead of appending the entire url, I append the first 2 and the last to characters of each url as a string. This, however hasn't helped.

Is there a way I could do this with sets or something?

Thanks for the help

edit: As a side note, my program is not yet multithreaded. I figured I should resolve this bottleneck before I get into learning about threading.

martineau
  • 119,623
  • 25
  • 170
  • 301
danem
  • 1,495
  • 5
  • 19
  • 35

4 Answers4

15

Perhaps you could use a set instead of a list for the urls that you have seen so far.

Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
  • Would I just check it the same way as I would a list? Would the performance really improve with such a simple change? D: – danem Jun 28 '11 at 16:18
  • +1 as something to try, checking existence in a set should be O(1) iirc, and I've seen amazing speedup for checking existing by switching to sets from lists – Davy8 Jun 28 '11 at 16:19
  • 2
    @Pete, yes -- testing for membership in sets and dicts is O(1), in lists and tuples, O(n) in the length of the list/tuple. So for long lists/tuples it makes a _huge_ difference. (+1 btw.) – senderle Jun 28 '11 at 16:19
  • @Pete give it a try, sets are ***extremely*** optimized if order doesn't matter. – Davy8 Jun 28 '11 at 16:19
  • @Pete, et al, see the edit to [this answer](http://stackoverflow.com/questions/4882428/help-me-optimize-this-python-code-project-euler-question-23/4882525#4882525) for a dramatic example of this speedup. – senderle Jun 28 '11 at 16:24
  • 1
    And FYI @Pete if you didn't know, O(1) means that it takes the same amount of time regardless of how big the set is, whereas with a list O(n) means that if the list is 10 times bigger, checking the existence can take up to 10 times longer (in the worst case if it's at the end of the list) – Davy8 Jun 28 '11 at 16:26
  • Wow I hardly see any performance drain at 500,000+ members of the set. Thanks a lot! @Dave Thanks for the explanation. If you couldn't tell, I am no computer scientist heh... – danem Jun 28 '11 at 16:32
7

Simply replace your 'list of crawled URLS" with a "set of crawled urls". Sets are optimised for random access (using the same hashing algorithms that dictionaries use) and they're a heck of a lot faster. A lookup operation for lists is done using a linear search so it's not particularly fast. You won't need to change the actual code that does the lookup.

Check this out.

In [3]: timeit.timeit("500 in t", "t = list(range(1000))")
Out[3]: 10.020853042602539

In [4]: timeit.timeit("500 in t", "t = set(range(1000))")
Out[4]: 0.1159818172454834
Noufal Ibrahim
  • 71,383
  • 13
  • 135
  • 169
3

I had a similar problem. Ended up profiling various methods (list/file/sets/sqlite) for memory vs time. See these 2 posts. Finally sqlite was the best choice. You can also use url hash to reduce the size

Searching for a string in a large text file - profiling various methods in python

sqlite database design with millions of 'url' strings - slow bulk import from csv

Community
  • 1
  • 1
user
  • 17,781
  • 20
  • 98
  • 124
2

Use a dict with the urls as keys instead (O(1) access time).

But a set will also work. See

http://wiki.python.org/moin/TimeComplexity