0

I have to create my own web crawler (for educational purposes) that crawls through every single (or as many as possible) Bulgarian website (.bg domain) and returns the server it's running on using the curl -I command in the Linux shell or the requests library. I'm using a big database-like website which contains links to many other websites as a good starting point.

So I have to check every link in every site and check the server it's running on, pushing it into a database. The tricky thing is that I need to open every link and go deeper and open other links (like a tree). So the idea is I have to use a BFS algorithm, keep the visited sites in a list and add every link I haven't visited yet. I'm also only interested in the basic URL ,not the relative webpages inside the site, since I'm interested in the server the site is running on. In other words, I should only check example.bg once, and I'm not interested in example.bg/xyz/xyz/....

I don't really know where to start, so I'm interested in a general algorithm for solving this problem using Beautiful Soup and requests.

Boyan Kushlev
  • 1,043
  • 1
  • 18
  • 35
  • Look into `Scrapy` or `PySpider` - they would solve you a huge part of the problem right away: http://stackoverflow.com/questions/27243246/can-scrapy-be-replaced-by-pyspider. – alecxe Feb 12 '16 at 19:51

1 Answers1

1

As you say you'll need to use a graph traversal algorithm as BFS or DFS, for that I would start by thinking a way to couple one of these algorithms for the purpose you want, that basically is mark each of the web sites as visited. I don't know if you are familiar to it. I can give you a link for reference: http://www.geeksforgeeks.org/depth-first-traversal-for-a-graph/

Secondly you can start using Beautiful Soup and implement a way to pull data of interest out of the HTML files.

melalonso
  • 228
  • 2
  • 9