I have to create my own web crawler (for educational purposes) that crawls through every single (or as many as possible) Bulgarian website (.bg
domain) and returns the server it's running on using the curl -I
command in the Linux shell or the requests
library. I'm using a big database-like website which contains links to many other websites as a good starting point.
So I have to check every link in every site and check the server it's running on, pushing it into a database. The tricky thing is that I need to open every link and go deeper and open other links (like a tree). So the idea is I have to use a BFS algorithm, keep the visited sites in a list and add every link I haven't visited yet. I'm also only interested in the basic URL ,not the relative webpages inside the site, since I'm interested in the server the site is running on. In other words, I should only check example.bg
once, and I'm not interested in example.bg/xyz/xyz/...
.
I don't really know where to start, so I'm interested in a general algorithm for solving this problem using Beautiful Soup
and requests
.