2

I have been building a crawler in Python for the last 10 months. This crawler is using threading and Queue to hold all the visited and non-visited links.

I use BeautifulSoup and request to access the urls and pick up page title, meta description, keywords, cms system and many more.

At the moment the crawler checks the first seed url for data and when it's done scraping data it finds new links from the current page in one single thread, and then it keeps repeating itself.

Everything works fine, except when I want to stop the scraping process, it doesn't stop the threads, but just keeps running, I have added a locked variable that counts the amount of pages crawled, but somehow it stops the crawl process but not the links process, when crawl is reached.

Would below setup be better?

  1. A Link class that has one job.

This class would make use of Threads to find all the links on the webpage, or if a limit is set only return the links requested.

  1. A Scraper class that has one job.

This class would run through all the links returned from the Link class, and collect the requested data using Threads.

My Issue:

If I create the new setup, then the Link class would send request to the urls and so would the scraper class.

So if my link class should return 100 urls then I would send at minimum of 100 request to the urls and when the scraper class goes though the links found by link task, then it ads 100 more request.

So instead of 100 request I end up sending 200 requests.

  • I could pass a soup object from the link class to some array, but then I would end up having a memory problem as I see it.

What is your advice?

Dannie
  • 31
  • 4
  • My advice is: Don't use threads, use [trio](https://trio.readthedocs.io/), it is designed for (among other things) exiting gracefully. – L3viathan Aug 02 '18 at 14:16
  • Hi @L3viathan, thanks i have a look at Trio. Do u think it whould be a good idea to split up the crawler? – Dannie Aug 04 '18 at 09:19
  • Yes, I do. I've written a simple crawler in Trio before, and I used seperate workers for the actual crawling (several of them to do parallel requests), for analysing the responses and extracting new links, and one for putting the links back on the crawling queue. – L3viathan Aug 04 '18 at 09:35
  • How would u handle the issue with multiple requests? – Dannie Aug 04 '18 at 11:06

0 Answers0