It's normal that the spyder request robot.txt
that's where the rules are.
robot.txt
is basically a blacklist of urls that you should not visit/crawl which use glob/regex kind of syntax to specify the forbidden urls.
Scapy will read the robot.txt
and translate those rules to code. During the crawl when the spyder meets an url it first validates against the rules generated from the robot.txt
that the URL can be visited. If the URL is not blacklisted by robot.txt
scrapy will visit the url and deliver a Response
.
robot.txt
is not only blacklisting urls, it also provide the speed at which the crawl can happen. Here is an example robot.txt
:
User-Agent: *
Disallow: /x?
Disallow: /vote?
Disallow: /reply?
Disallow: /submitted?
Disallow: /submitlink?
Disallow: /threads?
Crawl-delay: 30