9

I saw this post to make scrapy crawl any site without allowed domains restriction.

Is there any better way of doing it, such as using a regular expression in allowed domains variable, like-

allowed_domains = ["*"]

I hope there is some other way than hacking into scrapy framework to do this.

Community
  • 1
  • 1
hrishikeshp19
  • 8,838
  • 26
  • 78
  • 141

2 Answers2

14

Don't set allowed_domains at all.

Look at the get_host_regex() function in this scrapy file:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py

Shawn Lewis
  • 214
  • 3
  • 4
1

you should diactivate offsite middlware which is a built in spider middleware in scrapy. for more information http://doc.scrapy.org/en/latest/topics/spider-middleware.html