I've been looking to start using ScraPy for a large web scraping project. I've been using python-requests (HTTP for Humans) and BeautifulSoup for most of the websites I've scraped over the last 5 years.
My reasons for wanting to use the requests
library over Scrapy.Request
:
- While playing around with
Scrapy
, I've noticed the content returned to be very primitive in the sense that it's more equivalent to the raw format. Whereas, withrequests.content
andrequests.text
the output is cleaner due to howrequests
more elegantly handles content-encoding. - I've seen issues with
Scrapy.Request
not returning the same content as therequests
library. requests
offers easy ways to conveniently access response headers, such asrel="next"
links, whereasScrapy.Request
does not (and also converts headers to uppercase and forces the format to bytes).requests
offers a convenientresponse.json()
method (I know, I know, trivial, but still). It's been briefly discussed in this issue and has been essentially rotting since 2016.
I've seen similar questions regarding doing this on StackOverflow. The most direct request was here, but it did not receive much attention and just like the other questions I found mentioning wanting to use python-requests
with ScraPy
, people either unknowingly assumed the OPs were talking about Scrapy.Request
or simply ignored the point of the question and provided a Scrapy.Request
equivalent to the OPs' requests.
The first Scrapy
method I started looking into to see about overriding Scrapy.Request
was start_requests
.
No Luck
Then I thought about seeing if someone had created a custom Downloader-Middleware
that allowed for drop-in python-requests
compatibility.
No Luck
Finally, I thought to try searching for an upstream solution, involving Twisted
's way of handling requests. That's the only place where I found anything close to a solution, txrequests
. However, there didn't seem to be a way to tie that in with Scrapy
.
So, having done my research, here i am asking for help.
Finally, the Question
While I know it might be ill-advisable, I'm asking how can I override Scrapy.Request
to use python-requests
across my entire project?