2

I've been looking to start using ScraPy for a large web scraping project. I've been using python-requests (HTTP for Humans) and BeautifulSoup for most of the websites I've scraped over the last 5 years.

My reasons for wanting to use the requests library over Scrapy.Request: 

  • While playing around with Scrapy, I've noticed the content returned to be very primitive in the sense that it's more equivalent to the raw format. Whereas, with requests.content and requests.text the output is cleaner due to how requests more elegantly handles content-encoding. 
  • I've seen issues with Scrapy.Request not returning the same content as the requests library. 
  • requests offers easy ways to conveniently access response headers, such as rel="next" links, whereas Scrapy.Request does not (and also converts headers to uppercase and forces the format to bytes). 
  • requests offers a convenient response.json() method (I know, I know, trivial, but still). It's been briefly discussed in this issue and has been essentially rotting since 2016. 

I've seen similar questions regarding doing this on StackOverflow. The most direct request was here, but it did not receive much attention and just like the other questions I found mentioning wanting to use python-requests with ScraPy, people either unknowingly assumed the OPs were talking about Scrapy.Request or simply ignored the point of the question and provided a Scrapy.Request equivalent to the OPs' requests.

The first Scrapy method I started looking into to see about overriding Scrapy.Request was start_requests.

No Luck

Then I thought about seeing if someone had created a custom Downloader-Middleware that allowed for drop-in python-requests compatibility. 

No Luck

Finally, I thought to try searching for an upstream solution, involving Twisted's way of handling requests. That's the only place where I found anything close to a solution, txrequests. However, there didn't seem to be a way to tie that in with Scrapy

So, having done my research, here i am asking for help. 

Finally, the Question 

While I know it might be ill-advisable, I'm asking how can I override Scrapy.Request to use python-requests across my entire project?

user2357112
  • 260,549
  • 28
  • 431
  • 505
CaffeinatedMike
  • 1,537
  • 2
  • 25
  • 57

1 Answers1

0

It won't work because requests is not asynchronous. Unless you're ok with that in which case I don't see the point of using scrapy.

If you're not happy with scrapy requests you could try aiohttp or switch to an asynchronous language like Go or Node.

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • `txrequests` takes care of extending the `requests` library to be async. I just don't know how or where to apply it for it be used throughout the entire project. I'm guessing it'd be in a custom `downloader-middleware`. – CaffeinatedMike Mar 02 '20 at 03:39
  • I've never heard of txrequests. You might want to contact the author. – pguardiario Mar 02 '20 at 04:02