Questions tagged [crawlera]

26 questions
0
votes
0 answers

Website redirects endlessly until the max-redirection is reached in scrapy

site behaves noramlly when accessed through browser but the redirection issue occurs while accessing the site through scrapy bots. I use Scrapy-Crawlera proxy services, still site redirects endlessly. If i use handle_httpstatus_list = [302] or…
0
votes
1 answer

Scrapy crawlera bug

Scrapy 2.0.1, scrapy_crawlera 1.7.0. I think scrapy_crawlera should access meta differently (https://github.com/scrapy/scrapy/issues/3516) 2020-04-02 06:02:36 [scrapy.core.engine] INFO: Spider opened 2020-04-02 06:02:36 [scrapy.extensions.logstats]…
aikipooh
  • 137
  • 1
  • 19
0
votes
1 answer

Crawlera, cookies, sessions, rate limiting

I'm trying to use scrapinghub to crawl a website that heavily limits request rate. If I run the spider as-is, I get 429 pretty soon. If I enable crawlera as per standard instructions, the spider doesn't work anymore. If I set headers =…
kenshin
  • 197
  • 11
0
votes
1 answer

How to make the website believe that the request is coming from a browser using Scrapy?

I am trying to scrape this url: https://www.bloomberg.com/news/articles/2019-06-03/a-tesla-collapse-would-boost-european-carmakers-bernstein-says I just wanted to scrape title and posted date only but bloomberg always banned man and think that I am…
0
votes
0 answers

scrapy-splash response.body contains no html

Im trying to use crawlera alongside splash local instance, this is my lua script function main(splash) function use_crawlera(splash) local user = splash.args.crawlera_user local host = 'proxy.crawlera.com' local port = 8010 local…
0
votes
1 answer

Stop Scrapy request pipeline for a few minutes and retry

I am scraping a single domain using Scrapy and Crawlera proxy and sometimes due to Crawlera issues (technical break) and I am getting 407 status code and can't scrape any site. Is it possible to stop request pipeline for 10 minutes and then restart…
Bociek
  • 1,195
  • 2
  • 13
  • 28
0
votes
2 answers

Scrapy spider not working with crawlera middleware

I wrote a spider to crawl a large site. im hosting it on scrapehub and am using the crawlera add on. Without crawlera my spider runs on scrapehub just fine. As soon as i switch to crawlera middleware the spider just exits without doing a single…
joe
  • 73
  • 2
  • 8
0
votes
2 answers

Does scrapy-crawlera handle a 429 status code?

Wondering if anyone knows if scrapy-crawlera middleware handles the 429 status code when using scrapy, or if I need to implement my own retry logic? I can't seem to find it documented anywhere
Kevin Glasson
  • 408
  • 2
  • 13
0
votes
2 answers

How to get session_id when using Crawlera lua script in Scrapy Splash?

As you know, we use this lua script when we try to use Scrapy Splash with Crawlera: function use_crawlera(splash) -- Make sure you pass your Crawlera API key in the 'crawlera_user' arg. -- Have a look at the file spiders/quotes-js.py to see…
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
0
votes
1 answer

Scrapy Splash + Crawlera in Linux always get 503 service unavailable error

When I use Scrapy Splash + Crawlera in my Linux server, it always gets 503 errors. It works just fine in Windows. Why is that?
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
-1
votes
1 answer

How to authenticate using scrapy spider with Zyte Smart Proxy Manager (former Crawlera) enabled?

I followed the scrapy-zyte-smartproxy documentation to integrate proxy usage into my spider. Now my spider can't log in.
Danil
  • 4,781
  • 1
  • 35
  • 50
1
2