0

Hello!

Question to one who use scrapinghub, shub-image, selenuim+phantomjs, crawlera. English skill is not good, sorry

I needed to scrape site which have many JS code. So I use scrapy+selenium. Aslo it should run at Scrapy Cloud. I've writtеn spider which uses scrapy+selenuim+phantomjs and run it on my local machine. All is ok. Then I deployed project to Scrapy cloud using shub-image. Deployment is ok. But results of webdriver.page_source is different. It's ok on local, not ok(HTML with inscription - 403, request 200 http) at cloud. Then I decided to use crawlera acc. I've added it with:

service_args = [
            '--proxy="proxy.crawlera.com:8010"',
'--proxy-type=https',
'--proxy-auth="apikey"',
]

for Windows(local)

self.driver = webdriver.PhantomJS(executable_path=r'D:\programms\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_args=service_args)

for docker instance

self.driver = webdriver.PhantomJS(executable_path=r'/usr/bin/phantomjs', service_args=service_args, desired_capabilities=dcap)

Again at local all is ok. Cloud not ok. I've checked cralwera info. It's ok. Requests sends from both(local and cloud).

Note again: Same proxies(crawlera). response at windows: 200 http, html with right code

response at ScrapyCloud(docker instance): 200 http, html with inscription 403(forbidden)

I dont get what's wrong. I think it might be differences between phantomjs versions(Windows, Linux).

Any ideas?

kzr
  • 41
  • 1
  • 5
  • A site is denying access for bots from a well-known scraping service. Do you not see what could be wrong here? – Vaviloff Apr 20 '17 at 03:43
  • @Vaviloff , sure I know it. But I am using proxy(crawlera)... As I wrote I use same proxy, same user-agent from both local and scloud, and result is different. – kzr Apr 20 '17 at 07:29
  • No experience with selenium and phantomjs but for Splash it's recommended to make a regular (do not use Crawlera) request to Splash and use Crawlera in the Splash request. Maybe a similar logic can be applied for your case. – Casper Apr 20 '17 at 07:50
  • Are proxies that you use transparent? Could they be transmitting originating IP of a request? – Vaviloff Apr 20 '17 at 08:57
  • Try checking the headers for both [here](https://httpbin.org/headers) and compare them – pguardiario Apr 20 '17 at 23:46
  • @Vaviloff I can see IPs of each request – kzr Apr 21 '17 at 08:26
  • @Casper Thanks, it's mb useful – kzr Apr 21 '17 at 08:28

0 Answers0