3

I need to scrape a web page that is a javascript-rendered AngularJS app. The developers of the site detect Safari/Firefox in private browsing mode and disallow it to be used, and therefore scraped. The page works with Safari/Firefox when you are not in private mode.

The interesting thing is that no such warning is given when using Chrome whether in private mode or not. I was using Scrapy+Selenium, but I was really hoping to use ScrapyJS/Splash for this project. However, it looks like the Scrapy/Splash combination suffers from the website's private browsing wall.

Is it possible to tell Scrapy to use Chrome? I know Selenium has quite a few drivers, and it is pretty well documented on how to use each, but I can't find any info on if Scrapy has support for other browsers or if someone else has already done this. Google/SO searches haven't illuminated this at all for me either.

Randy
  • 908
  • 12
  • 30
  • Have you tried changing the user agent? http://stackoverflow.com/questions/18920930/scrapy-python-set-up-user-agent – Gustavo Bezerra Mar 22 '16 at 04:16
  • Yes, I tried this in the Scrapy `settings.py` file and it didn't seem to have an effect. I tried a few known Chrome/Firefox/Safari agents as well as some "Scrapy be a good citizen" user agents. – Randy Mar 22 '16 at 04:21
  • Have you tried using selenium's `chrome driver` ? – Rahul Mar 22 '16 at 05:55
  • Sorry if I am wrong, but from my limited experience with Scrapy over an year ago, as far as I know, differently from Selenium, it doesn't really use the backend of any browser. It just sends HTTP requests using requests/twisted so the idea of "using browser X with Scrapy" doesn't seem to make much sense. I guess your best shot is trying Selenium. – Gustavo Bezerra Mar 22 '16 at 06:56
  • 1
    @Randy, starting from Splash 2.0, you can disable private mode at startup or runtime. See https://splash.readthedocs.org/en/stable/changes.html#id4 _"it is now possible to turn Private mode OFF at startup using command-line option or at runtime using splash.private_mode_enabled attribute;"_ – paul trmbrth Mar 22 '16 at 09:14
  • @Rahul - yes, I have tried that and it works, but I would like to use a Scrapy/Splash combination instead of Scrapy/Selenium. – Randy Mar 23 '16 at 14:40
  • @paultrmbrth - Thanks! I will check that out. – Randy Mar 23 '16 at 14:40
  • @paultrmbrth, if you want to add your comment as an answer, I tried it out and it works just as I hoped. – Randy Mar 25 '16 at 03:31
  • @Randy, done. Thx for the feedback – paul trmbrth Mar 25 '16 at 10:04

1 Answers1

3

Starting from Splash 2.0, you can disable Private mode (which is "on" by default).

There are two ways to go about it:

  • at startup, with the --disable-private-mode argument, e.g., if you're using Docker:

    $ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash --disable-private-mode
    
  • at runtime when using the /execute endpoint and setting splash.private_mode_enabled=false

Also, take note of the effect of disabling private mode:

Note that if you disable private mode browsing data such as cookies or items kept in local storage may persist between requests.

paul trmbrth
  • 20,518
  • 4
  • 53
  • 66