7

Currently changing user_agent by passing different strings to the html_session() method.

Is there also a way to change your IP address on a timer when scraping a website?

scoa
  • 19,359
  • 5
  • 65
  • 80
tonyk
  • 348
  • 5
  • 22
  • 1
    this sounds an awful lot like a method for circumventing terms of use of a website ... – Ben Bolker Jan 04 '17 at 14:55
  • Take a look here: http://google-scraper.squabbel.com/ This is dedicated to Google scraping but will help for your question as well as by using the information for anything. It applies to almost any website, most are easier than Google. – John Jan 04 '17 at 22:30
  • you can use tor and privoxy or direct tor for this purpose. Note:- I personally believe there is nothing unethical in circumventing website restriction. Obviously you should not take advantage of the process and make unnecessarily numerous hits to the target webpage. – Indranil Gayen Jan 05 '17 at 04:15
  • Thank you guys. Do you know of a good guide for that @IndranilGayen using R? Failing that could always use Python. – tonyk Jan 09 '17 at 18:11

1 Answers1

6

You can use a proxy (which changes your ip) via use_proxy as follows:

html_session("you-url", use_proxy("proxy-ip", port))

For more details see: ?httr::use_proxy

To check if it is working you can do the following:

require(httr)

content(GET("https://ifconfig.co/json"), "parsed")
content(GET("https://ifconfig.co/json", use_proxy("138.201.63.123", 31288)), "parsed")

The first call will return your IP. The second call should return 138.201.63.123 as ip.

This Proxy was taken from http://proxylist.hidemyass.com/ - no garantees for anything...

Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • Thank you. Are there any restrictions on the IP address or port number that can be used? – tonyk Jan 04 '17 at 14:48
  • @tonyk What would be such a restriction? – lukeA Jan 04 '17 at 14:55
  • @tonyk it has to be a valid URL of a proxy server. If you want to use a _socks_-proxy use something like `use_proxy("socks://127.0.0.1", 9050)` – Rentrop Jan 04 '17 at 15:03
  • So for instance it could be any currently valid entry on the https://www.socks-proxy.net/ website? – tonyk Jan 04 '17 at 16:57
  • Thank you. Sending one request gets me a robot check. Do you know how to view the information sent in the request? – tonyk Jan 09 '17 at 18:10
  • Habe a Look at ?verbose – Rentrop Jan 09 '17 at 20:28
  • I have tried the sollution and it dose not work for me. `html_session("https://www.maxmodels.pl", use_proxy("95.171.198.206", 8080))` generated the error Error in `curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: Connection timed out after 10000 milliseconds` – AAAA Apr 08 '18 at 12:15
  • Are you sure the proxy is working correctly? Oftentimes proxy's from the web are outdated/blocked by the other site. Did it work using the proxy eg via curl in the Shell? – Rentrop Apr 08 '18 at 14:09