how to deal with captcha when web scraping using R

Question

I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.

this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"

this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"

When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.

Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.

I was wondering,

Is my way of using proxy correct?
Why the spider keeps being transferred while browser doesn't. They are from the same IP.

Thanks.

This is my code:

a <- GET(url, use_proxy(proxy, port), timeout(10),
          add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
                      'Connection' = 'keep-alive',
                      'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
                      'Accept-Encoding' = 'gzip, deflate, br',
                      'Host' = 'ajax.api.lianjia.com',
                      'Accept' = '*/*',
                      'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
                      'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>% 
            html_nodes('div.title') %>% html_text()

Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.

That is an intentional function of the website wanting to control how you use their data. Once you clear the captcha on your browser, it has a cookie which it holds. Your R script has no such "key" to show when it goes back, so it gets the captcha again. And it makes sense that the site works this way because malevolent scripts on your computer might try to use your IP address to scrape (or even steal) data and send it elsewhere....this is the sites defense against that type of behavior. You should investigate the terms of the site to be sure you are not violating them by scraping. — sconfluentus, Sep 19 '17 at 17:17

score 2 · Answer 1 · answered Sep 19 '17 at 17:40

You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).

You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:

Use randomness when sending GET request. Like you can use Sys.sleep function (use random distribution) to sleep before sending each GET request.
Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.
Manipulate your source location (ip address, and server info)

Manipulating these information will help you to avoid getting CAPTACHA validation in some way.

how to deal with captcha when web scraping using R

1 Answers1