How to complete geetest (captcha) when scraping, by python-requests, while request values are taken by solving captcha manually?

Question

I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).

Here is a sample link to it: captcha link

I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests. My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.

At the beginning, I've scraped html code which I got on website after captcha prompt:

<head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>

I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above: cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.
I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.

Here is a sample request that my browser is sending after completing captcha:

sample request

I'm sending it in program by:

s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)

Response is 403 and I have no idea how to make it work, help me.

score 1 · Answer 1 · answered Jun 06 '21 at 17:55

1

Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.

The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you

answered Jun 06 '21 at 17:55

Yuriy Glukhov

144
3

Thanks for answer, i appreciate that, I'm trying to do it manually, can u help me? – Kuba -a Jun 06 '21 at 18:16
1) open up a fresh tab and then get the page to load the captcha 2) open your networks tab with f12, then solve captcha 3) returned page will have a cookie on it 4) put cookie into header and solve request – Yuriy Glukhov Jun 06 '21 at 18:47
That is actually different approach, but the output is the same, and it works!!! Thank u very much. – Kuba -a Jun 06 '21 at 19:12
@Kuba-a no problem, if it helped you, accepting the answer it appreciated, goodluck! – Yuriy Glukhov Jun 06 '21 at 19:19

score 0 · Answer 2 · answered Jan 20 '22 at 10:18

Use a header parameter to solve this problem. Just like so

header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
    'referer':'https://www.google.com/'
}

r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)

Test it with web cache before running with real url

How to complete geetest (captcha) when scraping, by python-requests, while request values are taken by solving captcha manually?

2 Answers2

Use a header parameter to solve this problem. Just like so