-1

I have a scraper that has worked without an issue for 18 months until today. Now I get 403 response from htlv.org and don't seem to be able to fix the issue. My code is below so the answer is not the usual to just add headers. If I print response.text it says something about captchas. So I assume I'd have to bypass captcha or my ip is blocked? Please help :)

import requests

url = 'https://www.hltv.org/matches'
headers = {
    "Accept-Language": "en-US,en;q=0.5",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Referer": "http://thewebsite.com",
    "Connection": "keep-alive"}
response = requests.get(url, headers=headers)
print response

EDIT: This remains a mystery to me, but today my code started working again on my main PC. Did not make any changes to the code. KokoseiJ could not reproduce the problem, but Booboo did. The code also worked on my old PC, which I dug from storage, but not on my main PC. Anyways, thanks to all who tried to help me with this issue.

Jepa
  • 1
  • 1
  • 2
  • Have you tried using a web browser from your computer? – eventHandler Jan 23 '22 at 12:06
  • With a web browser I do get to the site – Jepa Jan 23 '22 at 12:09
  • Try to copy the headers that your browser is sending to the server. Also try again without using the `Referer` header. – eventHandler Jan 23 '22 at 12:12
  • 1
    Can't reproduce. You've possibly hit the rate limit which you have to solve via captcha. – KokoseiJ Jan 23 '22 at 12:43
  • That's what I feared. Thanks. – Jepa Jan 23 '22 at 12:54
  • Now, how do I go about with captcha? – Jepa Jan 23 '22 at 12:55
  • I tried doing this using the exact headers Chrome was sending up according to Chrome's Inspector, including headers that begin with ':', which took a bit of doing since I had to override header validation to accomplish that -- and I still got a 403 response. I will post the code for whatever it is worth. – Booboo Jan 23 '22 at 13:02
  • With my other computer the code works. They are connected to same router so should an ip block apply to the other pc aswell? Anyone have any suggestions to fix the issue with my main pc? – Jepa Jan 23 '22 at 13:21

2 Answers2

0

I am posting this not as a solution but as something that did not work, but may be useful information.

I went to https://www.hltv.org/matches then brought up Chrome's Inspector and reloaded the page and looked at the request headers Chrome (supposedly) used for the GET request. Some of the header names began with a ':', which requests considers illegal. But looking around Stack Overflow, I found a way to get around that (supposedly for Python 3.7 and greater). See the accepted answer and comments here for details.

This still resulted in a 403 error. Perhaps somebody might spot an error in this (or not).

These were the headers shown by the Inspector:

:authority: www.hltv.org
:method: GET
:path: /matches
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: no-cache
cookie: MatchFilter={%22active%22:false%2C%22live%22:false%2C%22stars%22:1%2C%22lan%22:false%2C%22teams%22:[]}
dnt: 1
pragma: no-cache
sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36

And the code:

import requests
import http.client
import re

http.client._is_legal_header_name = re.compile(rb'\S[^:\r\n]*').fullmatch

url = 'https://www.hltv.org/matches'
headers = {
    ':authority': 'www.hltv.org',
    ':method': 'GET',
    ':path': '/matches',
    ':scheme': 'https',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'cookie': 'MatchFilter={%22active%22:false%2C%22live%22:false%2C%22stars%22:1%2C%22lan%22:false%2C%22teams%22:[]}',
    'dnt': '1',
    'pragma': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.text)
print(response)
Booboo
  • 38,656
  • 3
  • 37
  • 60
0

Also came across this issue recently. My solution was using th js-fetch library (see answer)

I assume cloudfare and others found some way to detect, wheather a request is made by a browser (js) or other programming languages.

kaliiiiiiiii
  • 925
  • 1
  • 2
  • 21