I have build a script to scrape www.tesco.com grocery result page, example link :
https://www.tesco.com/groceries/en-GB/search?query=kitkat
Unfortunately my Python script is getting blocked by server ( regular get request). I have even tried to use CURL on my machine to troubleshoot:
curl htpps://www.tesco.com
but I get below response:
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://dce-homepage.tesco.com/" on this server.<P>
Reference #18.496cd417.1592645071.44e961c
</BODY>
</HTML>
When trying to use Postman with it's standard headers I get 200 OK response. In my script I have tried to use same headers as Postman and I get 200 OK but only if I use it on my local PC. When I spin up a fresh Instance on AWS - free tier of Ubuntu 18.04 or similar - ever CURL gets 404 as above. Ideally I would like my script to work on AWS. When run, script doesn’t work - just hangs. When I interrupt it I get below:
^CTraceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ttest.py", line 18, in <module>
results = requests.get(url, headers = headers)
File "/usr/lib/python3/dist-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 520, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 630, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1356, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.6/ssl.py", line 874, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
KeyboardInterrupt
Perhaps tesco.com has banned all AWS instances from scraping their website?
Here is code which is working on my PC but not on AWS instance.
'EDIT' - I have tried withour cookies in headers - still no luck.
import requests
headers = {'User-Agent': 'PostmanRuntime/7.25.0',
'Accept': '*/*',
'Cache-Control': 'no-cache',
'Host': 'www.tesco.com',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'bm_sz=04919BE521C5C4D8ADF4617D5250A484~YAAQrpxkX+b8IYVyAQAA/VQr0QgTg5gDEXUmuUfa0qqtHv0QHHZjtL4gcSJ9RA7hoaEXJOTp1DYPb9xCrGwP37BrvtUY2kCKB7PqvVLXAXnfrt9F0ZiEPj10SiSVXZRZj8klW46ZA7Ho/0XtWlsO2aFX1MPkmD2/C10cDH6E1PgeO9EUNkZi9uPu109p4DE=; _abck=5621BD87FE69A39458BD0AB267BB9A81~-1~YAAQrpxkX+f8IYVyAQAA/VQr0QTSvxcBlxnRsND9THtPksH0EbfK/A3XkW0xT9oCk0Bj1ewbVDXr3PqtBjR7hHO6h6IXMvC2XID5RrAk0gVEKGwm9RDyBWyvp6hnPzicHMH6tTUIZdYLmssjIBAJ2WnpBkKUuF0YbX45V4H8d3m6u8FOhyqZewFyT1+Yvh14NDHwmDw4Yb4hQkLPglrkzt8LV39SpfSjjGkWMjyX4l967aCe+SHK5hjcTIz9bjSAoOQNqFWR5ATMnfBDSLOfaAQ4Dic=~-1~-1~-1; atrc=48693e75-78d9-4fce-85d0-9a0a50232644; _csrf=2wH2UKiamS-tjvd4hERekcG2',
'Referer': 'http://www.tesco.com/'
}
url = 'https://www.tesco.com/groceries/en-GB/search?query=kitkat'
results = requests.get(url, headers = headers)
print(results.status_code)
www.tesco.com robots.txt
doesn't forbid scraping:
Sitemap: https://www.tesco.com/UK.sitemap.xml
User-agent: *
Disallow: *reviews/submission/*
Disallow: *&sortBy*
Disallow: *promotion=*
Disallow: *currentModal*
Disallow: *active-tab*
Disallow: *include-children*
Disallow: *new&new*
Disallow: /groceries/*reviews/submission
EDIT:
I have downloaded headless chrome webbrowser to my ubuntu server instance on AWS and tried to take screenshot of tesco.com. I get below error:
For clarification I tried to browse https address - which shouldnt matter as I'm sure it has https redirect.