1

I have build a script to scrape www.tesco.com grocery result page, example link :

https://www.tesco.com/groceries/en-GB/search?query=kitkat

Unfortunately my Python script is getting blocked by server ( regular get request). I have even tried to use CURL on my machine to troubleshoot:

curl htpps://www.tesco.com

but I get below response:

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
 
You don't have permission to access "http&#58;&#47;&#47;dce&#45;homepage&#46;tesco&#46;com&#47;" on this server.<P>
Reference&#32;&#35;18&#46;496cd417&#46;1592645071&#46;44e961c
</BODY>
</HTML>

When trying to use Postman with it's standard headers I get 200 OK response. In my script I have tried to use same headers as Postman and I get 200 OK but only if I use it on my local PC. When I spin up a fresh Instance on AWS - free tier of Ubuntu 18.04 or similar - ever CURL gets 404 as above. Ideally I would like my script to work on AWS. When run, script doesn’t work - just hangs. When I interrupt it I get below:

^CTraceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ttest.py", line 18, in <module>
    results = requests.get(url, headers = headers)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 520, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 630, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
KeyboardInterrupt

Perhaps tesco.com has banned all AWS instances from scraping their website?

Here is code which is working on my PC but not on AWS instance.

'EDIT' - I have tried withour cookies in headers - still no luck.


import requests
headers = {'User-Agent': 'PostmanRuntime/7.25.0',
'Accept': '*/*',
'Cache-Control': 'no-cache',
'Host': 'www.tesco.com',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'bm_sz=04919BE521C5C4D8ADF4617D5250A484~YAAQrpxkX+b8IYVyAQAA/VQr0QgTg5gDEXUmuUfa0qqtHv0QHHZjtL4gcSJ9RA7hoaEXJOTp1DYPb9xCrGwP37BrvtUY2kCKB7PqvVLXAXnfrt9F0ZiEPj10SiSVXZRZj8klW46ZA7Ho/0XtWlsO2aFX1MPkmD2/C10cDH6E1PgeO9EUNkZi9uPu109p4DE=; _abck=5621BD87FE69A39458BD0AB267BB9A81~-1~YAAQrpxkX+f8IYVyAQAA/VQr0QTSvxcBlxnRsND9THtPksH0EbfK/A3XkW0xT9oCk0Bj1ewbVDXr3PqtBjR7hHO6h6IXMvC2XID5RrAk0gVEKGwm9RDyBWyvp6hnPzicHMH6tTUIZdYLmssjIBAJ2WnpBkKUuF0YbX45V4H8d3m6u8FOhyqZewFyT1+Yvh14NDHwmDw4Yb4hQkLPglrkzt8LV39SpfSjjGkWMjyX4l967aCe+SHK5hjcTIz9bjSAoOQNqFWR5ATMnfBDSLOfaAQ4Dic=~-1~-1~-1; atrc=48693e75-78d9-4fce-85d0-9a0a50232644; _csrf=2wH2UKiamS-tjvd4hERekcG2',
'Referer': 'http://www.tesco.com/'

}

url = 'https://www.tesco.com/groceries/en-GB/search?query=kitkat'
results = requests.get(url, headers = headers)

print(results.status_code)

www.tesco.com robots.txt doesn't forbid scraping:


Sitemap: https://www.tesco.com/UK.sitemap.xml
 
User-agent: *
Disallow: *reviews/submission/*
Disallow: *&sortBy*
Disallow: *promotion=*
Disallow: *currentModal*
Disallow: *active-tab*
Disallow: *include-children*
Disallow: *new&new*
Disallow: /groceries/*reviews/submission

EDIT:

I have downloaded headless chrome webbrowser to my ubuntu server instance on AWS and tried to take screenshot of tesco.com. I get below error:

enter image description here

For clarification I tried to browse https address - which shouldnt matter as I'm sure it has https redirect.

Krzysztof_K
  • 73
  • 2
  • 9
  • I had suspicions around the CSRF token but I was able to get a 200 response with exactly your code on both a repl.it window and a Digital Ocean instance. My other suspicion would be around rate limiting. Try implementing an exponential backoff? – shaunakde Jun 20 '20 at 13:10
  • Well i'm not sending requestes to tesco.com at hight rates. At the moment I run script few times a day. I will try to spin up a Windows instance on AWS install Postman and try again - this may tell me if all AWS IPs are blocked by tesco.co or perhaps only Linux instances? – Krzysztof_K Jun 20 '20 at 13:17
  • @shaunakde what do you mean by version mismatch? I'm currently using Python 3.8.2 – Krzysztof_K Jun 20 '20 at 13:24
  • 1
    I can run `curl 'https://www.tesco.com/groceries/en-GB/search?query=kitkat' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'` from my local machine and it works but running it from an AWS instance does not work. I guess AWS IPs are blocked, also I expect common proxies are blocked and some user agents appear blocked too e.g. curl. – Dan-Dev Jun 20 '20 at 22:17
  • @Dan-Dev Thank you! I was hoping someone else will spin up AWS instance and check. Now I know for sure that it's not just me :) I will research some paid proxies. Does anyone know any cheap ones? My project isin't commercial, so I dont want to spend loads of many for Portfolio app. – Krzysztof_K Jun 20 '20 at 22:24
  • @Dan-Dev I have used paid proxies and those work fine. Than you for your help! Damn you Tesco! – Krzysztof_K Jun 20 '20 at 23:52
  • Did you get the proxies working on your server? I am using selenium on gcp and facing the same problem. I am unable to setup proxies on server level. It works for curl calls and requests in python. But my usecase is with selenium. @Dan-Dev Can I setup proxy for my selenium browser? – Mayank Feb 21 '21 at 04:53
  • Hi. Yes, proxies saved me. So I found online set of 5 proxies, free month trial then few dollars a month. Works great, I just rotate then at random from text file. – Krzysztof_K Mar 09 '21 at 00:17

2 Answers2

3

AWS publish their IP ranges in a json format. This can be imported into web servers, to stop website scraping. I would expect a large Supermarket chain like Tescos to implement this.

One thing to try, is to change the AWS region to the latest one e.g Europe (Paris) eu-west-3. In the small possibility that their IP ranges are out-of-date.

Also there's the possibility that someone with an AWS lambada with the same shared IP range, submitted to many requests in a short period and got automatically blocked.

To get around this issue you can connect to a VPN, which will hide your AWS IP Address. Also you could create a VPC to your local machine (and therefore use your local PC IP address).

Greg
  • 4,468
  • 3
  • 16
  • 26
  • Thank you! I'm just checking if AWS Windows server isntance will be blocked. I did try using proxies for scraping - yes, the free ones, and unfortuantley I was still blocked. I guess I can try VPN. – Krzysztof_K Jun 20 '20 at 13:21
  • Just for testing purposes, you can create a basic python lambda and just try various regions. Also it shouldn't make a difference, but in your headers, yous should only need referee and user-agent (and set the user agent to Chrome). Setting the host can sometime cause issues. – Greg Jun 20 '20 at 13:31
  • Ok, I have just tried `curl www.tesco.com` on Windows instance of AWS - no reply. But using IE allows me to browse their website. Hmmmm any ideas? – Krzysztof_K Jun 20 '20 at 13:40
  • I have used additional options for header to mimic Postman requests as those go trough fine. – Krzysztof_K Jun 20 '20 at 13:41
  • I just tried wget on my linux instance - same issiue : `wget https://www.tesco.com` ```--2020-06-20 13:49:56-- https://www.tesco.com/ Resolving www.tesco.com (www.tesco.com)... 23.218.140.69 Connecting to www.tesco.com (www.tesco.com)|23.218.140.69|:443... connected. HTTP request sent, awaiting response...``` – Krzysztof_K Jun 20 '20 at 13:50
  • When you ran "curl www.tesco.com on Windows instance of AWS" - if theres no responses then it usually a firewall issue? Try a URL that you know will work (one that you own)? Also if Tescos works in IE, try going to an webpage that does IP lookup (just to make sure its using AWS IP). IE, Postman and your code are all doing the same thing. They just send a get request to a URL, with some headers and an IP address. If the response in none, then that usually an inbound/outbound issue. If you get "Access denied", then mean that the website (Tescos) didn't like the url, header and/or IP address. – Greg Jun 20 '20 at 17:38
  • I Get below error when I wait long enough on Windows instance. Otherwise IE or Chrome display tesco.com fine. ```curl : The underlying connection was closed: An unexpected error occurred on a receive. At line:1 char:1 + curl https://www.tesco.com + ~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebExc eption + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand ``` I can confirm that i's using AWS external IP address – Krzysztof_K Jun 20 '20 at 18:14
  • Can you install Postman onto the server? "The underlying connection was closed: An unexpected error" - this looks like the instance hasn't been setup correctly with TLS. This goes a bit beyond me, but have a look at https://stackoverflow.com/questions/41897114/unexpected-error-occurred-running-a-simple-unauthorized-rest-query?rq=1 – Greg Jun 20 '20 at 18:53
  • I have installed Postman and indeed it's getting blocked. It get's stuck on Sending request and I get nothing back. In my opinion their website is smart enough to know who is real Browser and who not and it's blocking everyone else. Shame. – Krzysztof_K Jun 20 '20 at 19:44
0

Looks like Tesco.com is blocking AWS IP address. I have resulted to using paid proxies which work fine for now. Thank you @Dan-Dev for help in checking on your AWS instance's.

Krzysztof_K
  • 73
  • 2
  • 9