I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.
I've combed through relevant SO and medium articles and tried:
- adding the appropriate headers
- specifying user agent
- using different libraries (
urllib
,cloudscraper
,selenium
) - using a virtual display (
pyvirtualdisplay
withxvfb
) as according to this post: How to bypass Cloudflare bot protection in selenium
Example code of the urllib
version to illustrate the question:
import json
import urllib.request
def lambda_handler(event, context):
url = 'https://disboard.org/servers/tag/python/15'
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
return respData
The above code returns a 403
status + reCAPTCHA.
I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?
Thank you in advance.