1

I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.

I've combed through relevant SO and medium articles and tried:

Example code of the urllib version to illustrate the question:

import json
import urllib.request

def lambda_handler(event, context):
    url = 'https://disboard.org/servers/tag/python/15'
    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
    req = urllib.request.Request(url, headers = headers)
    resp = urllib.request.urlopen(req)
    respData = resp.read()
    return respData

The above code returns a 403 status + reCAPTCHA.

I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?

Thank you in advance.

  • got the same issue with a site protected by cloudflair anti-bot protection. This this seems to be smart :( I'm gonna check how to use a special proxy inside of AWS lambda. – Roman T Jul 16 '22 at 20:42

0 Answers0