1

I tried running this Python script using BeautifulSoup and requests modules :

from bs4 import BeautifulSoup as bs
import requests

url = 'https://udemyfreecourses.org/
headers = {'UserAgent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
soup = bs(requests.get(url, headers= headers).text, 'lxml')

But when I send this line :

print(soup.get_text())

It doesn't scrape the text data but instead, It returns this output:

Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.

I even even used headers when requesting the webpage, so It can looks like a normal navigator, but I'm still getting this message that's preventing me from accessing the real webpage

Note : The webpage is working perfectly on the navigator directly, but It doesn't show much info when I try to scrape it.

Is there any other way than the one I used with headers that can get a perfect valid request from the website and bypass this security called Mod_Security?

Any help would be very very helpful, Thanks.

  • 1
    ModSecurity is a web application firewall which can be configured by rules and it is smart enough not to tell you which rule was hit to reject your traffic. I guess in your case the website wants tell you that it does not like to be scraped. – Klaus D. Dec 27 '20 at 14:10

1 Answers1

2

EDIT: The Dash in "User-Agent" is essential.

Following this Answer https://stackoverflow.com/a/61968635/8106583

headers = {
     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

Your User-Agent is the problem. This User-Agent works for me.

Also: Your ip might be blocked by now :D

wuerfelfreak
  • 2,363
  • 1
  • 14
  • 29