5

I am new and I try to grap source code of an Web page for tutorial.I got beautifulsoup install,request install. At first I want to grap the source.I am doing this scraping job from "https://pythonhow.com/example.html".I am not doing anything illegal and I think this site also established for this purposes.Here's my code:

import requests
from bs4 import BeautifulSoup

r=requests.get("http://pythonhow.com/example.html")
c=r.content
c

And i got the mod security error:

b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'

Thanks for all who re dealing with me.Respectly

özgür Sanli
  • 103
  • 1
  • 8

2 Answers2

28

You can easily fix this issue by providing a user agent to the request. By doing so, the website will think that someone is actually visiting the site using a web browser.

Here is the code that you want to use:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

r = requests.get("http://pythonhow.com/example.html", headers=headers)
c = r.content

print(c)

Which gives you the expected output

b'<!DOCTYPE html>\n<html>\n<head>\n<style>\ndiv.cities {\n    background-color:black;\n    color:white;\n    margin:20px;\n    padding:20px;\n} \n</style>\n</head>\n<body>\n<h1 align="center"> Here are three big cities </h1>\n<div class="cities">\n<h2>London</h2>\n<p>London is the capital of England and it\'s been a British settlement since 2000 years ago. </p>\n</div>\n<div class="cities">\n<h2>Paris</h2>\n<p>Paris is the capital city of France. It was declared capital since 508.</p>\n</div>\n<div class="cities">\n<h2>Tokyo</h2>\n<p>Tokyo is the capital of Japan and one of the most populated cities in the world.</p>\n</div>\n</body>\n</html>'
Siddharth Dushantha
  • 1,391
  • 11
  • 28
3

We just need to pass an argument called headers...

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
AgentNirmites
  • 156
  • 12
  • 1
    can you tell me why put `header`? 2nd I am using Windows and this header is available like: `agent = { "User-Agent": 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}` So should we put both of them or can change as per our system? – Engr Ali Aug 24 '22 at 04:39
  • 2
    @EngrAli, This is basic security. We can bypass it. By adding header, we are telling the server that this request is coming from some browser. 2nd, if you are using windows and the user agent of Linux works, then there's nothing to change. It is just a user agent. For more information, you can open developer console on any (cool) browser and go to "Network" tab. There, you can see how your browser sends all the requests and in code, we have to do the similar. Thanks for your question, I really appreciate that you read my answer and please don't forget to up vote if it helped you ☺. – AgentNirmites Aug 27 '22 at 13:09