0

I am trying to scrape pages from this website Text The pages in Arabic and French have the same URL I tried the following code

    headers = {'Accept-Language': "lang=\"AR-DZ"}
    r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",headers)
    soup = BeautifulSoup(r.content,"lxml")
    print(soup.getText)

I get the following error message:

<bound method Tag.get_text of <html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br/><br/>Your support ID is: 12750291427324767866<br/><br/><a href="javascript:history.back();">[Go Back]</a></body></html>>

when I remove the header Beautifulsoup scrapes the page in French.

My goal is to scrape the statements and speeches in Arabic in order to build a corpus. Any help appreciated.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user3357814
  • 35
  • 1
  • 7
  • Try using postman and see if you get the same error – oubaydos Dec 01 '21 at 16:43
  • 1
    The support id error generally means your request was blocked by their firewall. Double check if your request headers are correct, and may need additional headers for it to accept the request – Wondercricket Dec 01 '21 at 16:45
  • normally to change language on this page you have to click link `http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx` which has `language=ar` - so maybe do the same in code. Use `Session()` to remeber cookis and first use `requests.get()` with this url. Maybe it will set correct language in cookies. – furas Dec 01 '21 at 19:37
  • you have opening `"` before `AR-DZ` but you don't have closing `"` after `AR-DZ` in string `"lang=\"AR-DZ"` but maybe you should use `"lang=AR-DZ"` – furas Dec 01 '21 at 19:40

2 Answers2

1

First: in "lang=\"AR-DZ" you have opening " before AR-DZ but you don't have closing " after AR-DZ but you should rather use "lang=AR-DZ"


Normally in browser to change language on this page you have to click link with url http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx which has language=ar - so you can do the same in code.

Use Session() to remeber cookies and first use requests.get() with this url. It will set correct language in cookies.

import requests
from bs4 import BeautifulSoup 

#headers = {'User-Agent': 'Mozilla/5.0'}
#headers = {'Accept-Language': "lang=AR-DZ"}

s = requests.Session()

url = 'http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx'
r = s.get(url)#, headers=headers)

url = 'http://www.mae.gov.dz/news_article/6396.aspx'
r = s.get(url)#, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
print(soup.getText)
furas
  • 134,197
  • 12
  • 106
  • 148
0

set the language cookie to "ar"

import requests
from bs4 import BeautifulSoup   
cookies = dict(language='ar')

r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",cookies=cookies)
soup = BeautifulSoup(r.content,"lxml")
print(soup.text)
diggusbickus
  • 1,537
  • 3
  • 7
  • 15