How to get beautiful soup to scrape pages in Arabic from a multilingual website where pages in different languages have the same URL

Question

I am trying to scrape pages from this website Text The pages in Arabic and French have the same URL I tried the following code

    headers = {'Accept-Language': "lang=\"AR-DZ"}
    r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",headers)
    soup = BeautifulSoup(r.content,"lxml")
    print(soup.getText)

I get the following error message:

<bound method Tag.get_text of <html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br/><br/>Your support ID is: 12750291427324767866<br/><br/><a href="javascript:history.back();">[Go Back]</a></body></html>>

when I remove the header Beautifulsoup scrapes the page in French.

My goal is to scrape the statements and speeches in Arabic in order to build a corpus. Any help appreciated.

The support id error generally means your request was blocked by their firewall. Double check if your request headers are correct, and may need additional headers for it to accept the request — Wondercricket, Dec 01 '21 at 16:45
normally to change language on this page you have to click link `http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx` which has `language=ar` - so maybe do the same in code. Use `Session()` to remeber cookis and first use `requests.get()` with this url. Maybe it will set correct language in cookies. — furas, Dec 01 '21 at 19:37
you have opening `"` before `AR-DZ` but you don't have closing `"` after `AR-DZ` in string `"lang=\"AR-DZ"` but maybe you should use `"lang=AR-DZ"` — furas, Dec 01 '21 at 19:40

score 1 · Accepted Answer · answered Dec 01 '21 at 19:47

First: in "lang=\"AR-DZ" you have opening " before AR-DZ but you don't have closing " after AR-DZ but you should rather use "lang=AR-DZ"

Normally in browser to change language on this page you have to click link with url http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx which has language=ar - so you can do the same in code.

Use Session() to remeber cookies and first use requests.get() with this url. It will set correct language in cookies.

import requests
from bs4 import BeautifulSoup 

#headers = {'User-Agent': 'Mozilla/5.0'}
#headers = {'Accept-Language': "lang=AR-DZ"}

s = requests.Session()

url = 'http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx'
r = s.get(url)#, headers=headers)

url = 'http://www.mae.gov.dz/news_article/6396.aspx'
r = s.get(url)#, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
print(soup.getText)

score 0 · Answer 2 · answered Dec 01 '21 at 17:11

0

set the language cookie to "ar"

import requests
from bs4 import BeautifulSoup   
cookies = dict(language='ar')

r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",cookies=cookies)
soup = BeautifulSoup(r.content,"lxml")
print(soup.text)

answered Dec 01 '21 at 17:11

diggusbickus

1,537
3
7
15

How to get beautiful soup to scrape pages in Arabic from a multilingual website where pages in different languages have the same URL

2 Answers2