1

I tried every 'User-Agent' in here, still I get urllib.error.HTTPError: HTTP Error 400: Bad Request. I also tried this, but I get urllib.error.URLError: File Not Found. I have no idea what to do, my current codes are;

from bs4 import BeautifulSoup
import urllib.request,json,ast

with open ("urller.json") as f:
    cc = json.load(f) #the file I get links, you can try this link instead of this
    #cc = ../games/index.php?g_id=23521&game=0RBITALIS 

for x in ast.literal_eval(cc): #cc is a str(list) so I have to convert
    if x.startswith("../"):

        r = urllib.request.Request("http://www.game-debate.com{}".format(x[2::]),headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
        #x[2::] because I removed '../' parts from urlls

        rr = urllib.request.urlopen(r).read()
        soup = BeautifulSoup(rr)

        for y in soup.find_all("ul",attrs={'class':['devDefSysReqList']}):
            print (y.text)

Edit: If you try only 1 link probably it won't show any error, since I get the error every time at 6th link.

Community
  • 1
  • 1
GLHF
  • 3,835
  • 10
  • 38
  • 83
  • Do you _have_ to use `urllib`? I just tried `requests.get("http://www.game-debate.com/games/index.php?g_id=23521&game=0RBITALIS")` and it works perfectly. `requests` is far superior in virtually every respect. – Akshat Mahajan Jun 30 '16 at 02:16
  • @AkshatMahajan but I edited the question, if you try only 1 link probably it will be ok since I get that bad request error every time at 6th link from json file – GLHF Jun 30 '16 at 02:18
  • Have you tried printing each URL before making the request? Perhaps the URL is malformed in some obvious way. – John Gordon Jun 30 '16 at 02:19
  • @JohnGordon the link that I get the error is `../games/index.php?g_id=23255&game=12 Labours of Hercules II: The Cretan Bull` – GLHF Jun 30 '16 at 02:29
  • Those embedded spaces may be causing the issue. I don't believe literal spaces are allowed in a URL. – John Gordon Jun 30 '16 at 02:32
  • I thought that, but there are 3 links before this one and all of them have spaces. I'm really confused I'm not usually work with bs and urllib and this nonsense error is confusing me. – GLHF Jun 30 '16 at 02:56

1 Answers1

1

A quick fix is to replace the space with +:

url = "http://www.game-debate.com"
r = urllib.request.Request(url + x[2:] ,headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})

A better option may be to let urllib quote the params:

from bs4 import BeautifulSoup
import urllib.request,json,ast
from urllib.parse import quote, urljoin

with open ("urller.json") as f:
    cc = json.load(f) #the file I get links, you can try this link instead of this
    url = "http://www.game-debate.com"


    for x in ast.literal_eval(cc):  # cc is a str(list) so I have to convert
        if x.startswith("../"):
            r = urllib.request.Request(urljoin(url, quote(x.lstrip("."))), headers={
                'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})

            rr = urllib.request.urlopen(r).read()
            soup = BeautifulSoup(rr)
            print(rr.decode("utf-8"))

            for y in soup.find_all("ul", attrs={'class':['devDefSysReqList']}):
                print (y.text)

Spaces in a url are not valid and need to be percent encoded as %20 or replaced with +.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321