0

Hey guys im making a Python Webcrawler at the Moment. So i have a link, which last chars are: "search?q=" and after that im using my wordlist which i have loaded before into a list. But when i try to open that with : urllib2.urlopen(url) it throws me an Error (urlopen error no host given) . But when i open that link with urllib normally (so typing the word which is normally automatic pasted in) it just works fine. So can you tell me why this is happening?

Thanks and regards

Full error:

  File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 61, in <module>
    getResults()
  File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 40, in getResults
    usock = urllib2.urlopen(url)
  File "C:\Python27\lib\urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 402, in open
    req = meth(req)
  File "C:\Python27\lib\urllib2.py", line 1113, in do_request_
    raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>

Code:

with open(filePath, "r") as ins:
    wordList = []
    for line in ins:
        wordList.append(line)

def getResults():
packageID = ""
count = 0
word = "Test"
for x in wordList:
    word = x;
    print word
    url = 'http://www.example.com/search?q=' + word
    usock = urllib2.urlopen(url)
    page_source = usock.read()
    usock.close()
    print page_source
    startSequence = "data-docid=\""
    endSequence = "\""
    while page_source.find(startSequence) != -1:
        start = page_source.find(startSequence) + len(startSequence)
        end = page_source.find(endSequence, start)
        print str(start);
        print str(end);
        link = page_source[start:end]
        print link
        if link:
            if not link in packageID:
                packageID += link + "\r\n"
                print packageID
        page_source = page_source[end + len(endSequence):]
count+=1

So when i print the string word it outputs the correct word from the wordlist

2 Answers2

0

I solved the Problem. I simply using now the urrlib instead of urllib2 and anything works fine thank you all :)

-1

Note that urlopen() returns a response, not a request.

You may have a broken proxy configuration; verify that your proxies are working:

print(urllib.request.getproxies())

or bypass proxy support altogether with:

url = urllib.request.urlopen(
    "http://www.example.com/search?q="+text_to_check
    proxies={})

Sample way to combining URL with word from Wordlist. It combines the list words to get the images from the url and downloads it. Loop it around to access the whole list you have.

import urllib
import re
print "The URL crawler starts.."

mylist =["http://www.ebay","https://www.npmjs.org/"]
wordlist = [".com","asss"]

x = 1
urlcontent = urllib.urlopen(mylist[0]+wordlist[0]).read()
imgUrls = re.findall('img .*?src="(.*?)"',urlcontent)

for imgUrl in imgUrls:
    img = imgUrl
    print img
    urllib.urlretrieve(img,str(x)+".jpg")
    x= x + 1

Hope this helps, else post your code and error logs.