2

So I'm trying to make a web crawler in python using HTMLParser and urllib3 in python. Currently I have two different import problems the first being

import html.parser
import urllib

urlText = []

#Define HTML Parser
class parseText(HTMLParser.HTMLParser):

def handle_data(self, data):
    if data != '\n':
        urlText.append(data)


#Create instance of HTML parser
lParser = parseText()

thisurl = "http://www-rohan.sdsu.edu/~gawron/index.html"
#Feed HTML file into parser
lParser.feed(urllib.urlopen(thisurl).read())
lParser.close()
for item in urlText:
    print (item)

and with this code it returns an error in the visual studio error box

name 'HTMLParser' is not defined

and the second option is the exact same but with import HTMLParser instead of html.parser

import HTMLParser
import urllib

urlText = []

#Define HTML Parser
class parseText(HTMLParser.HTMLParser):

def handle_data(self, data):
    if data != '\n':
        urlText.append(data)


#Create instance of HTML parser
lParser = parseText()

thisurl = "http://www-rohan.sdsu.edu/~gawron/index.html"
#Feed HTML file into parser
lParser.feed(urllib.urlopen(thisurl).read())
lParser.close()
for item in urlText:
    print (item)

which returns the error

No module named 'markupbase'

I'm losing my mind with the packages. Does anyone know a fix or see a problem. Ps. I'm running this in Visual studio 2016 and am in Python 3.5

David A
  • 41
  • 1
  • 5
  • 1
    I can't reproduce the problem in your second sample. Please show a full traceback. Also fix your indentation. – Alex Hall Dec 15 '16 at 17:39

1 Answers1

1

I am also following the same tutorial that you are following to learn web crawling. I also got those issues yesterday when I run that code. After a few google search I resolved those. I am new to python and web crawling, so correct me if I say something wrong.

If you are using python 3.5 you should import HTMLParser from html.parser and urllib.request. At line 7 you have to inherit just HTMLParser instead of HTMLParser.HTMLParser. At this point your code should be look like this

from html.parser import HTMLParser
import urllib.request

urlText = []

#Define HTML Parser
class parseText(HTMLParser):

    def handle_data(self, data):
        if data != '\n':
            urlText.append(data)


#Create instance of HTML parser
lParser = parseText()

thisurl = "http://www-rohan.sdsu.edu/~gawron/index.html"
#Feed HTML file into parsers
lParser.feed(urllib.request.urlopen(thisurl).read())
lParser.close()
for item in urlText:
    print (item)

Now if you run this code you will this error

TypeError: Can't convert 'bytes' object to str implicitly

that's because HTMLParser.feed() only takes string and urllib.request.urlopen().read() generates raw data as bytes. So we are going to decode this raw data as utf8. At line 19 we add .decode('utf8')after read(). The final code will be look like

from html.parser import HTMLParser
import urllib.request

urlText = []

#Define HTML Parser
class parseText(HTMLParser):

    def handle_data(self, data):
        if data != '\n':
            urlText.append(data)


#Create instance of HTML parser
lParser = parseText()

thisurl = "http://www-rohan.sdsu.edu/~gawron/index.html"
#Feed HTML file into parsers
lParser.feed(urllib.request.urlopen(thisurl).read().decode('utf8'))
lParser.close()
for item in urlText:
    print (item)

Converting bytes to sting also works with this at line 19

lParser.feed(str(urllib.request.urlopen(thisurl).read()))

but if i use this handle_data won't recognize any whitespace such as '\n'. So the code works fine without removing any '\n'.

tontus
  • 179
  • 2
  • 17