I wrote simple class which inherits SGMLParser. The main idea behind this class is to collect all the links from html page and print the line number where this link could be found.
The class looks like this:
class HtmlParser(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.links = []
def start_a(self, attr):
href = [v for k, v in attr if k == "href"]
self.links.append(href[0])
print(self.getpos())
The problem is that getpos() returns (1,0) on every link. So if run the following code:
parser = HtmlParser()
parser.feed('''
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title></title>
</head>
<body>
<a href="www.foo-bar.com"></a>
<a href="http://foo.bar.com"></a>
<a href="www.google.com"></a>
</body>
</html>''')
parser.close()
print(parser.links)
The output will be:
(1, 0)
(1, 0)
(1, 0)
['www.foo-bar.com', 'http://foo.bar.com', 'www.google.com']
The question: why I can't get the actual line number for the links?