0

I am writing HTMLParser implementation in python which will process a web page downloaded from the internet.

Here is my code:

class Parser(HTMLParser.HTMLParser):

...

parser=Parser()

httpRequest = urllib2.Request("http://www......")
pageContent = urllib2.urlopen(httpRequest)

while (True):
          htmlTextPortion = pageContent.read()
          parser.feed(htmlTextPortion)

My question is: will the 'read' call block until the whole HTML page is downloaded or it will each time return chunks of page that have been loaded so far?

This is important to me as I need to start processing the web page as soon as possible and not to wait until it's end.

I heard that pycurl library has an option of streaming, but is it for sure I need to switch to pycurl, or i can reach same functionality with urllib2?

Many thanks...

Andrey Rubliov
  • 1,359
  • 2
  • 17
  • 24

1 Answers1

0

urllib2's default handler actually seems to fetch the entire page on the urlopen() call. read() doesn't block because the entire page is already available. You could probably write your own handler to stream the data (the opener returns a file-like object, which is exposed via read() on the response, and this could stream) but if another library has that functionality already baked-in, I'd use that instead.

kindall
  • 178,883
  • 35
  • 278
  • 309