Loading html source with urllib2/mechanize in python

Question

I'd like to download some html source with urllib2 or mechanize (with .read()). Unfortunately the source I want to have is quite large. I just get a string of length up to 65747 characters (with both libs). The remaining tail is not considered. This really bugs me, I don't know how to deal with this problem. Can someone give me a hint?

EDIT: Here's a snippet of the code I use.

cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

dataHTML = ""
fp = opener.open(url)

while 1:
    r = fp.read()
    if r == '':
        break
    dataHTML += r

Here is full solution: http://stackoverflow.com/questions/1824069/urllib2-not-retrieving-entire-http-response — Jarosław Jaryszew, Mar 21 '13 at 15:26
The only solutions with urlretrieve() or readlines() (I haven't tested them) are not really satisfying. Notice that all other solutions are at least not working. — SpaceMonkey, Mar 21 '13 at 16:20
This solution works. I ran in my Python interpreter line for line. http://stackoverflow.com/a/4268012/399704 — Aaron D, Mar 21 '13 at 17:13

emil · Accepted Answer · 2013-03-21T17:05:50.700

0

You can call read() several of times:

b = ''
while 1:
    r = f.read()
    if r == '':break
    b += r

works better?

edited Mar 21 '13 at 17:05

answered Mar 21 '13 at 15:23

emil

1,642
13
12

It's still not working. I added my code above. Is my use of the build_opener() command ok? Also I have to admit that I use urllib2 but this shouldn't affect your solution. – SpaceMonkey Mar 21 '13 at 23:31

Loading html source with urllib2/mechanize in python

1 Answers1