0

I'd like to download some html source with urllib2 or mechanize (with .read()). Unfortunately the source I want to have is quite large. I just get a string of length up to 65747 characters (with both libs). The remaining tail is not considered. This really bugs me, I don't know how to deal with this problem. Can someone give me a hint?

EDIT: Here's a snippet of the code I use.

cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

dataHTML = ""
fp = opener.open(url)

while 1:
    r = fp.read()
    if r == '':
        break
    dataHTML += r
SpaceMonkey
  • 965
  • 1
  • 8
  • 12
  • Here is full solution: http://stackoverflow.com/questions/1824069/urllib2-not-retrieving-entire-http-response – Jarosław Jaryszew Mar 21 '13 at 15:26
  • The only solutions with urlretrieve() or readlines() (I haven't tested them) are not really satisfying. Notice that all other solutions are at least not working. – SpaceMonkey Mar 21 '13 at 16:20
  • This solution works. I ran in my Python interpreter line for line. http://stackoverflow.com/a/4268012/399704 – Aaron D Mar 21 '13 at 17:13

1 Answers1

0

You can call read() several of times:

b = ''
while 1:
    r = f.read()
    if r == '':break
    b += r

works better?

emil
  • 1,642
  • 13
  • 12
  • It's still not working. I added my code above. Is my use of the build_opener() command ok? Also I have to admit that I use urllib2 but this shouldn't affect your solution. – SpaceMonkey Mar 21 '13 at 23:31