Python Curl writefunction not working onsecond call

Question

I've written a simple script in Python.

It parses the hyperlinks from a webpage, and afterwards these links are retrieved to parse some information.

I have similar scripts running and re-using the writefunction without any problems, for some reason it fails, and I can't figure it out why.

General Curl init:

storage = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.COOKIEFILE, "")
c.setopt(pycurl.POST, 0)
c.setopt(pycurl.FOLLOWLOCATION, 1)
#Similar scripts are working this way, why this script not?
c.setopt(c.WRITEFUNCTION, storage.write)

First call to retreive links:

URL = "http://whatever"
REFERER = URL

c.setopt(pycurl.URL, URL)
c.setopt(pycurl.REFERER, REFERER)
c.perform()

#Write page to file
content = storage.getvalue()
f = open("updates.html", "w")
f.writelines(content)
f.close()
... Here the magic happens and links are extracted ...

Now looping these links:

for i, member in enumerate(urls):
    URL = urls[i]
    print "url:", URL
    c.setopt(pycurl.URL, URL)
    c.perform()

    #Write page to file
    #Still the data from previous!
    content = storage.getvalue()
    f = open("update.html", "w")
    f.writelines(content)
    f.close()
    #print content
    ... Gather some information ...
    ... Close objects etc ...

You could try `c.setopt(c.WRITEFUNCTION, f.write)` in the loop to avoid appending data to the same object. It might be enough if `Curl()` is reusable. — jfs, May 05 '13 at 22:55
No that doesn't work, I've tried that before, I think it's just passing a reference. Is it possible the string length from the fist page is too big (Webpage is quite large, compared to other things I retreive with Curl and Python.) — honda4life, May 06 '13 at 17:18

score 0 · Accepted Answer · answered May 06 '13 at 19:18

If you want to download urls to different files in sequence (no concurrent connections):

for i, url in enumerate(urls):
    c.setopt(pycurl.URL, url)
    with open("output%d.html" % i, "w") as f:
        c.setopt(c.WRITEDATA, f) # c.setopt(c.WRITEFUNCTION, f.write) also works
        c.perform()

Note:

storage.getvalue() returns everything that was written to storage from the moment it is created. In your case you should find the output from multiple urls in it
open(filename, "w") overwrites the file (previous content is gone) i.e., update.html contains whatever is in content on the last iteration of the loop

"storage.getvalue() returns everything that was written to storage from the moment it is created." That's what I wanted to hear, probably I didn't notice it in my other scripts, when opening with a browser it might be ignored, when opening with a text editor it might be visible or something like that. — honda4life, May 06 '13 at 19:55

Python Curl writefunction not working onsecond call

1 Answers1