0

I've written a simple script in Python.

It parses the hyperlinks from a webpage, and afterwards these links are retrieved to parse some information.

I have similar scripts running and re-using the writefunction without any problems, for some reason it fails, and I can't figure it out why.

General Curl init:

storage = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.COOKIEFILE, "")
c.setopt(pycurl.POST, 0)
c.setopt(pycurl.FOLLOWLOCATION, 1)
#Similar scripts are working this way, why this script not?
c.setopt(c.WRITEFUNCTION, storage.write)

First call to retreive links:

URL = "http://whatever"
REFERER = URL

c.setopt(pycurl.URL, URL)
c.setopt(pycurl.REFERER, REFERER)
c.perform()

#Write page to file
content = storage.getvalue()
f = open("updates.html", "w")
f.writelines(content)
f.close()
... Here the magic happens and links are extracted ...

Now looping these links:

for i, member in enumerate(urls):
    URL = urls[i]
    print "url:", URL
    c.setopt(pycurl.URL, URL)
    c.perform()

    #Write page to file
    #Still the data from previous!
    content = storage.getvalue()
    f = open("update.html", "w")
    f.writelines(content)
    f.close()
    #print content
    ... Gather some information ...
    ... Close objects etc ...
Daniel Stenberg
  • 54,736
  • 17
  • 146
  • 222
  • You could try `c.setopt(c.WRITEFUNCTION, f.write)` in the loop to avoid appending data to the same object. It might be enough if `Curl()` is reusable. – jfs May 05 '13 at 22:55
  • No that doesn't work, I've tried that before, I think it's just passing a reference. Is it possible the string length from the fist page is too big (Webpage is quite large, compared to other things I retreive with Curl and Python.) – honda4life May 06 '13 at 17:18

1 Answers1

0

If you want to download urls to different files in sequence (no concurrent connections):

for i, url in enumerate(urls):
    c.setopt(pycurl.URL, url)
    with open("output%d.html" % i, "w") as f:
        c.setopt(c.WRITEDATA, f) # c.setopt(c.WRITEFUNCTION, f.write) also works
        c.perform()

Note:

  • storage.getvalue() returns everything that was written to storage from the moment it is created. In your case you should find the output from multiple urls in it
  • open(filename, "w") overwrites the file (previous content is gone) i.e., update.html contains whatever is in content on the last iteration of the loop
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • "storage.getvalue() returns everything that was written to storage from the moment it is created." That's what I wanted to hear, probably I didn't notice it in my other scripts, when opening with a browser it might be ignored, when opening with a text editor it might be visible or something like that. – honda4life May 06 '13 at 19:55