Getting HTML with Pycurl

Question

I've been trying to retrieve a page of HTML using pycurl, so I can then parse it for relevant information using str.split and some for loops. I know Pycurl retrieves the HTML, since it prints it to the terminal, however, if I try to do something like

html = str(c.perform())

The variable will just hold a string which says "None".

How can I use pycurl to get the html, or redirect whatever it sends to the console so it can be used as a string as described above?

Thanks a lot to anyone who has any suggestions!

score 21 · Accepted Answer · answered Jul 02 '11 at 00:57

21

this will send a request and store/print the response body:

from StringIO import StringIO    
import pycurl

url = 'http://www.google.com/'

storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content

if you want to store the response headers, use:

c.setopt(c.HEADERFUNCTION, storage.write)

answered Jul 02 '11 at 00:57

Corey Goldberg

59,062
28
129
143

Great! That does exactly what I've been looking for. Though, one line is incorrect. It should say storage = StringIO.StringIO(). Otherwise, an error is raised. Regardless, thanks for your help!! – Sinthet Jul 02 '11 at 01:26
3

i think it is correct as-is. notice I do 'from StrongIO import StringIO' – Corey Goldberg Jul 02 '11 at 02:06
Ah, that might be it. I checked my source and just imported the entire library. Sorry for the confusion! – Sinthet Jul 02 '11 at 02:59
Any chance you might update this for Python3? Looks like Python3 deprecated StringIO in favor of io.StringIO, which doesn't quite work as above. – Dustin Kirkland Feb 03 '13 at 04:40
3

For Python 3 use `io.BytesIO` instead, but then `.getvalue()` will return `bytes`, so you should turn them into string with `.decode("utf-8")` – Adam Apr 16 '14 at 11:30

score 6 · Answer 2 · answered Jul 02 '11 at 01:02

The perform() method executes the html fetch and writes the result to a function you specify. You need to provide a buffer to put the html into and a write function. Usually, this can be accomplished using a StringIO object as follows:

import pycurl
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.google.com/")

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()

You could also use a file or tempfile or anything else that can store data.

Getting HTML with Pycurl

2 Answers2

Linked