get many pages with pycurl?

Question

I want to get many pages from a website, like

curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"

but get the pages' data in python, not disk files. Can someone please post pycurl code to do this,
or fast urllib2 (not one-at-a-time) if that's possible,
or else say "forget it, curl is faster and more robust" ? Thanks

user425996 · Answer 1 · 2010-08-20T05:35:14.243

So you have 2 problem and let me show you in one example. Notice the pycurl already did the multithreading/not one-at-a-time w/o your hardwork.

#! /usr/bin/env python

import sys, select, time
import pycurl,StringIO

c1 = pycurl.Curl()
c2 = pycurl.Curl()
c3 = pycurl.Curl()
c1.setopt(c1.URL, "http://www.python.org")
c2.setopt(c2.URL, "http://curl.haxx.se")
c3.setopt(c3.URL, "http://slashdot.org")
s1 = StringIO.StringIO()
s2 = StringIO.StringIO()
s3 = StringIO.StringIO()
c1.setopt(c1.WRITEFUNCTION, s1.write)
c2.setopt(c2.WRITEFUNCTION, s2.write)
c3.setopt(c3.WRITEFUNCTION, s3.write)

m = pycurl.CurlMulti()
m.add_handle(c1)
m.add_handle(c2)
m.add_handle(c3)

# Number of seconds to wait for a timeout to happen
SELECT_TIMEOUT = 1.0

# Stir the state machine into action
while 1:
    ret, num_handles = m.perform()
    if ret != pycurl.E_CALL_MULTI_PERFORM:
        break

# Keep going until all the connections have terminated
while num_handles:
    # The select method uses fdset internally to determine which file descriptors
    # to check.
    m.select(SELECT_TIMEOUT)
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break

# Cleanup
m.remove_handle(c3)
m.remove_handle(c2)
m.remove_handle(c1)
m.close()
c1.close()
c2.close()
c3.close()
print "http://www.python.org is ",s1.getvalue()
print "http://curl.haxx.se is ",s2.getvalue()
print "http://slashdot.org is ",s3.getvalue()

Finally, these code is mainly based on an example on the pycurl site =.=

may be you should really read doc. ppl spend huge time on it.

If anyone try to using multithread approach in python level, it is not really fetching pages in parallel as you though because of python GIL's problem. Thanks for the curl lib which is written by C, the multi_perfrom by curl lib is truly multithreading. These is the fastest approach I can think of. — user425996, Aug 21 '10 at 06:49
Here's a good example of CurlMulti: http://fragmentsofcode.wordpress.com/2011/01/22/pycurl-curlmulti-example/ — Javier, Jul 21 '14 at 17:26

score 3 · Accepted Answer · answered Dec 24 '09 at 22:40

here is a solution based on urllib2 and threads.

import urllib2
from threading import Thread

BASE_URL = 'http://farmsubsidy.org/DE/browse?page='
NUM_RANGE = range(0000, 3603)
THREADS = 2

def main():
    for nums in split_seq(NUM_RANGE, THREADS):
        t = Spider(BASE_URL, nums)
        t.start()

def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

class Spider(Thread):
    def __init__(self, base_url, nums):
        Thread.__init__(self)
        self.base_url = base_url
        self.nums = nums
    def run(self):
        for num in self.nums:
            url = '%s%s' % (self.base_url, num)
            data = urllib2.urlopen(url).read()
            print data

if __name__ == '__main__':
    main()

Thanks Corey; how do I (thread newbie) wait til they're all done ? — denis, Dec 25 '09 at 14:49
Denis, you can call join() on each thread in main(). this will block until the threads are done. — Corey Goldberg, Dec 25 '09 at 16:02

David Lemphers · Answer 3 · 2012-05-05T03:52:26.943

1

Using BeautifulSoup4 and requests -

Grab head page:

page = Soup(requests.get(url='http://rootpage.htm').text)

Create an array of requests:

from requests import async

requests = [async.get(url.get('href')) for url in page('a')]
responses = async.map(requests)

[dosomething(response.text) for response in responses]

Requests requires gevent to do this btw.

edited May 05 '12 at 03:52

answered May 05 '12 at 03:47

David Lemphers

3,568
3
18
10

score 1 · Answer 4 · answered Oct 07 '12 at 19:57

I can recommend you to user async module of human_curl

Look example:

from urlparse import urljoin 
from datetime import datetime

from human_curl.async import AsyncClient 
from human_curl.utils import stdout_debug

def success_callback(response, **kwargs):
    """This function call when response successed
    """
    print("success callback")
    print(response, response.request)
    print(response.headers)
    print(response.content)
    print(kwargs)

def fail_callback(request, opener, **kwargs):
    """Collect errors
    """
    print("fail callback")
    print(request, opener)
    print(kwargs)

with AsyncClient(success_callback=success_callback,
                 fail_callback=fail_callback) as async_client:
    for x in xrange(10000):
        async_client.get('http://google.com/', params=(("x", str(x)),)
        async_client.get('http://google.com/', params=(("x", str(x)),),
                        success_callback=success_callback, fail_callback=fail_callback)

Usage very simple. Then page success loaded of failed async_client call you callback. Also you can specify number on parallel connections.

score 1 · Answer 5 · answered Dec 24 '09 at 19:06

1

You can just put that into a bash script inside a for loop.

However you may have better success at parsing each page using python. http://www.securitytube.net/Crawling-the-Web-for-Fun-and-Profit-video.aspx You will be able to get at the exact data and save it at the same time into a db. http://www.securitytube.net/Storing-Mined-Data-from-the-Web-for-Fun-and-Profit-video.aspx

answered Dec 24 '09 at 19:06

Dmitry

89
3

curl holds a persistent connection during the entire transfer, doing a shell loop for 3600 TCP fresh connections WILL be a lot slower... – Daniel Stenberg Dec 25 '09 at 22:31
it would still run serially. see my answer for a version that can download many streams in parallel. – Corey Goldberg Dec 26 '09 at 20:57
yes, and quite possibly then using pycurl in several threads would be even faster! ;-) – Daniel Stenberg Dec 26 '09 at 22:26

score 1 · Answer 6 · answered Feb 06 '11 at 11:40

1

If you want to crawl a website using python, you should have a look to scrapy http://scrapy.org

answered Feb 06 '11 at 11:40

dzen

6,923
5
28
31

get many pages with pycurl?

6 Answers6

Linked