6

I have a multi-threaded script which occasionally freezes when it connects to a server but the server doesn't send anything back. Netstat shows a connected tcp socket. This happens even if I have TIMEOUT set. The timeout works fine in an unthreaded script. Here's some sample code.

def xmlscraper(url):
  htmlpage = StringIO.StringIO()
  rheader = StringIO.StringIO()
  c = pycurl.Curl()
  c.setopt(pycurl.USERAGENT, "user agent string")
  c.setopt(pycurl.CONNECTTIMEOUT, 60)
  c.setopt(pycurl.TIMEOUT, 120)
  c.setopt(pycurl.FOLLOWLOCATION, 1)
  c.setopt(pycurl.WRITEFUNCTION, htmlpage.write)
  c.setopt(pycurl.HEADERFUNCTION, rheader.write)
  c.setopt(pycurl.HTTPHEADER, ['Expect:'])
  c.setopt(pycurl.NOSIGNAL, 1)
  c.setopt(pycurl.URL, url)
  c.setopt(pycurl.HTTPGET, 1)

pycurl.global_init(pycurl.GLOBAL_ALL)
for url in urllist:
    t = threading.Thread(target=xmlscraper, args=(url,))
    t.start()

Any help would be greatly appreciated! been trying to solve this for a few weeks now.

edit: The urllist has about 10 urls. It doesn't seem to matter how many there are.

edit2: I just tested this code out below. I used a php script that sleeps for 100 seconds.

import threading
import pycurl
def testf():
    c = pycurl.Curl()
    c.setopt(pycurl.CONNECTTIMEOUT, 3)
    c.setopt(pycurl.TIMEOUT, 6)
    c.setopt(pycurl.NOSIGNAL, 1)
    c.setopt(pycurl.URL, 'http://xxx.xxx.xxx.xxx/test.php')
    c.setopt(pycurl.HTTPGET, 1)
    c.perform()
t = threading.Thread(target=testf)
t.start()
t.join()

Pycurl in that code seems to timeout properly. So I guess it has something to do with the number of urls? GIL?

edit3:

I think it might have to do with libcurl itself cause sometimes when I check the script libcurl is still connected to a server for hours on end. If pycurl was properly timing out then the socket would have been closed.

Incognito
  • 1,883
  • 5
  • 21
  • 28

2 Answers2

3

I modified your 'edit2' code to spawn multiple threads and it works fine on my machine (Ubuntu 10.10 with Python 2.6.6)

import threading
import pycurl

def testf():
    c = pycurl.Curl()
    c.setopt(pycurl.CONNECTTIMEOUT, 3)
    c.setopt(pycurl.TIMEOUT, 3)
    c.setopt(pycurl.NOSIGNAL, 1)
    c.setopt(pycurl.URL, 'http://localhost/cgi-bin/foo.py')
    c.setopt(pycurl.HTTPGET, 1)
    c.perform()

for i in range(100):
    t = threading.Thread(target=testf)
    t.start()

I can spawn 100 threads and all timeout at 3 seconds (like I specified).

I wouldn't go blaming the GIL and thread contention yet :)

Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143
1

Python threads are hamstrung, in some situations, by the Global Interpreter Lock (the "GIL"). It may be that the threads you're starting aren't timing out because they're not actually being run often enough.

This related StackOverflow question might point you in the right direction:

Community
  • 1
  • 1
Brian Clapper
  • 25,705
  • 7
  • 65
  • 65
  • from what I understand the GIL only affects python code. I understood pycurl simply to hand over everything to libcurl and it itself handles the timeout. – Incognito Dec 28 '10 at 22:00
  • The GIL does affect Python threading, though. Check the related question. – Brian Clapper Dec 28 '10 at 22:14
  • some urls need cookies so I can't use cookielib. Otherwise i would have stuck with urllib2. – Incognito Dec 28 '10 at 22:18
  • could you elaborate on "threads you're starting aren't timing out because they're not actually being run often enough"? – Incognito Dec 28 '10 at 22:29
  • The GIL does not affect native library calls, which includes all low-level network blocking, because native library calls release the GIL while they wait. (Some buggy libraries don't do this properly, eg. I recall PIL having trouble with it, but I expect curl to handle it.) – Glenn Maynard Dec 28 '10 at 23:00
  • True. But the threads, themselves, are being spawned via Python. The individual threads *then* invoke native libraries. However, having said that (and without bothering to test my assertion in the least), I defer to @Corey Goldberg, who *has* tested the code under Python 2.6 and found no problem with timeouts. So, whatever I assert might be possible is irrelevant in the face of actual data. ;-) – Brian Clapper Dec 29 '10 at 02:32