4

I have a problem with urlopen (and requests.get)

In my program, if I run it inside a thread (I tested with multiprocessing too) [update: a thread that has been created by an imported module] it won't run until the program ends.

By "won't run" I mean not even start: the timeout (here 3 seconds) will never fire, and there is no connection made to the website.

Here is my simplified code:

import threading,urllib2,time

def dlfile(url):
  print 'Before request'
  r = urllib2.urlopen(url, timeout=3)
  print 'After request'
  return r

def dlfiles(*urls):
  threads = [threading.Thread(None, dlfile, None, (url,), {}) for url in urls]
  map(lambda t:t.start(), threads)

def main():
    dlfiles('http://google.com')

main()
time.sleep(10)
print 'End of program'

My output:

Before request
End of program
After request

Unfortunately, the code I'm writing on SO works as expected (i.e. "Before request/After request/End of program") and I can't reproduce the problem with simplified code yet.

I'm still trying to but in the mean time I'd like to know if anyone ever encountered that weird behaviour and what could cause it. Note that if I don't use a thread everything's fine.

Thanks for any help you can provide, I'm kind of lost and even the interwebs have no idea about this

UPDATE

Here is how to reproduce the behaviour

threadtest.py

import threading,urllib2,time
def log(a):print(a)
def dlfile(url):
  log('Before request')
  r = urllib2.urlopen(url, timeout=3)
  log('After request')
  return r

def dlfiles(*urls):
  threads = [threading.Thread(None, dlfile, None, (url,), {}) for url in urls]
  map(lambda t:t.start(), threads)

def main():
    dlfiles('http://google.com')

main()
for i in range(5):
    time.sleep(1)
    log('Sleep')
log('End of program')

threadtest-import.py

import threadtest

Then the outputs will be this:

$ python threadtest.py
Before request
After request
Sleep
Sleep
Sleep
Sleep
Sleep
End of program

$ python threadtest-import.py
Before request
Sleep
Sleep
Sleep
Sleep
Sleep
End of program
After request

Now that I found how to reproduce: is this behaviour normal? expected?

And how can I get rid of it? I.e. creating from an imported module a thread that can make urlopen load as expected.

René
  • 179
  • 8
  • Since today the `threadtest-import.py` script gives the normal result "Before request / After request / Sleep*5". I don't understand what is happening here... – René Apr 06 '16 at 19:17
  • 1
    I am facing the similar issue. Were you able to figure out the reason for it ?Any workaround ? – user3351750 Jul 14 '16 at 17:58

2 Answers2

1

Your code is fine. Single launch is expected.

def main():
    dlfiles('http://google.fr')

Here you are passing single url.

threads = [threading.Thread(None, dlfile, None, (url,), {}) for url in urls]

List comprehension will produce only one thread since there is single element in urls.

Try with:

def main():
    dlfiles('http://google.fr', 'http://google.com', 'http://google.gg')
xiº
  • 4,605
  • 3
  • 28
  • 39
  • Maybe I wasn't clear enough but the code I wrote in OP is just a simplified code that doesn't even reproduce the error. My problem is that in my real program all the urlopen calls (1 or more) will be made only after the program exits. – René Apr 04 '16 at 15:33
0

I forgot to post the solution, thanks to @user3351750 for his comment.

The problem is the structure of the files. In threadtest-import.py I import threadtest and during the time the module is imported, something* (I don't remember the exact mechanism) becomes blocking. IIRC this has to do with the re module in urllib. Sorry for not being clear.

The fix is to put your code in the imported module inside a function. This is good practice for a reason I guess.

I.e. do this:

import threadtest #do nothing except declarations
threadtest.run() #do the work

Instead of this:

import threadtest #declarations + work

And put the code

main()
for i in range(5):
    time.sleep(1)
    log('Sleep')
log('End of program')

Inside the run function:

def run():
    main()
    for i in range(5):
        time.sleep(1)
        log('Sleep')
    log('End of program')

This way the thing* stops being blocking and everything works as expected.

René
  • 179
  • 8