6

I'm running into an issue when combining multiprocessing, requests (or urllib2) and nltk. Here is a very simple code:

>>> from multiprocessing import Process
>>> import requests
>>> from pprint import pprint
>>> Process(target=lambda: pprint(
        requests.get('https://api.github.com'))).start()
>>> <Response [200]>  # this is the response displayed by the call to `pprint`.

A bit more details on what this piece of code does:

  1. Import a few required modules
  2. Start a child process
  3. Issue an HTTP GET request to 'api.github.com' from the child process
  4. Display the result

This is working great. The problem comes when importing nltk:

>>> import nltk
>>> Process(target=lambda: pprint(
        requests.get('https://api.github.com'))).start()
>>> # nothing happens!

After having imported NLTK, the requests actually silently crashes the thread (if you try with a named function instead of the lambda function, adding a few print statement before and after the call, you'll see that the execution stops right on the call to requests.get) Does anybody have any idea what in NLTK could explain such behavior, and how to get overcome the issue?

Here are the version I'm using:

$> python --version
Python 2.7.5
$> pip freeze | grep nltk
nltk==2.0.5
$> pip freeze | grep requests
requests==2.2.1

I'm running Mac OS X v. 10.9.5.

Thanks!

Romain G
  • 1,276
  • 1
  • 15
  • 27
  • The problem is not related to SSL, replacing the github api URL by 'http://google.com' doesn't change the behaviour with/without nltk imported. – Romain G Jun 10 '15 at 20:45
  • The problem is not related to `requests` either. When replacing the call to `requests.get` by `req = urllib2.Request('http://google.com'); handler = urllib2.urlopen(req); print handler.getcode()` the problem stays the same. – Romain G Jun 10 '15 at 20:49
  • Upgrading nltk to the last version did not fix the issue either... – Romain G Jun 10 '15 at 21:05
  • Try doing the same thing without multiprocessing, i.e., perform it in same process and see what happens. – Vikas Ojha Jun 10 '15 at 21:17
  • This works, the issue is specific to the request being sent from the child process. This bug has already been reported 2 months ago: https://github.com/nltk/nltk/issues/947. However, the version of NLTK I'm running has been released in Nov 2012, I'm surprised nobody noticed it sooner. – Romain G Jun 10 '15 at 21:26
  • Haha, it seems using Nltk and Python Requests in a child process is rare. Try using Thread instead of Process, I was having exactly same issue with some other library and Requests and replacing Process with Thread worked for me. Let me know, if it works, I will post it as an answer. – Vikas Ojha Jun 10 '15 at 21:30
  • This will work as well ;-) You can post it as an answer, this may be useful for other people. I won't accept it tho, because with threads you introduce the limitation of the GIL. I agree that it is not relevant for this simple example but in larger applications it may be a concern (and it is one for my actually) – Romain G Jun 10 '15 at 21:56
  • have you tried updating your NLTK version? – alvas Aug 06 '15 at 00:46
  • A similar thing happened to me when importing `ipdb`. See [here](http://stackoverflow.com/questions/33877491/python-multiprocessing-process-is-killed-by-http-request-if-ipdb-is-imported) – kilgoretrout Nov 23 '15 at 23:40

3 Answers3

1

It seems using Nltk and Python Requests in a child process is rare. Try using Thread instead of Process, I was having exactly same issue with some other library and Requests and replacing Process with Thread worked for me.

Vikas Ojha
  • 6,742
  • 6
  • 22
  • 35
1

Updating your python libraries and python should resolve the problem:

alvas@ubi:~$ pip freeze | grep nltk
nltk==3.0.3
alvas@ubi:~$ pip freeze | grep requests
requests==2.7.0
alvas@ubi:~$ python --version
Python 2.7.6
alvas@ubi:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:    14.04
Codename:   trusty

From code:

from multiprocessing import Process
import nltk
import time


def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

[out]:

Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned

From code:

alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing import Process
>>> import requests
>>> from pprint import pprint
>>> Process(target=lambda: pprint(
...         requests.get('https://api.github.com'))).start()
>>> <Response [200]>

>>> import nltk
>>> Process(target=lambda: pprint(
...         requests.get('https://api.github.com'))).start()
>>> <Response [200]>

It should work with python3 too:

alvas@ubi:~$ python3
Python 3.4.0 (default, Jun 19 2015, 14:20:21) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing import Process
>>> import requests
>>> Process(target=lambda: print(requests.get('https://api.github.com'))).start()
>>> 
>>> <Response [200]>

>>> import nltk
>>> Process(target=lambda: print(requests.get('https://api.github.com'))).start()
>>> <Response [200]>
alvas
  • 115,346
  • 109
  • 446
  • 738
0

This issue still seems not solved. https://github.com/nltk/nltk/issues/947 I think this is a serious issue (unless you are playing with NLTK, doing POCs and trying out models, not actual apps) I am running the NLP pipelines in RQ workers (http://python-rq.org/)

nltk==3.2.1
requests==2.9.1
Sasinda Rukshan
  • 439
  • 1
  • 5
  • 14