15

I am running a piece of code using a multiprocessing pool. The code works on a data set and fails on another one. Clearly the issue is data driven - Having said that I am not clear where to begin troubleshooting as the error I receive is the following. Any hints for a starting point would be most helpful. Both sets of data are prepared using the same code - so I don't expect there to be a difference - yet here I am.

Also see comment from Robert - we differ on os, and python version 3.6 (I have 3.4, he has 3.6) and quite different data sets. Yet error is identical down to the lines in the python code.

My suspicions:

  1. there is a memory limit per core that is being enforced.
  2. there is some period of time after which the process literally collects - finds the process is not over and gives up.

    Exception in thread Thread-9:

    Traceback (most recent call last):

    File "C:\Program Files\Python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\threading.py", line 911, in _bootstrap_inner self.run()

    File "C:\Program Files\Python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\threading.py", line 859, in run self._target(*self._args, **self._kwargs)

    File "C:\Program Files\Python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\multiprocessing\pool.py", line 429, in _handle_results task = get()

    File "C:\Program Files\Python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\multiprocessing\connection.py", line 251, in recv return ForkingPickler.loads(buf.getbuffer())

    TypeError: init() missing 1 required positional argument: 'message'

pythOnometrist
  • 6,531
  • 6
  • 30
  • 50
  • wow. I am having the exact same issue, manifesting the exact same way, at the exact same time. And I'm running on ubuntu. ```File "/home/ubuntu/anaconda3/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/pool.py", line 429, in _handle_results task = get() File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) TypeError: __init__() missing 1 required positional argument: 'message'``` – Robert E Mealey Apr 07 '17 at 01:40
  • I guess we're each other's only hope. will let you know if i figure it out :) – Robert E Mealey Apr 07 '17 at 01:44
  • Its really bizzare - it runs on a dataset with 600k observations and fails on one with 1.4 MM points. data are generated in exactly the same manner. quite nutty - i am running it linearly to see if its a data glitch - the error suggests its something to do with the multiprocessing module itself - possible how long it waits for an answer before giving up. – pythOnometrist Apr 07 '17 at 01:44
  • yeah that's pretty much exactly what's happening in my case too. Works on subset of larger dataset, fails on full dataset. – Robert E Mealey Apr 07 '17 at 01:45
  • do you know if there is a accessible memory limit ? My machine has 128gb ram - and the process never gets to even a quarter of it - perhaps there is a memory ceiling – pythOnometrist Apr 07 '17 at 01:49
  • No, I don't think so. I've used pools for things like this hundreds of times with a lot bigger memory footprint and never run into this. The timeout idea is interesting though. – Robert E Mealey Apr 07 '17 at 01:55
  • So I've determined that it is caused by adding a call to `detect` function from the `langdetect` module to mapped function. And the first instance of data triggering it is a longer chunk of text than any preceding it in the dataset (ran it on subsets of 100, 500, 400, 450, 401, 425, ..., 404 until i determined it failed on subset of 404 and not on subset of 403... ugh). But running it singlethreaded and timing that function, it returns milliseconds slower on that longer chunk of text. And the mapped function as a whole returns slower on some of the preceding data. So I think... – Robert E Mealey Apr 07 '17 at 02:13
  • ...it has something to do with how that module abuses namespace somehow. Just a gut instinct based on issues with multiprocessing in the past, mostly. I am digging into it now. – Robert E Mealey Apr 07 '17 at 02:18
  • hahahahai am using langdetect as well. You might be on the right track..not sure why size of data should have an effect on behavior... – pythOnometrist Apr 07 '17 at 02:27
  • are we the same person? – Robert E Mealey Apr 07 '17 at 02:31
  • clearly yes. parallel universes..i went windows..and ubuntu in another... – pythOnometrist Apr 07 '17 at 02:34
  • keeping with the parallel universe hypothesis - you might be hitting an upper bound and i hit the lower bound - lets see if this run does it. – pythOnometrist Apr 07 '17 at 03:11

3 Answers3

12

I think the issue is that langdetect quietly declares a hidden global detector factory here https://github.com/Mimino666/langdetect/blob/master/langdetect/detector_factory.py#L120:

def init_factory():
    global _factory
    if _factory is None:
        _factory = DetectorFactory()
        _factory.load_profile(PROFILES_DIRECTORY)

def detect(text):
    init_factory()
    detector = _factory.create()
    detector.append(text)
    return detector.detect()


def detect_langs(text):
    init_factory()
    detector = _factory.create()
    detector.append(text)
    return detector.get_probabilities()

This kind of thing can cause issues in multiprocessing, in my experience, by running afoul of the way that multiprocessing attempts to share resources in memory across processes and manages namespaces in workers and the master process, though the exact mechanism in this case is a black box to me. I fixed it by adding a call to init_factory function to my pool initialization function:

from langdetect.detector_factory import init_factory
def worker_init_corpus(stops_in):
    global sess
    global stops
    sess = requests.Session()
    sess.mount("http://", HTTPAdapter(max_retries=10))
    stops = stops_in
    signal.signal(signal.SIGINT, signal.SIG_IGN)
    init_factory()

FYI: The "sess" logic is to provide each worker with an http connection pool for requests, for similar issues when using that module with multiprocessing pools. If you don't do this, the workers do all their http communication back up through the parent process because that's where the hidden global http connection pool is by default, and then everything is painfully slow. This is one of the issues I've run into that made me suspect a similar cause here.

Also, to further reduce potential confusion: stops is for providing the stopword list I'm using to the mapped function. And the signal call is to force pools to exit nicely when hit with a user interrupt (ctrl-c). Otherwise they often get orphaned and just keep on chugging along after the parent process dies.

Then my pool is initialized like this:

self.pool = mp.Pool(mp.cpu_count()-2, worker_init_corpus, (self.stops,))

I also wrapped my call to detect in a try/catch LangDetectExeception block:

try:
    posting_out["lang"] = detect(posting_out["job_description"])
except LangDetectException:
    posting_out["lang"] = "none"

But this doesn't fix it on its own. Pretty confident that the the initialization is the fix.

Robert E Mealey
  • 506
  • 3
  • 14
  • you are the expert here - so will accept. But will leave my response if someone has a similar issue. – pythOnometrist Apr 07 '17 at 03:40
  • expert might be a bit strong. maybe we were having two distinct issues manifesting in what appeared to be the same way. I just verified again that if I remove the try/catch and don't filter out empty docs, I just get LangDetectException raised through the pools, not the behavior we both saw previously and if I remove the init_factory call from the worker_init function, I get the behavior. But I definitely don't know exactly why in this case it is happening and I don't really want to spend any more time on it. Good luck with the rest of your work, Bizarro Windows Me! :) – Robert E Mealey Apr 07 '17 at 03:46
  • :-) Good luck to you too! – pythOnometrist Apr 07 '17 at 19:21
  • 1
    I found an easier solution by using `spacy_langdetect`. I posted an answer here: [link](https://stackoverflow.com/a/56118039/1874449) – Habib Karbasian May 13 '19 at 18:37
  • Four years later, this is still an issue and either this or using `spacy_langdetect` is still the workaround lol – dmn May 26 '21 at 14:57
  • For me, the try-except block simply solved it. Your answer shed light on `lang_detect` as the culprit. Thanks. – f4z3k4s Aug 31 '21 at 12:21
2

Thanks to Robert - focusing on lang detect yielded the fact that possibly one of my text entries were empty

LangDetectException: No features in text

rookie mistake - possibly due to encoding errors- re-running after filtering those out - will keep you (Robert) posted.

pythOnometrist
  • 6,531
  • 6
  • 30
  • 50
  • I don't think that's the issue. I filtered those out and still got the issue. – Robert E Mealey Apr 07 '17 at 03:13
  • I think it's this bit of global abuse here: https://github.com/Mimino666/langdetect/blob/master/langdetect/detector_factory.py#L120 – Robert E Mealey Apr 07 '17 at 03:14
  • hmm - beyond me - not sure I follow at all. I do see some _factory being set to global something - but given we have identical errors - and mine went away with excluding empty reviews - perhaps its something else - possibly an encoding issue. What does langdetect say when you simply use map in place of pool.map? it looks like using pool.async_map can get you to raise the error trace – pythOnometrist Apr 07 '17 at 03:33
2

I was throwing a custom exception somewhere in the code, and it was being thrown in most of my processes (in the pool). About 90% of my processes went to sleep because this exception occurred in them. But, instead of getting a normal traceback, I get this cryptic error. Mine was on Linux, though.

To debug this, I removed the pool and ran the code sequentially.

Rohan Bhatia
  • 1,870
  • 2
  • 15
  • 31