Tensorflow + joblib: limited to 8 processes?

Question

I created a statistical estimator using TensorFlow. I followed sklearn's estimators, so I have a class that packages everything including importing Tensorflow and starting TF's session (if I import TF outside the class nothing works in parallel at all).

I need to run that estimator many times on randomized data to see the empirical distribution, so I am using joblib to parallelize the code that creates the data, creates the estimator object and runs the estimation on the data. I am working on a linux server with 64 cores (and plenty of memory) where I've run much bigger jobs than this, also using joblib. However, when I try running the TF-based code, I am only able to run 8 processes. If I try to use 9, then only 8 show in top and when those 8 are done, joblib never sends another 8 and never returns at all or it returns the following error message

"BrokenProcessPool: A process in the executor was terminated abruptly while the future was running or pending."

If I limit the processes to 8, then everything works normally. I tried changing joblib's backend to dask.parallel and I have the same behaviour. I get a bit more information from the backend, with constant messages saying

"distributed.nanny - WARNING - Worker process 7602 was killed by unknown signal"

I would like to be able to run more than 8 processes. The question is: is this a hard limit or can I change it via some TF parameter? Can I get around this problem in any way? I think the limitation is Tensorflow related because once 8 processes are running (and they take hours) I cannot run anything else from Tensorflow on that machine.

Thanks for your help!!

The following code reproduces the error:

from sklearn.base import TransformerMixin
import numpy as np
from joblib import Parallel, delayed

class MyEstimator(TransformerMixin):
    def __init__(self):
        import tensorflow as tf
        self._tf = tf
        self._graph = tf.Graph()
        with self._graph.as_default():
            self.session = self._tf.Session()
            A0 = np.eye(10, 2)
            self.a_var = a_var = tf.Variable(A0, name='a_var', dtype=tf.float64)
            self._x = x = tf.placeholder(dtype=tf.float64)
            self._y = y= tf.placeholder(dtype=tf.float64)
            w = tf.tensordot(a_var, x, axes=0)
            self.f = tf.reduce_mean((y-w)**2)

    def fit(self, x, y):
        #self.session.run(
        #             self._tf.global_variables_initializer())
        self._f = self.session.run(self.f, feed_dict={self._x:x, self._y: y, self.a_var:np.eye(10, 2)})

        return self

def run_estimator():
    my_est = MyEstimator()
    x = np.random.normal(0,1,10)
    y = np.random.normal(0,1,10)
    my_est.fit(x,y)

Parallel(n_jobs=16)(delayed(run_estimator)() for _ in range(16))

I am working on Linux, Python 3.6.3, TensorFlow 1.7.0, joblib 0.12.

score 1 · Accepted Answer · answered Oct 03 '18 at 11:14

Many months later, I found a solution a TensorFlow server, https://www.tensorflow.org/deploy/distributed

from sklearn.base import TransformerMixin
import numpy as np
from joblib import Parallel, delayed

class MyEstimator(TransformerMixin):
    def __init__(self, target):
        import tensorflow as tf
        self._tf = tf
        self._graph = tf.Graph()
        with self._graph.as_default():
            config = self._tf.ConfigProto(
                    intra_op_parallelism_threads=1,
                    inter_op_parallelism_threads=1,
                    device_count={"CPU":4},
                    use_per_session_threads=True)
            config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
            pool = config.session_inter_op_thread_pool.add()
            pool.num_threads = 1

            self.session = self._tf.Session(target)
            A0 = np.eye(10, 2)
            self.a_var = a_var = tf.Variable(A0, name='a_var', dtype=tf.float64)
            self._x = x = tf.placeholder(dtype=tf.float64)
            self._y = y= tf.placeholder(dtype=tf.float64)
            w = tf.tensordot(a_var, x, axes=0)
            self.f = tf.reduce_mean((y-w)**2)

    def fit(self, x, y):
        #self.session.run(
        #             self._tf.global_variables_initializer())
        self._f = self.session.run(self.f, feed_dict={self._x:x, self._y: y, self.a_var:np.eye(10, 2)})

        return self

def run_estimator(target):
    my_est = MyEstimator(target)
    x = np.random.normal(0,1,10)
    y = np.random.normal(0,1,10)
    my_est.fit(x,y)
    return 1

import tensorflow as tf
server = tf.train.Server.create_local_server()
Parallel(n_jobs=16)(delayed(run_estimator)(server.target) for _ in range(16))

Do you know the correct line for tf.train.Server.create_local_server() in tensorflow 2.0? — Alejandro, Oct 31 '19 at 21:00

Tensorflow + joblib: limited to 8 processes?

1 Answers1