3

I have a large number of variables (2000) that need to be initialized. Tensorflow takes a long time to initialize these variables which is a blocker for me right now. I am running tf in distributed mode (between graph.

with tf.variable_scope("f_counts"):
    per_ps_features = [] #A List of list
    for node in xrange(num_workers):
        with tf.device("/job:ps/task:{}".format(node % num_ps)):
            f = []  #List of features per parameter server
            for ps_node in xrange(num_workers):
                f.append(tf.get_variable(initializer=tf.constant([], dtype=tf.string), dtype=tf.string, validate_shape=False, trainable=False, name='ps_'+str(node)+'features_'+str(ps_node)))  # unique features per node                    
            per_ps_features.append(f)

As you can see, each PS has a variable corresponding to the number of PS servers. This makes the following very slow (sometimes an hour to just create the session)

with tf.train.MonitoredTrainingSession(master=server.target, is_chief= is_chief, config=tf.ConfigProto(log_device_placement=False)) as session: 

Is there a workaround or alternative when say num_workers = 200 ??

Engineero
  • 12,340
  • 5
  • 53
  • 75
  • Do things get much faster if you reduce number of variables to initialize? My first guess would be that you have some lagging machine that's slow to come up, so the whole tensorflow cluster is waiting for it – Yaroslav Bulatov Sep 06 '17 at 17:45
  • I don't think its the machines since reducing the number of variables to initialize speeds up things . – sushanth kumar Sep 06 '17 at 17:59
  • @YaroslavBulatov If you try creating 2000 variable with something like: for ps_node in xrange(2000): f.append(tf.get_variable(initializer=tf.constant([], dtype=tf.string), dtype=tf.string, validate_shape=False, trainable=False, name='features_'+str(ps_node))) you can reproduce the delay easily even on a single node. – sushanth kumar Sep 08 '17 at 17:31
  • 1
    Initializing 2000 scalar variables on a single machine shouldn't be much slower than 1 scalar variable initialization. If you can create a self-contained example reproducing the large delay it could be worth filing a bug – Yaroslav Bulatov Sep 09 '17 at 19:14

0 Answers0