0

I tried to use tf estimator to build logistic regression model. I used iris dataset and it ran successfully in my computer. However, when I tried to apply this model in cluster(using train_and_evaluate instead of classfier.train), I came across this problem.

python version:3.6.8 tensorflow version:1.13.1

Here is the code running locally:

iris dataset only contains numeric data. So feature_columns is a list of NumericColumn.

FUTURES = ['SepalLength', 'SepalWidth','PetalLength', 'PetalWidth', 'Species']
feature_columns = []
for key in FUTURES:
    feature_columns.append(tf.feature_column.numeric_column(key=key))

define estimator. pass feature_columns into params

classifier = tf.estimator.Estimator(
        model_fn=my_model_fn,
        model_dir=models_path,
        params={
            'feature_columns': feature_columns,
            'n_classes': 3,
        })

define model_fn.

def my_model_fn(features,labels,mode,params):
    net = tf.feature_column.input_layer(features, params['feature_columns'])
    logits = tf.layers.dense(net, params['n_classes'], activation=None)

    predicted_classes = tf.argmax(logits, 1)
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {'logits': logits}
        return tf.estimator.EstimatorSpec(mode, predictions=predictions)

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
        train_op = optimizer.minimize(loss,global_step=tf.train.get_global_step())  
        return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)      

    accuracy = tf.metrics.accuracy(labels=labels,
                                   predictions=predicted_classes,
                                   name='acc_op') 
    metrics = {'accuracy': accuracy} 
    tf.summary.scalar('accuracy', accuracy[1])
    if mode == tf.estimator.ModeKeys.EVAL:
        return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)

This code works well and generates some result.

-------------------------------------------------------------

Then I want to train it in cluster. my_model_fn is the same as previous one and self._feature_numeric_col is still a list of NumericColumn.

class LogisticReg():
   def __init__(self):
        self._feature_col = x.columns.tolist()
        self._feature_numeric_col = []
        for key in self._feature_col:
            self._feature_numeric_col.append(tf.feature_column.numeric_column(key=key))
        self.estimator = tf.estimator.Estimator(model_fn=self.my_model_fn,
                                                model_dir=self.model_path,
                                                config=self.config,
                                                params={'feature_columns':self._feature_numeric_col})

    def my_model_fn(self, features, labels, mode, params):

        net = tf.feature_column.input_layer(features, params['feature_columns'])
        logits = tf.layers.dense(net, self.n_class, activation=None)

        predicted_classes = tf.argmax(logits, 1)  
        if mode == tf.estimator.ModeKeys.PREDICT:
            predictions = {'logits': logits}
            return tf.estimator.EstimatorSpec(mode, predictions=predictions)

        loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

        if mode == tf.estimator.ModeKeys.TRAIN:
            optimizer = tf.train.AdagradOptimizer(learning_rate=0.1) 
            train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())  !
            return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

        accuracy = tf.metrics.accuracy(labels=labels,predictions=predicted_classes) 

        metrics = {'accuracy': accuracy}  
        tf.summary.scalar('accuracy', accuracy[1])  
        if mode == tf.estimator.ModeKeys.EVAL:
            return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)

use train_and_evaluate function instead of train/eval/predict

# input_fn
def input_fn(self, X, y, mode, batch_size):
    y = y.astype(np.int32)
    X = X.astype(np.float32)
    dataset = tf.data.Dataset.from_tensor_slices((dict(X), y)) # x,y:pandas
    if mode == 'train':
        dataset = dataset.shuffle(500)
        dataset = dataset.repeat()  
    dataset = dataset.batch(batch_size)
    return dataset

# train_spec
train_spec = tf.estimator.TrainSpec(input_fn=lambda: self.input_fn(x_train,y_train,'train',batch_size),
                                    max_steps=n_epochs)
# eval_spec
eval_spec = tf.estimator.EvalSpec(input_fn=lambda: self.input_fn(x_valid, y_valid, 'valid', batch_size),
                                          start_delay_secs=30, throttle_secs=30, steps=None)


tf.estimator.train_and_evaluate(self.estimator, train_spec, eval_spec)

I expect the cluster version can generate similar output as the local one. However, I get this error.

Traceback (most recent call last):
  File "/usr/local/bin/python3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/bin/python3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 852, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/usr/local/bin/python3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/mnt/glusterfs/model-center/train/classify.py", line 51, in my_model_fn
    net = tf.feature_column.input_layer(features, params['feature_columns'])
  File "/usr/local/bin/python3/lib/python3.6/site-packages/tensorflow/python/feature_column/feature_column.py", line 302, in input_layer
    cols_to_output_tensors=cols_to_output_tensors)
  File "/usr/local/bin/python3/lib/python3.6/site-packages/tensorflow/python/feature_column/feature_column.py", line 181, in _internal_input_layer
    feature_columns = _normalize_feature_columns(feature_columns)
  File "/usr/local/bin/python3/lib/python3.6/site-packages/tensorflow/python/feature_column/feature_column.py", line 2263, in _normalize_feature_columns
    'Given (type {}): {}.'.format(type(column), column))
ValueError: Items of feature_columns must be a _FeatureColumn. Given (type <class 'collections.NumericColumn'>): NumericColumn(key='sepal_length', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None).
zzzwwwdxd
  • 1
  • 1
  • don't use `train_and_evaluate`? try using `train`/`eval`/`predict` for training, if the problem goes away then your `train_and_evaluate` inputs are incorrect – Oleg Vorobiov Sep 10 '19 at 10:22
  • @okawo train_and_evaluate is for distributed training. – zzzwwwdxd Sep 16 '19 at 06:46
  • can you show the code where you define your `train_spec` and `eval_spec`? – Oleg Vorobiov Sep 16 '19 at 09:07
  • @okawo I already update the code. – zzzwwwdxd Sep 17 '19 at 08:01
  • try removing the `lambda` when you define `train_spec` and `eval_spec`, i. e. ```tf.estimator.TrainSpec(input_fn=self.input_fn(x_train,y_train,'train',batch_size), max_steps=n_epochs)``` and ```tf.estimator.EvalSpec(input_fn=self.input_fn(x_valid, y_valid, 'valid', batch_size), start_delay_secs=30, throttle_secs=30, steps=None)```. if after this you still have errors please post them. – Oleg Vorobiov Sep 17 '19 at 09:36
  • @okawo model_fn receive function instead of dataset. So I use lambda function to convert dataset into function. I also find two examples which use the similar method. I think train_spec and eval_spec I define is correct. Here is the link: https://stackoverflow.com/questions/49619995/how-to-control-when-to-compute-evaluation-vs-training-using-the-estimator-api-of https://cloud.google.com/blog/products/gcp/easy-distributed-training-with-tensorflow-using-tfestimatortrain-and-evaluate-on-cloud-ml-engine – zzzwwwdxd Sep 18 '19 at 01:59
  • in both links that you mentioned input handling functions output a tuple `(x, y)` for training, something that your code is missing. although you are using a tensor with probably correct dimensions(you never posted the shape of your input function output), make sure that the output of your input function fulfills this shape structure: `t = (x_train,y_train,'train',batch_size)`, `t[0].shape == (num_samples, *shape_of_your_x_data)` and `t[1].shape == (num_samples, *shape_of_your_y_data)`. and also make sure that your output has a clear number `dtype`(e. g. `np.float32`). – Oleg Vorobiov Sep 18 '19 at 17:11

0 Answers0