I want to reproduce the results for the NASNet models on CIFAR-10 for some benchmarking purposes, using the TF-Slim implementation in https://github.com/tensorflow/models/tree/master/research/slim. In order to train this model from scratch I added the following lines up to the original code in train_image_classifier.py
, following the instructions in the comments (lines 31-37) of the script /nets/nasnet/models.py
:
after line 247:
elif FLAGS.learning_rate_decay_type == 'cosine':
return tf.train.cosine_decay(FLAGS.learning_rate,
global_step,
decay_steps,
name='cosine_decay_learning_rate')
after line 536:
clone_gradients = tf.clip_by_global_norm(clones_gradients, 5.0)
After downloading the CIFAR-10 data and converting it to TFRecord format I run:
DATASET_DIR=/tmp/data/cifar10
TRAIN_DIR=/tmp/train_logs
python3 train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=cifar10 \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=nasnet_cifar \
--preprocessing_name=cifarnet \
--learning_rate=0.025 \
--optimizer=momentum \
--learning_rate_decay_type=cosine \
--num_epochs_per_decay=600.0 \
--batch_size=32
It seems that the training continues even after 600 epochs (= 937500 steps), though the parameters do not get updated since the learning rate becomes 0 after 600 epochs, due to the cosine decay. Running the evaluation script:
DATASET_DIR=/tmp/data/cifar10
TRAIN_DIR=/tmp/train_logs
python3 eval_image_classifier.py \
--alsologtostderr \
--checkpoint_path=${TRAIN_DIR} \
--dataset_name=cifar10 \
--dataset_split_name=test \
--dataset_dir=${DATASET_DIR} \
--model_name=nasnet_cifar \
--preprocessing_name=cifarnet
I get the following result:
/home/zelaa/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
WARNING:tensorflow:From eval_image_classifier.py:91: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Scale of 0 disables regularizer.
2018-02-24 19:22:39.646499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:02:00.0
totalMemory: 11.92GiB freeMemory: 11.81GiB
2018-02-24 19:22:39.646538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0, compute capability: 5.2)
WARNING:tensorflow:From eval_image_classifier.py:155: streaming_accuracy (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.metrics.accuracy. Note that the order of the labels and predictions arguments has been switched.
WARNING:tensorflow:From eval_image_classifier.py:157: streaming_recall_at_k (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed after 2016-11-08.
Instructions for updating:
Please use `streaming_sparse_recall_at_k`, and reshape labels from [batch_size] to [batch_size, 1].
INFO:tensorflow:Evaluating train_logs/model.ckpt-1002284
INFO:tensorflow:Starting evaluation at 2018-02-24-18:22:51
2018-02-24 19:22:52.383834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0, compute capability: 5.2)
INFO:tensorflow:Restoring parameters from train_logs/model.ckpt-1002284
INFO:tensorflow:Evaluation [20/200]
INFO:tensorflow:Evaluation [40/200]
INFO:tensorflow:Evaluation [60/200]
INFO:tensorflow:Evaluation [80/200]
INFO:tensorflow:Evaluation [100/200]
INFO:tensorflow:Evaluation [120/200]
INFO:tensorflow:Evaluation [140/200]
INFO:tensorflow:Evaluation [160/200]
INFO:tensorflow:Evaluation [180/200]
INFO:tensorflow:Evaluation [200/200]
eval/Recall_5[0.9985]
eval/Accuracy[0.9577]
INFO:tensorflow:Finished evaluation at 2018-02-24-18:23:26
So the test error for one run is 4.23 %, which does not correspond to any of the results presented in Learning Transferable Architectures for Scalable Image Recognition. Is there anything that I am missing here, which prevents me to match the paper results?