0

I have two data set and training using caffe library for CNN.

First data set has a lot of training data more 60,000 train images and 16,000 test images. Its solver file can be seen as follow. Batch size is set to 32 in training.

train_net: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6/train_0.prototxt"
test_net: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6/test_0.prototxt"
test_iter: 2080
test_interval: 4000
base_lr: 0.0010000000475
display: 10
max_iter: 16000
lr_policy: "multistep"
gamma: 0.10000000149
momentum: 0.899999976158
weight_decay: 0.000500000023749
snapshot: 2000
snapshot_prefix: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6/InceptionNet"
solver_mode: GPU
device_id: 0
debug_info: false
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 4000
stepvalue: 8000
stepvalue: 12000
iter_size: 1
momentum2: 0.999000012875
type: "Adam"
eval_type: "detection"
ap_version: "11point"
num_total_train_images: 62308
pathtolog: "/home/Softwares/Projects/caffe-ssd-2/NumberPlate/InceptionNet/6"
batchsize: 32
meanprecision: 0.5
scratch: 1

I have second data set with fewer number of train images. Only 2883 train images and 709 test images and batch size for training is set 16 as follow.

train_net: "/home /Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0/train_0.prototxt"
test_net: "/home/Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0/test_0.prototxt"
test_iter: 177
test_interval: 500
base_lr: 0.0010000000475
display: 10
max_iter: 8000
lr_policy: "multistep"
gamma: 0.10000000149
momentum: 0.899999976158
weight_decay: 0.000500000023749
snapshot: 1000
snapshot_prefix: "/home/Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0/InceptionNet"
solver_mode: GPU
device_id: 0
debug_info: false
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 2000
stepvalue: 4000
stepvalue: 6000
iter_size: 1
momentum2: 0.999000012875
type: "Adam"
eval_type: "detection"
ap_version: "11point"
num_total_train_images: 2883
pathtolog: "/home/Softwares/Projects/caffe-ssd-2/Nextan/InceptionNet/0"
batchsize: 16
meanprecision: 0.5
scratch: 1

I trained on the same PC with same GPU and resources. Second data set gave me "Check failed: error == cudaSuccess (74 vs. 0) misaligned address" But first dataset is successfully trained. What could be wrong?

talonmies
  • 70,661
  • 34
  • 192
  • 269
batuman
  • 7,066
  • 26
  • 107
  • 229

1 Answers1

0

It is as internal bug in Caffe due to the fact that in some situations max_workspace is not a multiple of 16 and that causes workspace to be unaligned in memory. The first thing I would try is to change the batch size.

Here you can see a pull request with the issue: https://github.com/BVLC/caffe/pull/6548

ailun0x0e
  • 49
  • 1
  • 7
  • did you solve this issue? modifying cudnn_conv.cpp doesnt help. – Khan Jan 06 '20 at 19:22
  • Have you tried aligning the address to be multiples of 32? `size_t m=32; max_workspace = (max_workspace + m-1) / m * m; //align address to be multiples of m` – ailun0x0e Feb 04 '20 at 15:15