0

I am using NVIDIA Jetson TX1 and caffe to train the AlexNet on my own data. I have 104,000 train and 20,000 validation images fed to my model. with batch size of 16 for both test and train.

I run the solver for training and I get this Bus error after 1300 iterations:

.
.
.
I0923 12:08:37.121116 2341 sgd_solver.cpp:106] Iteration 1300, Ir = 0.01
*** Aborted at 1474628919 (unix time) try "date -d @1474628919" if you  are using GNU date ***
PC: @                0x0 (unknown)
*** SIGBUS (@0x7ddea45000) received by PID 2341 (TID 0x7faa9fdf70) from  PID 18446744073149894656; stack trace: ***
    @       0x7fb4b014e0 (unknown)
    @       0x7fb3ebe8b0 (unknown)
    @       0x7fb4057248 (unknown)
    @       0x7fb40572b4 (unknown)
    @       0x7fb446e120 caffe::db::LMDBCursor::value()
    @       0x7fb4587624 caffe::DataReader::Body::read_one()
    @       0x7fb4587a90 caffe::DataReader::Body::InternalThreadEntry()
    @       0x7fb458a870 caffe::InternalThread::entry()
    @       0x7fb458b0d4 boost::detail::thread_data<>::run()
    @       0x7fafdf7ef0 (unknown)
    @       0x7fafcfde48 start_thread
Bus error

I use ubuntu 14, NVIDIA TegraX1, RAM 3.8 GB. As i understood it is a memory issue. Could you please explain better about it and help me how I can solve this problem? If any other information is needed please let me know.

user6726469
  • 231
  • 1
  • 3
  • 14
  • You can check the basic memory requirements of the model in the initialization. That stage gives you the memory requirement for each layer. However, that doesn't increase as the model trains. By any chance, do you checkpoint at 1300 iterations (or a divisor thereof)? – Prune Sep 23 '16 at 23:55
  • is it always happening at the 1300th iteration? – Shai Jan 09 '17 at 14:13

0 Answers0