1

Does mxnet store data/models somewhere outside of R? I keep running into scenarios where the first NN run of the day will produce good results, and every following run (even of the exact same code) will produce NA/NaN for all training steps.

Example: https://github.com/xup6fup/MxNetR-examples/blob/master/1.%20Basic%20models/3.%20softmax%20regression/1.%20Standard%20example.R

I copied and pasted the code as is, ran it and got about 70% accuracy. I noticed that the device was set to cpu, and I have gpu version compiled. So I changed it to gpu, reran ..... all NaN. Clear R session workspace, rerun original code with cpu, all NA.

Restart Rstudio server, rerun exact code.... all NA. It seems like SOMETHING is being stored outside of rstudio server and it interferes with subsequent FeedForward. I have this issue with multiple mxnet tutorials, where often they will work the first time, but subsequently will fail, even with identical code run.

Garglesoap
  • 565
  • 6
  • 18
  • Are you sure you didn't accidentally store some faulty model (line 50) and read it back every time (line 54)? – liori Nov 18 '17 at 22:24
  • That's a good suggestion, but I haven't run up to that line of the code, just to line 44. After fiddling a bit it seems that restarting the instance or rstudio, resets something that allows the nn to run. Using rstudio 'clear session workspace' must miss something that interfers with subsequent nn runs. – Garglesoap Nov 19 '17 at 05:49
  • Nope, lost it again. Restarting both instance or rstudio still results in NaN training. I know it's saving something somewhere... – Garglesoap Nov 19 '17 at 06:12
  • Just clearing workspace only cleans R variables. MXNet is a native library which stores the model outside R variables (let say, on GPU), so all of its internal state indeed are kept "outside of R". But this wouldn't explain why it wouldn't work after restarting the whole RStudio process — this ought to destroy everything that wasn't actually kept in permanent storage, e.g. on the file system. – liori Nov 19 '17 at 16:55
  • Unless Rstudio is saving MXNet data to a workspace that is then loaded with rstudio everytime. How does one go about clearing mxnet's internal state? Seems quite counter-intuitive that going outside of R is required/ – Garglesoap Nov 23 '17 at 17:28
  • Can you try without RStudio, that is—run `R --vanilla` and source this script from there manually? This way you can check if the problem is with RStudio or not. – liori Nov 23 '17 at 18:51
  • Base R and use of rm(list = ls(all.names = TRUE)) seemed to help a bit but their are still severe inconsistencies such as the Sonar example being the single working tutorial, decreasing training accuracy, and sometimes the rm command doesn't seem to work and I have to quit/restart R. Boston housing, MINST and Iris tutorials all give NA or NaN training accuracy. – Garglesoap Nov 24 '17 at 02:06

1 Answers1

2

If library was compiled somewhere before Nov 12 2017, then there's been a bug present in the random initialisation for some time which resulted in the initialization weights to be all nearly 0s.

jeremiedb
  • 41
  • 2
  • I compiled mine late Oct, but it was the pre-installed version that comes with AWS deep learning. Do I need an even newer version than 0.12.0? Does 'git clone --recursive https://github.com/apache/incubator-mxnet.git.' get the latest version? I think I can install that along the base AMI mxnet version – Garglesoap Nov 29 '17 at 08:38
  • The bug appeared in 0.11.1 but was only resolved around 0.12.1. To validate if this is the issue, try: mx.runif(shape=2:3). If it's all nearly 0s, then you'll need to recompile following the official instructions. Your above command is correct to get the latest version. – jeremiedb Nov 30 '17 at 01:35
  • You're right I'm running 0.12.0, althougth mx.runif(shape=2:3) doesnt return any 0s in the matrix. I tried removing mxnet and cloning 0.12.1, but attempting to build the R package has dragged me back into the linux quagmire of missing LID lib variables that sucked 50+ hours on the first installation. – Garglesoap Nov 30 '17 at 22:17