0

I have a model to train on a large data set that does not fit into RAM. So, basically my plan is to slice the data set creating a DataSet instance with input vectors and associated labels for every chunk. E.g. if I have 1M input vectors/labels I'd split them into 10 chunks each having 100K records.
Then I'd put a chunk into 2 INDArray objects (for inputs and labels), create a DataSet and call model.fit() with that data set, repeating this procedure for every chunk and repeating the whole process until say the model's score reaches some value. My questions are:
1. Do I understand the process correctly?
2. Can the INDArray instances be reused? Would it be right to allocate them once and then just fill them up with data set chunks over and over again?

faraway
  • 428
  • 2
  • 4
  • 15

1 Answers1

0

You don't have to do any of this. Workspaces already solves your allocation problem: http://deeplearning4j.org/workspaces

Just use the standard datavec -> recordreaderdatasetiterator -> dataset pattern. That already handles minibatches for you.

Adam Gibson
  • 3,055
  • 1
  • 10
  • 12
  • Do I need to explicitly use workspaces or there is a kind of default workspace? Is there an example of the pattern you're referring to? I've mostly seen code for training on images or predefined data sets like MNIST. I have a custom file with 27M input vectors and labels I'd like to train the network on and each vector requires some pre-processing to put it into INDArray... – faraway Jul 11 '18 at 08:24
  • Yes, you use it as part of training. None of your further observations are relevant here. We already do all of the memory allocation for you. Look at the workspaces docs for your particular use case. – Adam Gibson Jul 11 '18 at 09:09
  • Are there any examples? I found a DataVec tutorial on converting one CSV to another using Spark. – faraway Jul 11 '18 at 12:47
  • Of what? Anything with a record reader will work: https://github.com/deeplearning4j/dl4j-examples/search?q=RecordReader&unscoped_q=RecordReader this is the main examples repo. Everything is in here. – Adam Gibson Jul 11 '18 at 21:55
  • Sorry to bother you, it may look very obvious to you, but somehow for me it's not. All the examples from your link use either CSVRecordReader or ImageRecordReader out of the box. The problem is that my input vectors and desired outputs reside in a single binary file as fixed size records. The file is neither CSV, nor image. I guess I have to implement a custom RecordReader and I was looking for an example of that. Will try to read ImageRecordReader implementation. – faraway Jul 12 '18 at 07:32
  • You would write your own custom record reader then. The iterator would handle the batching and everything else for you. If you need help implementing a custom record reader, you can look at the existing ones or we are more than glad to help you in the gitter here: https://gitter.im/deeplearning4j/deeplearning4j – Adam Gibson Jul 12 '18 at 08:16