0

After fixing Error loading ND4J Compressors (thank you Adam!), I get the following error: java.lang.RuntimeException: Failed to allocate 4735031021 bytes from HOST memory

(or java.lang.RuntimeException: cudaMalloc failed; Bytes: [4735031021]; Error code [2]; DEVICE [0])

17:31:16.143 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
17:32:10.593 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
17:32:10.625 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows Server 2019]
17:32:10.625 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [8]; Memory: [8,0GB];
17:32:10.625 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
17:32:10.657 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 11.6.55
17:32:10.657 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3090]; cc: [8.6]; Total memory: [25769279488]
17:32:10.657 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - Backend build information:
 MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN
17:44:35.415 [main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
17:44:39.735 [main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 0 is 7.222021991720728
Exception in thread "main" java.lang.RuntimeException: Failed to allocate 4735031021 bytes from HOST memory
        at org.nd4j.jita.memory.CudaMemoryManager.allocate(CudaMemoryManager.java:70)
        at org.nd4j.jita.workspace.CudaWorkspace.init(CudaWorkspace.java:88)
        at org.nd4j.linalg.api.memory.abstracts.Nd4jWorkspace.initializeWorkspace(Nd4jWorkspace.java:508)
        at org.nd4j.linalg.api.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:658)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:2040)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2813)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2756)
        at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:174)
        at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:61)
        at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:2357)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2315)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2378)
        at FAClassifierLearning.main(FAClassifierLearning.java:120)

Looks like error came from model.fit(allTrainingData) after first iteration.

Error appears only when using GPU, everything works correctly on the CPU.

When run, trying pass parameters -Xmx28g -Dorg.bytedeco.javacpp.maxbytes=30G, but no succes...

My code

//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File("vector.txt")));

//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = Integer.parseInt(5422);
int numClasses = Integer.parseInt(1170);
int batchSize = 4000;

DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build();

List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();

while (iterator.hasNext()) {
    DataSet allData = iterator.next();
    allData.shuffle();
    SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.9);  // Use 90% of data for training
    trainingData.add(testAndTrain.getTrain());
    testData.add(testAndTrain.getTest());
}

DataSet allTrainingData = DataSet.merge(trainingData);
DataSet allTestData = DataSet.merge(testData);

//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):       
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(allTrainingData);           // Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(allTrainingData);     // Apply normalization to the training data
normalizer.transform(allTestData);         // Apply normalization to the test data. This is using statistics calculated from the *training* set

long seed = 6;
int firstHiddenLayerSize = labelIndex/6;
int secondHiddenLayerSize = firstHiddenLayerSize/4;

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
        .seed(seed)
        .activation(Activation.TANH)
        .weightInit(WeightInit.XAVIER)
        //.updater(new Sgd(0.1))
        .updater(Adam.builder().build())
        .l2(1e-4)
        .list()
        .layer(new DenseLayer.Builder().nIn(labelIndex).nOut(firstHiddenLayerSize)
                .build())
        .layer(new DenseLayer.Builder().nIn(firstHiddenLayerSize).nOut(secondHiddenLayerSize)
                .build())
        .layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                .activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
                .nIn(secondHiddenLayerSize).nOut(numClasses).build())
        .build();

//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();

//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));

for(int i=0; i<5000; i++) {
    model.fit(allTrainingData);
}

//evaluate the model on the test set
Evaluation eval = new Evaluation(numClasses);

INDArray output = model.output(allTestData.getFeatures());

eval.eval(allTestData.getLabels(), output);
log.info(eval.stats());

// Save the Model
File locationToSave = new File(trained-model.zip);
model.save(locationToSave, true);

// Save DataNormalization
NormalizerSerializer ns = NormalizerSerializer.getDefault();
ns.write(normalizer, new File(trained-normalizer.bin));

Updated code (fixing error, only what changed)

...
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build();

List<DataSet> trainingData = new ArrayList<>();

while (iterator.hasNext()) {
    trainingData.add(iterator.next());
}

DataSet allTrainingData = DataSet.merge(trainingData);

// We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):       
// The same in code above

// MultiLayerConfiguration conf... 
// The same in code above
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();

List<DataSet> allTrainingDataBatched = allTrainingData.batchBy(10000);
for (int i=0; i<5000; i++) {
    for (DataSet dataSet: allTrainingDataBatched) {
        model.fit(dataSet);
    }
}
...
apollox
  • 101
  • 7

1 Answers1

1

Your GPU is not able to keep up with whatever you have locally.

HOST memory is your normal cpu ram. GPU ram is what's called device memory. Those are separate address spaces with their own limitations.

If you are running on a smaller GPU there might not be much you can do.

A few considerations: Consider shrinking your batch size Minimize allocations on the GPU only create your datasets after you are ready.

Monitor your GPU RAM using whatever tools you have available on y our platform of choice such as the windows process explorer or nvidia-smi.

Feel free to post below and I can try to offer more specific advice on your particular GPU.

Adam Gibson
  • 3,055
  • 1
  • 10
  • 12
  • Thank you Adam! My current configuration (I run on server of vps-provider): Intel Xeon Processor (Cascadelake) 1.50 GHz RAM 64,0 GB GPU rtx3090 24GB My app reading csv-file containing 0 and 1, 5422 digits in a row, ~230000 rows. >>**Consider shrinking your batch size** Which batch size do you mean? In this line: `DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build()` ? But the error occurs after using the iterator, on the line `model.fit(allTrainingData)` – apollox Mar 17 '23 at 07:23
  • >>Minimize allocations on the GPU What allocations size would be better in my case? I tried specifying different values of -Dorg.bytedeco.javacpp.maxbytes (or not specifying this parameter at all) >>only create your datasets after you are ready Sorry, I don't quite understand... I've posted in my question how creating data-sets (starting from `DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build()` and nex 11 lines) Perhaps this is not the best option? – apollox Mar 17 '23 at 07:29
  • After changing batchSize from 4000 to 2000 and plaing with -Xmx and -Dorg.bytedeco.javacpp.maxbytes params, now I get error `Exception in thread "main" java.lang.RuntimeException: cudaMalloc failed; Bytes: [4735031021]; Error code [2]; DEVICE [0]` – apollox Mar 17 '23 at 07:32
  • Reducing input csv-file from 230.000 records to 200.000 (each record of 5422 zeros and 1) solved the problem. While model.fit(allTrainingData) is running, Windows task manager shows the following values: RAM - 30/64 GB (46%) Dedicated GPU memory - 21.5/24 GB. Obviously, almost the entire GPU memory is occupied, and increasing the input file by several thousand records leads to an error. Is it possible to use free RAM (~30 Gb) additionally to GPU, for example by tuning -Xmx ? Tried tuning both -Xmx and -Dorg.bytedeco.javacpp.maxbytes but no luck ( – apollox Mar 17 '23 at 10:40
  • Trying to respond to you here one at a time. First on your batch size: Consider shrinking your batch size** Which batch size do you mean? In this line: DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build() Note that just because you create the iterator it doesn't mean anything is loaded. Nothing happens till you call fit, so my advice is still applicable here. – Adam Gibson Mar 17 '23 at 23:26
  • Few more questions. Could you clarify your version? Memory usage on different versions can vary a bit. Regarding memory configuration...try setting the gc frequency and reading this: https://deeplearning4j.konduit.ai/multi-project/explanation/configuration/memory#configuring-memory-limits – Adam Gibson Mar 17 '23 at 23:29
  • Thank you Adam! Shrinking batch size reduced GPU RAM consumption. Version 1.0.0-M2.1 – apollox Mar 22 '23 at 07:40