0

I am currently trying to use huggingface trainer in a for-loop esque setting: I am training on single data examples and then evaluating for each example in my dataset - so I initialize trainer, and call trainer.train() multiple times in my script. The reason I am using trainer is due to its ease of deepspeed, which I need to fit a larger model on my GPU.

Right after calling trainer.train(), the GPU usage on my GPU spikes up to ~27GB, and stays there permanently - so in future calls of the loop, this memory is still there and combines with future trainer memory to cause a OOM error. I have tried deleting the trainer, its optimizer, its model, the model itself, and use torch.cuda.empty_cache() and gc.collect() often in my code. For example, here is the code at the end of each for loop:

    del model
    torch.cuda.empty_cache()
    gc.collect()

However, I still have not managed to locate what is causing this residual memory. 27GB is about how much the full model should take to load on the GPU, but deleting it does not do anything. Is there any way I can fix this? I suspect using del model only removes it from CPU memory and does not free the GPU memory, but I'm not quite sure how to make sure everything is deleted properly before the next step in the for loop.

Would very much appreciate help with this.

Example Code:

 tokenizer = LlamaTokenizer.from_pretrained(llama_path)
 for example in dataset:

    model = LlamaForCausalLM.from_pretrained(llama_path) #download Llama-7B
    train_dataset = CustomDataset(example, ...)
    trainer = Trainer(model, 
                      training_args, #includes directory to deepspeed config
                      train_dataset, 
                      tokenizer)
     trainer.train()
     *** evaluate model***
     logits = model.logits(example)
     output.append(logits) #goal is to get these logits for each example
nlp4892
  • 61
  • 7
  • 1
    `del model` and `gc.collect` won't work most of the time because the `model` variable still has counters on it, so python thinks the model will still be used. If you absolutely have to have a "new state" model with every iteration, just reset the weights/parameters of the model at the beginning of each iteration. This should help: https://stackoverflow.com/questions/63627997/reset-parameters-of-a-neural-network-in-pytorch – Djinn May 05 '23 at 08:06
  • 1
    Just for clarification, when I say each iteration, I mean each epoch, not each for-loop iteration. So there won't be a for-loop creating the model and deleting it. The only for-loops will be the one on epoch count and `data.DataLoader`. You'd have the the code set up as normal as if you were doing a regular, multi-epoch training session, but at the start of each epoch, reset the weights/parameters. The model would train on each epoch as if it were on the first epoch. – Djinn May 05 '23 at 15:08
  • But wouldn't that reset the model training? I want the model to be trained for example for 5 epochs and then evaluate it on some other metrics (get logits, etc.). What I mean by "new model" at each iteration is the following: I have the raw model, I train it on a subset of data, evaluate, restart process - train on different subset of data, evaluate, and so on. – nlp4892 May 05 '23 at 17:33
  • 1
    Ah ok. Are you asking about k-fold cross validation then? The input data is split into subsets, which are then split into train/val/test sets. After the set amount of epochs selected for training, the val and test set are used to validation, then another subset of the input data is used for train/val/test sets. Does that sound like what you're after? – Djinn May 05 '23 at 18:12
  • Not quite. So basically, my research is in model editing - I am trying to setup fine-tuning as a baseline. I fine-tune on one example and evaluate the performance on the test set in my dataset. I am trying to use trainer to fine-tune; so I do trainer.train() for that one example, and then use the trained model to obtain evaluation numbers. Then, I want to reset the model, fine-tune on the next example using trainer.train(), and so on. Currently, one trainer.train() call works fine, but at the second one I get a OOM error. It appears that trainer has some residual memory left on the GPUs – nlp4892 May 05 '23 at 18:36
  • That I haven't figured out how to delete. That is what I am trying to figure out - I'm not sure what is causing that memory. I have tried deleting the trainer, its optimizers, the model itself, setting the model to None, setting it to CPU, clearing cuda cache, and forcing python garbage collector, but for some reason there still is 27GB of memory sitting there and I have no idea why. Apologies if I didn't make this very clear in the past... – nlp4892 May 05 '23 at 18:38
  • 1
    Do you have a simple code example that you're using that could better explain what's going on? Sorry, I think I understand but I'm confused with how the training loop is set up. Is the huggingface trainer being used as transfer learning, in which you want to further train on a subset of your samples, then reset the model to it's original "pre-further-trained" state? Then to use that newly reset model on another subset? – Djinn May 05 '23 at 18:42
  • Exactly. I added a quick sample code to the original response. – nlp4892 May 05 '23 at 18:48
  • I reset the model by downloading at the start of each loop, because I don't have space for 2 model copies on one GPU; i.e, I can't use .deepcopy() or anything. I have also tried loading a state dict of the raw downloaded model; same issue occurred. – nlp4892 May 05 '23 at 18:52
  • Ah ok, yeah the way to reset that without redownloading it would be, before the loop, to initialize the model and save the original weights, which would be `sd = model.state_dict()`. Then at the start of each loop iteration, assign the original weights back to the model `model.load_state_dict(sd)`. `sd` would hold the original parameters before the training loop. When the model is trained, it's current `state_dict` will be updated with respect to the current training data subset, then loading the original parameters `sd` would essentially "reset" the model to its original, pretrained state. – Djinn May 05 '23 at 19:31
  • I would've put that as an answer but I'm still not 100% sure that's the solution you're after :D – Djinn May 05 '23 at 19:34
  • Yeah I tried doing that, but I'm still getting the same issue. I think I'll try and rewrite the code to use manual deepspeed (with deepspeed.initialize(), etc.) - perhaps trainer just isn't meant to be used in the way I am trying to. Thanks so much for the help anyway! – nlp4892 May 05 '23 at 19:36

0 Answers0