2

I'm using aitextgen to fine-tune the 355M GPT-2 model using the train function. The datasets are small txt files consisting of lines like these (these are encoded texts for keyword-based text generation, hence the "~^keywords~@"):

<|startoftext|>~^~@"Yes, but one forgets that she is there--or anywhere. She seems as if she were an accident."<|endoftext|>
<|startoftext|>~^man~@"Then jump out and unharness this horse. A man will come for it to- morrow."<|endoftext|>
<|startoftext|>~^mind 's~@"It would upset the house terribly," said Nan; "but I don't mind that. I'm with you, Patty. Let's do it."<|endoftext|>
<|startoftext|>~^Booth sure say wish~@"I wish I were sure that I had," said Booth.<|endoftext|>

I use aitextgen's training function like this:

    gpt2 = aitextgen(tf_gpt2 = "355M", to_gpu= True)

    gpt2.train(dataset,
               line_by_line = True,
               batch_size= 1,
               num_steps = 50,
               save_every = 10,
               generate_every = 10,
               learning_rate = 1e-3,
               fp16 = False)

When I run this function, I get this output:

0%|          | 0/10000 [00:00<?, ?it/s]
Windows does not support multi-GPU training. Setting to 1 GPU.
C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\pytorch_lightning\trainer\connectors\callback_connector.py:147: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=False)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=False)`.
  rank_zero_deprecation(
C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\pytorch_lightning\trainer\connectors\callback_connector.py:90: LightningDeprecationWarning: Setting `Trainer(progress_bar_refresh_rate=20)` is deprecated in v1.5 and will be removed in v1.7. Please pass `pytorch_lightning.callbacks.progress.TQDMProgressBar` with `refresh_rate` directly to the Trainer's `callbacks` argument instead. Or, to disable the progress bar pass `enable_progress_bar = False` to the Trainer.
  rank_zero_deprecation(
C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\pytorch_lightning\trainer\connectors\callback_connector.py:167: LightningDeprecationWarning: Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed in v1.7. Please set `Trainer(enable_model_summary=False)` instead.
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  0%|          | 0/50 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\transformers\modeling_utils.py", line 1364, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location="cpu")
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 882, in _load
    result = unpickler.load()
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 857, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 845, in load_tensor
    storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 205852672 bytes.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Josh\Python Projects\FYP\src\[py file name].py", line 34, in <module>
    gpt2 = aitextgen(tf_gpt2 = "355M", to_gpu= True)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\aitextgen\aitextgen.py", line 166, in __init__
    self.model = GPT2LMHeadModel.from_pretrained(model, config=config)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\transformers\modeling_utils.py", line 1368, in from_pretrained
    if f.read().startswith("version"):
MemoryError

I have tried many methods, including clearing the CUDA cache using torch.cuda.empty_cache(), splitting the files down to even smaller ones. None of them worked.

I'm running this on my local machine (RTX3070, 32GB RAM), I checked the task manager and the RAM usage barely hits 50%. Is there anything wrong with my code that's causing the memory errors?

Cephylist
  • 21
  • 2
  • "DefaultCPUAllocator: not enough memory" -- you are running out of host memory, not GPU memory – talonmies Jan 03 '22 at 07:04
  • 1
    I get that, but this shouldn't be the case. My RAM usage is 50%, and the dataset is a mere 3MB, loading it shouldn't be an issue. – Cephylist Jan 03 '22 at 07:15
  • You can see that the traceback is blowing in the deserialization routine, and your "mere 3Mb" dataset is trying to allocate 205Mb for one tensor. If the dataset isn't corrupted somehow, then you are trying to load a huge and extremely sparse dataset which is really running you out of memory – talonmies Jan 03 '22 at 07:24
  • I tried running the code on Colab and it seems to work (Originally, I was using Spyder), the model was successfully trained. The only difference is that in the Colab code, I'm loading the file from my Google Drive, while in the Spyder code, I'm loading it from a filepath. I'm an NLP newbie, so my apologies if I'm not being detailed enough in explaining my issues. – Cephylist Jan 03 '22 at 08:24
  • I have no idea about the inner workings of these models either, just the Python to GPU parts, and I see no problem there. Sorry I can't be of more help – talonmies Jan 03 '22 at 08:55

0 Answers0