Can I load the weight parameter on DRAM permanently and use it whenever it's needed?

Asked Jun 26 '23 at 14:05

Active Jun 26 '23 at 14:05

Viewed 25 times

I'm using HuggingFace and debugging NLLB-MoE with the VSCode debugger.

But the model parameters are so big and take so long time for a single execution.

Can I just make parameter be loaded on the CPU DRAM and use it whenever needed?

The code is below and you can measure how much time it takes just for loading the weights.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-moe-54b")

article = "Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage."
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
    **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"], max_length=50
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

asked Jun 26 '23 at 14:05

Ryan

Can I load the weight parameter on DRAM permanently and use it whenever it's needed?

0 Answers0