0

I'm using HuggingFace and debugging NLLB-MoE with the VSCode debugger.

But the model parameters are so big and take so long time for a single execution.

Can I just make parameter be loaded on the CPU DRAM and use it whenever needed?

The code is below and you can measure how much time it takes just for loading the weights.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-moe-54b")

article = "Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage."
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
    **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"], max_length=50
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
Ryan
  • 73
  • 7

0 Answers0