How to prune a transformer?

Question

I am trying to reduce memory and speed up my own fine-tuned transformer. I came across the tutorial for pruning on the huggingface site. I am referring to the following snippet. The trainer.train() is missing, so I added it. It ran without error, however, there is no reduction in memory (I used model.get_memory_footprint() and before and after pruning Model memory footprint: 503695916 bytes). Same for inference speed. I also tried out different pruning configurations (global pruning, different pruning types or target sparsities) but it did not help. Can someone help me?

from optimum.intel.neural_compressor import INCTrainer
from neural_compressor import WeightPruningConfig
from transformers import TrainingArguments, Trainer
from transformers.data.data_collator import default_data_collator

pruning_config = WeightPruningConfig(
pruning_type="magnitude",
start_step=0,
end_step=15,
target_sparsity=0.2,
pruning_scope="local",
)

from transformers import TrainingArguments, Trainer
save_dir="prunedModel"

trainer = INCTrainer(
model=model,
pruning_config=pruning_config,
args=TrainingArguments(save_dir, max_steps=500,num_train_epochs=1.0, do_train=True, do_eval=True,metric_for_best_model="f1",greater_is_better=True),
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
tokenizer=processor,
data_collator=default_data_collator,
)
train_result = trainer.train() # <-- Added by me
trainer.save_model(save_dir) # <-- Added by me
optimized_model = AutoModelForSequenceClassification.from_pretrained(save_dir)

memory_footprint = optimized_model.get_memory_footprint()
print(f"Model memory footprint: {memory_footprint} bytes")`

Expected behavior As per the model should be pruned and the actual model without pruned and the pruned model should have different sizes but they have the Model memory footprint:

Is pruning weights in this fashion intended to save memory? It seems like the weights which have zero magnitude will still take the same amount of memory. I think pruning is intended to reduce on-disk size after compression. — Nick ODell, Feb 25 '23 at 01:42
How can I make sure if the model is pruned or not if the model disk size is going to be the same even after performing magnitude pruning. Also this article suggest the sparsity has a 25X speed during inference if there is no change in the model disk size what are the factors making model 25X faster https://medium.com/intel-analytics-software/structured-pruning-for-transformer-based-models-116e949ef12c — Jyoti yadav, Feb 25 '23 at 16:18

How to prune a transformer?

0 Answers0