Reading a pretrained huggingface transformer directly from S3

Question

Loading a huggingface pretrained transformer model seemingly requires you to have the model saved locally (as described here), such that you simply pass a local path to your model and config:

model = PreTrainedModel.from_pretrained('path/to/model', local_files_only=True)

Can this be achieved when the model is stored on S3?

kd88 · Accepted Answer · 2021-08-13T15:23:41.247

Answering my own question... (apparently encouraged)

I achieved this using a transient file (NamedTemporaryFile), which does the trick. I was hoping to find an in-memory solution (i.e. passing in the BytesIO directly to from_pretrained) but that would require a patch to the transformers codebase

import boto3 
import json 

from contextlib import contextmanager 
from io import BytesIO 
from tempfile import NamedTemporaryFile 
from transformers import PretrainedConfig, PreTrainedModel 
  
@contextmanager 
def s3_fileobj(bucket, key): 
    """
    Yields a file object from the filename at {bucket}/{key}

    Args:
        bucket (str): Name of the S3 bucket where you model is stored
        key (str): Relative path from the base of your bucket, including the filename and extension of the object to be retrieved.
    """
    s3 = boto3.client("s3") 
    obj = s3.get_object(Bucket=bucket, Key=key) 
    yield BytesIO(obj["Body"].read()) 
 
def load_model(bucket, path_to_model, model_name='pytorch_model'):
    """
    Load a model at the given S3 path. It is assumed that your model is stored at the key:

        '{path_to_model}/{model_name}.bin'

    and that a config has also been generated at the same path named:

        f'{path_to_model}/config.json'

    """
    tempfile = NamedTemporaryFile() 
    with s3_fileobj(bucket, f'{path_to_model}/{model_name}.bin') as f: 
        tempfile.write(f.read()) 
 
    with s3_fileobj(bucket, f'{path_to_model}/config.json') as f: 
        dict_data = json.load(f) 
        config = PretrainedConfig.from_dict(dict_data) 
 
    model = PreTrainedModel.from_pretrained(tempfile.name, config=config) 
    return model 
     
model = load_model('my_bucket', 'path/to/model')

what does the ```key``` parameter in ```s3_fileobj()``` represent? — Abercrombie, Aug 13 '21 at 10:10
@JohnDoe there was a bug in `load_model`, which missed the `bucket` argument in the call to `s3_fileobj`. `key` is the full path to the file object you want in bucket `bucket`. I've improved the docs as well a little — kd88, Aug 13 '21 at 15:25
Is there a way I can persist the ```NamedTemporaryFile```, so that it does not download the model each time? — Abercrombie, Aug 18 '21 at 09:46
@JohnDoe it depends what you mean by "each time": within one python / jupyter / etc session you could decorate `load_model` with `@lru_cache()` (`from functools import lru_cache` and then `@lru_cache()def load_model(...)`), which will then cache the result of `load_model` for future calls. If by "each time" you mean beyond one python session then you might be better off downloading the model — kd88, Aug 19 '21 at 05:48
By each time, I meant a new python session. I ended up handling the ```OSError``` exception, which downloads and saves the model only for the first time, using the conventional ```file.write()``` method. Thanks for the help! — Abercrombie, Aug 20 '21 at 05:48

Reading a pretrained huggingface transformer directly from S3

1 Answers1