3

I am working on using a transformer. Pipeline to get BERT embeddings to my input. using this without a pipeline i am able to get constant outputs but not with pipeline since I was not able to pass arguments to it.

How can I pass transformer-related arguments for my Pipeline?

# These are BERT and tokenizer definitions
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = ['hello world']

# Normally I would do something like this to initialize the tokenizer and get the result with constant output
tokens = tokenizer(inputs,padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model(**tokens)[0].detach().numpy().shape


# using the pipeline 
pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)

# or other option
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

nlp=pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)

# to call the pipeline
nlp("hello world")

I have tried several ways like the options listed above but was not able to get results with constant output size. one can achieve constant output size by setting the tokenizer arguments but have no idea how to give arguments for the pipeline.

any idea?

dennlinger
  • 9,890
  • 1
  • 42
  • 63
Israel-abebe
  • 538
  • 1
  • 4
  • 20

1 Answers1

2

The max_length tokenization parameter is not supported per default (i.e. no padding to max_length is applied), but you can create your own class and overwrite this behavior:

from transformers import AutoTokenizer, AutoModel
from transformers import FeatureExtractionPipeline
from transformers.tokenization_utils import TruncationStrategy

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = ['hello world']

class MyFeatureExtractionPipeline(FeatureExtractionPipeline):
      def _parse_and_tokenize(
        self, inputs, max_length, padding=True, add_special_tokens=True, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs
    ):
        """
        Parse arguments and tokenize
        """
        # Parse arguments
        if getattr(self.tokenizer, "pad_token", None) is None:
            padding = False
        inputs = self.tokenizer(
            inputs,
            add_special_tokens=add_special_tokens,
            return_tensors=self.framework,
            padding=padding,
            truncation=truncation,
            max_length=max_length
        )
        return inputs

mynlp = MyFeatureExtractionPipeline(model=model, tokenizer=tokenizer)
o = mynlp("hello world", max_length = 500, padding='max_length', truncation=True)

Let us compare the size of the output:

print(len(o))
print(len(o[0]))
print(len(o[0][0]))

Output:

1
500
768

Please note: that this will only work with transformers 4.10.X and previous versions. The team is currently refactoring the pipeline classes and future releases will require different adjustments (i.e. that will not work as soon as the refactored pipelines are released).

cronoik
  • 15,434
  • 3
  • 40
  • 78