0

I have a codebase which was working fine but today when I was trying to run, I observed that tokenizer.encode_plus stopped returning attention_mask. Is it removed in the latest release? Or, do I need to do something else?

The following piece of code was working for me.

encoded_dict = tokenizer.encode_plus(
                truncated_query,
                span_doc_tokens,
                max_length=max_seq_length,
                return_overflowing_tokens=True,
                pad_to_max_length=True,
                stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,
                truncation_strategy="only_second",
                return_token_type_ids=True,
                return_attention_mask=True
            )

But now, I get only dict_keys(['input_ids', 'token_type_ids']) from encode_plus. Also, I realized that the returned input_ids are not padded to max_length.

cronoik
  • 15,434
  • 3
  • 40
  • 78
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161

1 Answers1

0

I figured out the issue. I updated the tokenizers API to 0.7.0 which is the latest version. However, the latest version of the transformers API works with tokenizers 0.5.2 version. After rollbacking to 0.5.2, the issue disappeared. With pip show, I see the following.

Name: transformers
Version: 2.8.0
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers
Name: tokenizers
Version: 0.5.2
Summary: Fast and Customizable Tokenizers
Home-page: https://github.com/huggingface/tokenizers
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161