attention_mask is missing in the returned dict from tokenizer.encode_plus

Question

I have a codebase which was working fine but today when I was trying to run, I observed that tokenizer.encode_plus stopped returning attention_mask. Is it removed in the latest release? Or, do I need to do something else?

The following piece of code was working for me.

encoded_dict = tokenizer.encode_plus(
                truncated_query,
                span_doc_tokens,
                max_length=max_seq_length,
                return_overflowing_tokens=True,
                pad_to_max_length=True,
                stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,
                truncation_strategy="only_second",
                return_token_type_ids=True,
                return_attention_mask=True
            )

But now, I get only dict_keys(['input_ids', 'token_type_ids']) from encode_plus. Also, I realized that the returned input_ids are not padded to max_length.

Can you add which version of `transformers`/`tokenizers` you are currently running? — dennlinger, Apr 30 '20 at 11:16

score 0 · Accepted Answer · answered May 09 '20 at 07:21

I figured out the issue. I updated the tokenizers API to 0.7.0 which is the latest version. However, the latest version of the transformers API works with tokenizers 0.5.2 version. After rollbacking to 0.5.2, the issue disappeared. With pip show, I see the following.

Name: transformers
Version: 2.8.0
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers

Name: tokenizers
Version: 0.5.2
Summary: Fast and Customizable Tokenizers
Home-page: https://github.com/huggingface/tokenizers

attention_mask is missing in the returned dict from tokenizer.encode_plus

1 Answers1