0

I am working on a low-resource language and need to make a classifier. I used the tokenizers library to train the following tokenizers: WLV, BPE, UNI, WPC. I have saved the result of each into a json file.

I load each of the tokenizers using Tokenizer.from_file function.

tokenizer_WLV = Tokenizer.from_file('tokenizer_WLV.json')

and I can see it is loaded properly. However only the method encode exists.

so if I do tokenizer_WLV.encode(s1), I get an output like

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]

and I can see each token along with the id as following.

out_wlv = tokenizer_WLV.encode(s1)
print(out_wlv.ids)
print(out_wlv.tokens)

I can use the encode_batch

def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]", direction='right')
    tokenized_sentences = tokenizer.encode_batch(sentences)
    return tokenized_sentences

which results in something like

[Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

I need to make a data feature in a size of mxn where m is the number of observations and n number of unique tokens. encode_plus does this automatically. So I am curious what is the most efficient way for constructing this feature matrix ?

Areza
  • 5,623
  • 7
  • 48
  • 79

1 Answers1

0

encode_plus is a method that huggingface transformer tokenizers have (but it is already deprecated and should therefore be ignored).

The alternative huggingface tokenizers and the huggingface transformer tokenizers provide is __call__:

tokenizer_WLV(s1)
cronoik
  • 15,434
  • 3
  • 40
  • 78