I am working on a low-resource language and need to make a classifier. I used the tokenizers library to train the following tokenizers: WLV, BPE, UNI, WPC. I have saved the result of each into a json file.
I load each of the tokenizers using Tokenizer.from_file
function.
tokenizer_WLV = Tokenizer.from_file('tokenizer_WLV.json')
and I can see it is loaded properly. However only the method encode
exists.
so if I do tokenizer_WLV.encode(s1)
, I get an output like
Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]
and I can see each token along with the id as following.
out_wlv = tokenizer_WLV.encode(s1)
print(out_wlv.ids)
print(out_wlv.tokens)
I can use the encode_batch
def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]", direction='right')
tokenized_sentences = tokenizer.encode_batch(sentences)
return tokenized_sentences
which results in something like
[Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]
I need to make a data feature in a size of mxn where m is the number of observations and n number of unique tokens. encode_plus
does this automatically. So I am curious what is the most efficient way for constructing this feature matrix ?