0

I encounter some problems when trying to tokenize using distilBert. I use Jupyter Notebook.

Here's my full code

maxlen = 50

#tokens
maxqnans = np.int((maxlen-20)/2)
corpus_tokenized = ["[CLS] "+
             " ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '', 
             str(txt).lower().strip()))[:maxqnans])+
             " [SEP] " for txt in corpus]

#masks
masks = [[1]*len(txt.split(" ")) + [0]*(maxlen - len(
           txt.split(" "))) for txt in corpus_tokenized]

#idx
idx = [tokenizer.encode(seq.split(" ")) for seq in txt_seq]
    
# padding
txt_seq = [txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) != maxlen else txt for txt in corpus_tokenized]
    
#segments
segments = [] 
for seq in txt_seq:
    temp, i = [], 0
    for token in seq.split(" "):
        temp.append(i)
        if token == "[SEP]":
             i += 1
    segments.append(temp)
    
#vector
X_train = [np.asarray(idx, dtype='int32'), 
           np.asarray(masks, dtype='int32'), 
           np.asarray(segments, dtype='int32')]

It is said that the problem is in this line of code:

idx = [tokenizer.encode(seq.split(" ")) for seq in txt_seq]

I'm getting the following error:

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Can anybody help me with this? Thank you!

ccc
  • 1
  • 1

1 Answers1

0

The argument (in the tokenizer.encode method) is a list of string. So I guess in your list contained some elements that are not string( nan example).

Sorry, I only guess to this error, because I have no your data. Can you debug your data?

Tan Phan
  • 337
  • 1
  • 4
  • 14
  • I checked and all of the elements are string. That's why I'm pretty confused. I'm using the Kaggle dataset https://www.kaggle.com/crowdflower/twitter-airline-sentiment – ccc Nov 05 '21 at 06:06
  • I am going to try your code. What is your corpus variable? – Tan Phan Nov 05 '21 at 06:25
  • I created a [kernel](https://www.kaggle.com/phanttan/textencodeinput-error) with your error. I am learning in Huggingface/transformer. So be happy to meet you. – Tan Phan Nov 06 '21 at 02:30