I tried to use a Google Colab initialised Notebook for Kaggle and found a strange behaviour as it gave me something like:
16 # text2tensor
---> 17 train_seq,train_mask,train_y = textToTensor(train_text,train_labels,pad_len)
18 val_seq,val_mask,val_y = textToTensor(val_text,val_labels,pad_len)
19
<ipython-input-9-ee85c4607a30> in textToTensor(text, labels, max_len)
4 tokens = tokenizer.batch_encode_plus(text.tolist(), max_length=max_len, padding='max_length', truncation=True)
5
----> 6 text_seq = torch.tensor(tokens['input_ids'])
7 text_mask = torch.tensor(tokens['attention_mask'])
8
ValueError: expected sequence of length 38 at dim 1 (got 13)
The error came from the code below:
def textToTensor(text,labels=None,max_len=38):#max_len is 38
tokens = tokenizer.batch_encode_plus(text.tolist(), max_length=max_len, padding='max_length', truncation=True)
text_seq = torch.tensor(tokens['input_ids']) # ERROR CAME FROM HERE
text_mask = torch.tensor(tokens['attention_mask'])
text_y = None
if isinstance(labels,np.ndarray):
text_y = torch.tensor(labels.tolist())
return text_seq, text_mask, text_y
train_seq,train_mask,train_y = textToTensor(train_text,train_labels,pad_len)
train_data = TensorDataset(train_seq, train_mask, train_y)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
I again ran this code on COLAB and it ran smoothly. Can it be because of the Version and something like that? Can someone please this?
Kaggle Configs:
transformers: '2.11.0'
torch: '1.5.1'
python: 3.7.6
Colab Configs:
torch: 1.7.0+cu101
transformers: 3.5.1
python: 3.6.9
EDIT:
My train_text
is a numpy array of texts and train_labels
is 1-d numerical array with 4 classes ranging 0-3.
Also: I initialized my tokenizer as:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')