Pytorch + BERT+ batch_encode_plus() Code running fine in Colab but producing problems with Kaggle in mismatch input shapes

Question

I tried to use a Google Colab initialised Notebook for Kaggle and found a strange behaviour as it gave me something like:

 16   # text2tensor
---> 17   train_seq,train_mask,train_y = textToTensor(train_text,train_labels,pad_len)
     18   val_seq,val_mask,val_y       = textToTensor(val_text,val_labels,pad_len)
     19 

<ipython-input-9-ee85c4607a30> in textToTensor(text, labels, max_len)
      4   tokens = tokenizer.batch_encode_plus(text.tolist(), max_length=max_len, padding='max_length', truncation=True)
      5 
----> 6   text_seq = torch.tensor(tokens['input_ids'])
      7   text_mask = torch.tensor(tokens['attention_mask'])
      8 

ValueError: expected sequence of length 38 at dim 1 (got 13)

The error came from the code below:

def textToTensor(text,labels=None,max_len=38):#max_len is 38
  
  tokens = tokenizer.batch_encode_plus(text.tolist(), max_length=max_len, padding='max_length', truncation=True)
  
  text_seq = torch.tensor(tokens['input_ids']) # ERROR CAME FROM HERE
  text_mask = torch.tensor(tokens['attention_mask'])

  text_y = None
  if isinstance(labels,np.ndarray):
    text_y = torch.tensor(labels.tolist())

  return text_seq, text_mask, text_y



train_seq,train_mask,train_y = textToTensor(train_text,train_labels,pad_len)
train_data = TensorDataset(train_seq, train_mask, train_y)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

I again ran this code on COLAB and it ran smoothly. Can it be because of the Version and something like that? Can someone please this?

Kaggle Configs:

transformers: '2.11.0'
torch: '1.5.1'
python: 3.7.6

Colab Configs:

torch: 1.7.0+cu101
transformers: 3.5.1
python: 3.6.9

EDIT: My train_text is a numpy array of texts and train_labels is 1-d numerical array with 4 classes ranging 0-3.

Also: I initialized my tokenizer as:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

What happens when you install 2.11.0 in colab? Does this also cause an error? In case you expect some help, please gives us an example of `train_text`, `train_labels`, `pad_len` that produces an error on kaggle and show us how you have initialized the tokenizer. — cronoik, Nov 27 '20 at 16:51
I used `pad_len 30`. `train_text` is any 2-D array of texts with dimension `(number_of_data_points , length_of_each_text)`. It gives the same error with re-installed configs. — Deshwal, Nov 27 '20 at 18:17
Do `print(tokens['input_ids'].shape)` and see if the shape what you expect it to be. — stackoverflowuser2010, Nov 27 '20 at 18:41
Re-installed configs? You mean colab with 2.11.0 leads to the same error? You forgot to show us how you have initilizaled your tokenizer. — cronoik, Nov 27 '20 at 19:31
@cronoik I innstall `2.1.1` in `Kaggle` and same error occured. In `Colab` it is working smooth. Also, I have edited how I initialised my tokens. — Deshwal, Nov 28 '20 at 13:40
Why do you install `2.1.1` in kaggle? According to your question you have tried it with `2.11.0` and I asked you what happens when you install `2.11.0` in colab. But since it seems like that you can install freely in kaggle, why don't you install `3.5.1` in kaggle? — cronoik, Nov 28 '20 at 15:53
@cronoik Sorry, I wanted to say that I installed version `3.5.1` in Kaggle and it did not work. — Deshwal, Nov 29 '20 at 05:15

Pytorch + BERT+ batch_encode_plus() Code running fine in Colab but producing problems with Kaggle in mismatch input shapes

0 Answers0