RuntimeError: Found dtype Long but expected Float when fine-tuning using Trainer API

Question

I'm trying to fine-tune BERT model for sentiment analysis (classifying text as positive/negative) with Huggingface Trainer API. My dataset has two columns, Text and Sentiment, it looks like this.

Text                     Sentiment
This was good place          1
This was bad place           0

Here is my code:

from datasets import load_dataset
from datasets import load_dataset_builder
from datasets import Dataset
import datasets
import transformers
from transformers import TrainingArguments
from transformers import Trainer

dataset = load_dataset('csv', data_files='./train/test.csv', sep=';')
tokenizer = transformers.BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
model = transformers.BertForSequenceClassification.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1", num_labels=1) 
def tokenize_function(examples):
    return tokenizer(examples["Text"], truncation=True, padding='max_length')

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column('Sentiment', 'label')
tokenized_datasets = tokenized_datasets.remove_columns('Text')
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
    model=model, args=training_args, train_dataset=tokenized_datasets['train']
)
trainer.train()

Running this throws error:

Variable._execution_engine.run_backward(
RuntimeError: Found dtype Long but expected Float

The error may come from dataset itself, but can I fix it with my code somehow? I searched the Internet and this error seems to have been previously solved by "converting tensors to float" but how would I do it with Trainer API? Any advise is very highly appreciated.

Some reference:

https://discuss.pytorch.org/t/run-backward-expected-dtype-float-but-got-dtype-long/61650/10

is it possible your loss function is _binary_ cross-entropy instead of multi-variate cross-entropy? — Shai, Dec 30 '21 at 10:40
Here is my full dataset, I currently use a small sample of it: https://github.com/JereRajala00/training-data — Mr. Engineer, Dec 30 '21 at 16:43
@Shai Shouldn't binary cross-entropy be used with binary classification task? What do you think causes this error? — Mr. Engineer, Dec 30 '21 at 16:53
@Mr.Engineer in contrast to multi-label CE, the binary CE in pytorch expects the labels to be floats in range [0,1] — Shai, Dec 30 '21 at 17:15
Alright... how would I use multi-label CE, or convert values of Sentiment-column into float? — Mr. Engineer, Dec 30 '21 at 20:06

score 3 · Answer 1 · answered Dec 30 '21 at 22:53

Most likely, the problem is with loss function. This can be fixed if you set up the model correctly, mainly by specifying the correct loss to use. Refer to this code to see the logic for deciding the proper loss.

Your problem has binary labels and thus should be framed as a single-label classification problem. As such, the code you have shared will be inferred as a regression problem, which explains the error that it expected float but found long type for target labels.

You need to pass the correct problem type.

model = transformers.BertForSequenceClassification.from_pretrained(
    "TurkuNLP/bert-base-finnish-cased-v1", 
    num_labels=1, 
    problem_type = "single_label_classification"
)

This will make use of BCE loss. For BCE loss, you need the target to float, so you also have to cast the labels to float. I think you can do that with the dataset API. See this.

The other way would be to use a multi-class classifier or CE loss. For that, just fixing num_labels should be fine.

model = transformers.BertForSequenceClassification.from_pretrained(
    "TurkuNLP/bert-base-finnish-cased-v1", 
    num_labels=2,
)

Alright I was able to get past this error by converting Sentiment column into floats with ´pandas´ library. I'm training the model now, hopefully it I will be able to use it for inference. — Mr. Engineer, Dec 31 '21 at 15:16
Also, don't forget to use the correct problem type. Even though MSE may work, for classification you should use "*CE" loss — Umang Gupta, Jan 05 '22 at 17:30

score 0 · Answer 2 · answered Dec 31 '21 at 12:04

Here I am assuming that you are trying to do one label classification, that is, to predict a single result instead of predicting multiple results.

But the loss function (I don't know what you are using but it is probably BCE) you use, expects a vector from you as a label.

So either you need to convert your labels to vectors as people suggested in the comments, or you can replace the loss function with Cross-entropy loss and change your number of label parameters with 2(or whatever). Both solutions will work.

If you want to train your model as multi-label classifier you can convert your labels to vectors with using sklearn.preprocessing:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

dataset = pd.read_csv("filename.csv", encoding="utf-8")
enc_labels = preprocessing.LabelEncoder()
int_encoded = enc_labels.fit_transform(np.array(dataset["Sentiment"].to_list()))

onehot_encoder = OneHotEncoder(sparse = False)
int_encoded = int_encoded.reshape(len(int_encoded),1)
onehot_encoded = onehot_encoder.fit_transform(int_encoded)
for index, cat in dataset.iterrows():
    dataset.at[index , 'Sentiment'] = onehot_encoded[index]

score 0 · Answer 3 · answered Apr 26 '22 at 19:11

You could cast your data.

If you have it in Pandas format. You could do:

df['column_name'] = df['column_name'].astype(float)

If you have it in HuggingFace format. You should do something like that:

from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')
from datasets import Value, ClassLabel

new_features = dataset.features.copy()
new_features["idx"] = Value('int64')
new_features["label"] = ClassLabel(names=['negative', 'positive'])
new_features["idx"] = Value('int64')
dataset = dataset.cast(new_features)

Before:

dataset.features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

After:

dataset.features

{'idx': Value(dtype='int64', id=None),
 'label': ClassLabel(num_classes=2, names=['negative', 'positive'], id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

RuntimeError: Found dtype Long but expected Float when fine-tuning using Trainer API

3 Answers3