I'm fine-tuning a Spanish RoBERTa model for a new task, namely text classification, to be more precise, sentiment analysis, and want to know how I can evaluate the models performance after fine-tuning for the test or validation data.
Code
The preprocessing looks like this:
tass_train = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/es_train.tsv', sep='\t', header=None, usecols=[1,2])
tass_test = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/es_val.tsv', sep='\t', header=None, usecols=[1,2])
from sklearn.model_selection import train_test_split
# Create validation data from the training data
train_texts, val_texts, train_labels, val_labels = train_test_split(list(tass_train[1]), list(tass_train[2]), test_size=0.1)
# Creating lists for the test text and test labels
test_texts = list(tass_test[1])
test_labels = list(tass_test[2])
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-bne")
# Create the preprocessed texts
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
# Convert the labels to ints
## Create a dictionary for the mapping
d = {'NEU':0, 'N':1, 'P':2}
## Map the values in the dictionary to the three lists of labels
train_labels = list(pd.Series(train_labels).map(d).astype(int))
val_labels = list(pd.Series(val_labels).map(d).astype(int))
test_labels = list(pd.Series(test_labels).map(d).astype(int))
# Create the tensorflow datasets from our encodings
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
test_labels))
My fine-tuning looks like this:
# Training with native TensorFlow
from transformers import TFAutoModelForSequenceClassification
## Model Definition
model = TFAutoModelForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne", from_pt=True, num_labels=3)
## Model Compilation
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.metrics.SparseCategoricalAccuracy()
model.compile(optimizer=optimizer,
loss=loss,
metrics=metric)
## Fitting the data
history = model.fit(train_dataset.shuffle(1000).batch(64), epochs=5, batch_size=64)
I've used the .fit()
part from a HuggingFace Tutorial, but what I've found strange is that there's no validation/test dataset in there.
Output
The following output I get from the training, is therefore, I guess, the performance on the training set.
/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py:337: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
"Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/5
16/16 [==============================] - 37s 1s/step - loss: 1.0225 - sparse_categorical_accuracy: 0.4857
Epoch 2/5
16/16 [==============================] - 18s 1s/step - loss: 0.6771 - sparse_categorical_accuracy: 0.7177
Epoch 3/5
16/16 [==============================] - 18s 1s/step - loss: 0.3543 - sparse_categorical_accuracy: 0.8786
Epoch 4/5
16/16 [==============================] - 18s 1s/step - loss: 0.1371 - sparse_categorical_accuracy: 0.9625
Epoch 5/5
16/16 [==============================] - 18s 1s/step - loss: 0.0445 - sparse_categorical_accuracy: 0.9921
Question
How can I evaluate the performance of the model on the test/validation data?
Elaboration
I'm guessing I have to specify that when I call .fit()
, so what I did is:
history = model.fit(train_dataset, validation_data=val_dataset, epochs=5, batch_size=64)
But this results in an error:
ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (49, 3)).
I've also tried:
results = model.evaluate(test_dataset)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))
This won't work either, and I get the following error:
ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (53, 3)).
-- Very similar to the one above.
Unfortunately, the HuggingFace documentation doesn't mention this issue, as far as I've seen.