How can I evaluate a Huggingface model after fine-tuning?

Question

I'm fine-tuning a Spanish RoBERTa model for a new task, namely text classification, to be more precise, sentiment analysis, and want to know how I can evaluate the models performance after fine-tuning for the test or validation data.

Code

The preprocessing looks like this:

tass_train = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/es_train.tsv', sep='\t', header=None, usecols=[1,2])
tass_test = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/es_val.tsv', sep='\t', header=None, usecols=[1,2])

from sklearn.model_selection import train_test_split

# Create validation data from the training data
train_texts, val_texts, train_labels, val_labels = train_test_split(list(tass_train[1]), list(tass_train[2]), test_size=0.1)

# Creating lists for the test text and test labels
test_texts = list(tass_test[1])
test_labels = list(tass_test[2])

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-bne")

# Create the preprocessed texts 
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# Convert the labels to ints

## Create a dictionary for the mapping
d = {'NEU':0, 'N':1, 'P':2}

## Map the values in the dictionary to the three lists of labels 
train_labels = list(pd.Series(train_labels).map(d).astype(int))
val_labels = list(pd.Series(val_labels).map(d).astype(int))
test_labels = list(pd.Series(test_labels).map(d).astype(int))

# Create the tensorflow datasets from our encodings
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels))

My fine-tuning looks like this:

# Training with native TensorFlow 
from transformers import TFAutoModelForSequenceClassification

## Model Definition
model = TFAutoModelForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne", from_pt=True, num_labels=3)

## Model Compilation
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.metrics.SparseCategoricalAccuracy()
model.compile(optimizer=optimizer, 
              loss=loss,
              metrics=metric) 

## Fitting the data
history = model.fit(train_dataset.shuffle(1000).batch(64), epochs=5, batch_size=64)

I've used the .fit() part from a HuggingFace Tutorial, but what I've found strange is that there's no validation/test dataset in there.

Output

The following output I get from the training, is therefore, I guess, the performance on the training set.

/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py:337: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/5
16/16 [==============================] - 37s 1s/step - loss: 1.0225 - sparse_categorical_accuracy: 0.4857
Epoch 2/5
16/16 [==============================] - 18s 1s/step - loss: 0.6771 - sparse_categorical_accuracy: 0.7177
Epoch 3/5
16/16 [==============================] - 18s 1s/step - loss: 0.3543 - sparse_categorical_accuracy: 0.8786
Epoch 4/5
16/16 [==============================] - 18s 1s/step - loss: 0.1371 - sparse_categorical_accuracy: 0.9625
Epoch 5/5
16/16 [==============================] - 18s 1s/step - loss: 0.0445 - sparse_categorical_accuracy: 0.9921

Question

How can I evaluate the performance of the model on the test/validation data?

Elaboration

I'm guessing I have to specify that when I call .fit(), so what I did is:

history = model.fit(train_dataset, validation_data=val_dataset, epochs=5, batch_size=64)

But this results in an error: ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (49, 3)).

I've also tried:

results = model.evaluate(test_dataset)
    
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

This won't work either, and I get the following error: ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (53, 3)). -- Very similar to the one above.

Unfortunately, the HuggingFace documentation doesn't mention this issue, as far as I've seen.

How can I evaluate a Huggingface model after fine-tuning?

Code

Output

Question

Elaboration

0 Answers0