I designed a text matching model based on Transformer on the Quora dataset, but why are the F1 values of the models I designed based on Transformer very low, around 70%? For the reproduced ESIM model, the F1 value is only about 70% when the encoder is replaced with Transformer. How do hyperparameters usually be set for Transformer type models during text matching (such as the Quora dataset)?
Give me some suggestions for choosing hyperparameters