I have been trying to build a BERT model for a specific domain. However, my model is trained on non-English text, so I'm worried that the default token size, 30522, won't fit my model.
Does anyone know where the number 30522 came from?
I expect that researchers were fine-tuning their model by focusing on training time and vocabulary coverage, but a more clear explanation will be appreciated.