Our team is trying to use a deep learning (vision transformer) model called Deplot. It is a vision transformer model that is capable of converting charts to tables (texts).
The architecture of the model isn't too complex compared to other transformer models, but it is a fine-tuned model on top of an another Vision transformer model which is fine tuned on top of an another Vision transformer called Pix2Struct. The problem is that our team needs the model to work on other languages.
Would it just be enough to fine-tune the deplot model or should I fine-tune the other base models as well? Also what would be the other options without fine-tuning the model if we were to use the model in other languages but with the same feature?
We've thought of replacing the vocab list to an another language and train it using it but ended up with a conclusion that it is not worth it. Any help would be appreciated.