i'm currently developing a Question-Answer system (in Indonesian language) using BERT for my thesis. The dataset and the questions given are in Indonesian.
The problem is, i'm still not clear on how the step-to-step process to develop the Question-Answer system in BERT.
From what I concluded after reading a number of research journals and papers, the process might be like this:
- Prepare main dataset
- Load Pre-Train Data
- Train the main dataset with the pre-train data (so that it produce "fine-tuned" model)
- Cluster the fine-tuned model
- Testing (giving questions to the system)
- Evaluation
What i want to ask are :
- Are those steps correct? Or maybe there any missing step(s)?
- Also, if the default pre-train data that BERT provide is in English while my main dataset is in Indonesian, how can i create my own indonesian pre-train data?
- Does it really need to perform data/model clustering in BERT?
I appreciate any helpful answer(s). Thank you very much in advance.