0

i'm currently developing a Question-Answer system (in Indonesian language) using BERT for my thesis. The dataset and the questions given are in Indonesian.

The problem is, i'm still not clear on how the step-to-step process to develop the Question-Answer system in BERT.

From what I concluded after reading a number of research journals and papers, the process might be like this:

  1. Prepare main dataset
  2. Load Pre-Train Data
  3. Train the main dataset with the pre-train data (so that it produce "fine-tuned" model)
  4. Cluster the fine-tuned model
  5. Testing (giving questions to the system)
  6. Evaluation

What i want to ask are :

  • Are those steps correct? Or maybe there any missing step(s)?
  • Also, if the default pre-train data that BERT provide is in English while my main dataset is in Indonesian, how can i create my own indonesian pre-train data?
  • Does it really need to perform data/model clustering in BERT?

I appreciate any helpful answer(s). Thank you very much in advance.

Dhimas Yoga
  • 87
  • 10

1 Answers1

0

I would take a look at Huggingface's Question & Answer examples. That would at least be a good place to start.

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")
scarpacci
  • 8,957
  • 16
  • 79
  • 144