0

Let's say I want to train BERT with 2 sentences (query-answer) pair against a certain binary label (1,0) for the correctness of the answer, will BERT let me use 512 words/tokens each for the query and the answer or together(query+answer combined) they should be 512? [510 upon ignoring the [start] and [sep] token]

Thanks in advance!

Berriel
  • 12,659
  • 4
  • 43
  • 67

1 Answers1

0

Together, and actually it's together they should be 509 since there are two [SEP], one after question and another after answer:

[CLS] q_word1 q_word2 ... [SEP] a_word1 a_word2 ... [SEP]

where q_word refers to words in the question and a_word refers to words in the answer

Crystina
  • 990
  • 1
  • 5
  • 16
  • Alright. Thanks for the information. But since my answers are super long (approx 500 words each), can you help me suggest a few ways around to get over this? – Soumya Ranjan Sahoo Jun 20 '20 at 17:16
  • so first thing to try, of course, is to simply throw away the overlong part (since it's just a small portion, most of time should not affect a lot on the results). if the answers are really too long (e.g. > 1000 etc) maybe u can try split them into half and train seperately – Crystina Jun 22 '20 at 14:04
  • All the white spaces are also considered to be individual tokens in the BERT tokenizer or only the words are considered to be tokens here? @Crystina – Soumya Ranjan Sahoo Jul 01 '20 at 16:22
  • actually neither, words need to be further tokenized by the WordPiece tokenizer, (which huggingface provides), and the final tokens are usually the sub-words – Crystina Jul 02 '20 at 02:48
  • I see. So given a sentence of let's say 500 words, I really don't have control over the length of the tokens i.e. chances are all the 500 words might not fit into the max_seq_length after the sentence being divided into subwords? I use the hugging face interface and can I have control over which words are being truncated by chance or which tokens are being tokenized? – Soumya Ranjan Sahoo Jul 04 '20 at 10:50