How to effectively construct a data set suitable for pre-training of llm (large language models)?

Question

I want to pre-train my own llm from scratch, so first I'm trying to construct the dataset. After several web searches and research，I get some idea from huggingface's open course:

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

The solution is like this: For example, you have a sentence tokenized as [a, b, c, d, e, f], and model context window length=3, then:

cut the whole sentence into two pieces: [a, b, c] and [d, e, f],
input is [a, b, c], and the corresponding label is just clone of the input which is also [a, b, c]. But It has a special emphasis of ⚠️ Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels. So inside the lib, it transform label from [a, b, c] to [b, c, padding]. finally we get two samples:

[a, b, c] -> [b, c, pad]
[d, e, f] -> [e, f, pad]

But in my opinion, the proper label for [a, b, c] should be [b, c, d], and instead of cut the whole sentence into pieces which length equals to window size it will make more sense to use a slide-window of context length to go over from end to end. In this way, we can get these samples:

[a, b, c] -> [b, c, d]
[b, c, d] -> [c, d, e]
[c, d, e] -> [d, e, f]

Both the quantity and quality of samples are improved.

So which is the right way? I'm confused by this idea for months, please give me some advice if you have practical experience in this field，thank you!

How to effectively construct a data set suitable for pre-training of llm (large language models)?

0 Answers0