How to chose the right LineByLineTextDataset-parameters for the transformers LineByLineTextDataset

Question

From this website explaining the Roberta parameters, I understood that the max_position_embeddings should be a power of 2. Then from this GitHub issue, I understood that we should add 2 to the max_position_embeddings value while setting the RobertaConfig parameters. So, we will get:

max_position_embeddings_value = 512 # power of 2
config = RobertaConfig(
    max_position_embeddings=max_position_embeddings_value+2
)

Now, I am wondering how to specify the LineByLineTextDataset-parameters.

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
                tokenizer=tokenizer,
                file_path='path_text.txt',
                block_size=max_position_embeddings_value,
            )

Do the LineByLineTextDataset-parameters depend on the RobertaConfig-parameters ?

Suppose each line in the path_text.txt file has a lenght of n. Should I specify the max_position_embeddingsaccording to the lenght n ?
e.g. what will be the value of max_position_embeddings_value if n = 10, or n = 10 or n = 500 or n = 1000 ... `?
Should block_size = max_position_embeddings_value ? Does it depend on the lenght n?

I wanted to chose max_position_embeddings_value following this rule :

for n > 128 and n < 256, 
max_position_embeddings_value = 256, as it is a power of 2, closer to n and greater than n

for n > 256 and n < 512, 
max_position_embeddings_value = 512, as it is a power of 2, closer to n and greater than n

for n > 512 and n < 1024, 
max_position_embeddings_value = 1024, as it is a power of 2, closer to 10n24 and greater than n

etc ...

Is it a good approach ?

How to chose the right LineByLineTextDataset-parameters for the transformers LineByLineTextDataset

0 Answers0