From this website explaining the Roberta parameters, I understood that the
max_position_embeddings
should be a power of 2.
Then from this GitHub issue, I understood that we should add 2 to the max_position_embeddings
value while setting the RobertaConfig
parameters.
So, we will get:
max_position_embeddings_value = 512 # power of 2
config = RobertaConfig(
max_position_embeddings=max_position_embeddings_value+2
)
Now, I am wondering how to specify the LineByLineTextDataset
-parameters.
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path='path_text.txt',
block_size=max_position_embeddings_value,
)
Do the LineByLineTextDataset
-parameters depend on the RobertaConfig
-parameters ?
- Suppose each line in the
path_text.txt
file has a lenght ofn
. Should I specify themax_position_embeddings
according to the lenghtn
?
e.g. what will be the value ofmax_position_embeddings_value
ifn = 10
, orn = 10
orn = 500
orn = 1000
... `? - Should
block_size = max_position_embeddings_value
? Does it depend on the lenghtn
?
I wanted to chose max_position_embeddings_value
following this rule :
for n > 128 and n < 256,
max_position_embeddings_value = 256, as it is a power of 2, closer to n and greater than n
for n > 256 and n < 512,
max_position_embeddings_value = 512, as it is a power of 2, closer to n and greater than n
for n > 512 and n < 1024,
max_position_embeddings_value = 1024, as it is a power of 2, closer to 10n24 and greater than n
etc ...
Is it a good approach ?