1

I'm currently trying to fine-tune DistilGPT-2 (with Pytorch and HuggingFace transformers library) for a code completion task. My corpus is arranged like the following example:

<|startoftext|>
public class FindCityByIdService {
    private CityRepository cityRepository = ...
<|endoftext|>

My first attempt was to run the following script from transformers library:

python run_clm.py 
     --model_type=gpt2 \
     --model_name_or_path distilgpt2 \
     --do_train \
     --train_file $TRAIN_FILE \
     --num_train_epochs 100 \
     --output_dir $OUTPUT_DIR \
     --overwrite_output_dir \
     --save_steps 20000 \
     --per_device_train_batch_size 4 \

After doing some generation tests, I realized that the model is not predicting \ n for any given context. I imagine that some pre-process stage or something similar is missing. But anyway, what should I do so that \ n be predicted as expected?

HF Forum question

Thanks!!

  • 1
    did you try adding the "\n" in the training data? I suppose that the model can only learn to predict it if it sees it in the training data. – Moritz Dec 06 '20 at 10:43
  • Having the same issue trying to finetune gpt2. I've got newlines in my training file, but anything I generate from the resulting model never has any newlines. It seems like maybe newlines are stripped from the training data? But I can't find any evidence of that in the code at the moment. – Adrien Dec 13 '20 at 03:25

1 Answers1

1

I think I found a hacky solution for this.

In run_clm.py change:

    def tokenize_function(examples):
        return tokenizer(examples[text_column_name])

to:

    def tokenize_function(examples):
        return tokenizer([example + "\n" for example in examples[text_column_name]])

When the Dataset is initially built, it splits it by lines without keeping the newlines on each line. Then the group_texts method concatenates them into batches without adding newlines back. So changing tokenize_function to append \n to each line gives us those newlines back.

Just tested this change out on my fine-tuning job and it worked! Getting newlines being generated in the resulting model.

Adrien
  • 81
  • 3
  • It might also be solved by passing a `csv` file to `run_clm.py`, instead of `txt`. If the texts in its `text` column contain newline symbols, they will also appear in the generated output. Moreover, this may reduce the validation loss if the dataset consists of unrelated short texts since the examples will be split more naturally. – vbyno Apr 05 '21 at 06:48