Transformers summarization with Python Pytorch - how to get longer output?

Question

I use Ai-powered summarization from https://github.com/huggingface/transformers/tree/master/examples/summarization - state of the art results.

Should i train it myself to get summary output longer than used in original huggingface github training script? :

python run_summarization.py \
    --documents_dir $DATA_PATH \
    --summaries_output_dir $SUMMARIES_PATH \ # optional
    --no_cuda false \
    --batch_size 4 \
    --min_length 50 \
    --max_length 200 \
    --beam_size 5 \
    --alpha 0.95 \
    --block_trigram true \
    --compute_rouge true

When i do inference with

--min_length 500 \
--max_length 600 \

I got a good output for 200 tokens, but the rest of the text is

. . . [unused7] [unused7] [unused7] [unused8] [unused4] [unused7] [unused7]  [unused4] [unused7] [unused8]. [unused4] [unused7] . [unused4] [unused8] [unused4] [unused8].  [unused4]  [unused4] [unused8]  [unused4] . .  [unused4] [unused6] [unused4] [unused7] [unused6] [unused4] [unused8] [unused5] [unused4] [unused7] [unused4] [unused4] [unused7]. [unused4] [unused6]. [unused4] [unused4] [unused4] [unused8]  [unused4] [unused7]  [unused4] [unused8] [unused6] [unused4]   [unused4] [unused4]. [unused4].  [unused5] [unused4] [unused8] [unused7] [unused4] [unused7] [unused9] [unused4] [unused7]  [unused4] [unused7] [unused5] [unused4]  [unused5] [unused4] [unused6]  [unused4]. .  . [unused5]. [unused4]  [unused4]   [unused4] [unused6] [unused5] [unused4] [unused4]  [unused6] [unused4] [unused6]  [unused4] [unused4] [unused5] [unused4]. [unused5]  [unused4] . [unused4]  [unused4] [unused8] [unused8] [unused4]  [unused7] [unused4] [unused8]  [unused4] [unused7]  [unused4] [unused8]  [unused4]  [unused8] [unused4] [unused6]

score 3 · Accepted Answer · answered Feb 20 '20 at 08:45

The short answer is: Yes, probably.

To explain this in a bit more detail, we have to look at the paper behind the implementation: In Table 1, you can clearly see that most of their generated headlines are much shorter than what you are trying to initialize. While that alone might not be an indicator that you couldn't generate anything longer, we can go even deeper and look at the meaning of the [unusedX] tokens, as described by BERT dev Jacob Devlin:

Since [the [unusedX] tokens] were not used they are effectively randomly initialized.

Further, the summariazation paper describes

Position embeddings in the original BERT model have a maximum length of 512; we over-come this limitation by adding more position em-beddings that are initialized randomly and fine-tuned with other parameters in the encoder.

This is a strong indicator that past a certain length, they are likely falling back to the default initialization, which is unfortunately random. The question is whether you can still salvage the previous pre-training, and simply fine-tune to your objective, or whether it is better to just start from scratch.

Transformers summarization with Python Pytorch - how to get longer output?

1 Answers1