Unable to train the conformer-rnnt model on tedlium data

Question

I am trying to train the conformer-rnnt model on tedlium data and am encountering the below error when the command to train is executed.

usage: run_speech_recognition_rnnt.py [-h] (--manifest MANIFEST | --data_file DATA_FILE) --data_root DATA_ROOT [--vocab_size VOCAB_SIZE] [--tokenizer {spe,wpe}]
                                  [--spe_type {bpe,unigram,char,word}] [--spe_character_coverage SPE_CHARACTER_COVERAGE] [--spe_bos] [--spe_eos] [--spe_pad]
                                  [--spe_sample_size SPE_SAMPLE_SIZE] [--spe_train_extremely_large_corpus]
                                  [--spe_max_sentencepiece_length SPE_MAX_SENTENCEPIECE_LENGTH] [--spe_no_split_by_unicode_script] [--no_lower_case] [--log]
run_speech_recognition_rnnt.py: error: the following arguments are required: --data_root

The command to train the conformer-rnnt model is shown below:

#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0 python run_speech_recognition_rnnt.py \
    --config_path="conf/conformer_transducer_bpe_xlarge.yaml" \
    --model_name_or_path="stt_en_conformer_transducer_xlarge" \
    --dataset_name="esc-benchmark/esc-datasets" \
    --tokenizer_path="tokenizer" \
    --vocab_size="1024" \
    --max_steps="100000" \
    --dataset_config_name="tedlium" \
    --output_dir="./" \
    --run_name="rnnt-tedlium-baseline" \
    --wandb_project="rnnt" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="4" \
    --logging_steps="50" \
    --learning_rate="1e-4" \
    --warmup_steps="500" \
    --save_strategy="steps" \
    --save_steps="20000" \
    --evaluation_strategy="steps" \
    --eval_steps="20000" \
    --report_to="wandb" \
    --preprocessing_num_workers="4" \
    --fused_batch_size="4" \
    --length_column_name="input_lengths" \
    --fuse_loss_wer \
    --group_by_length \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --do_predict \
    --use_auth_token

The run_speech_recognition_rnnt.py calls processes which incorporate the arguments given in the error, however it looks like it fails to recognize them. The below code is present in a file named process_asr_text_tokenizer.py and is called by the run_speech_recognition_rnnt.py.

def main():
data_root = args.data_root
manifests = args.manifest
data_file = args.data_file
vocab_size = args.vocab_size
tokenizer = args.tokenizer
spe_type = args.spe_type
spe_character_coverage = args.spe_character_coverage
spe_sample_size = args.spe_sample_size
spe_train_extremely_large_corpus = args.spe_train_extremely_large_corpus
spe_max_sentencepiece_length = args.spe_max_sentencepiece_length
spe_split_by_unicode_script = args.spe_split_by_unicode_script
spe_bos, spe_eos, spe_pad = args.spe_bos, args.spe_eos, args.spe_pad
lower_case = args.lower_case

if not os.path.exists(data_root):
    os.makedirs(data_root)

if args.log:
    logging.basicConfig(level=logging.INFO)

if manifests:
    text_corpus_path = __build_document_from_manifests(data_root, manifests)
else:
    text_corpus_path = data_file
tokenizer_path = __process_data(
    text_corpus_path,
    data_root,
    vocab_size,
    tokenizer,
    spe_type,
    lower_case=lower_case,
    spe_character_coverage=spe_character_coverage,
    spe_sample_size=spe_sample_size,
    spe_train_extremely_large_corpus=spe_train_extremely_large_corpus,
    spe_max_sentencepiece_length=spe_max_sentencepiece_length,
    spe_split_by_unicode_script=spe_split_by_unicode_script,
    spe_bos=spe_bos,
    spe_eos=spe_eos,
    spe_pad=spe_pad,
)

print("Serialized tokenizer at location :", tokenizer_path)
logging.info('Done!')

I am trying to train a conformer-rnnt model on tedlium data and i expect it to train for a while and return certain metrics as results. However, in the course of the experiment I am facing errors running the train command. Although the error seems like a syntactical error, i dont think it is just so and that it would require a more intricate solution.

Please help me resolve this issue.

score 0 · Answer 1 · answered May 31 '23 at 13:29

0

The --data_root parameter is lacking in the command to train the conformer-rnnt model, according to the error notice. This signifies that the script requires the --data_root parameter, which is missing from your command.

To resolve this issue, add the --data_root option followed by the appropriate value to define the data's root directory.

answered May 31 '23 at 13:29

urooj fatima

1
2

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 31 '23 at 13:51

Unable to train the conformer-rnnt model on tedlium data

1 Answers1