I am trying to train the conformer-rnnt model on tedlium data and am encountering the below error when the command to train is executed.
usage: run_speech_recognition_rnnt.py [-h] (--manifest MANIFEST | --data_file DATA_FILE) --data_root DATA_ROOT [--vocab_size VOCAB_SIZE] [--tokenizer {spe,wpe}]
[--spe_type {bpe,unigram,char,word}] [--spe_character_coverage SPE_CHARACTER_COVERAGE] [--spe_bos] [--spe_eos] [--spe_pad]
[--spe_sample_size SPE_SAMPLE_SIZE] [--spe_train_extremely_large_corpus]
[--spe_max_sentencepiece_length SPE_MAX_SENTENCEPIECE_LENGTH] [--spe_no_split_by_unicode_script] [--no_lower_case] [--log]
run_speech_recognition_rnnt.py: error: the following arguments are required: --data_root
The command to train the conformer-rnnt model is shown below:
#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0 python run_speech_recognition_rnnt.py \
--config_path="conf/conformer_transducer_bpe_xlarge.yaml" \
--model_name_or_path="stt_en_conformer_transducer_xlarge" \
--dataset_name="esc-benchmark/esc-datasets" \
--tokenizer_path="tokenizer" \
--vocab_size="1024" \
--max_steps="100000" \
--dataset_config_name="tedlium" \
--output_dir="./" \
--run_name="rnnt-tedlium-baseline" \
--wandb_project="rnnt" \
--per_device_train_batch_size="8" \
--per_device_eval_batch_size="4" \
--logging_steps="50" \
--learning_rate="1e-4" \
--warmup_steps="500" \
--save_strategy="steps" \
--save_steps="20000" \
--evaluation_strategy="steps" \
--eval_steps="20000" \
--report_to="wandb" \
--preprocessing_num_workers="4" \
--fused_batch_size="4" \
--length_column_name="input_lengths" \
--fuse_loss_wer \
--group_by_length \
--overwrite_output_dir \
--do_train \
--do_eval \
--do_predict \
--use_auth_token
The run_speech_recognition_rnnt.py calls processes which incorporate the arguments given in the error, however it looks like it fails to recognize them. The below code is present in a file named process_asr_text_tokenizer.py and is called by the run_speech_recognition_rnnt.py.
def main():
data_root = args.data_root
manifests = args.manifest
data_file = args.data_file
vocab_size = args.vocab_size
tokenizer = args.tokenizer
spe_type = args.spe_type
spe_character_coverage = args.spe_character_coverage
spe_sample_size = args.spe_sample_size
spe_train_extremely_large_corpus = args.spe_train_extremely_large_corpus
spe_max_sentencepiece_length = args.spe_max_sentencepiece_length
spe_split_by_unicode_script = args.spe_split_by_unicode_script
spe_bos, spe_eos, spe_pad = args.spe_bos, args.spe_eos, args.spe_pad
lower_case = args.lower_case
if not os.path.exists(data_root):
os.makedirs(data_root)
if args.log:
logging.basicConfig(level=logging.INFO)
if manifests:
text_corpus_path = __build_document_from_manifests(data_root, manifests)
else:
text_corpus_path = data_file
tokenizer_path = __process_data(
text_corpus_path,
data_root,
vocab_size,
tokenizer,
spe_type,
lower_case=lower_case,
spe_character_coverage=spe_character_coverage,
spe_sample_size=spe_sample_size,
spe_train_extremely_large_corpus=spe_train_extremely_large_corpus,
spe_max_sentencepiece_length=spe_max_sentencepiece_length,
spe_split_by_unicode_script=spe_split_by_unicode_script,
spe_bos=spe_bos,
spe_eos=spe_eos,
spe_pad=spe_pad,
)
print("Serialized tokenizer at location :", tokenizer_path)
logging.info('Done!')
I am trying to train a conformer-rnnt model on tedlium data and i expect it to train for a while and return certain metrics as results. However, in the course of the experiment I am facing errors running the train command. Although the error seems like a syntactical error, i dont think it is just so and that it would require a more intricate solution.
Please help me resolve this issue.