How to Train Wav2vec2 XLSR With local Custom Dataset

Question

I want to train a speech to text model with wav2vec2 xlsr (transformer-based model) in danish language, as a recommendation, many people train their model using common voice with the help of datasets library, but in common voice, there is very less amount of data for danish, now I want to train the model with my own custom data, but I am failed to find any clear documentation for this, can anybody please help me with this, that how can I do it step by step?

You maybe like this [blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) post. — cronoik, May 26 '22 at 13:15

score 3 · Answer 1 · answered Jun 29 '22 at 08:22

I suggest you to extend Common Voice (CV) Danish subset with your own dataset. Analyse dataset first and make your data like CV corpus. At this point: data extension (.wav, .mp3 ...), type (float32, int ...), audio lengths and of course transcription formats are important. Don not make your corpus sparse.

Place you data into CV corpus folder and load dataset. Then you should be able to fine-tune model with extended data using existing code.

Do not create completely new corpus If you are not an expert of wav2vec.

A Note: You should get reasonable result using less data. What WER did you achieve and what is your target. Hyper-parameter tuning may be the first thing you look for instead of data.

Jonatas Grosman · Answer 2 · 2022-07-25T18:55:21.833

I've built a tool to help me to fine-tune wav2vec2 models using custom data. Maybe this can help you too: https://github.com/jonatasgrosman/huggingsound.

You can install it using: pip install huggingsound

To fine-tune the XLSR model using a custom dataset, you'll need to do something like this:

from huggingsound import TrainingArguments, ModelArguments, SpeechRecognitionModel, TokenSet

model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53")
output_dir = "my/finetuned/model/output/dir"

# first of all, you need to define your model's token set
# however, the token set is only needed for non-finetuned models
# if you pass a new token set for an already finetuned model, it'll be ignored during training
tokens = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
token_set = TokenSet(tokens)

# define your custom train data
train_data = [
    {"path": "/path/to/sagan.mp3", "transcription": "extraordinary claims require extraordinary evidence"},
    {"path": "/path/to/asimov.wav", "transcription": "violence is the last refuge of the incompetent"},
]

# and finally, fine-tune your model
model.finetune(
    output_dir, 
    train_data=train_data,
    token_set=token_set,
)

How to Train Wav2vec2 XLSR With local Custom Dataset

2 Answers2