1

I have extracted audio embeddings from Google Audioset corpus (https://research.google.com/audioset/dataset/index.html). The audio embeddings contain a list of "bytes_lists" which is similar to the following

  feature {
    bytes_list {
      value: "#\226]\006(N\223K\377\207\r\363\333\377\000Y\322v9\351\303\000\377\311\375\215E\342\377J\000\000_\000\370\222:\270\377\357\000\245\000\377\213jd\267\353\377J\033$\273\267\307\035\377\000\207\244Q\000\000\206\000\000\312\356<R\325g\303\356\016N\224\377\270\377\237\240\377\377\321\252j\357O\217\377\377,\330\000\377|\246\000\013\034\000\377\357\212\267\300b\000\000\000\251\236\000\233\035\000\326\377\327\327\377\377\223\0009{"
    }
  }

Now, I want to use these audio embeddings for training my own model (CNN). I have some confusion about these audio embeddings.

  1. Should I extract STFT and MFCC from the audio embeddings? If so, how can I do that (any way to use librosa?)? Or, are the audio embeddings already transformed to MFCC?
  2. What should be the best way to split the audio set corpus into train, test and validate datasets? They are if Tfrecord format and each tfrecord file contain various segment of audio clips having different class labels.
  3. If I want to work on selective class labels (such as rowing, or car sound), what should be the best way to extract the selective audio segments?

Please share some helpful resources about working with Google audioset corpus if possible.

Sabid Habib
  • 419
  • 1
  • 4
  • 16
  • The embeddings are the output of a pretrained CNN. Why do you want to use them to train your own CNN? – Jon Nordby Mar 17 '22 at 16:35
  • Hi Jon, thanks for the response. How should I approach using the audioset corpus to train my own CNN in that case? – Sabid Habib Mar 17 '22 at 16:45
  • 1
    You have to download the audio files from Youtube, and use them together with the labels. There are some tools like audiosetdl that can help, https://github.com/soundsensing/audiosetdl – Jon Nordby Mar 17 '22 at 18:56

0 Answers0