2

I am working with Python to isolate elements from music. Training a model, I break my audio into frames, and have a label for each frame - 1 or 0. Unfortunately, due to rounding errors, my labels are always 1 or 2 frames short.

Converting my audio to frames, I get a value of (13, 3709)

    s = [] 
    for y in audio:
        mfcc = librosa.feature.mfcc(y= y, sr = 16000, n_mfcc=13, n_fft=2048, hop_length = 1024)
        s.append(mfcc)

Converting my text file (for the mp3 I am working with) from milliseconds to frame numbers, I get a vector value of 3708.

    output = []              
    for block in textCorpus:
        block_start = int(float(block[0]) * 16000 / 1024)   # Converted to frame number
        block_end = int(float(block[1]) * 16000 / 1024)     # Converted to frame number
        singing = block[2]
        block_range = np.arange(block_start, block_end, 1)  # Step size is 1 (per frame number)
# extraneous code 

I have tried using Decimal, math.floor and also math.ceil within my block_start and block_stop variables, but I can't seem to match my audio frame length.

  • I don't know much about librosa but have you checked if 3708 is actually the last index in the vector? Assuming it is indexed from 0 the size and indexes would match in that case – Chachmu Apr 05 '18 at 18:40
  • 3707 is the last index if we consider that the index starts from 0. Therefore 3708 in total length with the 0 index. Likewise with the audio which was divided into 3709 frames with the last index being 3708. –  Apr 05 '18 at 18:56

2 Answers2

1

Use the Fraction package in the standard library: https://docs.python.org/2/library/fractions.html

It is useful for exact rational number arithmetic.

Primusa
  • 13,136
  • 3
  • 33
  • 53
  • To be clear, you mean `block_start = (Fraction(block[0]) * 16000 / 1024)` Edit: Which unfortunately gives me an error TypeError: unsupported format string passed to Fraction.__format__ –  Apr 05 '18 at 18:59
  • what is the value of block[0]? – Primusa Apr 05 '18 at 19:01
  • I am suggesting you convert all of your arithmetic here to fractions and get rid of the floats altogether – Primusa Apr 05 '18 at 19:01
  • block[0] is 0.000ms, block[1] is 37.903 and block[2] is 0. All this in turn means that between 0.000ms and 37.903ms in the audio file I am working with, there is no singing. I understand about your suggestion. –  Apr 05 '18 at 19:25
  • alright so the format error that is thrown in your above example can be stopped with converting block[0] to float first – Primusa Apr 05 '18 at 19:28
  • convert 16000 / 1024 to fraction with Fraction(16000, 1024), and Fraction around 0.064. Everything should be using exact arithmetic – Primusa Apr 05 '18 at 19:30
  • worst case scenario could you just trim extra frames at the end. You would be losing an almost legible amount of data – Primusa Apr 05 '18 at 19:40
  • Thanks for the helpful tip, but unfortunately this sends me further off the mark. I guess it's progress though. To be clear `block_start = int(float(block[0]) * Fraction(16000, 1024))` and `block_end = Fraction(block[1]) * Fraction(16000 , 1024)` –  Apr 05 '18 at 19:41
  • block_start = Fraction(float(block[0])) * Fraction(16000, 1024) – Primusa Apr 05 '18 at 19:43
  • Thanks for the help again (I promise I won't keep you much longer :D ). But this is beginning to bug me to no end. You say converting to float first should solve my error. I tried your exact suggestion yet I get a call back error: `ms_start = '{0:.3f}'.format(x * 1024 / 16000) TypeError: unsupported format string passed to Fraction.__format__` Any final suggestions? –  Apr 05 '18 at 19:53
  • Just throwing something out there but maybe multiply everything by 1000 at the start (so work with units of milliseconds) and just avoid non-integers altogether – Primusa Apr 05 '18 at 19:56
  • I am in milliseconds as it is! I was thinking as you said to resort to just trimming the end of the audio frame but I feel thats cheating. There must be a way. I know there is a way! –  Apr 05 '18 at 20:07
0

If you're getting the blocks in order, perhaps you could forego multiplications and divisions and just work around them with simple additions:

def labelToFrames(textCorpus):
    output    = []  
    offset    = 0
    increment = 0.064           # or 1024/16000      
    for block in textCorpus:
        block_start = block[0]   
        block_end   = block[1]    
        singing     = block[2]
        while offset < block_end:
            ms_start = '{0:.3f}'.format(offset) 
            offset   = min(block_end,offset + increment)          
            ms_end   = '{0:.3f}'.format(offset)   
            add_to_output = [ms_start, ms_end, singing]
            output.append(add_to_output)
    return output
Alain T.
  • 40,517
  • 4
  • 31
  • 51