2

I am building a prototype of a sound detection app that will ultimately run on a phone (iPhone/Android). It needs to be near real-time to give fast enough response to the user when a particular sound is recognized. I am hoping to use tensorflow to actually build and train the model and then deploy it on mobile device.

What I am unsure about is best way to feed data to tensorflow for inference in this case.

Option 1: Feed only newly acquired samples to the model.

Here the model itself keeps a buffer of previous signal samples, to which new samples are appended and the whole thing get processed. Something like:

samples = tf.placeholder(tf.int16, shape=(None))
buffer = tf.Variable([], trainable=False, validate_shape=False, dtype=tf.int16)
update_buffer = tf.assign(buffer, tf.concat(0, [buffer, samples]), validate_shape=False)
detection_op = ....process buffer...
session.run([update_buffer, detection_op], feed_dict={samples: [.....]})

This seems to work, but if the samples are pushed to the model 100 times a second, what's happening inside tf.assign (the buffer can grow big enough, and if tf.assign constantly allocates memory this may not work well)?

Option 2: Feed the whole recording to the model

Here the iPhone app keeps the state/recording samples, and feeds the whole recording to the model. The input can get quite large, and re-running the detection op on the whole recording will have to keep recomputing the same values each cycle.

Option 3: Feed a sliding window of data

Here the app keeps the data for the whole recording, but feeds only the latest slice of data to the model. E.g. last 2 sec at 2000 sampling rate == 4000 sample fed fed at the rate of 1/100 sec (each new 20 samples). The model may also need to keep some running totals for the whole recording.

Advise?

Kliment Mamykin
  • 468
  • 2
  • 9

1 Answers1

1

I'd need to know a bit more about your application requirements, but for simplicities sake I recommend starting with option #3. The usual way to approach this problem for arbitrary sounds is:

  • Have some trigger to detect the start of a sound or speech utterance. This can just be sustained audio levels, or something more advanced.
  • Run a spectrogram over a fixed size window, aligned with the start of the noise.
  • The rest of the network can just be a standard image detection one (usually cut down in size) to classify the sound.

There are a lot of variations and other possible approaches. For example for speech it's typical to use MFCC as your feature generator, and then run an LSTM to separate out phonemes, but since you mention sound detection I'm guessing you don't need anything this advanced.

Pete Warden
  • 2,866
  • 1
  • 13
  • 12