I want to perform CTC Beam Search on (the output of an ASR model that gives) matrices of phoneme probability values. Tensorflow has a CTC Beam Search implementation but it's poorly documented and I fail to make a working example. I want to write a code to use it as a benchmark.
Here is my code so far:
import numpy as np
import tensorflow as tf
def decode_ctcBeam(matrix, classes):
matrix = np.reshape(matrix, (matrix.shape[0], 1,matrix.shape[1]))
aa_ctc_blank_aa_logits = tf.constant(matrix)
sequence_length = tf.constant(np.array([len(matrix)], dtype=np.int32))
(decoded_list,), log_probabilities = tf.nn.ctc_beam_search_decoder(inputs=aa_ctc_blank_aa_logits,
sequence_length=sequence_length,
merge_repeated=True,
beam_width=25)
out = list(tf.Session().run(tf.sparse_tensor_to_dense(decoded_list)[0]))
print(out)
return out
if __name__ == '__main__':
classes = ['AA', 'B', 'CH']
mat = np.array([[0.4, 0, 0.6, 0.2], [0.4, 0, 0.6, 0.2]], dtype=np.float32)
actual = decode_ctcBeam (mat, classes)
I'm having issues with understanding the code:
- in the example mat is shaped (2, 4), but the tensorflow module needs a (2, 1, 4) shape, so I reshape mat with
matrix = np.reshape(matrix, (matrix.shape[0], 1,matrix.shape[1]))
but what does this mean mathematically? is mat and matrix the same? Or I'm mixing things up here? 1 in the middle is the batch size in my understanding. - the decode_ctcBeam function returns with a list, in the example it gives [2], which should mean 'CH' from the defined classes. How do I generalize this and find the recognized phoneme sequences if I have a larger input matrix and let's say 40 phonemes?
Looking forward to your answers / comments! Thanks!