1

There appears to be a persistent issue in IBM Watson's Speech-To-Text API in that the transcript and alternatives are inconsistent. For instance, an excerpt of an offending transcript reads as:

'fall being just like'

and the corresponding alternatives, aggregated by timestamp, are

[following, vaulting, fall, faulting, folding]
[like]

indicating that there are no alternatives corresponding to 'being' nor 'just'. The most egregious example I have seen is a case where the transcript is perfectly fine but the alternatives are empty. The application I am working on works with the assumption that the alternatives are, for the most part, a superset of the transcript (up to accounting for smart formatting), so this is a serious issue for me.

Another excerpt is:

'are a team you know 80 - around back great' 

but the alternatives have [are, our, all] between 6.19 and 6.39 and then [back] itself between 8.18 and 8.54 which gives a ~2 second unaccounted for pause in which the transcript was able to detect words but the alternatives did not.

It seems sometimes that the reverse can occur as well, when the alternatives has words that cannot be matched to those appearing in the transcript, which compounds the problem since I cannot even then forcibly reconcile the two, for instance by inserting words into the alternatives with placeholder timestamps.

Long story short: why is it the case that sometimes the transcript and the alternatives cannot be reconciled even after the effects of smart formatting are ignored? In particular, it is possible for the transcript to contain words that do not appear in the alternatives and vice versa.

Riley
  • 199
  • 6

0 Answers0