SFSpeechRecognizer segment timestamps start at zero each minute

Question

I just noticed that the timestamps of SFTranscriptionSegments start at zero each minute, which makes it impossible to really know where the text is located if there are long pauses. Is this something that can be configured or worked around?

I am using SFSpeechRecognizer to transcribe audio files that are potentially longer than one minute. Chopping them into one-minute segments will have the danger of splitting words.

I am using SFSpeechRecognizer on Mac OS Catalina.

To add to that, I just noticed that at least at one point the timestamp starts at 0 not exactly after a minute but 13 seconds earlier, making the timestamp information completely worthless. Has anyone experienced this? I am using a simple wav file as input. — user1573546, Feb 04 '20 at 15:26
Over the entire duration of my file, roughly 8 minutes, this happens multiple times in a seemingly erratic way, i.e. clock resetting to 0 after a few seconds, producing completely unreliable transcription info. — user1573546, Feb 04 '20 at 15:30
Yes, I am. The code is essentially this with an added RunLoop: https://stackoverflow.com/questions/59920660/sfspeechrecognizer-on-macos-not-available-despite-successful-authorization/59977301#59977301. Thanks again for that hint! — user1573546, Feb 05 '20 at 10:34
Can you provide a sample file to try and reproduce the issue with? — TheNextman, Feb 05 '20 at 16:43
Here's a Dropbox folder with the sound file and the transcription results exported as an XML file. You can see the timestamp jumps I'm experiencing in the file and compare that to results you get. https://www.dropbox.com/sh/2g5qozafz0keeap/AADg_DG_CCK4yV-QI2t187uba?dl=0 — user1573546, Feb 06 '20 at 08:03
Did you have any luck on this? We've just come across the same issue. — Allan Poole, Feb 13 '20 at 04:36
No, unfortunately not. Haven't had time yet to file an official TSI. I am trying personal Apple contacts first but so far no luck. — user1573546, Feb 14 '20 at 10:21
Fair enough. If you're able to keep us updated on anything you hear, that'd be much appreciated. We'll do the same. We're currently experimenting with a workaround which is looking tentatively hopeful, but it is rather messy. — Allan Poole, Feb 17 '20 at 03:07
Will do. Appreciate if you'd do the same. I was thinking about a workaround with overlapping sub-one-minute segments (overlapping to deal with the issue of cutting of words), which requires some heuristics to get rid of the garbage it may produce around the cuts. Is that what you are doing? — user1573546, Feb 18 '20 at 08:06
Kind of. Basically we noticed that the problems started happening whenever there was a significant enough gap in speech that the recogniser output an utterance (collection of segments). Beyond the first utterance, all time stamp start times were essentially unusable (though duration was fine). Our workaround is to take advantage of the fact that the first utterance always seems to be accurate by buffering chunks of the file until we get our first utterance, changing the file offset to the end of the utterance, and repeating. The startTime of each utterance should equal the gap in speech. — Allan Poole, Feb 20 '20 at 23:15
Just to clarify - we are doing this on iOS. So not sure if you'd have the same results on Mac OS. — Allan Poole, Feb 21 '20 at 00:19

score 0 · Answer 1 · answered Feb 06 '20 at 17:30

0

You are not checking the isFinal property of the SFSpeechRecognitionResult. From the documentation:

A Boolean value that indicates whether speech recognition is complete and whether the transcriptions are final.

Until the transcription is final, the same segment can arrive again with the timestamp back to 0. If you check your result, you will see you have a lot of repeated segments.

You need to modify your handler:

[speechRecognizer recognitionTaskWithRequest: urlRequest resultHandler:  ^(SFSpeechRecognitionResult * _Nullable result, NSError * _Nullable error){

            if (result.final && !error)
            {
                NSString *transcriptText = result.bestTranscription.formattedString;
                NSLog(@"Transcript: %@", transcriptText);
            }

            if (error) { /* ... */ }
        }];

answered Feb 06 '20 at 17:30

TheNextman

12,428
2
36
75

That doesn't seem to be the real problem. If I change the code to that, it never gets a final result, just some errors towards the end. Keeping the code the same, just using different clip lengths I come to the conclusion that anything beyond 1 minute just leads to undefined behaviour or can you successfully transcribe the 8-minute audio file I sent you? – user1573546 Feb 07 '20 at 15:17
I did get some errors with on-device (203 - "Retry") which I found weird but didn't investigate further. When I commented out "requires on device", I successfully transcribed the full file with incrementing timestamps from 0 to ~440 – TheNextman Feb 07 '20 at 16:23
May also be related to the "partial results" property – TheNextman Feb 07 '20 at 16:24
I tried that as well (set requiresOnDeviceRecognition to false) and was able to transcribe the file once with timestamps going over 400 but all subsequent attempts cut off after exactly one minute. Weird. I am looking for an on-device solution anyway but so far it really seems nothing over a minute works reliably although the framework seems to be capable to do that in principle. – user1573546 Feb 07 '20 at 19:20
Strange indeed. Given Apple's recent track record and that it's a new API on 10.15, in your shoes I'd submit a TSI (technical support incident). Assuming you are a registered developer, you get 2 for free every year. – TheNextman Feb 07 '20 at 19:23
For a 2 1/1 minute file I consistently just get the last 43 seconds as a result. It all feels totally erratic for anything over 1 minute. shouldReportPartialResults is documented to do a different thing but I tried it anyway to set it to YES, to see if it made a difference and it didn't. NO makes more sense anyway in my case. Running out of ideas other than accepting that 1 minute is the de-facto limit. – user1573546 Feb 07 '20 at 19:23
Yes, maybe I will try that. Thanks for the tip and generally for your help. – user1573546 Feb 07 '20 at 19:24
The documentation does imply a soft limit of 1 minute as we discussed before, it shouldn't really be relevant on a desktop machine with on-device recognition. You might just be bumping into a badly ported iOS API. Good luck! – TheNextman Feb 07 '20 at 19:46

SFSpeechRecognizer segment timestamps start at zero each minute

1 Answers1