2

We have some proofread .srt files and we want to generate audio from them by AWS Polly. According to references on AWS Polly, the input type for Polly is either plain text or SSML enhanced-text. Is there a way to convert .srt file to SSML enhanced-text?

We want to use .srt files because they are proofread and they record "audio pausing" information in the file. For example:

1
00:00:04,960 --> 00:00:06,880
- [Instructor] Bacteria
are able to inhabit

2
00:00:06,880 --> 00:00:09,220
almost every environment on Earth,

3
00:00:09,500 --> 00:00:12,740
from desert tundra to
tropical rainforests.

There's a gap between 00:00:09220 to 00:00:09,500, this is the "audio pausing" information we have.

AWS Polly references: https://docs.aws.amazon.com/polly/latest/dg/ssml-to-speech-console.html

If there's no way to convert .srt to SSML enhanced-text, should I parse the .srt file to generate SSML enhanced-text that Polly can understand?

Xiufeng Chen
  • 47
  • 1
  • 11

2 Answers2

2

I've created a python script that does it: https://github.com/ThioJoe/SRT-To-SSML

It uses the duration attribute (for the prosody tag) and the break tag to theoretically keep the speech synced up with the original subtitles.

However there are still some limitations to keep in mind:

  • While there is a official/standard duration attribute that can be used with the prosody tag to specify exactly how long certain speech should last, it seems most services don't support it.
  • Amazon Polly's amazon:max-duration attribute will speed up speech to match that time, but will not slow it down, meaning it still might go out of sync with the original subtitles.

If using one entire ssml file doesn't work, one 'brute force' method I could think of would be to generate each line of the subtitles as a separate audio file, then use something that could stretch or shrink each file based on the duration of it's corresponding subtitle line. Then you'd have to tack on empty silence equal to the difference between each subtitle lines' timestamps. Then stitch it all together into one audio file. Not sure what tools would be needed for that though.

ThioJoe
  • 41
  • 5
1

If your end aim is SRT (video subtitles) to audio via Amazon Polly, I'm guessing It's theoretically possible , but SSML is not really made for this job (since you can't guarantee timing of multiple lines (start/stop/pauses/etc) such that it will be acceptable when paired with video). You may need to

  1. Separate each line into its own request/job
  2. Use the
<prosody amazon:max-duration>

tag. Calculate the max-duration from subtracting the start time of the next line from the current line

  1. Perform audio assembly by merging multiple audio clips and setting the start time.

Anyway, if you didn't use Polly, & have a FOSS-ish solution for SRT to audio, I'd like to hear it.

junh1024
  • 25
  • 6
  • SSML is specifically designed for speech synthesis therefore it is the ideal solution for the job – Miguel Sánchez Villafán Jan 14 '22 at 12:04
  • @MiguelSánchezVillafán , please read their whole question carefully. SSML is bad with timed pauses, which is the entire reason for OP' s question (I also edited my answer to clarify this). I suggested workarounds. do you have any other ideas ? – junh1024 Jan 16 '22 at 05:12
  • What do you mean with "bad with timed pauses"? Polly and SSML in general has the break tag available https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#break-tag – Miguel Sánchez Villafán Jan 28 '22 at 17:36
  • @MiguelSánchezVillafán ah, yes it does (now). SRT is actually a video subtitle format so sync is of paramount importance. I dont' think Polly/SSML current features is suitable for video subtitle > audio yet. I've further edited my answer to clarify, but my overall verdict doesn't change. – junh1024 Jan 30 '22 at 09:26
  • I see, I already knew SRT files were meant for subtitles, but your edits made it clear that SSML timings may not be respected by the Speech Synthesis Engine – Miguel Sánchez Villafán Feb 01 '22 at 13:00