6

Is it possible, programatically, to take someone's voice sample and produce a unique tone/property that could be used to create a synthesised speech?

For example, person A records himself. A unique tone is produced from this voice sample, and is being turned into synthesis speech. This allows people to use this synthetic voice in Text-to-Speech software, writing any text that they want that would be read in person A's voice.

Is it possible in today's terms? I know that there are companies that do this professionally, but generally, is it possible for a piece of software to do this?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Travier
  • 205
  • 2
  • 9
  • http://en.wikipedia.org/wiki/Siri, http://en.wikipedia.org/wiki/Google_Now, etc... – ElGavilan Apr 08 '14 at 17:31
  • If I understand correctly what you ask then I'd answer "no". You cannot generate a "complete voice", thus a voice usable for arbitrary "words" from a single "tone". You need separate samples for _all_ sounds, typically at least for diphones or better triphones. So a full catalogue of sounds by each speaker. – arkascha Apr 08 '14 at 17:40
  • OK, Thank you very much, arkascha. I was just thinking that, just like every person has a unique fingerprint, maybe different voices are distinguishable by some kind of a property. And ElGavilan, Siri doesn't work like that. It uses narrations recorded by a real woman. – Travier Apr 08 '14 at 19:26
  • As already reported, "no", you cannot do that with a single tone, but you can do with just a few sentences. I am one of the founders of Mivoq (https://www.mivoq.it): our online voice creation service is fully automatic and works with just a few tens of sentences. What you can try with just a few sentences is to search a similar voice in a big voice database, as they do at VocalID (https://www.vocalid.co/how). – Giulio Paci Sep 09 '16 at 18:12
  • I’m voting to close this question because it is not about programming as defined in the [help]. – desertnaut Mar 13 '21 at 10:14

3 Answers3

5

Using speaker adaptation methods you can achieve some results with comparably few training samples but still you should have some hundred sentences of the person - preferably with a phonetic transcription.

We once had this as a small lab exercise for students to record their own voices and train a voice model using HTS (http://hts.sp.nitech.ac.jp/). The "most simple" approach using HTS is to download the "Speaker dependent training demo" from this page and replace the training speech samples with your own recordings (of the same sentences!). We did this for another language with our own package though.

I think MaryTTS (http://mary.dfki.de/) has some more convenient tools to assist with this process but I've never worked with that.

But still - for high quality voices, you should have thousands of recorded sentences.

Markus Toman
  • 151
  • 2
  • 6
0

In 2021 and beyond I suggest to use mozilla/tts which is the best if you want to step in and use an existing, proven stack.

Jankapunkt
  • 8,128
  • 4
  • 30
  • 59
0

Seven years later, you can use your voice for text-to-speech:

Overdub: Ultra realistic text to speech voice cloning https://www.descript.com/overdub

There was a Bloomberg documentary about "Lyrebird", a neural network that can learn your voice and then you can make new sentences with it. Descript was founded by the people who made Lyrebird and now offers this service, and non-linear editing for synthesized audio.

Link to the Bloomberg documentary on YouTube: https://www.youtube.com/watch?v=VnFC-s2nOtI

Tagon
  • 1
  • 1