1

Some background: In trying to build a unit selection voice I followed the steps here: https://github.com/CSTR-Edinburgh/CSTR-Edinburgh.github.io/blob/master/_posts/2016-8-21-Multisyn_unit_selection.md and used a voice definition from here: https://raw.githubusercontent.com/CSTR-Edinburgh/merlin/master/egs/hybrid_synthesis/s1/voice_definition_files/unit_selection/cstr_us_awb_arctic_multisyn.scm. Unfortunately, the wavs were too noisy so I ended up hand-labelling them and skipping the automatic labelling process.

The voice is ok now but still needs some work. One error that occurs constantly is that festival reports "Missing diphone" for any pause to phone transition, e.g.:

festival> (utt.relation.print (SayText "I can say anything I want.") 'Unit)
Missing diphone: #_ay
 diphone still missing, backing off: #_ay
 backed off: #_ay -> #_ax
 diphone still missing, backing off: #_ax
 backed off: #_ay -> #_#
 diphone still missing, backing off: #_#
 backed off: #_ay ->
Missing diphone: ey_eh
 Interword so inserting silence.
 diphone still missing, backing off: ey_#
 backed off: ey_eh -> ax_#
 diphone still missing, backing off: ax_#
 backed off: ey_eh -> #_#
 diphone still missing, backing off: #_#
 backed off: ey_eh ->
Missing diphone: #_eh
 diphone still missing, backing off: #_eh
 backed off: #_eh -> #_ax
 diphone still missing, backing off: #_ax
 backed off: #_eh -> #_#
 diphone still missing, backing off: #_#
 backed off: #_eh ->
Missing diphone: t_#
 diphone still missing, backing off: t_#
 backed off: t_# -> #_#
 diphone still missing, backing off: #_#
 backed off: t_# ->

I tried replacing sil and sp (from the automatic process) in the labels with pau and h# (in order to correspond with the silences used in festival/lib/radio_phones.scm), and I also tried replacing them with just # but this didn't change anything. The source wav/labs definitely contain the transitions above (e.g. several start with "I can") but festival never seems to use these.

How can I get festival to use the pause to phone transitions in the source data?

Thanks!

John Leonard
  • 909
  • 11
  • 18
  • The problem is you do not have the phone `ey` in your database, not about #. You need to check which phones you actually have in the voice. It is hard to guess, but most likely the phoneset you used is different, probably because you picked the wrong lexicon initially. – Nikolay Shmyrev Jan 28 '18 at 09:07
  • Hey @NikolayShmyrev, thanks for the suggestion but I'm pretty sure `ey` is in the database, if I try saying "the way forward" the missing diphones it complains about are `#_dh` and `d_#` (at the start and end, `ey` isn't mentioned - and I hear it in the output). I thought you could choose the lexicon according to the subject dialect, I went with `cmulex` but I can see `do_final_alignment` asks if it's a valid lexicon as it's not `unilex-edi` or `unilex-rpx`, would you recommend a better one to use? – John Leonard Jan 28 '18 at 13:21
  • 1
    Check utt files in this voice http://www.festvox.org/packed/festival/1.96/festvox_cstr_us_awb_arctic_multisyn-1.0.tar.gz and compare with your files. For example arctic_a0001.utt, it has # symbols for silence inside utt file: `35 id _132 ; name # ; end 0.15 ; score 0 ; start F:standard+unisyn_start ; 36 id _133 ; name # ; end 0.45 ; score 0 ; start F:standard+unisyn_start ; 37 id _68 ; name # ; end 0.662 ; score 0 ; start F:standard+unisyn_start ; 40 id _25 ; name er ; end 0.968 ; score 0 ; start F:standard+unisyn_start ; ` Share your utt file. – Nikolay Shmyrev Jan 28 '18 at 15:28
  • Hmm... there are lots of differences [OneDrive link](https://1drv.ms/u/s!ApgrvsrlSAQghspdd0FsxY3xREQKyQ). Here's an excerpt: 43 id _86 ; name pau ; end 0.128 ; score 0 ; start F:standard+unisyn_start ; bad no_pms ; bad_dur outlier ; 44 id _26 ; name g ; end 0.152 ; score 0 ; start F:standard+unisyn_start ; cl_end 0.142 ; bad no_pms ; 45 id _27 ; name uh ; end 0.206 ; score 0 ; start F:standard+unisyn_start ; 46 id _28 ; name d ; end 0.234 ; score 0 ; start F:standard+unisyn_start ; cl_end 0.218 ; I might have to retrace my steps, the labels have `#` but it remains as `pau`... – John Leonard Jan 28 '18 at 23:05
  • 1
    Maybe you need to recreate utterances – Nikolay Shmyrev Jan 28 '18 at 23:29
  • I will try, thanks. One odd thing is that if I use `cmulex` when calling make_initial_phone_labs I get `sil` in the mlf. In festival I tried `(lex.list)` and it listed "cmu" not "cmulex" so I tried make_initial_phone_labs with `cmu` instead and the stdout was different but the mlf still contained `sil`. – John Leonard Jan 28 '18 at 23:47
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164069/discussion-between-john-leonard-and-nikolay-shmyrev). – John Leonard Jan 29 '18 at 08:23

1 Answers1

0

What was happening was when I was running a script based on the Multisyn unit selection the build_utts part was failing and skipping because the hand-labelled labels didn't match exactly what Festival would have predicted. For example, if the speaker had said "extreme" as eh k s ... but Festival would calculate ih k s ... the build_utts script would fail with an error like:

align missmatch at ih (0.000000) eh (2.810566)

I manually ran the build_utts script for each utterance and adjusted the label accordingly. If, like me, you are foolish enough to try hand-labelling yourself a couple of tips that helped me:

  • Consider removing any phone closures such as t_cl or d_cl as these can really mess it up when it's trying to match
  • Make sure there is a pause (i.e. #) at the start and end of each utterance as the build_utts script won't complain about it but when running the voice in Festival you will get an error like:

            -=-=-=-=-=- EST Error -=-=-=-=-=-
            {FND} Feature end not defined
    
            -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    

Thanks to @NikolayShmyrev for pointing me in the right direction. He also recommended using Ossian instead of Festival which uses python rather than Festival's fairly difficult code.

John Leonard
  • 909
  • 11
  • 18