General Information
Introduction
TTS-Cubed is a collection of speech synthesis tools and modules that extends and closely integrates the currently used Festival, Flite and FreeTTS speech synthesis systems. TTS-Cubed is not a stand-alone system, but rather adds functionality to the other systems.
The goal of TTS-Cubed is to provide better mechanisms of building voices in Festival, using the Festvox toolkit, and converting them to a format suitable for use in Flite and FreeTTS. We find Festival as a very good system for voice development and testing, but Flite and FreeTTS are superior in comparison for voice deployment and integration into other systems, therefore the need for TTS-Cubed. TTS-Cubed also provides a conduit for our group to implement our research in a way that it can be released in the hope that it may be useful to other researchers and users of speech synthesis systems.
This distribution of TTS-Cubed includes:
-
MultiDiphone - a general purpose diphone unit selection engine based on MultiSyn
- implemented in Festival, Flite and FreeTTS
- conversion tools to use Festival voice in Flite and FreeTTS
- Support for using Praat for labeling purposes
- Documentation
-
Preliminary voices in the following languages:
- Afrikaans
- isiZulu
Authors
- Todo:
- authors or developers?
The principal developers of TTS-Cubed are:
- Aby Louw
- Gerrit Botha
You can contact us through the TTS-Cubed Forums.
License
The following BSD-style license applies to TTS-Cubed:
HLT Research Group
Meraka Institute & University of Pretoria
Copyright (c) 2006
All Rights Reserved
Permission is hereby granted, free of charge, to use and distribute this
software and its documentation without restriction, including without
limitation the rights to use, copy, modify, merge, publish, distribute,
sub license, and/or sell copies of this work, and to permit persons to
whom this work is furnished to do so, subject to the following conditions:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Any modifications must be clearly marked as such.
* Original authors' names are not deleted.
* Neither the name of the Meraka Institute nor the name of the University
of Pretoria nor the names of its contributors may be used to endorse or
promote products derived from this software without specific prior
written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS NAMELY
THE MERAKA INSTITUTE, THE UNIVERSITY OF PRETORIA, AND THE CONTRIBUTORS TO
THIS WORK "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Requirements
Since TTS-Cubed is not a stand-alone program, but rather adds functionality to the Festival, Flite and FreeTTS systems, the requirements for TTS-Cubed are basically the same as for those systems. See each of their documentation for their specific requirements. TTS-Cubed was built using the following versions of the respective packages:
- Edinburgh Speech Tools (version 1.2.96)
- Festival (version 1.96)
- Festvox (version 2.1)
- Flite (version 1.3)
- FreeTTS (version 1.2)
Although older versions may work, we have not yet tested it. GCC version 3.2 was used as there are known issues compiling Edinburgh Speech Tools and Festival with the latest version of GCC. See the Festival mailing list archive for compilation issues.
Edinburgh Speech Tools, Festival and Festvox are available from http://www.cstr.ed.ac.uk/projects/festival/ or http://festvox.org/latest/ while the Flite and FreeTTS source must also be downloaded.
It is required that you compile your own versions of the above from source code as you will need the libraries and include files to build some programs and voices.
Some of the scripts of TTS-Cubed require Praat and Python, the versions tested were:
- Praat (version 4.4.26)
- Python (version 2.4.3)
Praat is available from http://www.praat.org and Python from http://www.python.org
A basic knowledge of Festival, Festvox and of speech processing in general will be required to build TTS-Cubed voices, as well as patience and understanding, to quote from the Festvox README file:
"Building a new voice is a lot of work, and something will probably
go wrong which may require the repetition of some long boring and
tedious process. Even with lots of care a new voice still might
just not work."
Contributions
- Todo:
- bugzilla?
- Todo:
- contact mail
- A BSD-style license as in TTS-Cubed, Festival, Festvox, Flite and FreeTTS.
- You may place your own copyright in your source files.
Acknowledgements
- Todo:
- acknowledgements?, this is copy from FreeTTS page
Installation
This is a basic walk-trough of the installation process for each of the required packages. Note that this was only tested on a few of our systems, thus if you encounter problems then refer to the respective package's documentation. This installation procedure assumes the use of GCC-3.2 with the binary named gcc32
It is recommended to create a new directory and install all these packages in this directory.
TTS-Cubed Dependencies
Edinburgh Speech Tools
- Note:
- The termcap library is needed to compile the Edinburgh Speech Tools package.
-
Unpack the archive speech_tools-1.2.96-beta.tar.gz
tar -xzvf speech_tools-1.2.96-beta.tar.gz
-
Go to the directory and run the configure script
cd speech_tools ./configure
-
Use your favourite editor to edit the config file to tell speech tools to use GCC-3.2
emacs config/config
and at the section
## Compiler. ## The definitions are in compilers/$(COMPILER).mak ## Examples: gcc suncc egcs gcc28 COMPILER=gcc
change the compiler definition to use GCC 3.2
COMPILER=gcc32
save and exit. Also edit the compiler specific definition
emacs config/compilers/gcc32.mak
and change the lines
ifndef GCC32 GCC32 = gcc endif CC= $(GCC32) CXX = g++
to the following
ifndef GCC32 GCC32 = gcc32 endif CC= $(GCC32) CXX = g++32
save and exit.
-
Now you can build the system by running make
make
and test it with
make test
-
If the tests are successful then you can set the environment variables by editing your
.bashrc fileemacs ~/.bashrc
and add your the
ESTDIRenvironment variable pointing to the path of your installationexport ESTDIR=/home/aby/speech/speech_tools
save and exit, now set the environment variables
source ~/.bashrc
Festival
-
Unpack the archive festival-1.96-beta.tar.gz
tar -xzvf festival-1.96-beta.tar.gz
-
Go to the directory and run the configure script
cd festival ./configure
- Note:
- Festival searches for the Edinburgh Speech Tools installation and inherits most of it's configuration, thus you don't need to change any config files.
-
Now you can build the system by running make
make
-
Set the environment variables by editing your
.bashrc fileemacs ~/.bashrc
and add your the
FESTIVALDIRenvironment variable pointing to the path of your installationexport FESTIVALDIR=/home/aby/speech/festival
save and exit, now set the environment variables
source ~/.bashrc
It is usefull to set the Festival binary in your path then you can call it from anywhere.
Festival Extras
The extra Festival packages (there are more)
- festlex_CMU.tar.gz
- festlex_POSLEX.tar.gz
- festvox_cmu_us_awb_arctic_hts.tar.gz
- festvox_cmu_us_bdl_arctic_hts.tar.gz
- festvox_cmu_us_jmk_arctic_hts.tar.gz
- festvox_cmu_us_slt_arctic_hts.tar.gz
- festvox_kallpc16k.tar.gz
can all just be unpacked in the Festival root directory
cd festival cd .. tar -xzvf PACKAGE.tar.gz . . .
These packages provide festival with lexicons, HTS voices and diphone voice. No compilation is necessary.
Festvox
-
Unpack the archive festvox-2.1-current.tar.gz
tar -xzvf festvox-2.1-current.tar.gz
-
Go to the directory and run the configure script
cd festvox ./configure
- Note:
- Festvox searches for the Edinburgh Speech Tools installation and inherits most of it's configuration, thus you don't need to change any config files.
-
Now you can build the system by running make
make
-
Set the environment variables by editing your
.bashrc fileemacs ~/.bashrc
and add your the
FESTVOXDIRenvironment variable pointing to the path of your installationexport FESTVOXDIR=/home/aby/speech/festvox
save and exit, now set the environment variables
source ~/.bashrc
Flite
-
Unpack the archive flite-1.3-release.tar.gz
tar -xzvf flite-1.3-release.tar.gz
-
Go to the directory and run the configure script
cd flite-1.3-release ./configure
-
Now you can build the system by running make
make
-
Set the environment variables by editing your
.bashrc fileemacs ~/.bashrc
and add your the
FLITEDIRenvironment variable pointing to the path of your installationexport FLITEDIR=/home/aby/speech/flite-1.3-release
save and exit, now set the environment variables
source ~/.bashrc
- Todo:
- FreeTTS
TTS-Cubed
TTS-Cubed Tools
-
Unpack the archive tts3-tools-0.9-current.tar.gz
tar -xzvf tts3-tools-0.9-current.tar.gz
-
Go to the directory and run the configure script
cd tts3_tools ./configure
- Note:
- TTS-Cubed Tools searches for the Edinburgh Speech Tools installation and inherits most of it's configuration, thus you don't need to edit any config files.
-
Now you can build the system by running make
make
-
Set the environment variables by editing your
.bashrc fileemacs ~/.bashrc
and add your the
TTS3TOOLSenvironment variable pointing to the path of your installationexport TTS3TOOLS=/home/aby/speech/tts3_tools
save and exit, now set the environment variables
source ~/.bashrc
TTS-Cubed Festvox
-
Go to your Festvox installation directory, copy the TTS-Cubed Festvox package here and unpack the archive tts3-festvox-0.9-current.tar.gz
cd festvox cp /home/aby/downloads/tts3-festvox-0.9-current.tar.gz . tar -xzvf tts3-festvox-0.9-current.tar.gz
-
Edit the make files to add the TTS-Cubed Festvox files in the make procedure
emacs src/Makefile
and after the line
ALL_DIRS= ...add the followingTTS3_DIRS = multidiphone ALL_DIRS += $(TTS3_DIRS)
so that it looks something like
ALL_DIRS= db_example intonation duration unitsel \ ldom vox_diphone vox_files prosody st lts hts_build $(BUILD_DIRS) TTS3_DIRS = multidiphone ALL_DIRS += $(TTS3_DIRS)save and exit. Edit the
vox_filesdirectory make file and add the same lines so that it looks something lineALL_DIRS= us general uk prompts TTS3_DIRS = multidiphone ALL_DIRS += $(TTS3_DIRS)
save and exit. You don't need to rebuild the system as the TTS-Cubed Festvox package currently only adds scripts to Festvox, thus the above editing is not strictly necessary but we may add executables in the future.
TTS-Cubed Festival
-
Go to your Festival installation directory, copy the TTS-Cubed Festival package here and unpack the archive tts3-festival-0.9-current.tar.gz
cd festival cp /home/aby/downloads/tts3-festival-0.9-current.tar.gz . tar -xzvf tts3-festival-0.9-current.tar.gz
-
Edit the festival config file to add the TTS-Cubed Festival modules
emacs config/config
edit the end of the config file so that it looks something like
## Old diphone code that will be delete, left in only for some ## compatibility # ALSO_INCLUDE += diphone ## Other (non-Edinburgh) modules may also be specified here (e.g. OGI code), ALSO_INCLUDE += FT_Vox MultiDiphone
save and exit. Now rebuild Festival
make
- Todo:
- TTS-Cubed Flite and FreeTTS
Building Voices With TTS-Cubed
This section of the documentation serves as an addendum to the Festvox documentation Building Synthetic Voices as the TTS-Cubed voice building process is an extension of the Festvox methods and we only describe parts of the voice building process with a view on TTS-Cubed style voices.
It is recommended to create a directory dedicated to the voices built with TTS-Cubed, e.g. tts_cubed_voices or something similar.
MultiDiphone Synthesis
MultiDiphone synthesis is similar to MultiSyn synthesis with some implementation changes. The main reason for doing MultiDiphone is to have a code base, independent from other institutions, that we can apply our research on. MultiDiphone synthesis is concatenative, where the basic units are diphones. But the prerecorded database contains multiple instances of each diphone type, whereas in traditional diphone synthesis there is only one example of each diphone type in the database. This eliminates the need to prosodically modify the units, thereby preserving naturalness.
During synthesis a target utterance specification is predicted by various models specified by the voice and the specific language. Then the units (candidates), in the prerecorded database, that best fit the target context are selected for concatenation. The unit selection is done based on a targetcost and a joincost. The targetcost is calculated by a function that penalises the candidates for not fitting certain linguistic contexts of the target utterance, while the joincost is the mismatch in the join of two candidates in an acoustic contexts. The candidates with the smallest total cost are then selected for synthesis.
Currently there is no prosodic information used during the target cost calculations. We try to capture the prosody of the target utterance in the context functions of the target cost. Even though this method does lead to surprisingly good results, we hope to include some form of prosodic, and specifically intonation modelling, as some of our official languages are tone languages.
Text Selection And Recording
Probably the most important thing to consider when building a voice is the text that is selected for recording. These recordings will then be the database of the voice, containing the candidates units as mentioned above.
- Todo:
- link to arctic papers and something more.
We use Audacity for the recordings and record continuously. If the voice artists makes a mistake we just ask them to repeat the prompt, only stopping when the voice artist needs a break. We found that the recordings are much more fluent and it takes less time than stopping and starting the whole time. It does however increase the amount of time, needed to edit the recordings into their specific prompts, significantly.
Recordings are usualy done with a sample rate of 16kHz. Audacity can also be used to normalise the prompts that have been recorded.
Voice Setup
Go to your voices directory and make a new directory for the voice that you want to build.
cd tts_cubed_voices mkdir zuluvoice cd zuluvoice
Run the TTS-Cubed Festvox script to create the directories and copy the relevant scripts necessary for the voice building process. This script takes input in the same form as the other Festvox voice building scripts, INST LANG VOX, where:
- INST is the institute building the language, e.g. cmu, cstr, ogi. If there isn't an appropriate institute use, net.
- LANG is the language, e.g. us, fr etc.
- VOX is speaker/style identifier e.g kal, awb, golem.
To run the script
$FESTVOXDIR/src/multidiphone/setup_multidiphone INST LANG VOX
- Todo:
- emu label link
bin- Contains various scripts used during the voice building process.
cep- Used during labeling to save the voice cepstrum files.
dic- Not used in MultiDiphone synthesis, created by Festvox general scripts.
emu- Used for viewing labels and pitchmarks with EMU Label. We use Praat.
etc- General configuration files.
export- Data exported to a format suitable for use in TTS-Cubed Flite & FreeTTS modules.
f0- The fundamental frequency files of the waveforms.
festival- Contains utterance structures.
festvox- The voice and languages specification scripts and definitions.
group- Not used in MultiDiphone synthesis, created by Festvox general scripts.
lab- Segment (Phone) label files.
lar- Not used in MultiDiphone synthesis, created by Festvox general scripts.
lpc- Short-time-signals calculated from the wave files, used during synthesis.
mcep- Mel-Cepstrum coefficients, used as part of the join coefficients for join cost calculation.
phr- Not used in MultiDiphone synthesis, created by Festvox general scripts.
pm- Pitchmarks of the short-time-signals, calculated from wave files using Praat.
pm_lab- Not used in MultiDiphone synthesis, created by Festvox general scripts.
pm_praat_filled- Praat format pitchmarks.
pm_praat_unfilled- Praat format pitchmarks filled with the voice default fundamental frequency.
prompt-cep- Used during labeling to save the "closest voice" cepstrum files.
prompt-lab- Segment (Phone) label files of "closest voice", see labeling.
prompt-utt- Utterance structures from "closest voice", see labeling.
prompt-wav- Synthesised prompts from "closest voice", see labeling.
recording- Not used in MultiDiphone synthesis, created by Festvox general scripts.
scratch- Not used in MultiDiphone synthesis, created by Festvox general scripts.
syl- Not used in MultiDiphone synthesis, created by Festvox general scripts.
textgrid- Praat format labels, can contain more than just segment label files.
versions- Not used in MultiDiphone synthesis, created by Festvox general scripts.
wav- Database recordings.
wrd- Not used in MultiDiphone synthesis, created by Festvox general scripts.
Concatenation Costs
The files ./festvox/INST_LANG_VOX_target_cost.scm and ./festvox/INST_LANG_VOX_join_cost.scm define the concatenation cost functions used for the voice.
In ./festvox/INST_LANG_VOX_target_cost.scm the variable tc_subFunctions defines the specific target cost functions and associated weights. These functions take as input the target diphone from the target utterance and a candidate diphone from the database.
The default target cost functions are:
- tc_stress
- Compares stress on any vowel which form part of the diphone.
- tc_sylPos
- Compares diphone position in syllabic structure.
- tc_wordPos
- Compares diphone position in word structure.
- tc_phrasePos
- Compares diphone position in phrase structure.
- tc_POS
- Compares part-of-speech of diphone parent word.
- tc_leftContext
- Compares left phonemic context of diphone.
- tc_rightContext
- Compares right phonemic context of diphone.
- tc_segmentScore
- Compares the diphone segment score (from ASR alignment)
- tc_badDuration
- Compares diphone duration flags (if the parent phone has bad pitchmarking, then the flag is set)
- tc_scoreNumberSyls
- Compares the number of syllables in diphone word.
- tc_sylWordPos
- Compares the diphone's syllable position in word structure.
You can comment out those functions that you do not want to use. The higher the weight of a specific function the more that mismatch gets penalised.
In ./festvox/INST_LANG_VOX_join_cost.scm the variable jc_subFunctions defines the specific join cost functions and associated weights. These functions take as input two diphones and calculate what the cost would be of concatenating them. Two diphones that are adjacent in the original recorded database will return a cost of zero.
The default join cost functions are:
- jc_f0Distance
- The absolute difference between the F0 values of the join point.
- jc_powerDistance
- Absolute difference between the mel-cepstrum energy values of the join point.
- jc_spectralDistance
- The euclidian distance between the mel-cepstrum spectral vectors of the join point.
As with the target costs, these weights can be adjusted to suit your needs.
Festvox Files
To create a voice in a new language (non existing Festvox definitions) one needs to define the following (from Building Synthetic Voices):
- Phone set
- Token processing rules (not strictly required, as long as input text does not consist of numbers etc.)
- Prosodic phrasing method (also not strictly required)
- Word pronunciation (lexicon and/or letter-to-sound rules)
- Intonation (not required in MultiDiphone synthesis)
- Durations (not required in MultiDiphone synthesis)
Thus, the most important things to define for a voice to work is a phone set, lexicon and/or lts. Refer to Building Synthetic Voices for information regarding these definitions.
After you have defined a phone set you can use the script ./bin/make_dummy_durdata to build a duration data statistics file of your phone set, as the duration data is not actually used in the unit selection process.
./bin/make_dummy_durdata
this will create the ./festvox/INST_LANG_VOX_durdata.scm file.
- Todo:
- dictionary maker here and gzeros
./festvox/INST_LANG_VOX_lts_rewrites.scm , but if you like you can use the CART method and convert it to a Flite compilable C source.
The MultiDiphone lexicon file ( ./festvox/INST_LANG_VOX_lexicon.scm ) differs slightly from the normal Festvox one, in that MultiDiphone defines the addenda in a seperate file ( ./festvox/INST_LANG_VOX_addenda.scm ).
You must create the file ./etc/utts.data that contains the recorded prompts, with the following format
( bm_001 "isimo sokubonakala bobubanzi bezwe singamameter." )
( bm_002 "kodwa izindawo zaseflorida bezinethiwe." )
( bm_003 "kanjalo nasezingxenye ezisenyakatho nentshonalanga." )
.
.
.
where bm_001.wav in the wav directory is the name of the wave file associated with the utterance bm_001.
TTS-Cubed currently supports a very simple diphone backoff mechanism. A backoff list, which defines a 1 to 1 phone replacement, is used during synthesis to replace any diphones that are not available in the diphone database.
For exaple, if diphone b_eh is not found in the database a replacement for either b or eh is looked for in the backoff list starting from the top. Now if in the backoff list p is defined as a replacement for b the synthesiser will search for the diphone p_eh , continuing in the backoff list if p_eh is also not available (this time looking for replacements for p_eh ). Thus, there may be a halfphone mismatch in synthesis, but we are asured of a result.
The backoff list must be defined in the file ./festvox/INST_LANG_VOX_diphone_backoff.scm . To ensure that this mechanism always succeeds all diphones must be backed off to the silence phone. The script bin/test_simple_backoff can be used to test this.
If you defined your silence phone as "pau" then
./bin/test_simple_backoff "pau"
will test to see if the backoff mechanism will succeed with the defined backoff list.
The file ./festvox/INST_LANG_VOX_multidiphone.scm defines the actual voice and is Festival's entry point to the voice. Various parameters are defined in the file, and can be changed as is needed. See the comments in the file for information on specific parameters.
Labeling
Festvox provides two techniques for labeling the recorded prompts, namely dynamic time warping alignment and Baum-Welch training to build complete ASR acoustic models. With voices in new languages the second technique is preferred as there is no synthetic voice to do alignment with, however if the recorded database is small then the acoustic models aren't very good. Also when the recorded database is small it is feasible to do hand corrections of misaligned labels.
The file ./festvox/INST_LANG_VOX_phone_conversions.scm provides a cross-phoneset phone mapping scheme. Here you can define a 1 to 1 relationship between a new language's phoneset and an available diphone voice's phoneset. The idea is that the mapping is from a phone of the new language's phoneset to the acoustically closet match in the "closest voice" phoneset (closest voice is defined in ./festvox/INST_LANG_VOX_multidiphone.scm ). Now by synthesising the prompts with the diphone voice and replacing the phones with the phones of the new language, dynamic time warping alignment can be performed.
After defining the mapping in ./festvox/INST_LANG_VOX_phone_conversions.scm one can build the prompts with
festival -b festvox/build_multidiphone.scm '(build_prompts "etc/utts.data")'
this will synthesise the prompts defined in ./etc/utts.data with the "closest voice".
Now the two sets of prompts can be aligned with the Festvox script
./bin/make_labs prompt-wav/*.wav
which will generate the label files in ./lab . These label files can be converted to a Praat format (TextGrids) for viewing and hand correction with
./bin/make_lab_textgrid lab/*.lab
Open praat and load the textgrids in the directory ./texgrid , as well as their associated wave files in ./wav . Select a textgrid and its wave file and select edit to view and edit these files. If you have changed the textgrids they need to be converted back to the Festival "lab" format
./bin/make_textgrid_lab textgrid/*.TextGrid
which will overwrite the original label files.
Building The Voice
Pitchmarks
Getting good pitchmarks is important to the quality of the synthesis, as the short-time-signals used for synthesis are calculated around the pitchmarks. The file ./etc/pitchmarks.defs contain the parameters that specify the fundamental frequency range of the speaker. These must be edited according to your voice artist's range. The average range for male speakers are 80Hz-200Hz with a default of 100Hz and for females 120Hz-300Hz and default of 200Hz.
Praat can be used to view a few recordings and to get minimum and maximum pitch values to get a general idea of what the range of the speaker is.
After defining the pitch range in the speaker in the file ./etc/pitchmarks.defs the pitchmarks can be calculated with the script make_pm_praat. The actual pitchmark calculation is done by Praat.
./bin/make_pm_praat wav/*.wav
The pitchmarks are created in the ./pm directory and Praat viewable pitchmarks in ./pm_praat_filled and ./pm_praat_unfilled. The filled pitchmarks directory has added pitchmarks in unvoiced regions at the default pitch range. This is necessary as Festival does not distinguish between voiced and unvoiced regions during synthesis. You can view the pitchmarks in ./pm_praat_filled and ./pm_praat_unfilled by loading them into Praat with their asociated waveforms.
F0
TTS-Cubed uses Praat to calculate the F0 values, which are used in the join cost calculations. The F0's can be calculated by (be patient, this takes a while)
./bin/make_f0_praat wav/*.wav
which creates the F0 files in ./f0 .
LPC Coefficients And Residuals
The LPC coefficients and residuals are calculated at the pitchmark (filled) time points. These are the short-time-signals used in the RELP synthesis. The LPC's and residuals are calculated by the script make_lpc.
./bin/make_lpc wav/*.wav
This script dumps the lpc's and residuals in the ./lpc directory.
Join Coefficients
The join coefficients are calculated at the pitchmark times. They consist of a 12th order mel-cepstrum, it's energy and the F0 value at this pitchmark. These values are normalised for the entire database for each channel of the join coefficient and are calculated with
./bin/make_joincoefs wav/*.wav
which will create the mcep's in ./mcep and the normalised join coefficients in ./join
Utterance Building
Finally the utterances of the voice database can be built
festival -b festvox/build_multidiphone.scm '(build_utts "etc/utts.data")'
This script will dump the utterances in ./festival/utts
Now the voice can be loaded into Festival and used for synthesis
festival festvox/INST_LANG_VOX_multidiphone.scm Festival Speech Synthesis System 1.96:beta July 2004 Copyright (C) University of Edinburgh, 1996-2004. All rights reserved. For details type `(festival_warranty)' festival> (voice_INST_LANG_VOX_multidiphone) Please wait: Initialising MultiDiphoneVoice "INST_LANG_VOX_multidiphone". Voice loaded successfully! #<vox 0x9b55400> festival> (SayText "hello world") #<Utterance 0xb71d5278> festival> (exit)
The voice has different verbosity levels that can be set depending on the required information. The levels are:
- Processing utterance (during loading Festival prints out which utterance it is processing while building the diphone database)
- Ignored d phones with bad (the number of phones ignored because of "bad" flags that are set)
- Chosen diphone info (Festival prints out which diphones were selected for synthesis and from what utterances they came)
- adding diphone (during loading Festival prints out the diphone it is processing while building the diphone database)
- viterbi info (during synthesis Festival will print out information regarding the viterbi search in unit selection, if the pruning is turned on)
- backoff (if diphone backoff was performed the information will be printed out)
- export info (during voice exporting any information regarding the exporting process will be printed out)
- backoff info (if diphone backoff is being performed all the information will be printed out)
The default level is 0 (set in ./festvox/INST_LANG_VOX_multidiphone.scm ), and no extra information is printed out, but it can be set before running the voice at '(debug_level 0) or after voice loading (command (voice_INST_LANG_VOX_multidiphone)) in Festival with the command (vox.set_verbosity 2).
