TTS-Cubed

TTS-Cubed Logo

General Information

Introduction

TTS-Cubed is a collection of speech synthesis tools and modules that extends and closely integrates the currently used Festival, Flite and FreeTTS speech synthesis systems. TTS-Cubed is not a stand-alone system, but rather adds functionality to the other systems.

The goal of TTS-Cubed is to provide better mechanisms of building voices in Festival, using the Festvox toolkit, and converting them to a format suitable for use in Flite and FreeTTS. We find Festival as a very good system for voice development and testing, but Flite and FreeTTS are superior in comparison for voice deployment and integration into other systems, therefore the need for TTS-Cubed. TTS-Cubed also provides a conduit for our group to implement our research in a way that it can be released in the hope that it may be useful to other researchers and users of speech synthesis systems.

This distribution of TTS-Cubed includes:

Authors

Todo:
authors or developers?
TTS-Cubed was built by the HLT Research Group of the Meraka Institute and the University of Pretoria

The principal developers of TTS-Cubed are:

You can contact us through the TTS-Cubed Forums.

License

The following BSD-style license applies to TTS-Cubed:

                             HLT Research Group                                    
                 Meraka Institute & University of Pretoria                         
                            Copyright (c) 2006                                     
                            All Rights Reserved                                    
                                                                                  
  Permission is hereby granted, free of charge, to use and distribute this        
  software and its documentation without restriction, including without           
  limitation the rights to use, copy, modify, merge, publish, distribute,         
  sub license, and/or sell copies of this work, and to permit persons to          
  whom this work is furnished to do so, subject to the following conditions:      
                                                                                  
   * Redistributions of source code must retain the above copyright notice,       
     this list of conditions and the following disclaimer.                        
   * Any modifications must be clearly marked as such.                            
   * Original authors' names are not deleted.                                     
   * Neither the name of the Meraka Institute nor the name of the University      
     of Pretoria nor the names of its contributors may be used to endorse or      
     promote products derived from this software without specific prior           
     written permission.                                                          
                                                                                  
   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS NAMELY     
   THE MERAKA INSTITUTE, THE UNIVERSITY OF PRETORIA, AND THE CONTRIBUTORS TO      
   THIS WORK "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING, BUT NOT     
   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A        
   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER       
   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,       
   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,            
   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;    
   OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,       
   WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR        
   OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF         
   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.                                     

Requirements

Since TTS-Cubed is not a stand-alone program, but rather adds functionality to the Festival, Flite and FreeTTS systems, the requirements for TTS-Cubed are basically the same as for those systems. See each of their documentation for their specific requirements. TTS-Cubed was built using the following versions of the respective packages:

Although older versions may work, we have not yet tested it. GCC version 3.2 was used as there are known issues compiling Edinburgh Speech Tools and Festival with the latest version of GCC. See the Festival mailing list archive for compilation issues.

Edinburgh Speech Tools, Festival and Festvox are available from http://www.cstr.ed.ac.uk/projects/festival/ or http://festvox.org/latest/ while the Flite and FreeTTS source must also be downloaded.

It is required that you compile your own versions of the above from source code as you will need the libraries and include files to build some programs and voices.

Some of the scripts of TTS-Cubed require Praat and Python, the versions tested were:

Praat is available from http://www.praat.org and Python from http://www.python.org

A basic knowledge of Festival, Festvox and of speech processing in general will be required to build TTS-Cubed voices, as well as patience and understanding, to quote from the Festvox README file:

"Building a new voice is a lot of work, and something will probably
go wrong which may require the repetition of some long boring and
tedious process. Even with lots of care a new voice still might
just not work.
"

Contributions

Todo:
bugzilla?
Todo:
contact mail
We encourage contributions to TTS-Cubed. If you have fixes, code or suggestions please contact the us tts-cubed-contacts@sourceforge.net. For contributing code you must agree to these terms:

Acknowledgements

Todo:
acknowledgements?, this is copy from FreeTTS page
Refer to acknowledgements to see the list of people and organisations we would like to thank for making this project possible. Most of all, we thank our management for letting us do this, and Alan Black and Kevin Lenzo for doing Flite.


Installation

This is a basic walk-trough of the installation process for each of the required packages. Note that this was only tested on a few of our systems, thus if you encounter problems then refer to the respective package's documentation. This installation procedure assumes the use of GCC-3.2 with the binary named gcc32

It is recommended to create a new directory and install all these packages in this directory.

TTS-Cubed Dependencies

Edinburgh Speech Tools

Note:
The termcap library is needed to compile the Edinburgh Speech Tools package.
  1. Unpack the archive speech_tools-1.2.96-beta.tar.gz

    	tar -xzvf speech_tools-1.2.96-beta.tar.gz
    

  2. Go to the directory and run the configure script

    	cd speech_tools
    	./configure
    

  3. Use your favourite editor to edit the config file to tell speech tools to use GCC-3.2

    	emacs config/config
    

    and at the section

    	## Compiler.
    	## The definitions are in compilers/$(COMPILER).mak
    	## Examples: gcc suncc egcs gcc28
    
    	COMPILER=gcc
    

    change the compiler definition to use GCC 3.2

    	COMPILER=gcc32
    

    save and exit. Also edit the compiler specific definition

    	emacs config/compilers/gcc32.mak
    

    and change the lines

    	ifndef GCC32
    	    GCC32 = gcc
    	endif
    
    	CC= $(GCC32)
    	CXX = g++
    

    to the following

    	ifndef GCC32
    	    GCC32 = gcc32
    	endif
    
    	CC= $(GCC32)
    	CXX = g++32
    

    save and exit.

  4. Now you can build the system by running make

    	make
    

    and test it with

    	make test
    

  5. If the tests are successful then you can set the environment variables by editing your .bashrc file

    	emacs ~/.bashrc
    

    and add your the ESTDIR environment variable pointing to the path of your installation

    	export ESTDIR=/home/aby/speech/speech_tools	
    

    save and exit, now set the environment variables

    	source ~/.bashrc	
    

Festival

  1. Unpack the archive festival-1.96-beta.tar.gz

    	tar -xzvf festival-1.96-beta.tar.gz
    

  2. Go to the directory and run the configure script

    	cd festival
    	./configure
    

    Note:
    Festival searches for the Edinburgh Speech Tools installation and inherits most of it's configuration, thus you don't need to change any config files.

  3. Now you can build the system by running make

    	make
    

  4. Set the environment variables by editing your .bashrc file

    	emacs ~/.bashrc
    

    and add your the FESTIVALDIR environment variable pointing to the path of your installation

    	export FESTIVALDIR=/home/aby/speech/festival	
    

    save and exit, now set the environment variables

    	source ~/.bashrc	
    

    It is usefull to set the Festival binary in your path then you can call it from anywhere.

Festival Extras

The extra Festival packages (there are more)

can all just be unpacked in the Festival root directory

	cd festival
	cd ..
	tar -xzvf PACKAGE.tar.gz
	.
	.
	.

These packages provide festival with lexicons, HTS voices and diphone voice. No compilation is necessary.

Festvox

  1. Unpack the archive festvox-2.1-current.tar.gz

    	tar -xzvf festvox-2.1-current.tar.gz
    

  2. Go to the directory and run the configure script

    	cd festvox
    	./configure
    

    Note:
    Festvox searches for the Edinburgh Speech Tools installation and inherits most of it's configuration, thus you don't need to change any config files.

  3. Now you can build the system by running make

    	make
    

  4. Set the environment variables by editing your .bashrc file

    	emacs ~/.bashrc
    

    and add your the FESTVOXDIR environment variable pointing to the path of your installation

    	export FESTVOXDIR=/home/aby/speech/festvox	
    

    save and exit, now set the environment variables

    	source ~/.bashrc	
    

Flite

  1. Unpack the archive flite-1.3-release.tar.gz

    	tar -xzvf flite-1.3-release.tar.gz
    

  2. Go to the directory and run the configure script

    	cd flite-1.3-release
    	./configure
    

  3. Now you can build the system by running make

    	make
    

  4. Set the environment variables by editing your .bashrc file

    	emacs ~/.bashrc
    

    and add your the FLITEDIR environment variable pointing to the path of your installation

    	export FLITEDIR=/home/aby/speech/flite-1.3-release	
    

    save and exit, now set the environment variables

    	source ~/.bashrc	
    

Todo:
FreeTTS

TTS-Cubed

TTS-Cubed Tools

  1. Unpack the archive tts3-tools-0.9-current.tar.gz

    	tar -xzvf tts3-tools-0.9-current.tar.gz
    

  2. Go to the directory and run the configure script

    	cd tts3_tools
    	./configure
    

    Note:
    TTS-Cubed Tools searches for the Edinburgh Speech Tools installation and inherits most of it's configuration, thus you don't need to edit any config files.

  3. Now you can build the system by running make

    	make
    

  4. Set the environment variables by editing your .bashrc file

    	emacs ~/.bashrc
    

    and add your the TTS3TOOLS environment variable pointing to the path of your installation

    	export TTS3TOOLS=/home/aby/speech/tts3_tools	
    

    save and exit, now set the environment variables

    	source ~/.bashrc	
    

TTS-Cubed Festvox

  1. Go to your Festvox installation directory, copy the TTS-Cubed Festvox package here and unpack the archive tts3-festvox-0.9-current.tar.gz

    	cd festvox
    	cp /home/aby/downloads/tts3-festvox-0.9-current.tar.gz .
    	tar -xzvf tts3-festvox-0.9-current.tar.gz
    

  2. Edit the make files to add the TTS-Cubed Festvox files in the make procedure

    	emacs src/Makefile
    

    and after the line ALL_DIRS= ... add the following

    	TTS3_DIRS = multidiphone
    	ALL_DIRS += $(TTS3_DIRS)
    

    so that it looks something like

    	ALL_DIRS= db_example intonation duration unitsel \
            	  ldom vox_diphone vox_files prosody st lts hts_build $(BUILD_DIRS)
    	TTS3_DIRS = multidiphone
    	ALL_DIRS += $(TTS3_DIRS)
    

    save and exit. Edit the vox_files directory make file and add the same lines so that it looks something line

    	ALL_DIRS= us general uk prompts
    	TTS3_DIRS = multidiphone
    	ALL_DIRS += $(TTS3_DIRS)
    

    save and exit. You don't need to rebuild the system as the TTS-Cubed Festvox package currently only adds scripts to Festvox, thus the above editing is not strictly necessary but we may add executables in the future.

TTS-Cubed Festival

  1. Go to your Festival installation directory, copy the TTS-Cubed Festival package here and unpack the archive tts3-festival-0.9-current.tar.gz

    	cd festival
    	cp /home/aby/downloads/tts3-festival-0.9-current.tar.gz .
    	tar -xzvf tts3-festival-0.9-current.tar.gz
    

  2. Edit the festival config file to add the TTS-Cubed Festival modules

    	emacs config/config
    

    edit the end of the config file so that it looks something like

    	## Old diphone code that will be delete, left in only for some
    	## compatibility
    	# ALSO_INCLUDE += diphone
    
    	## Other (non-Edinburgh) modules may also be specified here (e.g. OGI code),
    
    	ALSO_INCLUDE += FT_Vox MultiDiphone
    

    save and exit. Now rebuild Festival

    	make
    

Todo:
TTS-Cubed Flite and FreeTTS

Building Voices With TTS-Cubed

This section of the documentation serves as an addendum to the Festvox documentation Building Synthetic Voices as the TTS-Cubed voice building process is an extension of the Festvox methods and we only describe parts of the voice building process with a view on TTS-Cubed style voices.

It is recommended to create a directory dedicated to the voices built with TTS-Cubed, e.g. tts_cubed_voices or something similar.

MultiDiphone Synthesis

MultiDiphone synthesis is similar to MultiSyn synthesis with some implementation changes. The main reason for doing MultiDiphone is to have a code base, independent from other institutions, that we can apply our research on. MultiDiphone synthesis is concatenative, where the basic units are diphones. But the prerecorded database contains multiple instances of each diphone type, whereas in traditional diphone synthesis there is only one example of each diphone type in the database. This eliminates the need to prosodically modify the units, thereby preserving naturalness.

During synthesis a target utterance specification is predicted by various models specified by the voice and the specific language. Then the units (candidates), in the prerecorded database, that best fit the target context are selected for concatenation. The unit selection is done based on a targetcost and a joincost. The targetcost is calculated by a function that penalises the candidates for not fitting certain linguistic contexts of the target utterance, while the joincost is the mismatch in the join of two candidates in an acoustic contexts. The candidates with the smallest total cost are then selected for synthesis.

Currently there is no prosodic information used during the target cost calculations. We try to capture the prosody of the target utterance in the context functions of the target cost. Even though this method does lead to surprisingly good results, we hope to include some form of prosodic, and specifically intonation modelling, as some of our official languages are tone languages.

Text Selection And Recording

Probably the most important thing to consider when building a voice is the text that is selected for recording. These recordings will then be the database of the voice, containing the candidates units as mentioned above.

Todo:
link to arctic papers and something more.
Our experience in recording suggest that a (very) quite room, using a laptop to do the recordings would suffice for a good quality voice. It is preferable that the voice artist talks in a monotone voice. This is especially important if the amount of recordings done is small, otherwise units with a large difference in intonation might be selected for concatenation. This sounds very bad!

We use Audacity for the recordings and record continuously. If the voice artists makes a mistake we just ask them to repeat the prompt, only stopping when the voice artist needs a break. We found that the recordings are much more fluent and it takes less time than stopping and starting the whole time. It does however increase the amount of time, needed to edit the recordings into their specific prompts, significantly.

Recordings are usualy done with a sample rate of 16kHz. Audacity can also be used to normalise the prompts that have been recorded.

Voice Setup

Go to your voices directory and make a new directory for the voice that you want to build.

	cd tts_cubed_voices
	mkdir zuluvoice
	cd zuluvoice

Run the TTS-Cubed Festvox script to create the directories and copy the relevant scripts necessary for the voice building process. This script takes input in the same form as the other Festvox voice building scripts, INST LANG VOX, where:

To run the script

	$FESTVOXDIR/src/multidiphone/setup_multidiphone INST LANG VOX

Todo:
emu label link
This script creates the following directories and copies the relevant voice builing scripts into them:

Concatenation Costs

The files ./festvox/INST_LANG_VOX_target_cost.scm and ./festvox/INST_LANG_VOX_join_cost.scm define the concatenation cost functions used for the voice.

In ./festvox/INST_LANG_VOX_target_cost.scm the variable tc_subFunctions defines the specific target cost functions and associated weights. These functions take as input the target diphone from the target utterance and a candidate diphone from the database.

The default target cost functions are:

You can comment out those functions that you do not want to use. The higher the weight of a specific function the more that mismatch gets penalised.

In ./festvox/INST_LANG_VOX_join_cost.scm the variable jc_subFunctions defines the specific join cost functions and associated weights. These functions take as input two diphones and calculate what the cost would be of concatenating them. Two diphones that are adjacent in the original recorded database will return a cost of zero.

The default join cost functions are:

As with the target costs, these weights can be adjusted to suit your needs.

Festvox Files

To create a voice in a new language (non existing Festvox definitions) one needs to define the following (from Building Synthetic Voices):

Thus, the most important things to define for a voice to work is a phone set, lexicon and/or lts. Refer to Building Synthetic Voices for information regarding these definitions.

After you have defined a phone set you can use the script ./bin/make_dummy_durdata to build a duration data statistics file of your phone set, as the duration data is not actually used in the unit selection process.

	./bin/make_dummy_durdata

this will create the ./festvox/INST_LANG_VOX_durdata.scm file.

Todo:
dictionary maker here and gzeros
TTS-Cubed Flite currently only supports the loading of lts rewrites type lts rules (see Building Synthetic Voices section "Building letter-to-sound rules by hand"), that you can define in ./festvox/INST_LANG_VOX_lts_rewrites.scm , but if you like you can use the CART method and convert it to a Flite compilable C source.

The MultiDiphone lexicon file ( ./festvox/INST_LANG_VOX_lexicon.scm ) differs slightly from the normal Festvox one, in that MultiDiphone defines the addenda in a seperate file ( ./festvox/INST_LANG_VOX_addenda.scm ).

You must create the file ./etc/utts.data that contains the recorded prompts, with the following format

	( bm_001 "isimo sokubonakala bobubanzi bezwe singamameter." )
	( bm_002 "kodwa izindawo zaseflorida bezinethiwe." )
	( bm_003 "kanjalo nasezingxenye ezisenyakatho nentshonalanga." )
        .
        .
        .

where bm_001.wav in the wav directory is the name of the wave file associated with the utterance bm_001.

TTS-Cubed currently supports a very simple diphone backoff mechanism. A backoff list, which defines a 1 to 1 phone replacement, is used during synthesis to replace any diphones that are not available in the diphone database.

For exaple, if diphone b_eh is not found in the database a replacement for either b or eh is looked for in the backoff list starting from the top. Now if in the backoff list p is defined as a replacement for b the synthesiser will search for the diphone p_eh , continuing in the backoff list if p_eh is also not available (this time looking for replacements for p_eh ). Thus, there may be a halfphone mismatch in synthesis, but we are asured of a result.

The backoff list must be defined in the file ./festvox/INST_LANG_VOX_diphone_backoff.scm . To ensure that this mechanism always succeeds all diphones must be backed off to the silence phone. The script bin/test_simple_backoff can be used to test this.

If you defined your silence phone as "pau" then

	./bin/test_simple_backoff "pau"

will test to see if the backoff mechanism will succeed with the defined backoff list.

The file ./festvox/INST_LANG_VOX_multidiphone.scm defines the actual voice and is Festival's entry point to the voice. Various parameters are defined in the file, and can be changed as is needed. See the comments in the file for information on specific parameters.

Labeling

Festvox provides two techniques for labeling the recorded prompts, namely dynamic time warping alignment and Baum-Welch training to build complete ASR acoustic models. With voices in new languages the second technique is preferred as there is no synthetic voice to do alignment with, however if the recorded database is small then the acoustic models aren't very good. Also when the recorded database is small it is feasible to do hand corrections of misaligned labels.

The file ./festvox/INST_LANG_VOX_phone_conversions.scm provides a cross-phoneset phone mapping scheme. Here you can define a 1 to 1 relationship between a new language's phoneset and an available diphone voice's phoneset. The idea is that the mapping is from a phone of the new language's phoneset to the acoustically closet match in the "closest voice" phoneset (closest voice is defined in ./festvox/INST_LANG_VOX_multidiphone.scm ). Now by synthesising the prompts with the diphone voice and replacing the phones with the phones of the new language, dynamic time warping alignment can be performed.

After defining the mapping in ./festvox/INST_LANG_VOX_phone_conversions.scm one can build the prompts with

	festival -b festvox/build_multidiphone.scm '(build_prompts "etc/utts.data")'

this will synthesise the prompts defined in ./etc/utts.data with the "closest voice".

Now the two sets of prompts can be aligned with the Festvox script

	./bin/make_labs prompt-wav/*.wav

which will generate the label files in ./lab . These label files can be converted to a Praat format (TextGrids) for viewing and hand correction with

	./bin/make_lab_textgrid lab/*.lab

Open praat and load the textgrids in the directory ./texgrid , as well as their associated wave files in ./wav . Select a textgrid and its wave file and select edit to view and edit these files. If you have changed the textgrids they need to be converted back to the Festival "lab" format

	./bin/make_textgrid_lab textgrid/*.TextGrid

which will overwrite the original label files.

Building The Voice

Pitchmarks

Getting good pitchmarks is important to the quality of the synthesis, as the short-time-signals used for synthesis are calculated around the pitchmarks. The file ./etc/pitchmarks.defs contain the parameters that specify the fundamental frequency range of the speaker. These must be edited according to your voice artist's range. The average range for male speakers are 80Hz-200Hz with a default of 100Hz and for females 120Hz-300Hz and default of 200Hz.

Praat can be used to view a few recordings and to get minimum and maximum pitch values to get a general idea of what the range of the speaker is.

After defining the pitch range in the speaker in the file ./etc/pitchmarks.defs the pitchmarks can be calculated with the script make_pm_praat. The actual pitchmark calculation is done by Praat.

	./bin/make_pm_praat wav/*.wav

The pitchmarks are created in the ./pm directory and Praat viewable pitchmarks in ./pm_praat_filled and ./pm_praat_unfilled. The filled pitchmarks directory has added pitchmarks in unvoiced regions at the default pitch range. This is necessary as Festival does not distinguish between voiced and unvoiced regions during synthesis. You can view the pitchmarks in ./pm_praat_filled and ./pm_praat_unfilled by loading them into Praat with their asociated waveforms.

F0

TTS-Cubed uses Praat to calculate the F0 values, which are used in the join cost calculations. The F0's can be calculated by (be patient, this takes a while)

	./bin/make_f0_praat wav/*.wav 

which creates the F0 files in ./f0 .

LPC Coefficients And Residuals

The LPC coefficients and residuals are calculated at the pitchmark (filled) time points. These are the short-time-signals used in the RELP synthesis. The LPC's and residuals are calculated by the script make_lpc.

	./bin/make_lpc wav/*.wav

This script dumps the lpc's and residuals in the ./lpc directory.

Join Coefficients

The join coefficients are calculated at the pitchmark times. They consist of a 12th order mel-cepstrum, it's energy and the F0 value at this pitchmark. These values are normalised for the entire database for each channel of the join coefficient and are calculated with

	./bin/make_joincoefs wav/*.wav

which will create the mcep's in ./mcep and the normalised join coefficients in ./join

Utterance Building

Finally the utterances of the voice database can be built

	festival -b festvox/build_multidiphone.scm '(build_utts "etc/utts.data")'

This script will dump the utterances in ./festival/utts

Now the voice can be loaded into Festival and used for synthesis

	festival festvox/INST_LANG_VOX_multidiphone.scm
	Festival Speech Synthesis System 1.96:beta July 2004
	Copyright (C) University of Edinburgh, 1996-2004. All rights reserved.
	For details type `(festival_warranty)'
	festival> (voice_INST_LANG_VOX_multidiphone)
	Please wait: Initialising MultiDiphoneVoice "INST_LANG_VOX_multidiphone".
	Voice loaded successfully!
	#<vox 0x9b55400>
	festival> (SayText "hello world")
	#<Utterance 0xb71d5278>
	festival> (exit)

The voice has different verbosity levels that can be set depending on the required information. The levels are:

    • Processing utterance (during loading Festival prints out which utterance it is processing while building the diphone database)
    • Ignored d phones with bad (the number of phones ignored because of "bad" flags that are set)
    • Chosen diphone info (Festival prints out which diphones were selected for synthesis and from what utterances they came)
    • adding diphone (during loading Festival prints out the diphone it is processing while building the diphone database)
    • viterbi info (during synthesis Festival will print out information regarding the viterbi search in unit selection, if the pruning is turned on)
    • backoff (if diphone backoff was performed the information will be printed out)
    • export info (during voice exporting any information regarding the exporting process will be printed out)
    • backoff info (if diphone backoff is being performed all the information will be printed out)

The default level is 0 (set in ./festvox/INST_LANG_VOX_multidiphone.scm ), and no extra information is printed out, but it can be set before running the voice at '(debug_level 0) or after voice loading (command (voice_INST_LANG_VOX_multidiphone)) in Festival with the command (vox.set_verbosity 2).