The ideas of a general, synthesizer system nonspecific, mark-up language for labelling text has been under discussion for some time. Festival has supported an SGML based markup language through multiple versions most recently STML (sproat97). This is based on the earlier SSML (Speech Synthesis Markup Language) which was supported by previous versions of Festival (taylor96). With this version of Festival we support Sable a similar mark-up language devised by a consortium from Bell Labls, Sub Microsystems, AT&T and Edinburgh, sable98. Unlike the previous versions which were SGML based, the implementation of Sable in Festival is now XML based. To the user they different is negligable but using XML makes processing of files easier and more standardized. Also Festival now includes an XML parser thus reducing the dependencies in processing Sable text.
Raw text has the problem that it cannot always easily be rendered as speech in the way the author wishes. Sable offers a well-defined way of marking up text so that the synthesizer may render it appropriately.
The definition of Sable is by no means settled and is still in development. In this release Festival offers people working on Sable and other XML (and SGML) based markup languages a chance to quickly experiment with prototypes by providing a DTD (document type descriptions) and the mapping of the elements in the DTD to Festival functions. Although we have not yet (personally) investigated facilities like cascading style sheets and generalized SGML specification languages like DSSSL we believe the facilities offer by Festival allow rapid prototyping of speech output markup languages.
Primarily we see Sable markup text as a language that will be generated by other programs, e.g. text generation systems, dialog managers etc. therefore a standard, easy to parse, format is required, even if it seems overly verbose for human writers.
For more information of Sable (and Festival) see
http://www.cstr.ed.ac.uk/projects/sable.html |
This document was taken (and slightly modified) from the Festival 1.4.X manual (chapter 10) at
http://www.cstr.ed.ac.uk/projects/festival/manual/festival_10.html |
10.1 Sable example an example of Sable with descriptions 10.2 Supported Sable tags Currently supported Sable tags 10.3 Adding Sable tags Adding new Sable tags
Here is a simple example of Sable marked up text
<?xml version="1.0"?> <!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN" "Sable.v0_2.dtd" []> <SABLE> <SPEAKER NAME="male1"> The boy saw the girl in the park <BREAK/> with the telescope. The boy saw the girl <BREAK/> in the park with the telescope. Good morning <BREAK /> My name is Stuart, which is spelled <RATE SPEED="-40%"> <SAYAS MODE="literal">stuart</SAYAS> </RATE> though some people pronounce it <PRON SUB="stoo art">stuart</PRON>. My telephone number is <SAYAS MODE="literal">2787</SAYAS>. I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place, but no one can pronounce that. By the way, my telephone number is actually <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>. </SPEAKER> </SABLE> |
There is not yet a definitive set of tags but hopefully such a list will form over the next few months. As adding support for new tags is often trivial the problem lies much more in defining what tags there should be than in actually implementing them. The following are based on version 0.2 of Sable as described in http://www.cstr.ed.ac.uk/projects/sable_spec2.html, though some aspects are not currently supported in this implementation. Further updates will be announces through the Sable mailing list.
LANGUAGE
ID
attribute. Valid values in Festival are, english
,
en1
, spanish
, en
, and others depending
on your particular installation.
For example
<LANGUAGE id="english"> ... </LANGUAGE> |
SPEAKER
NAME
which takes values
male1
, male2
, female1
, etc. There
is currently no definition about what happens when a voice is selected
which the synthesizer doesn't support. An example is
<SPEAKER name="male1"> ... </SPEAKER> |
AUDIO
My telephone number is <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>. |
MARKER
MARK
attribute is printed. This is done some when that piece
of text is analyzed. not when it is played. To use
this in any real application would require changes to this tags
implementation.
Move the <MARKER MARK="mouse" /> mouse to the top. |
BREAK
LEVEL
. Strength may be values
Large
, Medium
, Small
or a number. Note that
this this tag is an empty tag and must include the closing part
within itsefl specification.
<BREAK LEVEL="LARGE"/> |
DIV
TYPE
attribute may be specified but it is ignored
by Festival.
PRON
IPA
for an IPA specification (not
currently supported by Festival); SUB
text to be substituted
which can be in some form of phonetic spelling, and ORIGIN
where
the linguistic origin of the enclosed text may be identified to assist
in etymologically sensitive letter to sound rules.
<PRON SUB="toe maa toe">tomato</PRON> |
SAYAS
MODE
cand take any of the following a values: literal
,
date
, time
, phone
, net
, postal
,
currency
, math
, fraction
, measure
,
ordinal
, cardinal
, or name
. Further specification
of type for dates (MDY, DMY etc) may be speficied through the
MODETYPE
attribute.
As a test of marked-up numbers. Here we have a year <SAYAS MODE="date">1998</SAYAS>, an ordinal <SAYAS MODE="ordinal">1998</SAYAS>, a cardinal <SAYAS MODE="cardinal">1998</SAYAS>, a literal <SAYAS MODE="literal">1998</SAYAS>, and phone number <SAYAS MODE="phone">1998</SAYAS>. |
EMPH
LEVEL
attribute may be specified but its value is currently
ignored by Festival (besides the emphasis Festival generates
isn't very good anyway).
The leaders of <EMPH>Denmark</EMPH> and <EMPH>India</EMPH> meet on Friday. |
PITCH
Without his penguin, <PITCH BASE="-20%"> which he left at home, </PITCH> he could not enter the restaurant. |
RATE
The address is <RATE SPEED="-40%"> 10 Main Street </RATE>. |
VOLUME
Please speak more <VOLUME LEVEL="loud">loudly</VOLUME>, except when I ask you to speak <VOLUME LEVEL="quiet"> in a quiet voice </VOLUME>. |
ENGINE
An example is <ENGINE ID="festival" DATA="our own festival speech synthesizer"> the festival speech synthesizer</ENGINE> or the Bell Labs speech synthesizer. |
These tags may change in name but they cover the aspects of speech mark up that we wish to express. Later additions and changes to these are expected.
See the files `festival/examples/example.sable' and `festival/examples/example2.sable' for working examples.
Note the definition of Sable is on going and there are likely to be later more complete implementations of sable for Festival as independent releases consult `http://www.cstr.ed.ac.uk/projects/sable.html' for the most recent updates.
We do not yet claim that there is a fixed standard for Sable tags but we wish to move towards such a standard. In the mean time we have made it easy in Festival to add support for new tags without, in general, having to change any of the core functions.
Two changes are necessary to add a new tags. First, change the
definition in `lib/Sable.v0_2.dtd', so that Sable files may use it.
The second stage is to make Festival sensitive to that new tag. The
example in festival/lib/sable-mode.scm
shows how a new text mode
may be implemented for an XML/SGML-based markup language. The basic
point is that an identified function will be called on finding a start
tag or end tags in the document. It is the tag-function's job to
synthesize the given utterance if the tag signals an utterance boundary.
The return value from the tag-function is the new status of the current
utterance, which may remain unchanged or if the current utterance has
been synthesized nil
should be returned signalling a new
utterance.
Note the hierarchical structure of the document is not available in this method of tag-functions. Any hierarchical state that must be preserved has to be done using explicit stacks in Scheme. This is an artifact due to the cross relationship to utterances and tags (utterances may end within start and end tags), and the desire to have all specification in Scheme rather than C++.
The tag-functions are defined in an elements list. They are identified
with names such as "(SABLE" and ")SABLE" denoting start and end tags
respectively. Two arguments are passed to these tag functions,
an assoc list of attributes and values as specified in the document
and the current utterances. If the tag denotes an utterance
break, call xxml_synth
on UTT
and return nil
.
If a tag (start or end) is found in the document and there is no
corresponding tag-function it is ignored.
New features may be added to words with a start and end tag by
adding features to the global xxml_word_features
. Any
features in that variable will be added to each word.
Note that this method may be used for both XML based lamnguages and SGML
based markup languages (though and external normalizing SGML parser is
required in the SGML case). The type (XML vs SGML) is identified
by the analysis_type
parameter in the tts text mode specification.