Copyright 1999 - 2004 W3C (MIT, ERCIM, Keio University), all Rights Reserved. W3C liability, trademark, document use rules apply. See http://www.w3.org/Consortium/Legal/ for details.
The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A.
SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
The intended use of SSML is to improve the quality of synthesized content. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like phoneme and prosody (e.g. for speech contour design) may require specialized knowledge.
A legal stand-alone Speech Synthesis Markup Language document must have a legal XML Prolog. If present, the optional DOCTYPE must read as follows:
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
The XML prolog is followed by the root speak element.
The speak element must designate the SSML namespace. This can be achieved by declaring an xmlns
attribute or an attribute with an "xmlns" prefix. Note that when the xmlns
attribute is used alone, it sets the default namespace for the element on which it appears and for any child elements.
An example of a legal SSML header:
<?xml version="1.0"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
The following elements and attributes are defined in this specification.
The Speech Synthesis Markup Language is an XML application. The root element is speak. xml:lang
is a required attribute specifying the language of the root document. The version
attribute is a required attribute that indicates the version of the specification to be used for the document and must have the value "1.0".
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> ... the body ... </speak>
The speak element can only contain text to be rendered and the following elements: audio, break, emphasis, p, phoneme, prosody, say-as, sub, s, voice.
xml:lang
AttributeThe xml:lang
attribute can be used in SSML to indicate the natural language of the enclosing element and its attributes and subelements.
Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.
xml:lang
is a defined attribute for the voice, speak, p, and s elements. For vocal rendering, a language change can have an effect on various other parameters (including gender, speed, age, pitch, etc.) which may be disruptive to the listener. There might even be unnatural breaks between language shifts. For this reason authors are encouraged to use the voice element to change the language. xml:lang
is permitted on p and s only because it is common to change the language at those levels.
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p>I don't speak Japanese.</p> <p xml:lang="ja">日本語が分かりません。</p> </speak>
There may be variation across conforming processors in the implementation of xml:lang
voice changes for different markup elements (e.g. p and s elements).
All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis, break, p and s elements should each be rendered in a manner that is appropriate to the current language.
A p element represents a paragraph. An s element represents a sentence.
xml:lang
is a defined attribute on the p and s elements.
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>This is the first sentence of the paragraph.</s> <s>Here's another sentence.</s> </p> </speak>
The use of p and s elements is optional.
The p and s element(s) can only contain text to be rendered and the following elements: audio, break, emphasis, phoneme, prosody, say-as, sub, voice.
The p element can also contain the s element.
The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
The say-as element has three attributes: interpret-as
, format
, and detail
. The interpret-as
attribute is always required; the other two attributes are optional. The legal values for the format
attribute depend on the value of the interpret-as
attribute.
The say-as element can only contain text to be rendered.
interpret-as
and format
attributesThe interpret-as
attribute indicates the content type of the contained text construct. Specifying the content type helps the synthesis processor to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the optional format
attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.
In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context.
When the value for the interpret-as
attribute is unknown or unsupported by a processor, it must render the contained text as if no interpret-as
value were specified.
When the value for the format
attribute is unknown or unsupported by a processor, it must render the contained text as if no format
value were specified, and should render it using the interpret-as
value that is specified.
detail
attributeThe detail
attribute is an optional attribute that indicates the level of detail to be read aloud or rendered. Every value of the detail
attribute must render all of the informational content in the contained text; however, specific values for the detail
attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, a synthesis processor will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.
The detail
attribute can be used for all interpret-as
types.
If the detail
attribute is not specified, the level of detail that is produced by the synthesis processor depends on the text content and the language.
The phoneme element provides a phonemic/phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.
The ph
attribute is a required attribute that specifies the phoneme/phone string.
This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon, while values in say-as and sub may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.
The alphabet
attribute is an optional attribute that specifies the phonemic/phonetic alphabet. An alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "ipa" and vendor-defined strings of the form "x-organization" or "x-organization-alphabet".
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <phoneme alphabet="ipa" ph="təmei̥ɾou̥"> tomato </phoneme> <!-- This is an example of IPA using character entities --> <!-- Because many platform/browser/text editor combinations do not correctly cut and paste Unicode text, this example uses the entity escape versions of the IPA characters. Normally, one would directly use the UTF-8 representation of these symbols: "təmei̥ɾou̥". --> </speak>
It is an error if a value for alphabet
is specified that is not known or cannot be applied by a synthesis processor.
The phoneme element itself can only contain text (no elements).
The sub element is employed to indicate that the text in the alias
attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias
attribute specifies the string to be spoken instead of the enclosed string.
The sub element can only contain text (no elements).
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <sub alias="World Wide Web Consortium">W3C</sub> <!-- World Wide Web Consortium --> </speak>
The voice element is a production element that requests a change in speaking voice. Attributes are:
xml:lang
: optional language specification attribute.
gender
: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral".
age
: optional attribute indicating the preferred age in years (since birth) of the voice to speak the contained text.
variant
: optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second male child voice).
name
: optional attribute indicating a processor-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down. As a result a name must not contain any white space.
Although each attribute individually is optional, it is an error if no attributes are specified when the voice element is used.
The voice element is commonly used to change the language. When there is not a voice available that exactly matches the attributes specified in the document, or there are multiple voices that match the criteria, the voice selection algorithm must be used.
Approximately speaking, the xml:lang
attribute has the highest priority and all other attributes are equal in priority but below xml:lang
(the complete algorithm can be found in the original document).
voice attributes are inherited down the tree including to within elements that change the language.
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <voice gender="female">Mary had a little lamb,</voice> <!-- now request a different female child's voice --> <voice gender="female" variant="2"> Its fleece was white as snow. </voice> <!-- processor-specific voice selection --> <voice name="Mike">I want to be like Mike.</voice> <voice gender="female"> Any female voice here. <voice age="6"> A female child voice here. <p xml:lang="ja"> <!-- A female child voice in Japanese. --> </p> </voice> </voice> </speak>
Relative changes in prosodic parameters should be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.
The voice element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.
The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:
level
: the optional level
attribute indicates the strength of emphasis to be applied. Defined values are "strong", "moderate", "none" and "reduced". The default level
is "moderate". The meaning of "strong" and "moderate" emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The "reduced" level
is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The "none" level
is used to prevent the synthesis processor from emphasizing words that it might typically emphasize. The values "none", "moderate", and "strong" are monotonically non-decreasing in strength.
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> That is a <emphasis> big </emphasis> car! That is a <emphasis level="strong"> huge </emphasis> bank account! </speak>
The emphasis element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.
The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:
strength
: the strength
attribute is an optional attribute having one of the following values: "none", "x-weak", "weak", "medium" (default value), "strong", or "x-strong". This attribute is used to indicate the strength of the prosodic break in the speech output. The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break which the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between words. The stronger boundaries are typically accompanied by pauses. "x-weak" and "x-strong" are mnemonics for "extra weak" and "extra strong", respectively.
time
: the time
attribute is an optional attribute indicating the duration of a pause to be inserted in the output in seconds or milliseconds.
If a break element is used with neither strength
nor time
attributes, a break will be produced by the processor with a prosodic strength greater than that which the processor would otherwise have used if no break element was supplied.
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> Take a deep breath <break/> then continue. Press 1 or wait for the tone. <break time="3s"/> I didn't hear you! <break strength="weak"/> Please repeat. </speak>
The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:
pitch
: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels.
contour
: sets the actual pitch contour for the contained text. The format is specified in Pitch contour below.
range
: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges.
rate
: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate.
duration
: a value in seconds or milliseconds for the desired time to take to read the element contents.
volume
: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is 100.0. Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.
Although each attribute individually is optional, it is an error if no attributes are specified when the prosody element is used. The "x-foo " attribute value names are intended to be mnemonics for "extra foo". All units ("Hz", "st") are case-sensitive.
A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.
Relative changes for the attributes above can be specified
rate
attribute, relative changes are a number.volume
attribute, relative changes are a number preceded by "+" or "-", e.g. "+10", "-5.5".pitch
and range
attributes, relative changes can be given in semitones (a number preceded by "+" or "-" and followed by "st") or in Hertz (a number preceded by "+" or "-" and followed by "Hz"): "+0.5st", "+5st", "-2st", "+10Hz", "-5.5Hz". A semitone is half of a tone (a half step) on the standard diatonic scale.<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> The price of XYZ is <prosody rate="-10%">$45</prosody> </speak>
The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form (time position,target)
, the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch
attribute (a number followed by "Hz", a relative change, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)"> good morning </prosody> </speak>
The duration
attribute takes precedence over the rate
attribute. The contour
attribute takes precedence over the pitch
and range
attributes.
The default value of all prosodic attributes is no change. For example, omitting the rate
attribute means that the rate is the same within the element as outside.
The prosody element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.
The audio element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available.
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <!-- Empty element --> Please say your name after the tone. <audio src="beep.wav"/> <!-- Container element with alternative text --> <audio src="prompt.au">What city do you want to fly from?</audio> <audio src="welcome.wav"> <emphasis>Welcome</emphasis> to the Voice Portal. </audio> </speak>
An audio element is successfully rendered:
The audio element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.
End of document.