Speech Synthesis Markup Language (SSML) Version 1.0

W3C Recommendation 7 September 2004

This document is a modified and shortened version of:
 
http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
 
Please read the original document for acknowledgements, references and more detailed information.
 
Latest version:
http://www.w3.org/TR/speech-synthesis/

Editors:
Daniel C. Burnett, Nuance Communications
Mark R. Walker, Intel
Andrew Hunt, ScanSoft

Abstract

The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

0. Table of Contents

1. Introduction

This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A.

SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

The intended use of SSML is to improve the quality of synthesized content. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like phoneme and prosody (e.g. for speech contour design) may require specialized knowledge.

2. SSML Document Form

A legal stand-alone Speech Synthesis Markup Language document must have a legal XML Prolog. If present, the optional DOCTYPE must read as follows:

<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> 

The XML prolog is followed by the root speak element.

The speak element must designate the SSML namespace. This can be achieved by declaring an xmlns attribute or an attribute with an "xmlns" prefix. Note that when the xmlns attribute is used alone, it sets the default namespace for the element on which it appears and for any child elements.

An example of a legal SSML header:

<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">

3. Elements and Attributes

The following elements and attributes are defined in this specification.

3.1 Document Structure, Text Processing and Pronunciation

3.1.1 speak Root Element

The Speech Synthesis Markup Language is an XML application. The root element is speak. xml:lang is a required attribute specifying the language of the root document. The version attribute is a required attribute that indicates the version of the specification to be used for the document and must have the value "1.0".

<?xml version="1.0"?>
<speak version="1.0"
         xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  ... the body ...
</speak>

The speak element can only contain text to be rendered and the following elements: audio, break, emphasis, p, phoneme, prosody, say-as, sub, s, voice.

3.1.2 Language: xml:lang Attribute

The xml:lang attribute can be used in SSML to indicate the natural language of the enclosing element and its attributes and subelements.

Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

xml:lang is a defined attribute for the voice, speak, p, and s elements. For vocal rendering, a language change can have an effect on various other parameters (including gender, speed, age, pitch, etc.) which may be disruptive to the listener. There might even be unnatural breaks between language shifts. For this reason authors are encouraged to use the voice element to change the language. xml:lang is permitted on p and s only because it is common to change the language at those levels.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <p>I don't speak Japanese.</p>
  <p xml:lang="ja">日本語が分かりません。</p>
</speak>

There may be variation across conforming processors in the implementation of xml:lang voice changes for different markup elements (e.g. p and s elements).

All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis, break, p and s elements should each be rendered in a manner that is appropriate to the current language.

3.1.3 Text Structure: p and s Elements

A p element represents a paragraph. An s element represents a sentence.

xml:lang is a defined attribute on the p and s elements.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <p>
    <s>This is the first sentence of the paragraph.</s>
    <s>Here's another sentence.</s>
  </p>
</speak>

The use of p and s elements is optional.

The p and s element(s) can only contain text to be rendered and the following elements: audio, break, emphasis, phoneme, prosody, say-as, sub, voice.

The p element can also contain the s element.

3.1.4 say-as Element

The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.

The say-as element has three attributes: interpret-as, format, and detail. The interpret-as attribute is always required; the other two attributes are optional. The legal values for the format attribute depend on the value of the interpret-as attribute.

The say-as element can only contain text to be rendered.

The interpret-as and format attributes

The interpret-as attribute indicates the content type of the contained text construct. Specifying the content type helps the synthesis processor to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the optional format attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.

In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context.

When the value for the interpret-as attribute is unknown or unsupported by a processor, it must render the contained text as if no interpret-as value were specified.

When the value for the format attribute is unknown or unsupported by a processor, it must render the contained text as if no format value were specified, and should render it using the interpret-as value that is specified.

The detail attribute

The detail attribute is an optional attribute that indicates the level of detail to be read aloud or rendered. Every value of the detail attribute must render all of the informational content in the contained text; however, specific values for the detail attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, a synthesis processor will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.

The detail attribute can be used for all interpret-as types.

If the detail attribute is not specified, the level of detail that is produced by the synthesis processor depends on the text content and the language.

3.1.5 phoneme Element

The phoneme element provides a phonemic/phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

The ph attribute is a required attribute that specifies the phoneme/phone string.

This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon, while values in say-as and sub may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.

The alphabet attribute is an optional attribute that specifies the phonemic/phonetic alphabet. An alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "ipa" and vendor-defined strings of the form "x-organization" or "x-organization-alphabet".

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;"> tomato </phoneme>
  <!-- This is an example of IPA using character entities -->
  <!-- Because many platform/browser/text editor combinations do not
       correctly cut and paste Unicode text, this example uses the entity
       escape versions of the IPA characters.  Normally, one would directly
       use the UTF-8 representation of these symbols: "təmei̥ɾou̥". -->
</speak>

It is an error if a value for alphabet is specified that is not known or cannot be applied by a synthesis processor.

The phoneme element itself can only contain text (no elements).

3.1.6 sub Element

The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias attribute specifies the string to be spoken instead of the enclosed string.

The sub element can only contain text (no elements).

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <sub alias="World Wide Web Consortium">W3C</sub>
  <!-- World Wide Web Consortium -->
</speak>

3.2 Prosody and Style

3.2.1 voice Element

The voice element is a production element that requests a change in speaking voice. Attributes are:

Although each attribute individually is optional, it is an error if no attributes are specified when the voice element is used.

The voice element is commonly used to change the language. When there is not a voice available that exactly matches the attributes specified in the document, or there are multiple voices that match the criteria, the voice selection algorithm must be used. Approximately speaking, the xml:lang attribute has the highest priority and all other attributes are equal in priority but below xml:lang (the complete algorithm can be found in the original document).

voice attributes are inherited down the tree including to within elements that change the language.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <voice gender="female">Mary had a little lamb,</voice>
  <!-- now request a different female child's voice -->
  <voice gender="female" variant="2">
  Its fleece was white as snow.
  </voice>
  <!-- processor-specific voice selection -->
  <voice name="Mike">I want to be like Mike.</voice>
  <voice gender="female"> 
    Any female voice here.
    <voice age="6"> 
      A female child voice here.
      <p xml:lang="ja"> 
        <!-- A female child voice in Japanese. -->
      </p>
    </voice>
  </voice>
</speak>

Relative changes in prosodic parameters should be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.

The voice element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.

3.2.2 emphasis Element

The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  That is a <emphasis> big </emphasis> car!
  That is a <emphasis level="strong"> huge </emphasis>
  bank account!
</speak>

The emphasis element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.

3.2.3 break Element

The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:

If a break element is used with neither strength nor time attributes, a break will be produced by the processor with a prosodic strength greater than that which the processor would otherwise have used if no break element was supplied.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  Take a deep breath <break/>
  then continue. 
  Press 1 or wait for the tone. <break time="3s"/>
  I didn't hear you! <break strength="weak"/> Please repeat.
</speak>

3.2.4 prosody Element

The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:

Although each attribute individually is optional, it is an error if no attributes are specified when the prosody element is used. The "x-foo " attribute value names are intended to be mnemonics for "extra foo". All units ("Hz", "st") are case-sensitive.

Number

A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.

Relative values

Relative changes for the attributes above can be specified

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  The price of XYZ is <prosody rate="-10%">$45</prosody>
</speak>

Pitch contour

The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)">
    good morning
  </prosody>
</speak>

The duration attribute takes precedence over the rate attribute. The contour attribute takes precedence over the pitch and range attributes.

The default value of all prosodic attributes is no change. For example, omitting the rate attribute means that the rate is the same within the element as outside.

The prosody element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.

3.3 Other Elements

3.3.1 audio Element

The audio element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
                 
  <!-- Empty element -->
  Please say your name after the tone.  <audio src="beep.wav"/>

  <!-- Container element with alternative text -->
  <audio src="prompt.au">What city do you want to fly from?</audio>
  <audio src="welcome.wav">  
    <emphasis>Welcome</emphasis> to the Voice Portal. 
  </audio>

</speak>

An audio element is successfully rendered:

  1. If the referenced audio source is played, or
  2. If the synthesis processor is unable to execute #1 but the alternative content is successfully rendered, or
  3. If the processor can detect that text-only output is required and the alternative content is successfully rendered.

The audio element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.

End of document.