Monday, July 28, 2014
48Hrs Help PAQ's
MS Alerts
Write to Us

a) Why use .NET My Services?
b) What are .NET My Services?

Get This Blog via Email:

Email This Feed Using Squeet

SAPI XML TTS for Application Developers

By ArunGanesh
August 01, 2002
Page is Viewed 13332 times

Arun Ganesh

This piece of writing explains all the tags, which are much supportive in managing the Voice in SAPI 5.1.
The SAPI text-to-speech (TTS) extensible markup language (XML) tags supports the following categories.

1.Voice state control
2.Direct item insertion
3.Voice context control 4.Voice selection
5.Custom Pronunciation

With these different XML tags one can easily control and enhance the Voice and pronunciation. With this article you can notice a wave file, by playing that file; you can experience the usage of these tags obviously.


These five tags can be effectively used in SAPI TTS to control the Voice: Volume, Rate, Pitch, Emph, and Spell. Let us see all these things in detail.


The Volume tag controls the volume of a voice.
The necessary attribute of the Volume tag is Level. The value of this attribute should be an integer between zero and one hundred. Values outside of this range will be truncated.
<volume level="20">
This text should be spoken at volume level twenty.

The Rate tag controls the rate of a voice.
The Rate tag has two attributes, Speed and AbsSpeed, one of which must be present.
The value of both of these attributes should be an integer between negative ten and ten. The AbsSpeed attribute controls the absolute rate of the voice, so a value of ten always corresponds to a value of ten, a value of five always corresponds to a value of five.
<rate absspeed="5">
This text should be spoken at rate five.
<rate absspeed="-5">
This text should be spoken at rate negative five.
<rate absspeed="10"/>
All text, which follows, should be spoken at rate ten.
The relative rate of the voice is controlled by the attribute Speed.
<rate speed="5">
This text should be spoken at rate five.
Zero represents the default rate of a voice, with positive values being faster and negative values being slower.
The Pitch tag controls the pitch of a voice. The Pitch tag has two attributes, Middle and AbsMiddle, one of which must be present. The value of both of these attributes should be an integer between negative ten and ten. The AbsMiddle attribute controls the absolute pitch of the voice.
<pitch absmiddle="5">
This text should be spoken at pitch five.
<pitch absmiddle="-5">
This text should be spoken at pitch negative five.
<pitch absmiddle="10"/>
All text which follows should be spoken at pitch ten.
The Middle attribute controls the relative pitch of the voice.
<pitch middle="5">
This text should be spoken at pitch five.
<pitch middle="-5">
This text should be spoken at pitch zero.

The Emph tag instructs the voice to emphasize a word or section of text. The Emph tag cannot be empty. The following word should be emphasized.
<emph> ARUN MICRO SYSTEMS [] </emph>!
The method of emphasis may vary from voice to voice.
The Spell tag forces the voice to spell out all text, rather than using its default word and sentence breaking rules, normalization rules, and so forth. The Spell tag cannot be empty.
These words should be spelled out.
These words should not be spelled out.
Direct item insertion tags
Three tags are supported that applications the ability to insert items directly at some level: Silence, Pron, and Bookmark.
The Silence tag inserts a specified number of milliseconds of silence into the output audio stream. This tag must be empty, and must have one attribute, Msec.
Five hundred milliseconds of silence <silence msec=”500”/> just occurred.
The Pron tag inserts a specified pronunciation. The voice will process the sequence of phonemes exactly as they are specified. This tag can be empty, or it can have content. The Pron tag has one attribute, Sym, whose value is a string of white space separated phonemes.
<pron sym="h eh 1 l ow & w er 1 l d "/>
<pron sym="h eh 1 l ow & w er 1 l d"> hello world </pron>

The Bookmark tag inserts a bookmark event into the output audio stream. Use this event to signal the application when the audio corresponding to the text at the Bookmark tag has been reached. The Bookmark tag must be empty.
The application will receive an event here,
<bookmark mark="bookmark_one"/>

Voice context control tags
Two tags provide context to the current voice: PartOfSp and Context. Those tags enable the voice to determine how to deal with the text it is processing.
The PartOfSp tag provides the voice with the part of speech of the enclosed word(s). Use this tag to enable the voice to pronounce a word with multiple pronunciations correctly depending on its part of speech. The PartOfSp tag cannot be empty.
The PartOfSp tag has one attribute, Part, which takes a string corresponding to a SAPI part of speech as its attribute. Only SAPI defined parts of speech are supported - "Unknown", "Noun", "Verb", "Modifier", "Function", "Interjection".
<partofsp part="noun"> A </partofsp> is the first letter of the alphabet.
Did you <partofsp part="verb"> record </partofsp> that <partofsp part="noun"> record </partofsp>?
The Context tag provides the voice with information, which the voice may then use to determine how to normalize special items, like dates, numbers, and currency.
<context id="date_mdy"> 03/04/02 </context> should be March fourth, two thousand two.
<context id="date_dmy"> 03/04/02 </context> should be April third, two thousand two.
Voice Selection Tags
There are two tags, which can be used (potentially) to change the current voice: Voice and Lang.

The Voice tag selects a voice based on its attributes, Age, Gender, Language, Name, Vendor, and VendorPreferred. The Voice tag has two attributes: Required and Optional.
If no voice is found that matches all of the required attributes, no voice change will occur.
The default voice should speak this sentence.
<voice required="Gender=Female;Age!=Child">
A female non-child should speak this sentence, if one exists.
<voice required="Age=Teen">
A teen should speak this sentence - if a female, non-child teen is present, she will be selected over a male teen, for example.
The Lang tag selects a voice based solely on its Language attribute.
The Lang tag has one attribute, LangId. This attribute should be a LANGID, such as 409 (U.S. English) or 411 (Japanese). Note that these numbers are hexadecimal, but without the typical "0x".
The Lang tag is a shortened version of the Voice tag with the Required attribute containing "Language=xxx". So the following examples should produce exactly the same results:
<voice required="Language=409">
A U.S. English voice should speak this.
<lang langid="409">
A U.S. English voice should speak this.

Custom Pronunciation:
An alternative to using the <P> tag with the DISP and PRON attributes is to use custom pronunciation. Using custom pronunciation, tags in the form of the following.
<P DISP="disp" PRON="pron">word</P>
can be written as

To test all the above tags and to create TTS application using SAPI, please take a look at my article: “SAPI 5.1 in Creating Text to Speech Applications using C#”.
Hope this article will be helpful. Happy programming.


Microsoft Speech SDK, version 5.1
Microsoft largest software production company. Listed in Nasdaq as msft.
The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.
A descriptive declaration that annotates programming elements such as types, fields, methods, and properties. Attributes are saved with the metadata of a .NET Framework file and can be used to describe code to the common language runtime or to affect application behavior at run time.
A new programming language designed for building enterprise applications that run on the .NET Framework. C#, which is an evolution of C and C++, is type safe and object oriented. Because it is compiled as managed code, it benefits from the services of the common language runtime, such as language interoperability, security, and garbage collection.
An ordered sequence of properties that define an environment for the objects resident inside it. Contexts are created during the activation process for objects that are configured to require certain automatic services such as synchronization, transactions, just-in-time activation, security, and so on. Multiple objects can live inside a context.
.NET Force is optimised for Microsoft Internet Explorer 5 browsers.
Copyright © 2004 .NET Force.
Terms and Condition. All rights reserved.