This piece
of writing explains all the tags, which are much supportive in managing the
Voice in SAPI 5.1.
The SAPI text-to-speech (TTS) extensible markup language
(XML) tags supports the following categories.
1.Voice state control
2.Direct item insertion
3.Voice context control 4.Voice selection
5.Custom Pronunciation
With these different XML tags one can
easily control and enhance the Voice and pronunciation. With this article you
can notice a wave file, by playing that file; you can experience the usage of
these tags obviously.
VOICE STATE CONTROL
TAGS:
These five tags can be effectively used in SAPI TTS to
control the Voice: Volume, Rate, Pitch, Emph, and Spell. Let us see all these
things in detail.
Volume:
The Volume tag controls
the volume of a voice.
The necessary attribute of the Volume tag is Level. The value of this
attribute should be an integer between zero and one hundred. Values outside of
this range will be truncated.
Ex:
<volume level="20">
This
text should be spoken at volume level twenty.
</volume>
Rate:
The Rate tag controls the rate of a voice.
The Rate tag has two attributes, Speed and AbsSpeed, one of which must be
present.
The value of both of these attributes should be an integer between negative
ten and ten. The AbsSpeed attribute controls the absolute rate of the voice, so
a value of ten always corresponds to a value of ten, a value of five always
corresponds to a value of five.
<rate absspeed="5">
This text
should be spoken at rate five.
<rate absspeed="-5">
This text should be spoken at rate negative five.
</rate>
</rate>
<rate absspeed="10"/>
All text, which follows, should be spoken at rate ten.
Speed:
The relative rate of the voice is controlled by the attribute Speed.
Ex:
<rate speed="5">
This text should be spoken at rate five.
</rate>
Zero represents the default rate of a voice, with positive values being
faster and negative values being slower.
Pitch:
The Pitch tag controls the pitch of a voice. The Pitch tag has two
attributes, Middle and AbsMiddle, one of which must be present. The value of
both of these attributes should be an integer between negative ten and ten. The
AbsMiddle attribute controls the absolute pitch of the voice.
<pitch
absmiddle="5">
This text should be spoken at pitch five.
<pitch absmiddle="-5">
This text should be spoken at pitch
negative five.
</pitch>
</pitch>
<pitch
absmiddle="10"/>
All text which follows should be spoken at pitch ten.
The Middle attribute controls the relative pitch of the voice.
<pitch middle="5">
This text should be spoken at pitch
five.
<pitch middle="-5">
This text should be spoken at
pitch zero.
</pitch>
</pitch>
Emph:
The Emph tag instructs the voice to emphasize a word or section of text.
The Emph tag cannot be empty. The following word should be
emphasized.
<emph> ARUN MICRO SYSTEMS
[www.arunmicrosystems.netfirms.com] </emph>!
The method of emphasis may vary from voice to voice.
Spell:
The Spell tag forces the voice to spell out all text, rather than using its
default word and sentence breaking rules, normalization rules, and so forth. The
Spell tag cannot be empty.
<spell>
These words should be spelled
out.
</spell>
These words should not be spelled out.
Direct item insertion tags
Three tags are supported that applications the ability to insert items
directly at some level: Silence, Pron, and Bookmark.
Silence:
The Silence tag inserts a specified number of milliseconds of silence into
the output audio stream. This tag must be empty, and must have one attribute,
Msec.
Five hundred milliseconds of silence <silence
msec=”500”/> just occurred.
Pron:
The Pron tag inserts a specified pronunciation. The voice will process the
sequence of phonemes exactly as they are specified. This tag can be empty, or it
can have content. The Pron tag has one attribute, Sym, whose value is a string
of white space separated phonemes.
<pron sym="h eh 1 l ow & w er 1 l
d "/>
<pron sym="h eh 1 l ow & w er 1 l d"> hello world
</pron>
Bookmark:
The Bookmark tag inserts a bookmark event into the output audio stream. Use
this event to signal the application when the audio corresponding to the text at
the Bookmark tag has been reached. The Bookmark tag must be empty.
The application will receive an event here,
<bookmark
mark="bookmark_one"/>
Voice context control tags
Two tags provide context to the current voice: PartOfSp and
Context. Those
tags enable the voice to determine how to deal with the text it is processing.
PartOfSp
The PartOfSp tag provides the voice with the part of speech of the enclosed
word(s). Use this tag to enable the voice to pronounce a word with multiple
pronunciations correctly depending on its part of speech. The PartOfSp tag
cannot be empty.
The PartOfSp tag has one attribute, Part, which takes a string
corresponding to a SAPI part of speech as its attribute. Only SAPI defined parts
of speech are supported - "Unknown", "Noun", "Verb", "Modifier", "Function",
"Interjection".
<partofsp part="noun"> A </partofsp> is the
first letter of the alphabet.
Did you <partofsp part="verb"> record
</partofsp> that <partofsp part="noun"> record
</partofsp>?
Context
The Context tag provides the voice with information, which the voice may
then use to determine how to normalize special items, like dates, numbers, and
currency.
<context id="date_mdy"> 03/04/02 </context> should be
March fourth, two thousand two.
<context id="date_dmy"> 03/04/02
</context> should be April third, two thousand two.
Voice Selection Tags
There are two tags, which can be used (potentially) to change the current
voice: Voice and Lang.
Voice
The Voice tag selects a voice based on its attributes, Age, Gender,
Language, Name, Vendor, and VendorPreferred. The Voice tag has two attributes:
Required and Optional.
If no voice is found that matches all of the required attributes, no voice
change will occur.
Ex:
The default voice should speak this sentence.
<voice
required="Gender=Female;Age!=Child">
A female non-child should speak this sentence, if one
exists.
<voice required="Age=Teen">
A teen should speak this
sentence - if a female, non-child teen is present, she will be selected over a
male teen, for example.
</voice>
</voice>
Lang
The Lang tag selects a voice based solely on its Language attribute.
The Lang tag has one attribute, LangId. This attribute should be a LANGID,
such as 409 (U.S. English) or 411 (Japanese). Note that these numbers are
hexadecimal, but without the typical "0x".
The Lang tag is a shortened version of the Voice tag with the Required
attribute containing "Language=xxx". So the following examples should produce
exactly the same results:
<voice required="Language=409">
A U.S.
English voice should speak this.
</voice>
<lang
langid="409">
A U.S. English voice should speak
this.
</lang>
Custom Pronunciation:
An alternative to using the <P> tag with the DISP and PRON attributes
is to use custom pronunciation. Using custom pronunciation, tags in the form of
the following.
<P DISP="disp" PRON="pron">word</P>
can be written
as
<P>/disp/word/pron;</P>
Conclusion:To
test all the above tags and to create TTS application using SAPI, please take a
look at my article: “SAPI 5.1 in Creating Text to Speech Applications
using
C#”.
Hope this article will be helpful. Happy
programming.
Resourse:Microsoft Speech SDK,
version 5.1