Simulating Emotional Speech For A Talking Head November 2000

Simulating Emotional Speech for a Talking Head
November 2000
Contents
1 Introduction..............................................................................................1 2 Problem Description................................................................................2
2.1 Objectives..............................................................................................................2 2.2 Subproblems..........................................................................................................2 2.3 Significance...........................................................................................................3
3 Literature Review....................................................................................5
3.1 Emotion and Speech..............................................................................................5 3.2 The Speech Correlates of Emotion........................................................................6 3.3 Emotion in Speech Synthesis................................................................................8 3.4 Speech Markup Languages...................................................................................9 3.5 Extensible Markup Language (XML).................................................................10 3.5.1 XML Features..........................................................................................10 3.5.2 The XML Document................................................................................11 3.5.3 DTDs and Validation..............................................................................12 3.5.4 Document Object Model (DOM).............................................................14 3.5.5 SAX Parsing.............................................................................................15 3.5.6 Benefits of XML......................................................................................16 3.5.7 Future Directions in XML........................................................................17 3.6 FAITH.................................................................................................................18 3.7 Resource Review.................................................................................................20 3.7.1 Text-to-Speech Synthesizer.....................................................................20 3.7.2 XML Parser..............................................................................................22 3.8 Summary.............................................................................................................23
4 Research Methodology..........................................................................25
4.1 Hypotheses..........................................................................................................25
Page i
November 2000
4.2 Limitations and Delimitations.............................................................................26 4.2.1 Limitations...............................................................................................26 4.2.2 Delimitations............................................................................................26 4.3 Research Methodologies.....................................................................................27
5 Implementation......................................................................................28
5.1 TTS Interface.......................................................................................................28 5.1.2 Module Inputs..........................................................................................29 5.1.3 Module Outputs........................................................................................30 5.1.4 C/C++ API...............................................................................................31 5.2 SML Speech Markup Language.......................................................................32 5.2.1 SML Markup Structure............................................................................32 5.3 TTS Module Subsystems Overview....................................................................34 5.4 SML Parser..........................................................................................................35 5.5 SML Document...................................................................................................36 5.5.1 Tree Structure...........................................................................................36 5.5.2 Utterance Structures.................................................................................37 5.6 Natural Language Parser.....................................................................................39 5.6.2 Obtaining a Phoneme Transcription........................................................40 5.6.2 Synthesizing in Sections..........................................................................42 5.6.3 Portability Issues......................................................................................43 5.7 Implementation of Emotion Tags........................................................................44 5.7.1 Sadness.....................................................................................................45 5.7.2 Happiness.................................................................................................46 5.7.3 Anger........................................................................................................47 5.7.4 Stressed Vowels.......................................................................................48 5.7.5 Conclusion...............................................................................................48 5.8 Implementation of Low-level SML Tags............................................................49 5.8.1 Speech Tags.............................................................................................49 5.8.2 Speaker Tag..............................................................................................53 5.9 Digital Signal Processor......................................................................................54 5.10 Cooperating with the FAML module................................................................55 5.11 Summary...........................................................................................................57
Page ii
November 2000
6 Results and Analysis..............................................................................58

6.1 Data Acquisition..................................................................................................58 6.1.1 Questionnaire Structure and Design........................................................58 6.1.2 Experimental Procedure...........................................................................61 6.1.3 Profile of Participants...............................................................................63 6.2 Recognizing Emotion in Synthetic Speech.........................................................64 6.2.1 Confusion Matrix.....................................................................................64 6.2.2 Emotion Recognition for Section 2A.......................................................66 6.2.3 Emotion Recognition for Section 2B.......................................................69 6.2.4 Effect of Vocal Emotion on Emotionless Text........................................73 6.2.5 Effect of Vocal Emotion on Emotive Text..............................................75 6.2.6 Further Analysis.......................................................................................75 6.3 Talking Head and Vocal Expression...................................................................77 6.4 Summary.............................................................................................................81
7 Future Work...........................................................................................82
7.1 Post Waveform Processing..................................................................................82 7.2 Speaking Styles...................................................................................................83 7.3 Speech Emotion Development............................................................................84 7.4 XML Issues.........................................................................................................85 7.5 Talking Head.......................................................................................................86 7.6 Increasing Communication Bandwidth...............................................................87
8 Conclusion...............................................................................................88 9 Bibliography...........................................................................................91 10 Appendix A SML Tag Specification..................................................96

................................................................................................................................101
11 Appendix B SML DTD.....................................................................102 12 Appendix C Festival and Visual C++..............................................104 13 Appendix D Evaluation Questionnaire...........................................107 14 Appendix E Test Phrases for Questionnaire, Section 2B..............113
Page iii
November 2000
List of Figures
15 Figure 1 - An XML document holding simple weather information.11 16 Figure 2 - Sample section of a DTD file...............................................12 17 Figure 3 - XML syntax error - list and item tags incorrectly matched. 13 18 Figure 4 - Well-formed XML document, but does not follow grammar specification in DTD file (an item tag occurs outside of list tag)...........................................................................................................13 19 Figure 5 Well-formed XML document that also follows DTD grammar specification. Will not produce any parse errors..............13 20 Figure 6 - DOM representation of XML example..............................15 21 Figure 7 - FAITH project architecture................................................19 22 Figure 8 - Talking Head being developed as part of the FAITH project at the School of Computing, Curtin University of Technology. 20 23 Figure 9 - Top level outline showing how Festival and MBROLA systems were used together...................................................................21 24 Figure 10 - Black box design of the system, shown as the TTS module of a Talking Head...................................................................................28 25 Figure 11 - Top-level structure of an SML document........................32 26 Figure 12 - Valid SML markup............................................................33 27 Figure 13 - Invalid SML markup.........................................................33 28 Figure 14 - TTS module subsystems.....................................................34 29 Figure 15 - Filtering process of unknown tags....................................36 30 Figure 16 - SML Document structure for SML markup given above. 37
Page iv
November 2000
31 Figure 17 - Utterance structures to hold the phrase the moon. U = CTTS_UtteranceInfo object; W = CTTS_UtteranceInfo object; P = CTTS_PhonemeInfo object; pp = CTTS_PitchPatternPoint object. 39 32 Figure 18 - Tokenization of a part of an SML Document................................................................................................40
33 Figure 19 - SML Document sub-tree representing example SML markup....................................................................................................41 34 Figure 20 - Raw timeline showing server and client execution when synthesizing example SML markup above..........................................43 35 Figure 21 Multiply factors of pitch and duration values for emphasized phonemes............................................................................50 36 Figure 22 - Processing a pause tag........................................................51 37 Figure 23 - The effect of widening the pitch range of an utterance...52 38 Figure 24 - Processing the pron tag......................................................52 39 Figure 25 - Example MBROLA input..................................................55 40 Figure 26 - Example utterance information supplied to the FAML module by the TTS module. Example phrase: And now the latest news.......................................................................................................56 41 Figure 27 - A node carrying waveform processing instructions for an operation.................................................................................................83 42 Figure 28 - Insertion of new submodule for post waveform processing................................................................................................83 43 Figure 29 - SML Markup containing a link to a stylesheet................84 44 Figure 30 - Inclusion of an XML Handler module to centrally manage XML input................................................................................85 45 Figure 31 - Proposed design of TTS Module architecture to minimize bandwidth problems between server and client..................................87
Page v
November 2000
List of Tables
46 Table 1 - Summary of human vocal emotion effects.............................8 47 Table 2 - Summary of human vocal emotion effects for anger, happiness, and sadness..........................................................................44 48 Table 3 Speech correlate values implemented for sadness..............45 49 Table 4 - Speech correlate values implemented for happiness..........46 50 Table 5 - Speech correlate values implemented for anger.................47 51 Table 6 - Vowel-sounding phonemes are discriminated based on their duration and pitch..................................................................................48 52 Table 7- MBROLA command line option values for en1 and us1 diphone databases to output male and female voices.........................54 53 Table 8 - Statistics of participants........................................................63 54 Table 9 - Confusion matrix template....................................................64 55 Table 10 - Confusion matrix with sample data...................................65 56 Table 11 - Confusion matrix showing ideal experiment data: 100% recognition rate for all simulated emotions.........................................65 57 Table 12 Listener response data for neutral phrases spoken with happy emotion........................................................................................66 58 Table 13 Section 2A listener response data for neutral phrases.....67 59 Table 14 - Listener response data for Section 2A, Question 1...........68 60 Table 15 - Listener response data for Section 2A, Question 2...........68 61 Table 16 - Listener responses for utterances containing emotionless text with no vocal emotion.....................................................................70 62 Table 17 - Listener responses for utterances containing emotive text with no vocal emotion............................................................................71 63 Table 18 - Listener responses for utterances containing emotionless text with vocal emotion..........................................................................72
Page vi
November 2000
64 Table 19 - Listener responses for utterances containing emotive text with vocal emotion.................................................................................73 65 Table 20 - Percentage of listeners who improved in emotion recognition with the addition of vocal emotion effects for neutral text. 74 66 Table 21 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for neutral text...........................................................................................................74 67 Table 22 - Percentage of listeners whose emotion recognition improved with the addition of vocal emotion effects for emotive text. 75 68 Table 23 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for emotive text...........................................................................................................75 69 Table 24 Listener responses for participants who speak English as their first language. Utterance type is neutral text, emotive voice. 76 70 Table 25 Listener responses for participants who do NOT speak English as their first language. Utterance type is neutral text, emotive voice.........................................................................................76 71 Table 26 Listener responses for participants who speak English as their first language. Utterance type is emotive text, emotive voice. 77 72 Table 27 Listener responses for participants who do NOT speak English as their first language. Utterance type is emotive text, emotive voice.........................................................................................77 73 Table 28 - Participant responses when asked to choose the Talking Head that was more understandable....................................................78 74 Table 29 Participant responses when asked which Talking Head seemed best able to express itself..........................................................79 75 Table 30 Participant responses when asked which Talking Head seemed more natural..............................................................................80 76 Table 31 Participant responses when asked which Talking Head seemed more interesting........................................................................80
Page vii
Chapter 1 Introduction
When we talk we produce a complex acoustic signal that carries information in addition to the verbal content of the message. Vocal expression tells others about the emotional state of the speaker, as well as qualifying (or even disqualifying) the literal meaning of the words. Because of this, listeners expect to hear vocal effects, paying attention not only to what is being said, but how it is said. The problem with current speech synthesizers is that the effect of emotion on speech is not taken into account, producing output that sounds monotonic, or at worst distinctly machine-like. As a result of this, the ability of a Talking Head to express its emotional state will be adversely affected if it uses a plain speech synthesizer to "talk". The objective of this research was to develop a system that is able to incorporate emotional effects in synthetic speech, and thus improve the perceived naturalness of a Talking Head. This thesis reviews the literature in the fields of speech emotion, synthetic speech synthesis, and XML. A discussion on XML is featured prominently in this thesis because it was the vehicle chosen for directing how the synthetic voice should sound. It also had considerable impact on how speech information was processed. The design and implementation details of the project are discussed to describe the developed system. An in-depth analysis of the projects evaluation data is then given, concluding with a discussion of future work that has been identified.
Chapter 2 Problem Description

2.1 Objectives
Development of the project was aimed at meeting two main objectives to support the hypotheses of Section 4.1: 1. To develop a system that can add simulated emotion effects to synthetic speech. This involved researching the speech correlates of emotion that have been identified in the literature. The findings were to be applied to the control parameters available in a speech synthesizer, allowing a specified emotion to be simulated using rules controlling the parameters. 2. To integrate the system within the TTS (text-to-speech) module of a Talking Head. The speech system was to be added to the Talking Head that is part of the FAITH1 project. It is being developed jointly at Curtin University of Technology, Western Australia, and the University of Genoa in Italy (Beard et al, 1999). The text-to-speech module must be treated as a 'black box', which is consistent with the modular design of FAQbot.
2.2 Subproblems
A number of subproblems were identified to successfully develop a system with the stated objectives. 1. Design and implementation of a speech markup language. It was desirable that the markup language be XML-based; the reasons for this will become apparent later in the thesis. The role of the speech markup language (SML) is to
1
Facial Animated Interactive Talking Head
provide a way to specify in which emotion a text segment is to be rendered. In addition to this, it was decided to extend the application of the markup to provide a mechanism for the manipulation of generally useful speech properties such as rate, pitch, and volume. SML was designed to closely follow the SABLE specification, described by Sproat et al (1998). 2. Evaluation of each of the existing text-to-speech (TTS) submodules of the Talking Head was required. Its aim was to determine what could and could not be reused. This included assessing the existing TTS modules API, and the modules that interface with other subsystems of the Talking Head (namely the MPEG-4 subsystem). 3. Cooperative integration with modules that were being concurrently written for the Talking Head, namely the gesture markup language being developed by Huynh (2000). The collaboration between the two subprojects was aimed at providing the Talking Head with synchronization of vocal expressions and facial gestures. An architecture specification for allowing facial and speech synchronization is given by Ostermann et al. (1998). 4. Since the Talking Head is being developed to run over a number of platforms (Win32, Linux, and IRIX 6.3), it was crucial that the new TTS module would not hamper efforts to make the Talking Head a platform independent application.
2.3 Significance
The project is significant because despite the important role of the display of emotion in human communication, current text-to-speech synthesizers do not cater for its effect on speech. Research to add emotion effects to synthetic speech is ongoing, notably by Murray and Arnott (1996), but has been mainly restricted to a standalone system, and not part of a Talking Head as this project set out to do. Increased naturalness in synthetic speech is seen as being important for its acceptance (Scherer, 1996), and this is likely to be the case for applications of Talking Head technology as well. This thesis is attempting to address this need. Advances in this area will also benefit work in the fields of speech analysis, speech recognition and speech synthesis when dealing with natural variability. This is because work with the speech correlates of emotion will help support or disprove speech correlates identified in speech
analysis, help in proper feature extraction for the automatic recognition of emotion in the voice, and generally improve synthetic speech production.
Chapter 3 Literature Review

This section presents a brief review of the literature relevant to the areas the project is concerned with: the effects of emotion on speech, speech emotion synthesis, XML and speech markup languages.
3.1 Emotion and Speech

Emotion is an integral part of speech. Semantic meaning in a conversation is conveyed not only in the actual words we say, but also in how they are expressed (Knapp, 1980; Malandro, Barker and Barker, 1989). Even before they can understand words, children display the ability to recognize vocal emotion, illustrating the importance that nature places on being able to convey and recognize emotion in the speech channel bandwidth. The intrinsic relationship that emotion shares with speech is seen in the direct effect that our emotional state has on the speech production mechanism. Physiological changes such as increased heart rate and blood pressure, muscle tremors and dryness of mouth have been noted to be brought about by the arousal of the sympathetic nervous system, such as when experiencing fear, anger, or joy (Cahn, 1990). These effects of emotion on a persons speech apparatus ultimately affect how speech is produced, thus promoting the view that an emotion carrier wave is produced for the words spoken (Murray and Arnott, 1993). With emotion being described as the organisms interface to the world outside, (Scherer, 1981), considerable interest has been devoted to investigate the role of emotion in speech, particularly regarding its social aspects (Knapp, 1980). One function is to notify others of our behavioural intentions in response to certain events (Scherer, 1981). For example, the contraction of ones throat when experiencing fear will produce a harsh voice that is increased in loudness (Murray and Arnott, 1993), serving to warn and
frighten a would-be assailant, with the body tensing for a possible confrontation. The expression of emotion through speech also serves to communicate to others our judgement of a particular situation. Importantly, vocal changes due to emotion may in fact be cross-cultural in nature, though this may only be true for some emotions, and further work is required to ascertain this for certain (Murray, Arnott and Rohwer, 1996). We also deliberately use vocal expression in speech to communicate various meanings. Sudden pitch changes will make a syllable stand out, highlighting the associated word as an important component of that utterance (Dutoit, 1997). A speaker will also pause at the end of key sentences in a discussion to allow listeners the chance to process what was said, and a phrases pitch will increase towards the end to denote a question (Malandro, Barker and Barker, 1989). When something is said in a way that seems to contradict the actual spoken words, we will usually accept the vocal meaning over the verbal meaning. For example, the expression thanks a lot spoken in an angry tone will generally be taken in a negative way, and not as a compliment, as the literal meaning of the words alone would suggest. This underscores the importance we place on the vocal information that accompanies the verbal content.
3.2 The Speech Correlates of Emotion

Acoustics researchers and psychologists have endeavoured to identify the speech correlates of emotion. The motivation behind this work is based on the demonstrated ability of listeners to recognize different vocal expressions. If vocal emotions are distinguishable, then there are acoustic features responsible for how various emotions are expressed (Scherer, 1996). However, this task has met with considerable difficulty. This is because coordination of the speech apparatus to produce vocal expression is done unconsciously, even when a speaking style is consciously adopted (Murray and Arnott, 1996). Traditionally, there have been three major experimental techniques that researchers have used to investigate the speech correlates of emotion (Knapp, 1980; Murray and Arnott, 1993): 1. Meaningless, neutral content (e.g. letters of the alphabet, numbers etc) is read by actors who express various emotions. 2. The same utterance is expressed in different emotions. approach aids in comparing the emotions being studied. This
3. The content is ignored altogether, either by using equipment designed to extract various speech attributes, or by filtering out the content. The latter technique involves applying a low-pass filter to the speech signal, thus eliminating the high frequencies that word recognition is dependent upon. (This meets with limited success, however, since some of the vocal information also resides in the high frequency range.) The problem of speech parameter identification is further compounded by the subjective nature of these tests. This is evident in the literature, as results taken from numerous studies rarely agree with each other. Nevertheless, a general picture of the speech parameters responsible for the expression of emotion can be constructed. There are three main categories of speech correlates of emotion (Cahn, 1990; Murray, Arnott and Rohwer, 1996): Pitch contour. The intonation of an utterance, which describes the nature of accents and the overall pitch range of the utterance. Pitch is expressed as fundamental frequency (F0). Parameters include average pitch, pitch range, contour slope, and final lowering. Timing. Describes the speed that an utterance is spoken, as well as rhythm and the duration of emphasized syllables. Parameters include speech rate, hesitation pauses, and exaggeration. Voice quality. The overall character of the voice, which includes effects such as whispering, hoarseness, breathiness, and intensity.
It is believed that value combinations of these speech parameters are used to express vocal emotion. Table 1 is a summary of human vocal emotion effects of four of the socalled basic emotions: anger, happiness, sadness and fear (Murray and Arnott, 1993; Galanis, Darsinos and Kokkinakis, 1996; Cahn, 1990; Davtiz, 1976; Scherer, 1996). The parameter descriptions are relative to neutral speech.
Anger
Happiness
Sadness
Fear
Speech rate Pitch average Pitch range Intensity Pitch changes
Faster Very higher much
Slightly faster Much higher Much wider Higher Smooth, upward inflections
Slightly slower Slightly lower Slightly narrower Lower Downward inflections
Much faster Very much higher Much wider Higher Downward terminal inflections Irregular voicing1 Precise
Much wider Higher Abrupt, downward, directed contours Breathy, chesty tone1 Clipped
Voice quality Articulation

1
Breathy, blaring1 Slightly slurred
Resonant1 Slurred
terms used by Murray and Arnott (1993). Table 1 - Summary of human vocal emotion effects.
The summary should not be taken as a complete and final description, but rather is meant as a guideline only. For instance, the table above emphasizes the role of fundamental frequency as a carrier of vocal emotion. However, Knower (1941, as referred in Murray and Arnott, 1993) notes that whispered speech is able to convey emotion, even though whispering makes no use of the voices fundamental frequency. Nevertheless, being able to succinctly describe vocal expression like this has significant benefits for simulating emotion in synthetic speech.
3.3 Emotion in Speech Synthesis

In the past, focus has been placed on developing speech synthesizer techniques to produce clearer intelligibility, with intonation being confined to model neutral speech. However, the speech produced is distinctly machine sounding and unnatural. Speech synthesis is seen as being flawed for not possessing appropriate prosodic variation like that found in human speech. For this reason, some synthesis models are including the effects of emotion on speech to produce greater variability (Murray, Arnott and Rohwer, 1996). Interestingly, Scherer (1996) sees this as being crucial for the acceptance of synthetic speech. The advantage of the vocal emotion descriptions in Table 1 is that the speech parameters can be manipulated in current speech synthesizers to simulate emotional speech without dramatically affecting intelligibility. This approach thus allows emotive effects to be added on top of the output of text-to-speech synthesizers through the use of
carefully constructed rules. Two of the better known systems capable of adding emotionby-rule effects to speech are the Affect Editor, developed by Cahn (1990b), and HAMLET, developed by Murray and Arnott (1995) (Murray, Arnott and Newell, 1988). The systems both make use of the DECtalk text-to-speech synthesizer, mainly because of its extensive control parameter features. Future work is concerned with building a solid model of emotional speech, as this area is seen as being limited by our understanding of vocal expression, and the quality of the speech correlates used to describe emotional speech (Cahn, 1988; Murray and Arnott, 1995; Scherer, 1996). Although not within the scope of the project, it is worth mentioning that research is being undertaken in concept-to-speech synthesis. This work is aimed at improving the intonation of synthetic speech by using extra linguistic information (i.e. tagged text) provided by another system, such as a natural language generation (NLG) system (Hitzeman et al, 1999). Variability in speech is also being investigated in the area of speech recognition with the aim of possibly developing computer interfaces that respond differently according to the emotional state of the user (Dellaert, Polzin and Waibel, 1996). Another avenue for future research could be to incorporate the effects of facial gestures on speech. For instance, Hess, Scherer and Kappas (1988) noted that voice quality is judged to be friendly over the phone when a person is smiling. A model that could cater for this would have extremely beneficial applications for recent work concerned with the synchronization of facial gestures and emotive speech in Talking Heads. Finally, simulating emotion in synthetic speech not only has the potential to build more realistic speech synthesizers (and hence provide the benefits that such a system would offer), but will also add to our understanding of speech emotion itself.
3.4 Speech Markup Languages

Ideally, a text-to-speech synthesizer would be able to accept plain text as input, and speak it in a manner comparable to a human, emphasizing important words, pausing for effect, and pronouncing foreign words correctly. Unfortunately, automatically processing and analyzing plain text is extremely difficult for a machine. Without extra information to accompany the words it is to speak, the speech synthesizer will not only sound unnatural, but intelligibility will also decrease. Therefore, it is desirable to have an annotation scheme that will allow direct control over the speech synthesizers output.
Most research and commercial systems allow for such an annotation scheme, but almost all are synthesizer dependent, thus making it extremely difficult for software developers to build programs that can interface with any speech synthesizer. Recent moves by industry leaders to standardize a speech markup language has led to the draft specification of SABLE, a system independent, SGML-based markup language (Sproat et al, 1998). The SABLE specification has evolved from three existing speech synthesis markup languages: SSML (Taylor and Isard, 1997), STML (Sproat et al, 1997), and Javas JSML.
3.5 Extensible Markup Language (XML)

XML is the Extensible Markup Language created by W3C, the World Wide Web Consortium (Extensible Markup Language, 1998). It was specially designed to enable the use of large document management concepts for the World Wide Web that were embodied in SGML, the Standard Generalized Markup Language. In adopting SGML concepts however, the aim was also to remove features of SGML that were either not needed for Web applications or were very difficult to implement (The XML FAQ, 2000). The result was a simplified dialect of SGML that is relatively easy to learn, use and implement, and at the same time retains much of the power of SGML (Bosak, 1997). It is important to note that XML is not a markup language in itself, but rather it is a meta-language a language for describing other languages. Therefore, XML allows a user to specify the tag set and grammar of their own custom markup language that follows the XML specification.
3.5.1 XML Features

There are three significant features of XML that make it a very powerful meta-language (Bosak, 1997): 1. Extensibility - new tags and their attribute names can be defined at will. Because the author of an XML document can markup data using any number of custom tags, the document is able to effectively describe the data embodied within the tags. This is not the case with HTML, which uses a fixed tag set.
2. Structure the structure of an XML document can be nested to any level of complexity since it is the author that defines the tag set and grammar of the document. 3. Validation if a tag set and grammar definition is provided (usually via a Document Type Definition (DTD)), then applications processing the XML document can perform structural validation to make sure it conforms to the grammar specification. So though the nested structure of an XML document can be quite complex, the fact that it follows a very rigid guideline makes document processing relatively easy.
3.5.2 The XML Document

An XML document is a sequence of characters that contains markup (the tags that describe the text they encapsulate), and the character data (the actual text being marked up). Figure 1 shows an example of a simple XML document.
XML header declaration
<?xml version=1.0?> <weather-report> <date>March 25, 1998</date> <time>08:00</time> <area> <city>Perth</city> <state>WA</state> <country>Australia</country> </area> <measurements> <skies>partly cloudy</skies> <temperature>20</temperature> <h-index>51</h-index> <humidity>87</humidity> <uv-index>1</uv-index> </measurements> </weather-report>
Figure 1 - An XML document holding simple weather information. Markup tag
Character data (marked up text)
One of the main observations that should be made for the example given in Figure 1 is that an XML document describes only the data, and not how it should be viewed. This is unlike HTML, which forces a specific view and does not provide a good mechanism for data description (Graham and Quinn, 1999). For example, HTML tags such as P, DIV, and TABLE describe how a browser is to display the encapsulated text, but are
inadequate for specifying whether the data is describing an automotive part, is a section of a patients health record, or the price of a grocery item. The fact that an XML document is encoded in plain text was a conscious decision made by the XML designers the designing of a system-independent and vendorindependent solution (Bosak, 1997). Although text files are usually larger than comparable binary formats, this can be easily compensated for using freely available utilities that can efficiently compress files, both in terms of size and time. At worst, the disadvantages associated with an uncompressed plain text file is deemed to be outweighed by the advantages of a universally understood and portable file format that does not require special software for encoding and decoding.
3.5.3 DTDs and Validation

The XML specification has very strict rules which describe the syntax of an XML document for instance, the characters allowable within the markup section, how tags must encapsulate text, the handling of white space etc. These rigid rules make the tasks of parsing and dividing the document into sub-components much easier. A well-formed XML document is one that follows the syntax rules set in the XML specification. However, since its author determines the structure of the document, a mechanism must be provided that allows grammar checking to take place. XML does this through the Document Type Definition, or DTD. A DTD file is written in XMLs Declaration Syntax, and contains the formal description of a documents grammar (The XML FAQ, 2000). It defines amongst other things: which tags can be used and where they can occur, the attributes within each tag, and how all the tags fit together. Figure 2 gives a sample DTD section that describes two elements: list and item. The example declares that one or more item tags can occur within a list tag. Furthermore, an item tag may optionally have a type attribute.
<!ELEMENT <!ELEMENT <!ATTLIST type list (item)+> item> item CDATA #IMPLIED>
One or more item tags Attribute is optional
Figure 2 - Sample section of a DTD file
Extending this example, the different levels of validation performed by an XML parser can be seen. Figure 3 shows an XML document that does not meet the syntax specified in the XML specification.
<?xml version=1.0?> <list><item> Item 1 </list></item>
Figure 3 - XML syntax error - list and item tags incorrectly matched.
Figure 4 shows a well-formed XML document (i.e. it follows the XML syntax), but does not follow the grammar specified in the linked DTD file. (The DTD file is the one given in Figure 2).
<?xml version=1.0?> <!DOCTYPE list SYSTEM list-dtd-file.dtd <list> <item>Item 1</item> <item>Item 2</item> </list> <item>Item 3</item>
Figure 4 - Well-formed XML document, but does not follow grammar specification in DTD file (an item tag occurs outside of list tag).
list-dtd-file.dtd (DTD file)
Figure 5 shows a well-formed XML document that also meets the grammar specification given in the DTD file.
<?xml version=1.0?> <!DOCTYPE list SYSTEM list-dtd-file.dtd <list> <item>Item 1</item> <item type=x>Item 2</item> <item>Item 3</item> </list>
Figure 5 Well-formed XML document that also follows DTD grammar specification. Will not produce any parse errors.
The XML Recommendation states that any parse error detected while processing an XML document will immediately cause a fatal error (Extensible Markup Language, 1998) the XML document will not be processed any further, and the application will
not attempt to second guess the authors intent. Note that the DTD does NOT define how the data should be viewed either. Also, the DTD is able to define which sub-elements can occur within an element, but not the order in which they occur; the same applies for attributes specified for an element. For this reason, an application processing an XML document should avoid being dependent on the order of given tags or attributes.
3.5.4 Document Object Model (DOM)

The Document Object Model (DOM) Level 1 Specification states the Document Object Model as a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents (Document Object Model, 2000). It provides a tree-based representation of an XML document, allowing the creation, manipulation, and navigation of any part within the document. However, it is important to note that the DOM specification itself does not specify that documents must be implemented as a tree only, it is convenient that the logical structure of the document be described as a tree due to the hierarchical structure of marked up documents. The DOM is therefore a programming API for documents that is truly structurally neutral as well. Working with parts of the DOM is quite intuitive since the object structure of the DOM very closely resembles the hierarchical structure of the document. For instance, the DOM shown in Figure 6b would represent the tagged text example in Figure 6a. Again, the hierarchical relationships are logical ones defined in the programming API, and are not representations of any particular internal structures (Document Object Model, 2000). Once a DOM tree is constructed, it can be modified easily by adding/deleting nodes and moving sub-trees. The new DOM tree can then be used to output a new XML document since all the information required to do so is held within the DOM representation. A DOM tree will not be constructed until the XML document has been fully parsed and validated.
a.
<weather-report> <date>October 30, 2000</date> <time>14:40</time> <measurements> <skies>Partly cloudy</skies> <temperature>18</temperature> </measurements> </weather-report>
b.
<weather-report> <weather-report>
<date> <date>
<time> <time>
<measurements> <measurements>
October 30, 2000
14:40
<skies> <skies>
<temperature> <temperature>
Partly cloudy
18
Figure 6 - DOM representation of XML example
3.5.5 SAX Parsing

A downside to the DOM is that most XML parsers implementing the DOM make the entire tree reside in memory apart from putting a strain on system resources, it also limits the size of the XML document that can be processed (Python/XML Howto, 2000) (libxml, 2000). Also, say the application only needs to search the XML document for occurrences of a particular word, it would be inefficient to construct a complete inmemory tree to do this.
A SAX handler, on the other hand, can process very large documents since it does not keep the entire document in memory during processing. SAX, the Simple API for XML, is a standard interface for event-based XML parsing (SAX 2.0, 2000). Instead of building a structure representing the entire XML document, SAX reports parsing events (such as the start and end of tags) to the application through callbacks.
3.5.6 Benefits of XML

The following benefits of using XML in applications have been identified (Microsoft, 2000) (SoftwareAG, 2000b): Simplicity XML is easy to read, write and process by both humans and computers. Openness XML is an open and extensible format that leverages on other (open) standards such as SGML. XML is now a W3C Recommendation, which means it is a very stable technology. In addition, XML is highly supported by industry market leaders such as Microsoft, IBM, Sun, and Netscape, both in developer tools and user applications. Extensibility data encoded in XML is not limited to a fixed tag set. This enables precise data description, greatly aiding data manipulators such as search engines to produce more meaningful searches. Local computation and manipulation once data in XML format is sent to the client, all processing can be done on the local machine. The XML DOM allows data manipulation through scripting and other programming languages. Separation of data from presentation this allows data to be written, read and sent in the best logical mode possible. Multiple views of the data are easily rendered, and the look and feel of XML documents can be changed through XSL style sheets; this means that the actual content of the document need not be changed. Granular updates the structure of XML documents allows for granular updates to take place since only modified elements need to be sent from the server to the client. This is currently a problem with HTML since even with the slightest modification a page needs to be rebuilt. Granular updates will help reduce server workload. Scalability separation of data from presentation also allows authors to embed within the structured data procedural descriptions of how to produce different views. This offloads much of the user interaction from the server to the
client computer, reducing the servers workload and thus enhancing server scalability. Embedding of multiple data types XML documents can contain virtually any kind of data type such as image, sound, video, URLs, and also active components such as Java applets and ActiveX. Data delivery since XML documents are encoded in plain text, data delivery can be performed on existing networks, sent using HTTP just like HTML. Combined with the XML features discussed in section 3.5.1, the above list underscores the enormous potential of XML. Indeed, the extent of these benefits makes XML a core component in a wide range of applications: from dissemination of information in government agencies to the management of corporate logistics; providing telecommunication services; XML-based prescription drug databases to help pharmacists advise their customers; simplifying the exchange of complex patient records obtained from different data sources, and much more (SoftwareAG, 2000a).
3.5.7 Future Directions in XML

Section 3 of this chapter has served to give a basic introduction to XML: its design principles, features and benefits, the structure of an XML document, validation using a DTD, and two parsing strategies, DOM and SAX. The discussion has dealt mainly with the XML 1.0 Specification, however, and the reader should be aware that there is actually a family of technologies (Bos, 2000) associated with XML. Some of the XML projects currently underway are (Bos, 2000): Xlink describes a standard way of including hyperlinks within an XML file. XPointer & XFragments syntaxes for pointing to parts of an XML document. Cascading Style Sheets (CSS) making the style sheet language applicable to XML as it is with HTML. XSL advanced language for expressing style sheets.
XML Namespaces specification that describes how an URL can be associated with every tag and attribute within an XML document.
XML Schemas 1 & 2- aimed at helping developers to define their own XML-based formats. Future applications of some of these XML components to this project is discussed later in this thesis. For more information on these emerging technologies, see the XML Cover Pages (Cover, 2000).
3.6 FAITH
A very brief description of the FAITH project is required in order to gain an understanding of where the TTS module fits within the Talking Head architecture. Figure 7 shows a simplified view of the various subsystems that make up the Talking Head.
TTS
Text to synthesise Waveforms FAPs (visemes)
Brain
Personality
FAML
Text questions
FAPs
SERVER CLIENT
MPEG-4
Questions
Waveforms
FAPs
User Interface
Figure 7 - FAITH project architecture.
As Figure 7 shows, the architecture has been designed to fit a client/server framework; the client is responsible for interfacing with the user, where the Talking Head is rendered, audio played, extra information displayed etc. The server accepts a users text input (such as questions, dialog etc), and processing is carried out in the following order: The Brain module, developed by Beard (1999), processes the users text input and forms an appropriate response. The response is then sent to the TTS module. The TTS module is responsible for producing the speech equivalent of the Brains text response and outputs a waveform. It also outputs viseme information in MPEG-4s FAP (Facial Animation Parameter) format, and passes this to modules responsible for generating more FAP values. Visemes are the visual equivalent of speech phonemes (e.g. the mouth forms a specific shape when saying oo). Generated FAP values can be used to move specific points of the Talking Heads face (to produce head movements, blinking, gestures etc), which is the purpose of the next two subsystems. The Personality modules role is to generate MPEG-4 FAP values with the goal of simulating various personalities such as friendliness and dominance (Shepherdson, 2000). The FAML modules role is to generate MPEG-4 FAP values to display various head movements, and facial gestures and expressions specified through a special markup language (Huynh, 2000). As the diagram shows, communication between the client and server is done via an implementation of the MPEG-4 standard (Cechner, 1999). For a more detailed, summarized description of the FAITH project, see Beard et al (2000). (The reader should note however, that the paper describes the old TTS module, and not the (newer) TTS module described in this thesis). Figure 8 shows one of the models rendered on the client side with which the user interfaces with.
Figure 8 - Talking Head being developed as part of the FAITH project at the School of Computing, Curtin University of Technology.
3.7 Resource Review

3.7.1 Text-to-Speech Synthesizer
Much of the preparation for this project required the investigation of an appropriate text to speech synthesizer able to provide the flexibility required to achieve the proposed aims, and therefore deserves mention in this thesis. Initially, investigation focused on commercial text-to-speech systems. Most of the prominent systems offer documentation and interactive sampling via the Internet, including the ability to send and receive custom example output. Some of the TTS systems considered were from the following companies: AT&T Labs, Bell Labs, Elan Informatique (recently acquired by Lernout & Hauspie), Eloquent Technology, Lernout & Hauspie, and the DECtalk synthesizer (now owned by Compaq). However, though the speech quality of all the mentioned systems was of a very high standard, with the exception of the DECtalk synthesizer none of these allowed the control required for this project. At most, speech rate, average pitch and some user commands are available, but are too basic and inadequate for the project. The Microsoft Speech API 4.0 (SAPI) specification does allow for voice quality control such as whispering (through the \vce tag), but these were rarely implemented by the speech engines supporting SAPI. It is therefore no coincidence that the two major research groups of synthetic speech emotion, Cahn (1990), and Murray and Arnott (1995), chose to use DECtalk in their implementations. It offers a very large number of control commands through its API, including the ability to manipulate the utterance at the phoneme level (Hallahan, 1996). Notwithstanding these features, it was found that the freely available systems, the Festival Speech Synthesis System and MBROLA, offered the same capabilities (albeit in an indirect way) plus additional advantages.
Festival is a widely recognized research project developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh, with the aim of offering a free, high quality text-to-speech system for the advancement of research (Black, Taylor and Caley, 1999). The MBROLA project, initiated by the TCTS Lab of the Facult Polytechnique de Mons (Belgium), is a free multi-lingual speech synthesizer developed with aims similar to Festivals (MBROLA Project Homepage, 2000).
Text
NLP (Festival)
Phonemes, pitch and duration
Pitch and Timing Modifier

Modified phonemes, pitch and duration
DSP (MBROLA)
Waveform
Figure 9 - Top level outline showing how Festival and MBROLA systems were used together.
It was decided for this project to use the Festival system as the natural language parser (NLP) component of the module, which accepts text as input and transcribes this to its phoneme equivalent, plus duration and pitch information. This information can be then given to the MBROLA synthesizer, acting as the digital signal processing unit (DSP), which produces a waveform from this information. Although Festival has its own DSP unit, it was found that the Festival + MBROLA combination produces the best quality. It is important to note that the Festival system supports MBROLA in its API. Because of the phoneme-duration-pitch input format required for MBROLA, it provides very fine pitch and timing control for each phoneme in the utterance. As stated before, this level of control is simply unattainable with commercial systems, except
DECtalk. The advantage of using MBROLA over DECtalk, however, is in the fact that once a phonemes pitch is altered in the latter system, the generated pitch contour is overwritten. Cahn (1990) first mentioned this problem, and as a result did not manipulate the utterance at the phoneme level, limiting the amount of control, which ultimately hindered the quality of the simulated emotion. To overcome this, Murray and Arnott (1995) had to write their own intonation model to replace the DECtalk generated pitch contour when they changed pitch values at the phoneme level. Fortunately, this is not an issue with MBROLA, as changes to the pitch and duration levels can be done prior to passing it to MBROLA (as Figure 9 shows). Therefore, it can be seen that the Festival + MBROLA option offers high control comparable to the DECtalk synthesizer, with the benefit of less complexity. The use of Festival and MBROLA also addressed the platform independent subproblem described in Section 2.2. Although developed mainly for the UNIX platform, its source code can be ported to the Win32 platform via relatively minor modifications. The MBROLA Homepage offers binaries for many platforms, including: Win32, Linux, most Unix OS versions, BeOS, Macintosh and more. Before the final decision was made to make use of the Festival system however, an important issue required investigation. The previous TTS module of the Talking Head did not use the Festival system because although it was acknowledged that Festivals output is of a very high quality, the computation time was deemed to be far too expensive to use in an interactive application (Crossman, 1999). For example, the phrase Hello everybody. This is the voice of a Talking Head. The Talking Head project consists of researchers from Curtin University and will create a 3D model of a human head that will answer questions inside a web browser. took about 45 seconds to synthesize on a Silicon Graphics Indy workstation (Crossman, 2000). It is contested however, that the negative impression that could be made of the Festival system from such data may be a little misled. Though execution time may take longer on an SG Indy workstation, informal testing on several standard PCs (Win32 and Linux platforms) showed that the same phrase took less than 5 seconds to synthesize (including the generation of a waveform). Since TTS processing is done on the server side, the system can be easily configured to ensure Festival will carry its processing on a faster machine. Therefore, Festivals synthesis time was not considered a problem.
3.7.2 XML Parser
Since it is expected that the programs input will contain marked up text, an XML parser was required to parse and validate the input, and create a DOM tree structure for easy processing. There are a number of freely available XML parsers, though many are still in development stage and implement the XML specification to varying degrees. One of the more complete parsers is libxml, a freely available XML C library for Gnome (libxml, 2000). Using libxml as the XML parser fulfilled the needs of the project in a number of ways: a) Portability written in C, the library is highly portable. Along with the main program, it has been successfully ported to the Win32, Linux and IRIX platforms. b) Small and simple only a limited range of the XML features are being used, therefore a complex parser was not required. This is not to say that libxml is a trivial library as it offers some powerful features. c) Efficiency Informal testing showed libxml parses large documents in surprisingly little time. Although not used for this project, libxml offers a SAX interface to allow for more memory-efficient parsing (see section 3.5.5). d) Free libxml can be obtained cost-free and license-free. It is important to note that the libxml librarys DOM tree building feature was used to help create the required objects that hold the programs utterance information. However, care was taken to make sure the programs objects were not dependent on the XML parser being used. Instead, a wrapper class, CTTS_SMLParser, used libxml as the XML parser and output a custom tree-like structure very similar to that of the DOM. This ensured that all other objects within the program used the custom structure, and not the DOM tree that libxml outputs. (See Chapter 5 for more details.)
3.8 Summary
This chapter has explored research that was applicable to this project, focusing on how the literature can help with achieving the stated objectives and subproblems of Chapter 2, and supporting the hypotheses of Chapter 4. More specifically, the literature was investigated to find the speech correlates of emotion, seeking clear definitions so that there was a solid base to work from during the implementation phase. The work of prominent researchers in the field of synthetic speech emotion, such as Murray and
Arnott (1995) and Cahn (1990) who have already attempted to simulate emotional speech, was sought in order to gain an understanding of the problems involved, and the approach taken in solving them. The in-depth review on XML served two purposes: a) to describe what XML is and what the technology is trying the address, and b) to expound the benefits of XML so as to justify why SML was designed to be XML-based. A resource review was given to discuss the issues involved when deciding which tools to use for the TTS module, and to address one of the subproblems stated in Section 2.3; that is, that the TTS module should be able to run across the Win32, Linux, and UNIX platforms.
Chapter 4 Research Methodology

The literature review of Chapter 3 enabled the formation of the hypotheses stated in this chapter. It also identified areas where limitations would apply, and defined the scope of the project.
4.1 Hypotheses
The project was developed to test the following hypotheses: 1. The effect of emotion on speech can be successfully synthesized using control parameters. 2. Through the addition of emotive speech: a) Listeners will be able to correctly recognize the intended emotion being synthesized. b) Head. Information will be communicated more effectively by the Talking
It should be noted that hypothesis 2a allows for a significant error rate in recognizing the simulated emotion as we ourselves find it difficult to understand each others nonverbal cues, and are often misunderstood as a result. Malandra, Barker and Barker (1989), and Knapp (1980) discuss difficulties in emotive speech recognition. For the hypothesis to be accepted, however, the recognition rate will be significantly higher than mere chance, showing proof that correct recognition of the simulated emotion is indeed occurring.
4.2 Limitations and Delimitations

4.2.1 Limitations
Two main limitations have been identified: 1. Vocal Parameters - The quality of the synthesized emotional speech will be limited by the ability of the vocal parameters to describe the various emotions. This is a reflection of the current level of understanding of speech emotion itself. 2. Speech Synthesizer Quality - The quality of the speech synthesizer and the parameters it is able to handle will also have a direct effect on the speech produced. For instance, most speech synthesizers are unable to change voice quality features (breathiness, intensity etc) without significantly affecting the intelligibility of the utterance.
4.2.2 Delimitations
The purpose of this research is to determine how well the vocal effects of emotion can be added to synthetic speech it is not concerned with generating an emotional state for the Talking Head based on the words it is to speak. Therefore, the system will not know the required emotion to simulate from the input text alone. This top-level information will be provided through the use of explicit tags, hence the need for the implementation of a speech markup language. Due to the strict time constraints placed on this project, the emotions that are to be simulated by the system were bounded to happiness, sadness, and anger. These three emotions were chosen because of the wealth of study carried out on these emotions (and hence an increased understanding) compared to other emotions. This is because happiness, sadness, and anger (along with fear and grief) are often referred to as the basic emotions, on which it is believed other emotions are built on.
4.3 Research Methodologies

The following Research Methodologies of Mauch and Birch (1993) are applicable to this research: Design and Demonstration. This is the standard methodology used for the design and implementation of software systems. The speech synthesis system is being demonstrated as the TTS module of a Talking Head. Evaluation. The effectiveness of the system was needed to be determined via listener questionnaires, testing how well the TTS module supports the stated hypotheses. Therefore an evaluation research methodology was adopted. Meta-Analysis. The project involves a number of diverse fields other than speech synthesis; namely psychology, paralinguistics, and ethology. The meta-analysis research methodology was used to determine how well the speech emotion parameters described in these fields mapped to speech synthesis.
Chapter 5 Implementation
This chapter discusses the implementation of the TTS module to simulate emotional speech for a Talking Head, plus the stated subproblems of Section 2. The discussion covers how the modules input is processed, and how the various emotional effects were implemented. This will involve a description of the various structures and objects that are used by the TTS module. Since the module relies heavily on SML, the speech markup language that was designed and implemented to enable direct control over the modules output, the chapter discusses SML issues such as parsing and tag processing.
5.1 TTS Interface

Before an in-depth description of each of the TTS modules components is given, it will be beneficial to describe the input and outputs of the system. It was important to be able to describe the system as a very high-level black box; not only for clarity of design, but also to ensure that the replacement of the existing TTS module of the FAITH project would be a smooth one. It also minimizes module and tool interdependency. Figure 10 shows the black box design of the system as the TTS module of a Talking Head.
Waveform Text
TTS Module
Viseme information
Figure 10 - Black box design of the system, shown as the TTS module of a Talking Head.
It was mentioned earlier in Section 3.7.1 that the TTS module uses the Festival Speech Synthesis System and the MBROLA Synthesizer. Figure 10 does not show any
of this detail, nor should it. What is important to describe at this level is the modules interface; how the module produces its output is irrelevant to the user of the module.
5.1.2 Module Inputs

Figure 10 shows text as the single input to the TTS module. However, text can be a fairly ambiguous description for input, and indeed, the module caters for two distinct types of text: plain text, and text marked up in the TTS modules own custom Speech Markup Language (SML).
Plain Text The simplest form of input, plain text means that the TTS module will endeavour to render the speech-equivalent of all the input text. In other words, it will be assumed that no characters within the input represent directives for how to generate the speech. As a result of this, speech generated using plain text will have default speech parameters, spoken with neutral intonation.
SML Markup If direct control over the TTS modules output is desired, then the text to be spoken can be marked up in SML, the custom markup language implemented for the module. Although an in-depth description of SML will not be given here (see Section 2 and Appendix A), it was designed to provide the user of the TTS module with the following abilities: Direct control of speech production. For example, the system could be specified to speak at a certain speech rate, pitch, or pronounce a particular word in a certain way (this is especially useful for foreign names). Control over speaker properties. This gives the ability to not only have control of how the marked up text is spoken, but also who is speaking. Speaker properties such as gender, age, and voice can be dynamically changed within SML markup. The effect of the speakers emotion on speech. For example, the markup may specify that the speaker is sad for a portion of the text. As a result, the speech will sound sad. One of the primary objectives of this thesis is to determine how effective the simulated effect of emotion on the voice is.
Another important feature of the TTS module with regards to the input it can receive is that the module is able to handle unknown tags present within the text. This is important because other modules within the Talking Head may (and do) have their own markup languages to control processing within that module. For example, Huynh (2000) has developed a facial markup language to specify many head movements and expressions. If any non-SML tags are present within the input given to the TTS module, they will simply be filtered out of the SML input. A very important note to make is that the filtering out of unknown tags is done before any XML-related parsing is carried out; the XML Recommendation explicitly states that the presence of any unknown tags that do not appear in the documents referenced DTD should immediately cause a fatal error (Extensible Markup Language, 1998). Therefore, it is important that before any processing of the TTS modules input takes place, proper filtering takes place since it is expected that non-SML tags will present in the input. Admittedly, the very fact that the TTS module is given input that may not contain pure SML markup does not reflect a solid design of the system. However, the FAITH project has only just begun to make use of XML-based markup languages for its various modules and the XML processing architecture within the Talking Head is not very mature. One possibly better approach to maintaining several XML-based markup languages (such as SML and FAML) is discussed in the Future Work section (Chapter 7).
5.1.3 Module Outputs

The output of the TTS black box is two streams:
Waveform The TTS module will always produce a sound file which, when played, is the speech equivalent of the text received as input. The sound file is in the WAV 16-bit format. Once a waveform is produced, it is sent via MPEG-4 to the client side of the application where it is played (see Section 3.6).
Visemes The visual equivalent of the phonemes (speech sounds) that are spoken are also output from the TTS module. The viseme output is encoded as stated in the MPEG-4
specification (MPEG, 1999). The phoneme-to-viseme translation submodule is one of the few that were retained from the existing TTS module.
5.1.4 C/C++ API

Modules that call the TTS module do so through its C/C++ API, which was designed to be simple and high-level. In C++ programs, the CTTS_Central class is at the center of the TTS API. The following list describes each of its public methods.
CTTS_Central::SpeakFromFile (const char *Filename) Synthesizes the contents of the file specified by the input
parameter Filename. e.g. SpeakFromFile (infile.sml);

CTTS_Central::SpeakText (const char *Message)
Synthesizes the actual character string pointed to by Message.

SpeakText(Hello World);
e.g.
CTTS_Central::SpeakTextEx
(const
char
*Message,
int Emotion) Synthesizes the actual character string pointed to by Message, simulating the emotion specified by Emotion. Emotion can be
specified using any one of the predefined constants: TTS_NEUTRAL,

TTS_ANGRY, TTS_HAPPY, TTS_SAD.
A special note to make is that the type of input given to each of these functions is not explicitly specified. For instance, does the file given to SpeakFromFile contain plain text, or is it an SML document? Each of the API functions listed above automatically detect the input type using a simple heuristic: if the start of the input file or character string contains an XML header declaration, then it is detected to be SML markup, otherwise the input is treated as plain text. A C API has also been made available, which has the same functionality as the CTTS_Central objects interface. Only, initialization and destruction routines have to be explicitly called. TTS_Initialise () used to initialise the TTS module. The function is called once only.
TTS_SpeakFromFile (const char *Filename) same as CTTS_Central::SpeakFromFile (const char *Filename).
TTS_SpeakText (const char *Message) same as
CTTS_Central::SpeakText (const char *Filename). TTS_SpeakTextEx (const char *Message, int
Emotion) same as CTTS_Central::SpeakTextEx (const char *Message, int Emotion).
TTS_Destroy () used to nicely cleanup the TTS module once it is not need anymore. The function is called once only.
5.2 SML Speech Markup Language

In Section 2.2 it was identified that the design and implementation of a suitable markup language was required so that the emotion of a text segment could be specified, as well as providing a means of manipulating other useful speech parameters. SML is the TTS modules XML-based Speech Markup Language designed to meet these requirements. This section will provide an overview of how an utterance should be marked up in SML. For a description of each SML tag with its associated attributes, see Appendix A. For issues regarding SMLs implementation, see Section 5.7.
5.2.1 SML Markup Structure

An input file containing correct SML markup must contain an XML header declaration at the beginning of the file. Following the XML header, the sml tag encapsulates the entire marked up text, and can contain multiple p (paragraph) tags. Figure 11 shows the basic layout of an input file marked up in SML. Note that all the XML constraints discussed in Section 3.5 apply to SML.
Reference to SML v01 DTD
XML header Root tag
<?xml version="1.0"?> <!DOCTYPE sml SYSTEM "./sml-v01.dtd"> <sml> ... ... ... </sml> Figure 11 - Top-level structure of an SML document. Paragraphs
In turn, each p node can contain one or more emotion tags (sad, angry, happy, and
neutral), and instances of the embed tag; text not contained within an emotion tag is
not allowed. For example, Figure 12 shows valid SML markup, while Figure 13 shows SML markup that is invalid because it does not follow this rule. Note that unlike lazy HTML, the paragraph (p) tags must be closed properly.

<neutral>Please remain quiet.</neutral> <embed src=sound.wav/> <angry>Who made that noise?</angry> 
Figure 12 - Valid SML markup.
 <sad>I have some sad news:</sad> this part of the markup is not valid SML. Figure 13 - Invalid SML markup.
All tags described in Appendix A can occur inside an emotion tag (except sml, p, and embed). A limitation of SML is that emotion tags cannot occur within other emotion tags. However, unless explicitly specified, most other tags can contain even instances of tags with the same name. For example, a pitch tag can contain another pitch tag as the following example shows.
<pitch range=+100%> Not I, <pitch middle=-15%>said the dog.</pitch> </pitch>
The described structure of an input file containing SML markup can be confirmed by SMLs DTD (see Appendix B). Should the input file not conform to the DTD specification, a parse error will occur and, in accordance with the XML Recommendation, the input will not be processed.
5.3 TTS Module Subsystems Overview

NLP
Festival
Plain text
Phoneme data
Text
SML Parser
libxml
SML Document
Phoneme info
Visual Module
Visemes
Tags + Text/Phonemes
Modified Text/Phonemes
Phoneme data
DSP SML Tags Processor

MBROLA
Waveform
Figure 14 - TTS module subsystems.
As Figure 14 shows, the design of the TTS module subsystems is centered on the SML Document object. The main steps for synthesizing the modules input text involve the creation, processing, and output of the SML Document. This is broken down into the following tasks: a) Parsing. The input text is parsed by the SML Parser, and creates an SML Document object. The SML Parser makes use of libxml. b) Text to Phoneme Transcription. The Natural Language Parser (NLP) is responsible for transcribing the text into its phoneme equivalent, plus providing intonation information in the form of each phonemes duration and pitch values. This information is given to the SML Document object and
stored within its internal structures. The NLP unit makes use of the Festival Speech Synthesis System. c) SML Tag Processing. Any SML tags present in the input text are processed. This usually involves modifying the text or phonemes held within the SML Document. d) Waveform Generation. The phoneme data held within the SML Document is given to the Digital Signal Processing (DSP) unit to generate a waveform. The DSP makes use of the MBROLA Synthesizer. e) Viseme Generation. The Visual Module is responsible for transcribing the phonemes to their viseme equivalent. Again, the phoneme data is obtained from the SML Document. In this thesis, the Visual Module will not be discussed in any further detail since it is has reused much of the old TTS modules subroutines. Crossman (1999) provides a description of the phoneme-to-viseme translation process.
5.4 SML Parser

The SML Parser, encapsulated in the CTTS_SMLParser class, is responsible for parsing the modules text input to ensure that it is both a well-formed XML document and that its structure conforms to the grammar specification of the DTD. If the input is fully validated then an SML Document object is created based on the input. To perform full XML parsing on the input, the XML C Library libxml is used. Apart from validating the input, libxml also constructs a DOM tree (described in Section 3.5.4) that represents the inputs tag structure should no parse errors occur. The SML Document object that is returned by the SML Parser follows the hierarchical structure of the DOM very closely. Therefore, it traverses the DOM and creates an SML Document containing nodes mirroring the DOMs structure. Once the SML Document has been constructed, the DOM is destroyed and the SML Document is returned. It was mentioned in Section 5.1.2 that the TTS module is able to handle unknown tags present within the input markup. This is because the input is filtered to remove all unknown tags before any validation parsing is done by libxml. In doing so, the DOM tree that libxml creates does not hold any unknown tag nodes, and as a consequence neither does the SML Document.
The TTS module keeps track of all SML tag names by keeping a special XML document that holds SML tag information2. Filtering of the input is done by creating a copy of the input file, and copying only those tags that are known. It is important that this filtering process is carried out because the input is envisaged to contain other nonSML tags, such as those belonging to the FAML module. Figure 15 shows the filtering process.
SML tag information Tag lookup Known tag
Input file
Figure 15 - Filtering process of unknown tags.
Copy of input file (Filtered)
5.5 SML Document

As the TTS modules subsystems diagram shows (Figure 14), the SML Document is at the core of the TTS module. Its role is to store all information required for speech synthesis to take place, such as word, phoneme, and intonation data. It also contains the full tag information that appears in the input; in fact, such is the depth of information held that the SML markup could be easily recreated by the information held in the SML Document. The tag data is used to control the manipulation of the text and phoneme data. In this section we will describe the structure of the SML Document, as well the various structures required to perform the above mentioned role. Finally, the data held within the SML Document is then used to produce a waveform. The SML Document object is encapsulated by the TTS_SMLDocument and TTS_SMLNode classes.
5.5.1 Tree Structure

In the last section, it was mentioned that the structure of the SML Document matches very closely that of the XML DOM. The SML Document consists of a hierarchy of nodes that represent the information held in the input SML markup. Therefore, the nodes
2
The XML document is called tag-names.xml, and is held in the special TTS resource directory TTS_rc.
hold markup information, attribute values, and character data. Figure 16 shows the highlevel structure of an SML Document that would be constructed for the accompanying SML markup. Note how each node has a type that specifies what type of node it is. The hierarchical nature of the SML Document implies which text sections will be rendered in what way a parent will affect all its children. So, for the example in Figure 16, the emph node will affect the phoneme data of its (one) child node, the text node containing the text too. The happy node will affect the phoneme data of all its (three) children nodes containing the text Thats not, too, and far away respectively. Tags that were specified with attribute values are represented by element nodes that point to attribute information (this is not shown on Figure 16 for clarity purposes).
<?xml version="1.0"?> <!DOCTYPE sml SYSTEM "./sml-v01.dtd"> <sml> <neutral>I live at <rate speed=-10%>10 Main Street</rate> </neutral> <happy> Thats not <emph>too</emph> far away. </happy> </sml>
sml DOCUMENT_NODE
p ELEMENT_NODE
neutral ELEMENT_NODE
happy ELEMENT_NODE
text TEXT_NODE I live at
rate ELEMENT_NODE
text TEXT_NODE Thats not
emph ELEMENT_NODE
text TEXT_NODE far away.
text SML_TEXT_NODE 10 Main Street
text SML_TEXT_NODE too
Figure 16 - SML Document structure for SML markup given above.
5.5.2 Utterance Structures

Each text node contains its own utterance information, which comprises of word and phoneme related data. The information is held in different layers.
Utterance level the whole phrase held in that node. The CTTS_UtteranceInfo class is responsible for holding information at this level. Word level the individual words of the utterance. The CTTS_WordInfo class is responsible for holding information at this level. Phoneme level the phonemes that make up the words. The CTTS_PhonemeInfo class is responsible for holding information at this level. 4. Phoneme pitch level the pitch values of the phonemes (phonemes can have multiple pitch values). The CTTS_PitchPatternPoint class is responsible for holding information at this level. The way that the above mentioned objects are organized within a text node is as follows: A text node contains one CTTS_UtteranceInfo object. The CTTS_UtteranceInfo object contains a list of CTTS_WordInfo objects that contain word information. In turn, each
CTTS_WordInfo
1.
2. 3.
object
contains
list
of A
CTTS_PhonemeInfo
objects
that
contain
phoneme
information.
CTTS_PhonemeInfo object contains the actual phoneme and its duration (ms).
Each
CTTS_PhonemeInfo
object
then
contains
list
of
CTTS_PitchPatternPoint objects that contain pitch information for each
phoneme. A pitch point is characterized by a pitch value, and a percentage value of where the point occurs within the phonemes duration.
U W the P dh pp
(0,95)
P@ pp (50,101)
pitch value % inside phoneme length
W moon Pm pp (0,102) P uu pp (50,110) P n pp (100,103)
Figure 17 - Utterance structures to hold the phrase the moon. U = CTTS_UtteranceInfo object; W = CTTS_UtteranceInfo object; P = CTTS_PhonemeInfo object; pp = CTTS_PitchPatternPoint object.
5.6 Natural Language Parser

As introduced in Section 5.3, the NLP (Natural Language Parser) module is responsible for transcribing the text to be rendered as speech to its phoneme equivalent. It is also responsible for generating intonation information by providing pitch and duration values for each phoneme in the utterance. The goals this module sets out to achieve are nontrivial, and it is not surprising that this stage takes by far the longest time of any of the stages in the speech synthesis process. Dutoit (1997) gives an excellent discussion of the problems the NLP unit of a speech synthesizer must overcome. Since the phoneme transcription and the intonation information greatly affect the quality of the synthesized speech, it was very important to have an NLP that would produce high quality output. As mentioned in Section 3.7.1, the Festival Speech Synthesis System was chosen to provide these services, which is able to generate output comparable to commercial speech synthesizers.
5.6.2 Obtaining a Phoneme Transcription

As described in Section 5.5, each text node within the SML Document contains utterance objects that will ultimately hold the nodes word and phoneme information. One of the intermediate steps for obtaining a phoneme transcription is to tokenize the input character string into words. For example, the character string On May 5 1985, 1,985 people moved to Livingston would be tokenized in the following words by Festival: On May fifth nineteen eighty five one thousand nine hundred and eighty five people moved to Livingston. This illustrates the complexity of the input that Festival is able to handle, which has a direct effect on user perception of the intelligence of the Talking Head. To tokenize the contents of the SML Document, the tree is traversed and each text nodes content is individually given to Festival. Festival returns the tokens in the character string, and these are stored as words in the corresponding nodes utterance object. Figure 18 shows how each node holds its own token information.
SML Markup
<neutral> 10 oranges cost <emph>$8.30</emph> </neutral> 10 oranges cost ten oranges cost emph neutral
$8.30 eight dollars thirty
Figure 18 - Tokenization of a part of an SML Document.
Once word information is stored within each nodes utterance object, phoneme data can be generated for each word. Obtaining the actual phoneme data (including intonation) is a more complex process however. This is because an entire phrase should be given to Festival in order for correct intonation to be generated. As an example, consider the following SML markup (the corresponding nodes held in the SML Document are shown in Figure 19).
<happy> <rate speed=-15%>I wonder,</rate> you pronounced it <emph>tomato</emph> did you not? </happy>
happy
rate
you pronounced it
emph
did you not?
I wonder,
tomato
Figure 19 - SML Document sub-tree representing example SML markup.
If each text nodes contents is given to Festival one at a time (i.e. first I wonder, then you pronounced it, and so forth), then though Festival will be able to produce the correct phonemes it will not generate proper pitch and timing information for the phonemes. This will result in an utterance whose words are pronounced properly, but contains inappropriate intonation breaks that make the utterance sound unnatural. An appropriate analogy to this would be if a person were shown a pack of cards with words written on them one at a time, and asked to read it out loud. The person, not knowing what words will follow, will not know how to give the phrase an appropriate intonation. Now if the same person is given a card that contains the entire sentence on it, then now knowing what the phrase is saying, the person will read it out loud correctly. The same approach was taken in the solution to this problem. Continuing the above example will help understand how this is done. The SML Document is traversed until an emotion node is encountered. In the example, traversal would stop at the happy node. The contents of its child text nodes are then concatenated to make one phrase. So, the contents of the four text nodes in Figure 19 would be concatenated to form the phrase I wonder, you pronounced it tomato did you not? The phrase is stored in a temporary utterance object held in the happy node.
The phrase is given to Festival, and Festival generates the phoneme transcription as well as intonation information. The entire phoneme data is stored in the happy nodes temporary utterance object. Because each text node already contains word information in their utterance objects, it is a simple process to disperse the phoneme data held in the happy node amongst its children. The temporary utterance object in happy is then destroyed. If this procedure is followed then correct intonation is given to the utterance. Of course, a limitation is that this will not solve the problem of having an emotion change in mid sentence. However, the algorithm makes an assumption that this will not occur frequently, and that if it does, the intonation will not be needed to continue over emotion boundaries and a break is acceptable.
5.6.2 Synthesizing in Sections

Speech synthesis can be a processor intensive task, and can take a significant amount of time and memory when synthesizing larger utterances. Finding any way to minimize the waiting time is highly desirable, especially when the speech production is being waited upon by an interactive Talking Head. There was a concern that if a very large amount of SML markup was given to the TTS module, then the execution time would be unacceptable for someone communicating with the Talking Head. To avoid this from occurring, a solution was implemented that took advantage of the client/server architecture of the Talking Head. Instead of the entire SML Document being synthesized in one go, smaller portions (at the emotion node level) are synthesized one at a time on the server and sent to the client. As the Talking Head on the client begins to speak, the server synthesizes the next emotion tag of the SML Document. By the time the Talking Head has finished talking, the next utterance is ready to be spoken. This way, the actual waiting time is really only for the first utterance and is now dependent on the communication speed between the server and client, and not the synthesis time of the whole document. Figure 20 represents a timeline of the example markup. It should be noted that this section-oriented method of producing speech involves not only the NLP submodule but also all the steps in the synthesis process after the creation of the SML Document.
 <neutral>Utterance 1</neutral> <happy>Utterance 2</happy> <sad>Utterance 3</sad> 
SERVER
Synthesizing neutral tag (Utterance 1) Synthesizing happy tag (Utterance 2) Synthesizing sad tag (Utterance 3) Idle Idle
CLIENT
Playing Utterance 1
Playing Utterance 2
Playing Utterance 3
Figure 20 - Raw timeline showing server and client execution when synthesizing example SML markup above.
5.6.3 Portability Issues

To address the portability issue stated in Section 2.2, it was important Festival be useable over multiple platforms. Because the Festival system has been developed primarily for the UNIX platform, compiling it for IRIX 6.3 was relatively straightforward. Similarly, obtaining a Linux version of Festival was also effortless since Linux RPMs (RedHat Package Manager) containing precompiled Festival libraries are available. Although it has not been tested extensively on the Win32 platform, the Festival developers are confident that the source code is platform independent enough for Festival to compile on Win32 machines without too many changes. Despite this optimism, a considerable amount of the projects effort went into realizing this objective. In fact, changes made to the code were kept track of, and as it grew, a help document for compiling Festival with Microsoft Visual C++ resulted. The document was made available on the World Wide Web to help other developers, and has already received attention. The help document has been included in this thesis (see Appendix C). An online version of the help document can be found at the following URL: http://www.computing.edu.au/~stalloj/projects/honours/festival-help.html.
5.7 Implementation of Emotion Tags

Previous sections have dealt with describing the framework constructed to support the main hypothesis; that is, to simulate the effect of emotion on speech. This section will now discuss the implementation of SMLs emotion tags, which when used to markup text, cause the text to be rendered with the specified emotion. As has already been stated in this thesis, the speech correlates of emotion needed to be investigated in the literature for the main objectives to be met. Section 3.2 described the findings of this research, and a table was constructed that describes the speech correlates for four of the five so-called basic emotions: anger, happiness, sadness, and fear (see Table 1). The table formed the basis for implementing the angry, happy, and
sad SML tags. For ease of reference, the contents of Table 1 for the anger, happiness
and sadness emotions is shown again in the following table:
Anger Speech rate Pitch average Pitch range Intensity Pitch changes Faster Very higher much
Happiness Slightly faster Much higher Much wider Higher Smooth, upward inflections
Sadness Slightly slower Slightly lower Slightly narrower Lower Downward inflections
Much wider Higher Abrupt, downward, directed contours Breathy, chesty tone1 Clipped
Voice quality Articulation

1
Breathy, blaring1 Slightly slurred
Resonant1 Slurred
terms used by Murray and Arnott (1993).
Table 2 - Summary of human vocal emotion effects for anger, happiness, and sadness.
To implement the guidelines found in the literature on human speech emotion, Murray and Arnott (1995) developed a number of prosodic rules for their HAMLET system. The TTS module has adopted some of these rules, though slight modifications were required. Also, other similar prosodic rules have been developed through personal experimentation.
5.7.1 Sadness
Basic Speech Correlates Following the literature-derived guideline for the speech correlates of emotion shown in Table 2, Table 3 shows the parameter values set for the SML sad tag. The values were optimized for the TTS module, and are given as percentage values relative to neutral speech.
Parameter Speech rate Pitch average Pitch range Volume Value (relative to neutral speech) -15% -5% -25% 0.6
Table 3 Speech correlate values implemented for sadness.
As a result of the above speech parameter changes, the speech is slower, lower in tone, and is more monotonic (pitch range reduction gives a flatter intonation curve). The volume is reduced for sadness so that the speaker talks more softly. (Implementation details on how speech rate, volume and pitch values are modified can be found in Section 5.8). Prosodic rules The following rules, adopted from Murray and Arnott (1995), were deemed to be necessary for the simulation of sadness. Some parameter values were slightly modified to work best with the TTS module. 1. Eliminate abrupt changes in pitch between phonemes. The phoneme data is scanned, and if any phoneme pairs have a pitch difference of greater than 10% then the lower of the two pitch values is increased by 5% of the pitch range. 2. Add pauses after long words. If any word in the utterance contains six or more phonemes, then a slight pause (80 milliseconds) is inserted after the word. The following rules were developed specifically for the TTS module. 1. Lower the pitch of every word that occurs before a pause. Such words are lowered by scanning the phoneme data in the particular word, and
lowering the last vowel-sounding phoneme (and any consonant-sounding phonemes that follow) by 15%. This has the effect of lowering the last syllable. 2. Final lowering of utterance. The last syllable of the last word in the utterance is lowered in pitch by 15%.
5.7.2 Happiness
Basic Speech Correlates Following the literature-derived guideline for the speech correlates of emotion shown in Table 2, Table 4 shows the parameter values set for the SML happy tag. The values were optimized for the TTS module, and are given as percentage values relative to neutral speech.
Parameter Speech rate Pitch average Pitch range Volume Value (relative to neutral speech) +10% +20% +175% 1.0 (same as neutral)
Table 4 - Speech correlate values implemented for happiness.
As a result of the above speech parameter changes, the speech is slightly faster, is higher in tone, and sounds more excited since intonation peaks are exaggerated due to the pitch range increase. Prosodic rules The following rules were adopted from Murray and Arnott (1995). Some parameter values were slightly modified to work best with the TTS module. 1. Increase the duration of stressed vowels. The phoneme data is scanned, and the duration of all primary stressed vowel phonemes is increased by 20%. Stressed vowels are discussed in Section 5.7.4. 2. Eliminate abrupt changes in pitch between phonemes. The phoneme data is scanned, and if any phoneme pairs have a pitch difference of greater than 10% then the lower of the two pitch values is increased by 5% of the pitch range. 3. Reduce the amount of pitch fall at the end of the utterance. Utterances usually have a pitch drop in the final vowel and any following
consonants. This rule increases the pitch values of these phonemes by 15%, hence reducing the size of the terminal pitch fall.
5.7.3 Anger
Basic Speech Correlates Table 5 shows the parameter values set for the SML angry tag. Note that the values were optimized for the TTS module, and are given as percentage values relative to neutral speech.
Parameter Speech rate Pitch average Pitch range Volume factor Value (relative to neutral speech) +18% -15% -15% 1.7
Table 5 - Speech correlate values implemented for anger.
Prosodic rules The following rule was adopted from Murray and Arnott (1995). 1. Increase the pitch of stressed vowels. The phoneme data is scanned, and the pitch of primary stressed vowels is increased by 20%, while secondary stressed vowels are increased by 10%. Inspection of the parameter values in Table 5 will reveal that they differ considerably from the guidelines shown in Table 2. Initially, speech parameters were set for anger as shown in Table 2 but preliminary tests showed that even with different prosodic rules, the angry tag produced output that was too similar to that of the happy tag (both had increases in speech rate, pitch average and pitch range). It was decided to keep the increase in speech rate to denote an increase in excitement, but lower the pitch average. The lower voice seemed to better convey a menacing tone, for the same reason why animals utter a low growl to ward off possible intruders. With the help of the increase in volume, the pitch average lowering also results in a perceived hoarseness in the voice3, although vocal effects could not be implemented due to a limitation of the Festival and MBROLA systems. The combination of the decreased pitch range and the intonation rule results in a flatter intonation curve
Increased hoarseness in the voice for anger is supported in the literature (Murray and Arnott, 1993).
with sharper peaks. This upholds Table 2s description of pitch changes for anger: abruptcontour.
5.7.4 Stressed Vowels

Some of the prosodic rules in the previous sections made use of the term stressed vowel phoneme, and denoted two different types: primary and secondary. The term is used in Murray and Arnott (1995) when describing their prosodic rules, but unfortunately no explanation is given of the heuristic used to determine the type of stress. The frequent occurrence of the term within the prosodic rules however, signified that it was too important to ignore. Therefore, analysis was carried out on the phoneme data output by Festival to ascertain if there are any discriminating factors between vowel sounding phonemes. The results of the analysis showed that indeed, phoneme pitch and duration data can be used to categorize the importance of vowel-sounding phonemes within a word. Table 6 summarizes the findings of the analysis, and shows how three types of vowel-sounding phonemes can be discriminated based on the average phoneme pitch and duration values of the utterance.
Stress type Primary Secondary Tertiary Duration Avg. Yes Yes No > Duration Pitch > Pitch Avg. Yes No Yes
Table 6 - Vowel-sounding phonemes are discriminated based on their duration and pitch.
Therefore, the prosodic rules made use of the fact that different stress types existed for vowel-sounding phonemes, basing classification on the criteria shown in the table above. Whether or not this follows the stressed phoneme definition of Arnott and Murray (1995) is unclear; the fact remains that Table 6 allowed the implementation of prosodic rules that involved different stressed vowel types, and its success will be demonstrated in this thesis evaluation section (Chapter 6).
5.7.5 Conclusion
While the literature identifies a number of speech correlates of emotion, speech synthesizer limitations did not allow for the implementation of some of them; namely,
intensity, articulation and voice quality parameters. This may have been the main reason why happiness and anger would have been too similar if the recommendations found in the literature had been strictly followed based on acoustic features alone there were too few differences between the two emotions. In discussing HAMLETs prosodic rules, Murray and Arnott (1995) state that the rules were developed to be as synthesizer independent as possible. This was found to be the case, though specific values given in their paper for the DECtalk system obviously could not be used. Still, the values served as a very good indication of what settings needed to be changed for different emotions. The very fact that some of the HAMLET prosodic rules could be implemented in this projects TTS module serves to show that the work of Murray and Arnott (1995) is not speech synthesizer dependent. This has the added advantage that it assures that the emotion rules implemented in the TTS module are not dependent on the Festival Speech Synthesis System and the MBROLA Synthesizer.
5.8 Implementation of Low-level SML Tags

5.8.1 Speech Tags
This section will briefly describe the implementation of SMLs lower level tags that can be used to directly manipulate the output of the TTS module. For instance, using the rate tag can make the TTS module produce slower or faster speech, and the volume tag can increase/decrease the volume of the utterance. The tree-like structure of the SML Document lends itself to recursive tag processing. This is because an SML Document element node representing a tag usually affects a subtree of nodes of an unknown structure. Therefore it is very convenient to define recursive functions that visit all child nodes of the tag node, and process the data held in the text nodes. a) emph Using the emph tag, a word can be emphasized by increasing the pitch and duration of certain phonemes within the word. More specifically, emphasis of a word involves a target phoneme whose pitch and duration information is changed. The target phoneme can be specified via the target attribute (the actual phoneme is specified, not the letter); if not specified, the target becomes the first vowel-sounding phoneme within the word. The pitch and duration values of neighbouring phonemes (if they exist) of the target are
also affected. Figure 21 shows the factors that duration and pitch values are multiplied by.
<emph target=o affect=b level=moderate>sorry</emph> target phoneme duration = 2.0 pitch =1.3
sorry phonemes
s o r ii
neighbours duration = 1.8 pitch = 1.2
Figure 21 Multiply factors of pitch and duration values for emphasized phonemes.
The attributes that can be specified within the emph tag give various options on how the word will be emphasized. For instance, the affect attribute can specify whether only the pitch should change for the affected phonemes (default), or just the duration, or both. The level attribute specifies the strength of the emphasis (e.g. weak (default), moderate, strong). The current limitation of the tag is that only one target phoneme can be specified. However, this can be easily modified so that multiple target phonemes can exist within a word by extending the target attribute and how it is processed. b) embed The embed tag enables foreign file types to be embedded within SML markup. The type of embedded file is specified through the type attribute. Currently, two file types are supported in SML: audio the embedded file is an audio file, and is played aloud. The filename is specified through the src attribute. Embedding audio files within SML markup is useful for sound effects.
<embed type=audio src=sound.wav/>
mml Music Markup Language (MML) markup is being embedded, and signifies that the voice will sing the specified song. Two input files are specified through the following attributes: music_file (contains music description of song), and lyr_file (contains songs lyrics). Processing of
this file involves obtaining the two input filenames, and calling the MML library that will perform synthetic singing. The implementation of MML for synthetic singing was developed by Stallo (2000).
<embed type=mml music_file=rowrow.mml lyr_file=rowrow.lyr/>
c) pause The pause tag inserts a silent phoneme at the end of the last word of the previous text node with a duration specified by the length or msec attributes. Finding the text node previous to the pause node can be is non-trivial, as the algorithm must be able to handle any sub-tree structure. shows a possible structure that the algorithm must be able to handle.
Take a deep <emph>breath</emph> <pause length=medium/> and continue
TEXT_NODE Take a deep
text
ELEMENT_NODE
emph
pause
ELEMENT_NODE length=medium
TEXT_NODE Take a deep
text
TEXT_NODE breath
text
Silent phoneme to be inserted at end of breath
Figure 22 - Processing a pause tag.
d) pitch Pitch average The pitch average is modified by changing the pitch values of every phonemes pitch point(s). The middle attribute specifies by which factor the pitch values need to change. Modifying the pitch value of every phoneme by the same amount has the effect of changing the pitch average.
Pitch range Modifying the pitch range of an utterance requires that the pitch average be known. Therefore, the pitch average of the utterance is calculated before any pitch values can be modified. Each pitch value is then recalculated using the following equation:
NewPitch = (OldPitch Average) * PitchRangeChange
This has the effect of moving pitch points further away from the average line, for pitch values that are both greater and less than the pitch average (as shown in Figure 23). Care is taken that the new pitch values do not go below/higher than predetermined thresholds, or otherwise the voice loses its human quality and sounds too machine-like.
pitch average pitch average Original pitch values New pitch values Figure 23 - The effect of widening the pitch range of an utterance.
e) pron The pron tag is used to specify a particular pronunciation of a word. It pron tag is the only tag that modifies the text content of an SML Document text node, and is processed before the contents is given to the NLP module at the text to phoneme transcription stage. This is because the value of the sub attribute overwrites the contents of the text node. When the phoneme transcription stage is reached, the substituted text is what is given to the NLP module, and so the phoneme transcription reflects the specified pronunciation of the markup. Figure 24 illustrates how this is done for the markup segment:
<pron sub=toe may toe>tomato</pron>
ELEMENT_NODE sub=toe may toe
pron
ELEMENT_NODE sub=toe may toe text content phoneme transcription
pron
NLP
TEXT_NODE tomato
text
TEXT_NODE toe may toe
text
TEXT_NODE toe may toe
text
Sub-tree reflects the structure of the example markup. pron tag has attribute sub value of toe may toe
pron nodes sub attribute value overwrites contents of child text node.
At phoneme transcription stage, text node gives its contents to NLP, and receives phoneme transcription.
Figure 24 - Processing the pron tag.
f) rate
The speech rate is modified very easily by affecting the phoneme duration data in the text node structures. How much the speech should increase/decrease by is specified by the speed attribute, and each phonemes duration is multiplied by a factor reflecting the value of speed.
g) volume
The volume is modified through the usage of the MBROLA v command line option. The level attribute specifies the volume change, and this is converted to a suitable value to pass to MBROLA through the command line. The disadvantage of this way of implementing volume control is that MBROLA applies the volume to the whole utterance it synthesizes. Since sections are passed to MBROLA at the emotion tag level, the volume can vary at most from emotion to emotion. Therefore, dynamic volume change within an emotion tag is not possible, and has been identified as an area for future work (see Section 7.1).
5.8.2 Speaker Tag

The speaker tag allows control over some speaker properties, principally the diphone database MBROLA is to use (see Section 5.9) and the speakers gender. Implementation mainly involved providing a way to change the MBROLA Synthesizers settings. Example SML markup:
<speaker gender=female name=en1>I am a woman</speaker>
Name The value of the name attribute is passed to MBROLA on the command line, and must be the name of an MBROLA diphone database that already exists on the system. Using a different diphone database changes the voice since the recorded speech units contained within the database are from a different source. The TTS module can currently use two diphone databases: en1, and us1. Making use of more diphone databases (when they become available) requires very minimal additions. Unfortunately, specifying the actual name of the MBROLA diphone database in the SML markup has been identified as a design flaw, since an SML user will now be aware that the MBROLA synthesizer is being used and is forced to write markup that directly accesses an MBROLA voice. It is very important that this should be altered in the future.
Gender The MBROLA synthesizer provides the ability to change frequency values and voice characteristics, and thus provides a way of obtaining a male and female voice from the same diphone database. Obtaining a male or female voice requires the specification of the following MBROLA command line options: 1. Frequency ratio specified through the -f command line option. For instance, if -f 0.8 is specified on the MBROLA command line, all fundamental frequency values will be multiplied by 0.8 (voice will sound lower). 2. Vocal tract length ratio specified through the -l command line option. For instance, if the sampling rate of the database is 16000, indicating -l 18000 allows you to shorten the vocal tract by a ratio of 16/18 (which will make the voice sound more feminine). Unfortunately, values for the f and l MBROLA command line options are dependent on the diphone database being used. Table 7 shows the parameter values required to obtain a male and female voice for the en1 and us1 diphone databases.
Gender Male (using en1) Female en1) Female us1) (using Frequency ratio (f) 1.0 1.6 0.9 1.5 Vocal tract ratio (l) 16000 20000 16000 16000
Male (using us1) (using
Table 7- MBROLA command line option values for en1 and us1 diphone databases to output male and female voices.
5.9 Digital Signal Processor

Once all phoneme data is finalized, the SML Document contains the required information to produce a waveform, which is one of the outputs of the black box figure shown in Section 5.1. The Digital Signal Processing (DSP) unit of the TTS module makes heavy use of the MBROLA Synthesizer, which is a speech synthesizer based on the concatenation of diphones. This basically means that an utterance is synthesized by concatenating very small units of recorded speech sounds4.
4
The set of recorded speech sounds is often referred to as the voice data corpus, or a diphone database.
MBROLA accepts as input a list of phonemes, together with prosodic information (in the form of phoneme duration and a piecewise linear description of pitch), and from this is able to produce speech samples. The MBROLA input format is very intuitive and simple for describing phoneme data. Hence the design of the TTS modules utterance structures (discussed in Section 5.5.2) was based on the MBROLA input format. Figure 25 shows example MBROLA input required to produce a speech sample of the word emotion. Each line holds the following information, delimited by white space: phoneme, duration, n pitch-point pairs. A pitch-point pair is represented by two numbers: a percentage value representing how far into the phonemes duration the point occurs, and the actual pitch value.
# 210
phoneme
i 61 0 109 50 110 m 69 ou 140 0 107 50 110 sh 117 @ 48 0 105 50 94 n 91 100 90 # 210
duration
pitch point
Figure 25 - Example MBROLA input.
5.10 Cooperating with the FAML module

A subproblem that was stated in Section 2.2 was the cooperative integration with the FAML (Facial Animation Markup Language) module, which implements a facial gesture markup language for the Talking Head. Collaboration with Huynh (2000) was aimed at achieving synchronization of vocal expressions and facial gestures. The TTS module supports the FAML module in the following ways. 1. The allowance of FAML markup tags to be present within SML markup. This required a filtering process to be implemented so that non-SML tags are not present when parsing of the TTS input takes place. This also required that provisions be put in place so that tag name conflicts would not occur. The way this was resolved was by making sure that there were no tag names that occurred in both the SML tag set and the FAML tag set. For instance, to output an angry expression the SML tag uses the tag name angry, while the FAML uses the tag name anger. This is a simplistic solution however, because it hinders the
choice of good descriptive tag names (if an appropriate name is already being used in another tag set), and similar tag names such as anger and angry will only confuse future users of TTS and FAML markup. A possible solution would be the use of XML Namespaces, which allow different markup languages to contain the same tag names (ambiguity is resolved through the use of resolution identifiers e.g. sml.angry and faml.angry) 2. FAML API functions are called from within the TTS module to initialize the FAML module, and to allow the FAML module to modify the output FAP stream. 3. The creation of temporary utterance files in a format required by the FAML module. This information is needed by the FAML module for the proper synchronization of its generated facial gestures and the TTS modules speech. The utterance file contains word and phoneme information in the format shown in Figure 26.
word >And # 210 a 90 n 49 d 46 >now n 72 au 262 # 210 >the dh 45 @ 40 >latest l 69 ei 136 t 71 i 66 s 85 t 66 >news n 72 y 45 uu 217 z 82 # 210
phoneme and duration (ms)
Figure 26 - Example utterance information supplied to the FAML module by the TTS module. Example phrase: And now the latest news
5.11 Summary
This chapter has been able to demonstrate that the design and implementation of the described TTS module has endeavoured to follow the black box design principle in virtually all of its subsystems. As Figure 7 shows, the dependence of other Talking Head modules on the TTS module is strictly limited to its precisely defined inputs any modification of the internal workings of the TTS module will not affect how the rest of the Talking Head functions. The TTS modules own dependence on tools it uses such as libxml, the Festival Speech Synthesis System, and the MBROLA Synthesizer has been carefully bounded through proper design of C++ classes and structures. There is also no dependence of the DSP unit (which makes use of MBROLA) on the NLP unit (Festival), so minimal changes would be required if another NLP unit was used, but the TTS subsystem design would remain unchanged. The only requirements for a new speech synthesizer to be used in the NLP and DSP units would be the ability to manipulate the utterance at the phoneme data level (including pitch and duration information). The prosodic and speech parameter rules for the emotion tags are totally speech synthesizer independent, except for the volume settings. Because of this, the emotion tags could be easily ported for use in another TTS module using different speech synthesizers. Processing of most of the low-level speech tags makes use of the SML Documents utterance structures only. However, the speaker and volume tags make heavy use of the MBROLA Synthesizer due to the fact that these tags affect the way MBROLA produces the waveform. Future work will look at minimizing the dependence of these two tags on MBROLA.
Chapter 6 Results and Analysis

Evaluation of this project was primarily concerned with testing the hypotheses stated in Section 4.1. Therefore, the evaluation process endeavoured to ascertain how well the system is able to simulate emotion in synthetic speech, and the extent of the effect this has on a Talking Heads ability to communicate. In this chapter, the procedure in which data was acquired to test the hypotheses will be described. This will be followed by a presentation of the data, of which a full analysis is carried out.
6.1 Data Acquisition

6.1.1 Questionnaire Structure and Design
It was decided that evaluation data would be acquired via a questionnaire filled out by participants of a single demonstration. The decision was based on experience gained in the evaluation of a previous version of the Talking Head by Shepherdson (2000). The questionnaire had to be carefully designed to ensure that the questions asked would provide sufficient data to adequately prove or disprove the projects stated hypotheses. The following subsections describe the structure of the questionnaire, and design issues that were considered. For a copy of the actual questionnaire, see Appendix D. Section 1 Personal and Background Details Because it was known beforehand that most, if not all, participants were to be university students, and that a good proportion would be international students, questions were prepared to acquire the participants nationality, and if English was their first spoken language. Combined with their responses in the other sections, it was hoped that this would provide useful data for analysis. Other information such as the participants age and gender was also asked.
Section 2 Emotion Recognition This section was made up of two parts, and borrowed heavily from the testing method described in Murray and Arnott (1995). Both parts dealt with the recognition of emotion in synthetic speech, but followed a different format. Part A had only two test phrases, but each phrase was synthesized under four different emotions: anger, happiness, sadness, and neutral (no emotion). The two phrases were specially chosen because they were deemed to be emotionally undetermined (Murray and Arnott, 1995). Emotionally undetermined phrases are phrases whose emotion cannot be determined simply by the words. For instance, the phrase I received my assignment mark today can be convincingly said under a variety of emotions the speaker may be sad about the mark he or she received, can be feeling happy about it, or can even be angry having felt to be treated unfairly. For Part B, ten test phrases were prepared: five of these were emotionally neutral phrases, and the other five were emotionally biased phrases. An example of a neutral phrase is The book is lying on the table. An example of an emotionally biased, or emotive, phrase is I would not give you the time of day (the words already carry negative connotations, which could probably influence the listener to identify anger or disgust). Part B consisted of the following utterances (in this order): a) b) Five neutral phrases spoken without emotion. Five emotive phrases spoken without emotion.
c) The same phrases as in (a) above, but spoken in one of the four emotions (includes neutral). d) The same phrases as in (b) above, but spoken in one of the four emotions. Both Cahn (1990b) and Murray and Arnott (1995) expressed difficulty in finding appropriate test phrases, and indeed, the same difficulties were encountered when designing the questionnaire for this project, especially in finding neutral phrases that sounded convincing under any of the four emotions. A number of phrases were borrowed from the experiments of the aforementioned researchers, and the rest were original. It is important to note that the participants were not made aware of the different types of phrases that were prepared. See Appendix E for a list of the example test phrases used. An important design issue for this section of the questionnaire was deciding the way in which the participants should indicate their choice. Murray and Arnott (1995) have
identified (and used) two basic methods of user input that are suitable for speech emotion recognition: forced response tests, and free response tests. In a forced response test, the subject is forced to choose from a list of words the one that best describes the emotion that he or she perceives is being spoken. In a free response test, the subject may write down any word that he or she thinks best describes the emotion. In experiments performed by Cahn (1990), the evaluation was based on a forced response test, with only the six emotions that the Affect Editor simulated as possible responses. Participants were also asked to indicate (on a scale of 1 to 10) how strongly they heard the emotion in the utterance, and how sure they were. For this project, it was decided to adopt the forced response test because of its simplicity and to avoid the possible ambiguity of a free response test (for instance, if a participant wrote down exasperated, should it be categorized as angry or disgusted, or neither?). However, a mechanism needed to be provided so as not to limit the possible selection of responses. This was important because only three emotions other than neutral anger, happiness, and sad were simulated by the system. It was feared that should the selection list be confined to just four possible responses, then it could potentially invalidate the data. For instance, any utterance that contained positive words, or contained a positive tone to it, would immediately be perceived as happy only because it would be the only positive emotion in the list. This situation was avoided by adding two distractor emotions in the selection list surprise and disgust. This way, listeners would have more choice. In addition to this, an Other option was added to the selection list to enable the participant to write down their own descriptive term if they so wished. Therefore, a total of seven possible responses were made available for the four different types of emotion being simulated. It is beneficial that the tests in this section had a very similar structure to the forced response tests conducted by Murray and Arnott (1995). The experiment could not be exactly the same since time limitations did not permit as many test utterances to be played, and not all the emotions that were tested in Murray were simulated. Still, the experiments are similar enough to provide at least a loose comparison of results. It would have been interesting if a female voice would have been tested since the voice gender can be changed through the speaker tag (see Section 5.8.2). However, the questionnaire length was strictly limited, and a sufficient number of examples of one gender were needed to obtain useful data; interchanging male and female voices would have introduced complex variables, and making valid conclusions from the data would have been difficult, if not impossible. Still, differences in subject responses to male and
female voices could have possibly provided useful data for analysis and this should certainly be looked into in the future (see Chapter 7). Section 3 Talking Head For this section it was desired to obtain data that could describe the effect that adding vocal emotion to a Talking Head had on a users perception of the Talking Head. To do this it was decided to prepare two movies of the Talking Head: one speaking without vocal emotion effects, and the other including emotion effects in its speech all other variables must remain as much the same as possible (e.g. facial expressions, movements, and the actual words spoken). It was not possible to have the visual information of the Talking Head exactly the same for both examples, since the inclusion of certain speech tags affected the length of the utterance, and so slight movements such as eye blinking may have been different. Facial expressions showing emotion however, were the same for both examples (the exact placement, duration and intensity of the expressions could be controlled via facial markup tags (Huynh, 2000)). The utterance that was synthesized for both examples was excerpted from the Lewis Carroll classic Alices Adventures in Wonderland (Carroll, 1946). The excerpt contained dialogue between three characters of the story: Alice, the March Hare, and the Hatter. The passage was chosen for its wonderful expressiveness (it is a childrens novel), the fact that it included dialogue, and because various emotions such as sad, curiosity, and disappointment could be used when reading the passages. The first examples speech was synthesized without any markup at all, and therefore included Festivals intonation unchanged. For the second example, the text was marked up by hand using a variety of high-level speech emotion tags, and lower-level speech tags such as rate, pitch and emph (see Section 5.8).
Section 4 General Information The last section ascertained if the participant had heard of the term speech synthesis before, and if they had seen a Talking Head before. The section also aimed at receiving comments about the speech examples the participant had heard, and about the Talking Head they had seen.
6.1.2 Experimental Procedure

The demonstration was held in a moderately sized lecture theatre that was categorized by benches arranged in a tiered fashion (maximizing the view of the front), and a large
screen at the front of the theatre. The room also had an adequate sound system enabling each participant to comfortably listen to the demonstration. A total of 45 participants took place in the demonstration, a number large enough to produce adequate data. The demonstration began by very briefly introducing the participants to what the project is about. They were told that by filling out the questionnaire they would be helping in evaluating how well the project addresses the problem it had set out to solve. Nevertheless, it was made clear to all participants that it was not compulsory that they should sit the demonstration. The participants were given an overview of what to expect in the questionnaire, including the fact that they would be played a number of sound files and asked to comment on each one. It was emphasized however, that it was not they who were being tested, but the program itself, and that there were no right or wrong answers. The participants were asked to fill out Section 1 of the questionnaire. They were told the relevance of the questions being asked in that section, but that they were under no obligation to answer questions they did not feel comfortable with. The participants were encouraged to ask questions to clarify any details they felt had not been made clear. The sound and movie demonstrations of Sections 2 and 3 had been pre-rendered and made part of a Microsoft PowerPoint presentation that was shown on the large screen at the front of the lecture theatre. The reason for this was to avoid waiting for the utterances to be generated, and to provide a way of showing the audience the current section and example number that was being played. It also minimized the risk of anything going wrong. Parts A and B of Section 2 consisted of the test utterances described in Section 6.1.1 being played. For each test utterance, the sound file was played twice and the participants were given time to choose which emotion best suited how the speaker sounded. In order to give the participants a chance to listen to the test utterance again and confirm their choice, the five most recent utterances were repeated every fifth example. The repeating of the utterances also served as a mental break for the participants. The next section of the questionnaire consisted of the two Talking Head examples described in Section 6.1.1. The participants were asked to first watch both of the examples, and then fill out that part of the questionnaire. Before the examples were played, the participants were asked to scan the four questions asked in that section so as to aid them in what they should look for the Talking Heads clarity, expressiveness, naturalness, and appeal. However, the participants were not asked to focus on the voice only, even though it was only the voice that changed in the two examples. The participants were then asked to fill out the rest of the questionnaire, which consisted of more general questions (see Appendix D).
Important to note is that there were some factors that did not allow the evaluation to be carried out under ideal conditions. For instance, the very fact that the demonstration was held with a group of participants may have been a cause of distraction for some. Moving towards an ideal situation would have been for each participant to sit the questionnaire one at a time, in front of a computer without anyone else in the room. This would have minimized any distractions, and would have allowed the participant to go at his or her own pace. Given the time and resources available, however, this was not possible. Still, the demonstration was designed to be short enough to keep the participants attention and give each participant the ability to review their answers through playbacks, and conducted in such a way as to give the participants ample time to answer the questions.
6.1.3 Profile of Participants

The first and last section of the questionnaire provided some statistics about the participants who were involved in the evaluation process. Table 8 shows a summary of the information obtained.
Average age: Gender: Male Female Country lived in most: Australia Singapore Indonesia Malaysia Vietnam Other Unanswered English first language: Yes No 23.9 84.00% 16.00% 51.10% 8.90% 8.90% 6.70% 4.40% 8.90% 11.10% 57.80% 42.20%
Table 8 - Statistics of participants
In addition to the statistics shown in the table, it should be noted that all participants were students enrolled in a second year Computer Science introductory graphics unit. All participants were computer literate who used computers at home, school and for work. 71.1% had heard of the term speech synthesis before, while 26.7% had not (2.2% were unknown). Also, 82.2% had seen a Talking Head before, while 15.6% had not (2.2% were unknown).
6.2 Recognizing Emotion in Synthetic Speech

The discussion in this section is concerned with providing a full analysis of the data obtained from the emotion recognition tests Sections 2A and 2B of the questionnaire. Before the discussion can continue however, it will be beneficial to describe the structure and meaning of the confusion matrix, which is how most of the data will be presented.
6.2.1 Confusion Matrix

The advantage of using a confusion matrix to present the data is that it not only shows the correct recognition percentage for a particular emotion, but also the distribution of the emotions that were mistakenly5 chosen. It thus provides a way to clearly show any possible confusion that may have occurred between two or more emotions. Table 9 shows the confusion matrix template that will be used. The column headings represent the possible emotions the participants could choose from. The row headings represent the actual emotions that were simulated. Each cell will contain a percentage value indicating the proportion of listeners who perceived a particular emotion for a particular stimulus.
PERCEIVED EMOTION
Happy Sad Angry Neutral Surprised Disgusted Other
S T I M U L U S
Happy Sad Angry Neutral
Table 9 - Confusion matrix template
For example, in Table 10, the first row of example percentage values show that when an utterance was combined with happy vocal emotion effects generated by the system, 42.2% of listeners perceived the emotion as happy (i.e. correct recognition took place), 2.2% perceived it as sounding sad, 6.7% angry, 17.9% neutral, 23.7% surprised, 4.4% disgusted, and 2.9% specified something else. The data in the example therefore, would indicate happiness was the most recognized emotion for that utterance, and that the utterance was mostly confused with surprise and neutral. Note that the Other category
5
The intended meaning is not that the participants failed to recognize the simulated emotion through a fault of their own, but simply that the participants choice didnt match the emotion being simulated.
also includes those participants who could not decide which emotion was being portrayed.
PERCEIVED EMOTION
S T I M U L U S
42.2%
2.2%
6.7%
17.9%
23.7%
4.4%
2.9%
Table 10 - Confusion matrix with sample data.
From the above example, it can be seen that the data should be read in rows; each row representing an utterance or group of utterances that were simulating a specific emotion, and the cell values for that row show the distribution of the participants responses. Cells that have the same row and column names hold values that represent when the listeners perceived emotion matched the emotion being simulated any other cell represents wrong recognition. A table holding values such as those in Table 11 would therefore be showing ideal data 100% recognition of the simulated speech emotion took place for all emotions.
PERCEIVED EMOTION
S T I M U L U S
100%
0.0% 0.0% 0.0%
0.0%
0.0% 0.0%
0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0%
100%
0.0% 0.0%
100%
0.0%
100%
Table 11 - Confusion matrix showing ideal experiment data: 100% recognition rate for all simulated emotions.
Of course, this is virtually impossible, since it is well documented in the literature that our recognition rate is far less than perfect for when humans speak (Malandra, Barker and Barker, 1989). Knapp (1980) identifies three main factors that describe why this is so: 1. People vary in their ability to express their emotions (not only in speech, but also in other forms of communication such as facial expressions and body language).
2. People vary in their ability to recognize emotional expressions. In a study described in Knapp (1980), listeners ranged from 20 percent correct to over 50% correct. 3. Emotions themselves vary in being able to be correctly recognized. Another study showed anger was identified 63 percent of the time whereas pride was only identified 20 percent of the time. So if obtaining such ideal data for human speech is unrealistic, this is more the case when emotion is being simulated in synthetic speech.
6.2.2 Emotion Recognition for Section 2A

The Data The listener response data for the test utterances spoken with happy vocal emotion effects is given in Table 12. The data shows a relatively poor recognition rate (18.9%) for the happy emotion; 41.1% of listeners perceived the emotion to be neutral, while 22.2% perceived the speaker as sounding surprised.
PERCEIVED EMOTION
S T I M U L U S
18.9%
3.3%
3.3%
41.1%
22.2%
6.7%
4.4%
Table 12 Listener response data for neutral phrases spoken with happy emotion.
P Similarly, Table 13 shows E R C E response E M O T Iall N emotions demonstrated listener I V E D data for O four in the test utterances for Section 2A. Significant values are displayed in a larger font. S T I M U L U S
Happy Happy Sad Angry Neutral
Sad
Angry
Neutral
Surprised Disgusted
Other
18.9%
0.0% 6.7% 6.7%
3.3%
3.3% 1.1%
41.1% 22.2%
7.8% 0.0% 5.6% 3.3%
6.7% 6.7%
4.4% 6.7% 1.1% 6.7%
77.8%
0.0%
44.4% 21.1%
2.2%
21.1%
3.3%
23.3%
54.4%
Table 13 Section 2A listener response data for neutral phrases.
The following observations can be made from the data for each emotion (including happy, which has already been discussed). a) Happy. A poor recognition rate (18.9%), with the stimulus being strongly confused with neutral (41.1%) and surprise (22.2%). b) Sad. A very high recognition rate occurred for sadness (77.8%), with little confusion occurring with other possible emotions. c) Angry. A relatively high percentage of listeners recognized the angry stimulus (44.4%), but a considerable amount confused anger with neutral and disgust (21.1% each). d) Neutral. Most listeners correctly recognized when the utterance was played without emotion (54.4%), but a significant portion (23.3%) perceived the emotion as sad. The Analysis Any percentage value substantially greater than 14% was deemed as significant. This is based on the logic that if all participants had randomly chosen one of the seven emotions, then a particular emotion would have 1/7 (~14%) chance of being chosen. Therefore, if a cell has a percentage substantially greater than 14% then there must have been a factor (or factors) that influenced the listeners choice. Except for happy, all emotions had a recognition rate greater than 14%. However, sadness was the only emotion that enjoyed little confusion with other emotions. So with the exception of sadness, average recognition for the simulated emotion in the utterances was quite low. To give a possible explanation for these values, it will be helpful to look at the data for each question separately. Table 14 shows listener response data for Question 1 of Section 2A, and Table 15 shows listener response data for Question 2.
PERCEIVED EMOTION
S T I M U L U S
6.7% 0.0% 0.0% 0.0%
2.2%
6.7% 2.2%
57.8%
8.9% 6.7%
13.3% 0.0% 4.4% 2.2%
8.9% 11.1%
4.4% 11.1% 2.2% 6.7%
66.7%
0.0%
66.7%
2.2%
20.0%
2.2%
46.7%
40.0%
Table 14 - Listener response data for Section 2A, Question 1. PERCEIVED EMOTION
S T I M U L U S
31.1%
0.0% 13.3% 13.3%
4.4%
0.0% 0.0%
24.4% 31.1%
6.7% 0.0% 6.7% 4.4%
4.4% 2.2%
4.4% 2.2% 0.0% 6.7%
88.9%
0.0% 0.0%
22.2% 35.6%
2.2%
22.2%
4.4%
68.9%
Table 15 - Listener response data for Section 2A, Question 2.
It is very obvious that the two tables hold substantially different data values. Whereas for Question 1 the happy stimulus received a recognition rate of 6.7%, with 57.8% mistaking it for neutral, this changed dramatically for Question 2, with 31.1% of listeners correctly identifying the happy emotion, and only 24.4% classifying the utterance as neutral. There is still considerable confusion occurring for the happy stimulus even for Question 2 (surprise is 31.1%), but the difference between the two questions recognition rates is too large to ignore. Therefore it is evident that the results in this section were very much utterance dependent. From the data, it seems that the phrase The telephone has not rung at all today (Question 1) was said more effectively in an angry tone than I received my assignment mark today (Question 2), and this was despite the care that was taken to choose neutral test phrases. Both Cahn (1990) and Murray and Arnott (1995) report this problem of utterance-dependant results, and it demonstrates the difficult task of obtaining quantitative results on speech emotion recognition. An interesting observation is noting when a particular stimulus received strong listener recognition and the emotion in the same row with the next highest response. For example, anger received strong recognition for Question 1 (see Table 14). For that
utterance, the emotion that anger was most confused with is disgust, which received 20.0%. Close scrutiny of the two tables will also reveal that when happiness was strongly recognized, the emotion it was most confused with is surprised (see row 1 of Table 15). Also, neutral in Table 14 was most confused with sad. The pattern that emerges is the one described in the literature; the pairs that are most often confused with each other are happiness-surprise, sadness-neutral, and anger-disgusted. This is because the speech correlates identified in the literature are very similar for these emotion pairs. As a consequence, emotion recognition will often be confused between these emotion pairs, especially with neutral text. From Table 14 and Table 15, it can also be seen that generally, recognition of the simulated emotions improved for Question 2. This could be due to the listeners becoming accustomed to the synthetic voice and learning to distinguish between the emotions. It could also have been due to the listeners, who were all students, relating more to the phrase of the second question, which was about receiving an assignment mark. This could be clarified with further tests.
6.2.3 Emotion Recognition for Section 2B

As outlined in Section 6.1.1, there were ten test phrases in this section, with each phrase being used to generate two utterances one spoken in a neutral tone and the other in one of the three emotions (happy, sad, or angry). Section 2B was thus made up of four different types of utterances in the following order: a) b) c) d) Emotionless or neutral text phrases spoken with no vocal emotion. Emotive text phrases spoken with no vocal emotion. Emotionless text phrases of (a) spoken with vocal emotion. Emotive text phrases of (b) spoken with vocal emotion.
In this part of the analysis, the data will be presented for each of these four test utterance types.
Emotionless text with no vocal emotion
Table 16 shows listener responses for utterances with emotionless text with no vocal emotion; that is, utterances whose text was emotionally indistinguishable from just the words alone, and spoken with no vocal emotion effects. The text phrases used for this section are phrases 1-5 shown in Appendix E.
S T I M U L U S
PERCEIVED EMOTION
Happy Neutral Sad Angry Neutral Surprised Disgusted Other
2.2%
36.9%
2.7%
44.4%
3.1%
6.7%
4.0%
Table 16 - Listener responses for utterances containing emotionless text with no vocal emotion.
The observation that can be made from the data is that there was a strong recognition that the utterances were spoken with no emotion. However, the utterances were also confused with sadness, as it too received a strong listener response (36.9%). Again, like the previous section of the questionnaire (discussed in Section 6.2.2), it was found that there was a great deal of variation in listeners perception that was dependent on the utterance being spoken. For instance, one of the phrases in this section was The telephone has not rung at all today. The majority (73.3%) of listeners perceived the speaker as being sad, while only 20.0% thought it sounded neutral. Instead, the phrase I have an appointment at 2 oclock tomorrow was perceived more neutral (62.2% for neutral, 20.0% for sad). The point being emphasised is that although care was taken to choose emotionally undetermined phrases, the data suggests that this is a very difficult task. Our dependency on context to discriminate emotions is stated in Knapp (1980), and has been confirmed in this evaluation. Albeit in varying degrees, the general trend for this subsection was that the neutral voice was often perceived as sounding sad. This suggests that Festivals intonation, which is modeled to be neutral, may have an underlining sadness. Interestingly, Murray and Arnott (1995) also made this observation with the HAMLET system, which makes use of the MITalk phoneme duration rules described in Allen et al. (1987, Chapter 9).
Emotive text with no vocal emotion The use of utterances of this type endeavoured to determine the influence emotive text has on listener perception of emotion. Table 17 shows the data accrued for utterances spoken with no vocal emotion, but contained D E M Otext O N the speakers feelings can P E R C E I V E emotive T I (i.e. be approximately determined from the text alone). Phrases 6-10 of Appendix E were used for this subsection. S
T I M U L U S
Happy Happy Sad Angry Neutral
Sad
Angry
Neutral
Surprised Disgusted
Other
13.3% 2.2% 1.1% 2.2%
8.9%
0.0% 2.2%
68.9% 37.8% 75.6%
0.0% 0.0% 2.2% 11.1%
0.0% 4.4%
8.9% 4.4% 1.1% 6.7%
48.9%
0.0% 4.4%
42.2% 33.3%
0.0%
20.0%
0.0%
Table 17 - Listener responses for utterances containing emotive text with no vocal emotion.
The following observations can be made from the data: a) Row maxima occurred for three emotions corresponding to the emotion in the text (sadness, anger, and neutral). This shows that for these emotions, a good proportion of the listeners relied on the words in the utterance to distinguish between emotions. However, a significant amount of confusion occurred for sadness and anger; the sadness stimulus was perceived as neutral by 37.8% of listeners, and the anger stimulus was confused with neutral (33.3%) and disgusted (20.0%). b) The phrase containing happy text was very much perceived as having no emotion. It is unclear why this emotions recognition suffered more than the others did; maybe happiness heavily requires confirmation in the speakers voice for it to be identified as such. The significant confusion occurring with happiness, sadness, and anger suggests that although emotion perception can be strongly influenced by the words in the utterance, lack of vocal emotion in the voice meant that the utterances sounded unconvincing, and so emotion recognition was not very strong.
Emotionless text with vocal emotion The confusion matrix for phrases containing emotionless text (phrases 1-5 of Appendix E) spoken with vocal emotion is shown in Table 18; that is, utterances whose text was emotionally indistinguishable from just the words alone, but were spoken with various vocal emotion effects.
PERCEIVED EMOTION S T I M U L U S
Happy Happy Sad Angry Neutral Surprised Disgusted Other
24.4%
1.1%
0.0%
22.2% 40.0%
2.2%
10.0%
Sad Angry
0.0% 0.0%
72.2%
0.0%
0.0%
17.8%
2.2%
0.0% 0.0%
2.2% 4.4%
7.8% 2.2%
91.1%
Table 18 - Listener responses for utterances containing emotionless text with vocal emotion.
The following observations can be made from the data: a) The row maximum for the happiness stimulus was for the surprised emotion (40.0%), while 24.4% thought it sounded happy and 22.2% thought it sounded neutral. Though this is a low listener response for happiness, the high combined response of the happiness and surprised emotions seems to suggest that it had a pleasant tone. That listeners wrote descriptive terms (for the Other option) such as content, informative, proud, and curious seems to confirm this. b) A high correct recognition rate occurred for the sad stimulus (72.2%). Other listeners gave their own descriptive terms such as lethargic, disappointed, and sleepy. c) The anger stimulus received a very high recognition rate (91.1%), with very little confusion occurring with other emotions. Bearing in mind that the utterances for this subsection contained exactly the same text as the neutral text, neutral voice utterances (shown in Table 16), the strong effect vocal emotion has on a listeners perception of emotion in an utterance can be clearly seen. Whereas most listeners chose either neutral or sad for the neutral text, neutral voice utterances, Table 18 shows a much better overall emotion recognition for neutral text, emotive voice utterances.
Emotive text with vocal emotion The confusion matrix for phrases containing emotive text (phrases 6-10 of Appendix E) spoken with vocal emotion is shown in Table 19; that is, utterances whose text alone gave an indication of the speakers emotion, and were further spoken with the appropriate vocal emotion.
PERCEIVED EMOTION S T I M U L U S
66.7%
0.0% 0.0% 6.7%
4.4%
0.0% 4.4%
13.3%
4.4% 0.0% 0.0% 13.3%
2.2% 0.0%
8.9% 8.9% 5.6% 2.2%
62.2%
0.0% 2.2%
24.4%
1.1%
77.8%
0.0%
15.6%
4.4%
71.1%
Table 19 - Listener responses for utterances containing emotive text with vocal emotion.
The following observations can be made from the data: a) As predicted, the confusion matrix shows a strong average recognition rate for all simulated emotions (seen in the high values down the diagonal line, matching row and column names). b) Confusion continued to occur for emotions, but was not as significant as other utterance types. Both of the above observations indicate that correct emotion perception is taking place, upholding the hypotheses related to this section (see Section 4.1). Important to note is the reoccurrence of the confusion pairs mentioned in Section 6.2.2 sad with neutral, and angry with disgusted. Happiness, in Table 19 at least, was not confused much with surprise; however, this may be due to the utterance not lending itself to this confusion as surprise is certainly being confused with happiness in Table 18. Interestingly, some listeners complained in the general comments section that it was very difficult to hear disgust or surprise in the speech utterances (the reader will recall that disgust and surprise were not being simulated, and were included in the questionnaire selection list as distractors). This may suggest that some listeners may have sometimes chosen disgust or surprise because they felt the emotion had to come up sooner or later. Happiness for this section enjoyed a high recognition rate after suffering low recognition rates for other utterance types. For the neutral text, emotive voice test utterances, happy utterances were perceived as having a pleasant tone, but were not explicitly identified as happiness. The high recognition rate for happiness for emotive text, emotive voice type utterances suggests that the emotive text helped clarify the utterance as not just being pleasant but happy.
6.2.4 Effect of Vocal Emotion on Emotionless Text
By studying the two confusion matrices in Table 16 and Table 18, it can be seen that emotion recognition was enhanced for neutral phrases spoken with vocal emotion compared to the same text spoken without vocal emotion. The two confusion matrices however, show only the number of listeners who correctly or incorrectly recognized the simulated emotion. What they do not show is the number of listeners who improved with the addition of vocal emotion effects. To determine the effect of vocal emotion therefore, the data should be filtered to not show listeners who correctly recognized the intended emotion when the utterance was spoken without vocal emotion and then also correctly recognized when the utterance was spoken with vocal emotion. In order to address this, further analysis was carried out on the data to determine the effect the addition of vocal emotion had on listener emotion recognition. To do this, the analysis kept track of listeners who had improved in their recognition of the intended emotion, and also kept track of listeners whose recognition was shown to deteriorate when the utterance was spoken with vocal emotion. Table 20 shows for each simulated emotion, the percentage of listeners who incorrectly recognized the intended emotion when it was spoken without vocal emotion and who then improved in their recognition when the utterance was spoken with vocal emotion. Similarly, Table 21 shows the percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects.
Happy Sad Angry
22.2% 36.7% 84.4%
Happy Sad Angry
2.2% 12.2% 2.2%
Table 20 - Percentage of listeners who improved in emotion recognition with the addition of vocal emotion effects for neutral text.
Table 21 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for neutral text.
The above results show that an overall significant increase in emotion recognition occurred with the introduction of vocal emotion effects in a neutral utterance; this was true for all the simulated emotions. Anger gained the largest increase in emotion recognition (84.4%). This suggests that if a neutral text utterance is to be perceived as being spoken in anger, then its perception is very much dependent on the vocal emotion effects without vocal emotion, it is simply very difficult to perceive the speaker as being angry. Deterioration for sadness was higher than the other two emotions. The reason for this is not clear, except that confusion between sadness and neutrality occurred with all utterance types.
6.2.5 Effect of Vocal Emotion on Emotive Text

A similar analysis to the one in Section 6.2.4 was carried out to determine the effect of vocal emotion on emotive text. Table 22 and Table 23 show the results of this analysis.
Happy Sad Angry
57.8% 31.1% 41.1%
Happy Sad Angry
4.4% 17.8% 6.7%
Table 22 - Percentage of listeners whose emotion recognition improved with the addition of vocal emotion effects for emotive text.
Table 23 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for emotive text.
From the above results, it can be seen that all emotions received a significant increase in emotion recognition once vocal emotion was added to the utterance, with the greatest improvement occurring for happiness (57.8%). Anger did not improve so much with emotive text as with neutral text. This shows that emotive text also had a strong influence on determining the correct emotion being simulated. As with Table 21, Table 23 also shows that the deterioration of recognition with sadness was significantly higher than other emotions; sadness-neutral confusion was a problem throughout all the tests. Still, the substantial effect of vocal emotion even on emotive text phrases can be clearly seen through this analysis.
6.2.6 Further Analysis

In Section 6.1.1, it was mentioned that questionnaire participants were asked to provide personal and background information. This associated information with participants emotion recognition scores would enable further analysis for instance, were there any differences in emotion recognition between male and female participants? Or, were there any differences between participants who spoke English as their first language versus participants who didnt? Unfortunately, the gender distribution of the participants was too unbalanced to enable any concrete analysis to take place 84.4% were male, 15.6% were female. On the other hand, participant population was quite evenly divided for those who spoke English as their first language and those who didnt 57.8% and 42.2% respectively. For the language analysis, apart from minor differences in emotion recognition of the intended stimulus, correct emotion perception was the same for both classes of
participants. One discrepancy was found however, with differences too significant to overlook: it was noted from the data that participants who did not speak English as their first language confused sadness with neutrality significantly more than people who did speak English as their first language. This was noted for neutral text, emotive voice and emotive text, emotive voice type utterances. Table 24 shows listener responses for the sadness stimulus for participants who speak English as their first language, while Table 25 shows the same listener responses for people who do not speak English as their first language. Both Tables are for neutral text, emotive voice utterance types.
S T I M U L U S
PERCEIVED EMOTION
Happy Sad Sad Angry Neutral Surprised Disgusted Other
0.0%
82.7%
0.0%
5.8%
0.0%
1.9%
9.6%
Table 24 Listener responses for participants who speak English as their first language. Utterance type is neutral text, emotive voice. PERCEIVED EMOTION
S T I M U L U S
0.0%
55.3%
0.0%
34.2%
0.0%
2.6%
7.9%
Table 25 Listener responses for participants who do NOT speak English as their first language. Utterance type is neutral text, emotive voice.
The pattern is also reflected in the following data for emotive text, emotive voice utterance types (Table 26 and Table 27).
S T I M U L U S
PERCEIVED EMOTION
Sad
0.0%
76.9%
3.8%
11.5%
0.0%
0.0%
0.0%
Table 26 Listener responses for participants who speak English as their first language. Utterance type is emotive text, emotive voice.
S T I M U L U S
PERCEIVED EMOTION
0.0%
42.1%
5.3%
42.1%
0.0%
0.0%
10.5%
Table 27 Listener responses for participants who do NOT speak English as their first language. Utterance type is emotive text, emotive voice.
The significant difference in emotion perception for sadness suggests that cultural issues are coming into play; this is an issue that has received a good deal of interest in the non-verbal communication literature (Knapp, 1980) (Malandra, Barker and Barker, 1989). The high confusion between sadness and neutrality for participants without English as their first language may account for the high deterioration values for sadness seen in Table 21 and Table 23. Maybe comforting to know is that if there was confusion for sadness, neutral was chosen, as supported by Murray and Arnott (1995). It would be alarming indeed if the data would show that sadness could be easily confused with, say, disgust, as the consequences in real-life communication would be disastrous.
6.3 Talking Head and Vocal Expression

This section will endeavour to determine the effect vocal emotion has on peoples perception of the Talking Head that was described previously. Specifically, it will address hypothesis 2b that states: through the addition of emotive speech, information will be communicated more effectively by the Talking Head (see Section 4.1). In the demonstration of the two example movies of the Talking Head (described in Section 6.1.1), the following comparisons were asked to be made by the participants. As a memory refresher, the first Talking Head example was without vocal expression, and the second Talking Head example included vocal expression. The reader is reminded that Huynhs (2000) facial markup tags were used to perform facial expressions during the reading. The markup tags and their placement were identical for both examples.
Understandability. If one Talking Head version was determined to be generally easier to understand than the other, then it can be assumed that listener comprehension would benefit. Expressiveness. The Talking Head version that would be determined to be better able to express itself would communicate more effectively its feelings, the mood of a story, the seriousness/lightheartedness of information etc. Naturalness. Lester and Stone (1997) show that believability of pedagogical agents (an application of Talking Heads) plays an important role in user motivation. Along with Bates (1994), it is shown that life-like quality and the display of complex behaviours (including emotion) is a crucial factor for believability. Interest. Users who find a speaker interesting will give their attention to what is being said. If this occurs, the speaker has the opportunity to communicate his or her ideas better. Understandability Table 28 shows the response of participants who compared how well they could understand the Talking Head in each example, and chose which one they felt was easier to understand.
Understandability Talking Head 1 Talking Head 2 Neither
2.2% 82.2% 15.6%
Table 28 - Participant responses when asked to choose the Talking Head that was more understandable.
The data shows that the overwhelming majority of participants thought the second Talking Head demonstration (the one that included vocal expression) was easier to understand. From participants justifications for their choice, it was remarked that not only did most participants think the second demonstration was easier to understand, but that many thought the first demonstration was very difficult to understand. Participants wrote that the second demonstration was easier to understand because it spoke slower and because it had better expression in its voice. In fact, the speech rate of the second demonstration did not slow down; rather, more pauses between sentences and character
dialogues were present. This highlights the importance that silence has in our verbal communication. Although it could be thought that the second demonstration was perceived as being easier to understand because the story was told again, and so therefore the participants were hearing the story for the second time, the fact that the reasons for their choice were widely echoed shows that the voice was a determining factor. Expressiveness Table 29 shows the response of participants when asked which Talking Head demonstration seemed best able to express itself. Again, the second demonstration was overwhelmingly favoured over the first demonstration. Motivators for favouring the second demonstration were because it was perceived the storytelling was better structured, and because there was more variation in its tone. Many participants commented on how the variability and changes in pitch helped the turn-taking of the characters and to distinguish ideas behind [the] statements in the story. Still others commented on the dynamic use of volume, and that the tone of the second demonstration was more appropriate for the story.
Expressiveness Talking Head 1 Talking Head 2 Neither
4.4% 86.7% 8.9%
Table 29 Participant responses when asked which Talking Head seemed best able to express itself.
Another factor that was stated by most participants who favoured the second demonstration was that the vocal and facial expressions were better synchronised, and that the vocal expressions stood out more. The synchronisation of vocal expression and facial gestures is deemed to be very important for communication by Cassell et al (1994a), and Cassell et al (1994b); it is encouraging that most participants were able to notice this and view it as desirable. Interestingly, participants who did not favour the second demonstration of the Talking Head commented that it was too expressive (in a caricature way).
Naturalness
Table 30 shows the response of participants when asked which Talking Head seemed more natural. The main reason that the majority of participants that voted for the second demonstration gave was that the speech now matched the facial expressions better, and so seemed more realistic. For this question, there were a considerable amount of participants who commented that the mouth and lip movements were unconvincing and therefore needed more work for the Talking Head to seem realistic.
Naturalness Talking Head 1 Talking Head 2 Neither
15.6% 71.1% 13.3%
Table 30 Participant responses when asked which Talking Head seemed more natural.
Interest When asked which Talking Head demonstration seemed more interesting, participants gave a variety of reasons for their choice. Most participants opted for the second demonstration, and Table 31 shows the data for this question.
Interesting Talking Head 1 Talking Head 2 Neither
2.2% 84.4% 13.3%
Table 31 Participant responses when asked which Talking Head seemed more interesting.
Many participants wrote that because the first demonstration was difficult to understand, they lost interest in what it was saying. Conversely, the second demonstration was easier to understand, and as a result, they didnt have to concentrate as much. Others wrote that the Talking Head in the second demonstration seemed more alert and that it was happier. Several participants commented that the second demonstration was more interesting because it seemed it had more human qualities, and that the facial expressions were noticed more because of the improved speech. For the results shown in this section, it is safe to say that the overwhelming majority of participants favoured the second demonstration of the Talking Head over the first demonstration. What is gratifying is the fact that the participants werent told to focus on the Talking Heads voice but rather look at it on the whole, and yet most people
attributed the improvement between the two examples to the vocal expression of the Talking Head. A very important note to make however is that the results are from data that was obtained by comparing two versions of the Talking Head: with and without simulated vocal emotion. Further tests would need to be done to quantitatively determine how effective a communicator the Talking Head is with its improved speech. What is important is that the results have been able to show that with vocal emotion, the Talking Head is a better communicator, basing the argument that is has enhanced understandability, expressiveness, naturalness, and user interest.
6.4 Summary
This chapter briefly described how the evaluation research methodology was employed to test the hypotheses stated in Section 4.1. The TTS module was tested for how well listeners could recognize the simulated emotions, and the data was seen to largely support the stated hypotheses. The variables that can affect speech emotion recognition were investigated and discussed, and the test data was also seen to support both previous work done in the field of synthetic speech emotion (namely Murray and Arnott, 1995), and the general literature from the fields of paralinguistics and non-verbal behaviour. The last hypothesis of Section 4.1 was also tested to investigate if a Talking Head is able to communicate information more effectively when it is given the capability of speech expression. Through a series of demonstrations, this testing section showed that viewers overwhelmingly rated the Talking Head with vocal expression as much easier to understand, more expressive, natural, and more interesting to look and listen to. It is proposed that these factors contribute in making the Talking Head a more effective communicator.
Chapter 7 Future Work

Through the development of this project, a number of possible avenues for future work have been identified. The following sections came about either as a result of the strict time constraints of the project, and hence certain features were unable to be implemented, or because it was discovered that many areas offered great depth in which much future work could continue. This is especially true for XML, whose usability and potential is indeed enormous. It is envisaged that many of the following sections each contain enough depth for entire projects.
7.1 Post Waveform Processing

Another submodule could be added to the TTS module to perform post waveform processing after the Digital Signal Processor has generated the waveform. This could bring a range of benefits, from adding novelty filter effects to the voice to simulating various environments (such as a large hall, outdoors, bathroom effect etc). But perhaps the greatest benefit that post waveform processing could accomplish would be the implementation of dynamic volume within an utterance. The volume tag would be extended to include gradual increasing/decreasing volume abilities, and would help vary intensity levels within an utterance. The problem that the implementation of any post waveform effects would need to solve would be determining which part(s) of the waveform need to be altered. This would be determined within the SML Tag Processor, which could output a list of operation nodes describing the operation(s) that the Post Waveform Processor needs to perform, and the waveform parts that are to be affected by the operation. Figure 27 shows the structure of such a node, while Figure 28 shows an (initial) design of part of the TTS module with the addition of the Post Waveform Processor submodule.
Operation Parameters Start/End Figure 27 - A node carrying waveform processing instructions for an operation.
SML Document
Tags + Text/Phonemes Modified Phonemes Phoneme data
DSP SML Tags Processor

MBROLA
Processing directives
Waveform
Waveform Processing List Processing directives
Post Waveform Processor
Modified Waveform
Figure 28 - Insertion of new submodule for post waveform processing.
7.2 Speaking Styles

An interesting area for future work would be the investigation of developing different speaking styles. Work in this area is quite important, as the literature is very rich in the discussion of how and why we adopt different speaking styles. For instance, Knapp (1980) shows how research in the field of paralinguistics suggests that the way we speak changes depending on who we are talking to (e.g. a group of people or one on one, someone of the opposite sex, someone from a different age group etc). Research has also shown that male and female speakers have different speech intonation patterns, and catering for this in the SML speaker tags output would be beneficial (Knapp, 1980). Malandra, Barker and Barker (1989) describe how speaking styles, namely conversational style and dynamic style, affect the listeners perception of the speaker.
For instance, one of the characteristics of dynamic style is a faster rate of speech, and studies have shown that higher ratings of intelligence, knowledge, and objectivity are ascribed to the speaker (Miller, 1976 as referenced in Malandra, Barker and Barker, 1989). Contrariwise, a speaker adopting a conversational style of speaking (which is characterized by a more consistent rate and pitch), is rated to be more trustworthy, better educated, and more professional by listeners (Pearce and Conklin, 1971 as referenced in Malandra, Barker and Barker, 1989). If these results can be reproduced, this will clearly have major repercussions for a Talking Head. An interesting way that speaking styles could be specified is through the use of Cascading Style Sheets (CSS), making use of Extensible Stylesheet Language (XSL) technology. The SML markup itself would not need to be changed to adopt a different speaking style, but rather the XSL document that defines the style and voice of the speaker. This confirms one of the benefits of XML that was stated earlier, where the XML file describes the data and does not force the presentation of the data.
<?xml-stylesheet ...?> <?xml-stylesheet ...?> <sml> <sml>
link to stylesheet
<xsl:stylesheet ...> <xsl:stylesheet ...> ... ... </xsl:stylesheet ...> </xsl:stylesheet ...>
... ...
</sml> </sml>
Stylesheet
SML File
Figure 29 - SML Markup containing a link to a stylesheet.
7.3 Speech Emotion Development

The TTS module has implemented three emotions thus far (excluding neutral): happiness, anger, and sadness. Fear and grief are the two remaining of the five so-called basic emotions which could be implemented, and this certainly would not be a comprehensive list of vocal emotions that humans display. Therefore, there is much work that could be carried out in this area alone. However, future work in other (more complex) emotions may not produce the same results found in this thesis. This would be due to a number of reasons: a) More complex emotions are less understood. b) As a consequence of (a) speech correlates for complex emotions are much harder to identify, and as a result are featured less prominently in the literature.
c) It could well be that work in this area will find that speech correlates may not be as important for discriminating similar complex emotions (e.g. pride and satisfaction), and that it is rather what we say that provides the best cues for such speaker emotions. This is upheld by Knapp (1980), who states as we develop, we rely on context to discriminate emotions with similar characteristics.
7.4 XML Issues

Because FAITH has only just begun to incorporate XML input into its various modules, a solid framework for handling markup of various XML-based languages has not been properly defined. Section 5.1.2 raised an important issue concerning the handling of XML data; the TTS input needs to be filtered since the input may contain non-SML tags. Though this process works, this design is a little loose since all modules within the Talking Head receive XML data that is often not relevant to the module. Also, inefficient handling of the XML input occurs since individual modules are required to re-parse the same data. A more structured approach would be to have a separate module whose role would be to manage the XML data. Its task would be to then pass relevant XML data to the other modules, possibly in the form of a pre-built DOM. With this in place, each module in the system will have clearer definitions of its inputs, knowing that all information it receives is applicable to the module. Figure 30 shows where the XML Handler fits in with the other modules of the Talking Head. Note that the design also enables the system to not be as TTS driven as it currently is (see Section 7.5).
SML
TTS
XML Data (SML, FAML, etc)
TTS Output
Brain
XML Handler
FAML
FAML
FAML Output
Other
Other
Other Output
Figure 30 - Inclusion of an XML Handler module to centrally manage XML input.
A complete re-think of how XML data is handled is needed if XML is to play a major role in the way the Talking Head processes information (which is what we believe will eventuate). There is a need for the framework to be put in place as soon as possible in order to aid the implementation of future modules so that handling of the input is done in a controllable manner, and that the full potential of XML is realized.
7.5 Talking Head

The evaluation of the TTS module showed that there is a need to improve the viseme production of the Talking Head. This is logical since the speech and facial gestures are now able to express various emotions, but the mouth movements during speech currently stay the same. The effect of emotion on mouth movements is therefore an area that should be explored. This will include the need to investigate how certain facial expressions can be preserved during speech, especially those including mouth movements. For instance, currently it is quite difficult for the Talking Heads mouth to be in the shape of a smile and talk at the same time without causing undesirable effects. Another limitation that has been identified concerning the Talking Head is that its head movements are purely TTS driven. Currently, when speech is rendered, the FAP stream containing just the visemes is passed to the Personality and FAML modules so that they can add head movement and facial gesture directives down the FAP stream. The consequence of this process is that if the Talking Head does not say anything, no head movements or facial expressions can be generated. This results in a Talking Head that is completely immobile unless its talking, which gives a very mechanical feel to it. What is required is a complete redesign of how speech and the FAP stream output is generated. That is, the generation of the FAP stream for head movements and facial gestures should be partially divorced from speech production to give the Talking Head the ability to move, blink, and perform facial gestures and expressions even when it is not talking. This has not been a major issue during evaluation stages of the Talking Head in Shepherdson (2000), Huynh (2000), and this thesis because testing has been mainly done through the use of demonstration examples no thorough testing has yet been done in an interactive environment. However, the Talking Head has reached a level of maturity where it is now being demonstrated in public events, and hopefully this will result in constructive feedback to help prioritize (and shed light on) problems that need to be solved.
7.6 Increasing Communication Bandwidth

Work is underway to make the FAITH Projects Talking Head accessible from within a web browser, and therefore have a presence on the World Wide Web (Levy, 2000). With the current architecture, applications making use of the Talking Head will suffer bandwidth problems due to the large transfer of volume from the server to the client. Even with high-speed connections, users will still experience delays since many waveforms need to be sent to the client as the Talking Head speaks. A solution to this problem, of course, would be to implement streaming audio, but this would require considerable effort. This thesis proposes an alternate, simpler solution that would greatly reduce the amount of traffic between the server and the client. Figure 31 shows the proposed architecture of a modified TTS module that bridges across the server and client.
NLP (Festival) DSP (MBROLA)
Text to be rendered
Phoneme Info (text)
Waveform
TTS Module
SML Processing
TTS Module
Viseme Generator
Visemes
SERVER SIDE
CLIENT SIDE
Figure 31 - Proposed design of TTS Module architecture to minimize bandwidth problems between server and client.
The idea of sending text to the client instead of a waveform is not new, but other solutions state that to do this the entire TTS module should be on the client side. The text to be rendered is sent to the client, and the speech is fully synthesized there. However, a complex NLP module such as Festivals is quite large, and it would be very undesirable to force users to install such an application on their systems. The architecture in Figure 31 shows a system where all phoneme transcription and SML tag processing is still carried out on the server. Instead of producing a waveform and sending this to the client however, the phoneme information is sent across in MBROLAs input format (described in Section 5.9) to the TTS Module on the client side, where the waveform is produced by the DSP. The MBROLA Synthesizer is not a very large program, and has binaries available for most platforms. Note that although the diagram shows the Festival and MBROLA systems, any synthesizer that can deal with phoneme information at the input and output level can be used.
Chapter 8 Conclusion
This thesis has focused on the primary goal of simulating emotional speech for a Talking Head. The objectives reflected this by aiming to develop a speech synthesis system that is able to add the effects of emotion on speech, and to implement this system as the TTS module of a Talking Head. The literature was explored and investigation of research in the fields of non-verbal behaviour, paralinguistics, and speech synthesis allowed (with some confidence) the following hypotheses of Section 4.1 to be formed: 1. The effect of emotion on speech can be successfully synthesized using control parameters. 2. Through the addition of emotive speech: a) Listeners will be able to correctly recognize the intended emotion being synthesized. b) Head. Information will be communicated more effectively by the Talking
The review of the literature also helped to both identify and provide possible solutions to the subproblems stated in Section 2.2, namely the need to develop a speech markup language, and synchronizing speech and facial gestures. Throughout the entire design and implementation phase of the TTS module, the literature was a solid basis upon which design decisions were made. For instance, SMLs speech emotion tags implemented the speech correlates of emotion found in the literature and previous work on synthetic speech emotion, chiefly by Murray and Arnott (1995) and Cahn (1990). The evaluation of the TTS module was an integral part of this thesis. Therefore, an evaluative research methodology was employed to determine if the system supported or disproved the stated hypotheses. In order for testing to be carried out in a clear and
concise way that could be directly linked with the hypotheses, the questionnaire-based evaluation process was organized into two main sections: a) Speech emotion recognition, and
b) The extent of the effect vocal expression has on a Talking Heads ability to communicate. The results from the speech emotion recognition sections support hypotheses 1 and 2a; that is, strong recognition of the simulated emotions took place and so simulation of the emotive speech was successful. The evaluation was able to show that a good recognition rate of emotion occurred, comparable to that described in the literature for both synthetic speech emotion, and for human speech. The investigation of different combinations of neutral/emotive text and neutral/emotive vocal expression was important in that it helped identify the importance of the words we decide to use and how we choose to say them. Importantly, the results in Chapter 6 are seen to also confirm the literature; both in terms of the effects of the different variables involved (words and vocal expression), and also in terms of the emotions that are often confused together. As found in Murray and Arnott (1995) and Cahn (1990), the results were seen to be very utterance dependent and may have contributed to both correct recognition and also confusion between emotions. The reason for confusion between certain emotions is reported in the literature to be dependent on many factors, including the text in the phrase and its context, the speakers voice, and who is hearing the utterance, which brings gender and cultural issues in play (Malandra, Barker and Barker, 1980; Knapp, 1980). Results of the Talking Head experiments also proved to be positive with the overwhelming majority of viewers stating that the Talking Head was much easier to understand, more expressive, natural, and more interesting when vocal expression was added to its speech. It would be very interesting to investigate listener emotion recognition rates for the same utterances spoken by the Talking Head itself. This way, the importance of the combination of the visual channel (facial gestures) and the audio channel (speech) could be studied. There is much work yet to be done in the field of synthetic speech emotion, with Chapter 7 identifying a number of key areas which should be investigated. The main limitation of the project was the inability to implement dynamic volume control and speech correlates affecting the voice quality (such as nasality, hoarseness, breathiness etc). Notwithstanding these limitations however, this thesis has been able to demonstrate the effect synthetic speech emotion has on user perception when it is applied to a Talking
Head. It is envisaged it would bring similar benefits to other applications using speech synthesis. It is through this demonstration that the significance and value of this project is seen to have been confirmed.
Bibliography
Allen, J., Hunnicutt, M. S. and Klatt, D. (1987). From Text to Speech: The MITalk System, Cambridge University Press, Cambridge. Bates, J. (1994). The Role of Emotion in Believable Agents, Communications of the ACM, vol. 37, pp. 122-125. Beard, S. (1999). FAQBot, Honours Thesis, Curtin University of Technology, Bentley, Western Australia. Beard, S., Crossman, B., Cechner, P. and Marriott A. (1999). FAQbot, Proceedings of Pan Sydney Area Workshop on Visual Information Processing, Nov 1999, University of Sydney, Australia. Beard, S., Marriott, A. and Pockaj, R. (2000). A Humane Interface, OZCHI 2000 Conference on Human-Computer Interaction: Interfacing reality in the new millennium. (4-8 Dec, 2000), Sydney, Australia. Black, A. W., Taylor, P. and Caley, R. (1999). The Festival Speech Synthesis System System Documentation, Edition 1.4, for Festival Version 1.4.0, 17th June 1999, [Online], Available: www.cstr.ed.ac.uk/projects/festival/manual/festival_toc.html. Bos, B. (2000). XML in 10 points. http://www.w3.org/XML/1999/XML-in-10-points. [Online]. Available:
Bosak, J. (1997) XML, Java, and the Future of the Web. [Online]. Available: http://www.webreview.com/pub/97/12/19/xml/index.html Cahn, J. E. (1988). From Sad to Glad: Emotional Computer Voices, Proceedings of Speech Tech '88, Voice Input/Ouput Applications Conference and Exhibition, April 1988, New York City, pp. 35-37. Cahn, J. E. (1990). The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, vol. 8, pp. 1-19. Cahn, J. E. (1990b). Generating expression in synthesized speech, Technical Report, Massachusetts Institute of Technology Media Laboratory, MA, USA. Carroll, L. (1946). Alices Adventures in Wonderland, Random House, New York.
Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S., Becket, T., and Achorn, B. (1994a). Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents, In Proceedings of ACM SIGGRAPH 94. Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S., and Achorn, B. (1994b). Modeling the interaction between speech and gestures, In Proceedings of the 16th annual Conference of the Cognitive Science Society, pp. 119124. Cechner, P. (1999), NO-FAITH: Transport Service for Facial Animated Intelligent Talking Head, Honours Thesis, Curtin University of Technology, Bentley, Western Australia. Cover, R. (2000). The XML Cover Pages. [Online]. Available: http://www.oasisopen.org/cover/xml.html. Davitz, J. R. (1964). The Communication of Emotional Meaning, McGraw-Hill, New York. Dellaert, F., Polzin, T. and Waibel, A. (1996). Recognizing Emotion in Speech, Proceedings of ICSLP 96 the 4th International Conference on Spoken Language Processing, 3-6 October 1996, Philadelphia, PA, USA. Document Object Model (DOM) Level 1 Specification (Second Edition) (2000). [Online]. Available: http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929. Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis, Kluwer Acadesmic Publishers. Extensible Markup Language (XML) 1.0, (1998). http://www.w3.org/TR/1998/REC-xml-19980210. [Online]. Available:
Galanis, D., Darsinos, V. and Kokkinakis, G. (1996). Investigating Emotional Speech Parameters for Speech Synthesis, International Conference on Electronics, Circuits, and Systems (ICECS 96), pp. 1227-1230. Graham, I. S. and Quinn, L. (1999). XML Specification Guide. Wiley Computing Publishing, New York. Hallahan, W. I. (1996). DECtalk Software: Text-to-Speech Technology and Implementation, Digital Technical Journal. Heitzeman, J., Black, A. W., Mellish, C., Oberlander, J., Poesio, M. and Taylor, P. (1999). An Annotation Scheme for Concept-to-Speech Synthesis, Proceedings of the European Workshop on Natural Language Generation, Toulouse, France, pp. 5966.
Hess, U., Scherer, K. L. and Kappas, A. (1988). Multichannel Communication of Emotion, in Facets of Emotion Recent Research, Edited by Klaus R. Scherer, Lawrence Erlbaun Associates, Publishers, Hillsdale, New Jersey. Huynh, Q. (2000). Evaluation into the validity and implementation of a virtual news presenter for broadcast multimedia, Honours Thesis (yet to be published), Curtin University of Technology, Bentley, Western Australia. Knapp, M. L. (1980). Essentials of Nonverbal Communication, Holt, Rinehart and Winston, pp. 203-229. Lester, J. C. and Stone, B. A. (1997). Increasing Believability in Animated Pedagogical Agents, in First International Conference on Autonomous Agents, Marina del Rey, CA, USA, pp. 16-21. Levy, Y. (2000). WebFAITH (Web-based Facial Animated Interactive Talking Head), Honours thesis, Curtin University of Technology, Bentley, Western Australia. libxml (2000). The XML http://www.xmlsoft.org. C library for Gnome. [Online]. Available:
Malandra, L., Barker, L., and Barker, D. (1989). Nonverbal Communication, 2nd edition, Random House, pp. 32-50. Mauch, J. E. and Birch, J. W. (1993). Guide to the Successful Thesis and Dissertation, 3rd edition, Marcel Dekker, New York. MBROLA Project Homepage (2000). http://tcts.fpms.ac.be/synthesis/mbrola.html. Microsoft (2000). Benefiting From XML. http://msdn.microsoft.com/xml/general/benefits.asp. [Online], [Online]. Available: Available:
MPEG (1999). Overview of the MPEG-4 Standard. ISO/IEC JTC1/SC29/WG11 N2725. Seoul, South Korea. Murray, I. R. and Arnott, J. L. and Newell A. F. (1988). HAMLET Simulating emotion in synthetic speech, Proceedings of Speech '88, The 7th FASE Symposium (Institute of Acoustics), Edinburgh, August 1988, pp. 1217-1223 Murray, I. R. and Arnott, J. L. (1993). "Toward the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion", Journal of the Acoustical Society of America, February 1993, vol. 2, pp. 1097-1108. Murray, I. R. and Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech, Speech Communications, vol. 16, pp. 369-390.
Murray, I. R. and Arnott, J. L. (1996). Synthesizing emotions in speech: is it time to get excited?, in Proceedings of ICSLP 96 the 4th International Conference on Spoken Language Processing, Philadelphia, PA, USA, 3-6 October 1996, pp. 1816-1819. Murray, I. R. and Arnott, J. L. and Rohwer E. A. (1996). Emotional stress in synthetic speech: Progress and future directions, Speech Communication, vol. 20, pp. 85-91. Ostermann, J., Beutnagel, M., Fischer, A., Wang, Y. (1998). Integration of Talking Heads and Text-to-Speech Synthesizers for Visual TTS, in Proceedings of International Conference on Speech and Language Processing (ICSLP 98), Sydney, Australia, December 1998. Python/XML Howto (2000). [Online]. Available: http://www.python.org/doc/howto/xml. SAX 2.0 (2000). Sax: The Simple API for XML, version 2. [Online]. Available: http://www.megginson.com/SAX. Scherer, K. L. (1981). Speech and Emotional States, in Speech Evaluation in Psychiatry, edited by J. K. Darby (Grune and Stratton, New York). Scherer, K. L. (1996). Adding the Affective Dimension: a New Look in Speech Analysis and Synthesis, Proceedings of the International Conference on Speech and Language Processing (ICSLP 96), Philadelphia, USA, October 1996. Shepherdson, R. H. (2000). The Personality of a Talking Head, Honours Thesis, Curtin University of Technology, Bentley, Western Australia. SoftwareAG (2000a). XML Application Examples. [Online]. http://www.softwareag.com/xml/applications/xml_app_at_work.htm. SoftwareAG (2000b). XML Benefits. http://www.softwareag.com/xml/about/xml_ben.htm. [Online]. Available: Available:
Sproat, R., Taylor, P., Tanenblatt, M. and Isard, A. (1997). A Markup Language for Text-to-Speech Synthesis, in Proceedings of the Fifth European Conference on Speech Communication and Technology (Eurospeech 97), vol. 4, pp. 1747-1750. Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., and Lenzo, K. (1998). SABLE: A Standard for TTS Markup, in Proceedings of International Conference on Speech and Language Processing (ICSLP98), pp. 1719-1724. Stallo, J. (2000). Canto: A Synthetic Singing Voice. Technical Report, Curtin University of Technology, Western Australia. Available: http://www.computing.edu.au/~stalloj/projects/canto/. Taylor, P. and Isard, A. (1997). SSML: A speech synthesis markup language, Speech Communication, vol. 21, pp. 123-133.
The XML FAQ (2000). Frequently Asked Questions about the Extensible Markup Language. [Online]. Available: http://www.uc.ie/xml/.
Appendix A SML Tag Specification

The following list is a description of each of SMLs elements in alphabetical order. As with any XML element, all SML elements are case sensitive; therefore, all SML elements must appear in lower case, otherwise they will be ignored. The SML DTD is given in Appendix B. For associated implementation issues, see Sections 5.7 and 5.8.
SML Tags
angry
Description: Simulates the effect of anger on the voice (i.e. generates a voice that sounds angry). Attributes: none. Properties: Can contain other non-emotion elements. Example:
<angry>I would not give you the time of day</angry>
embed
Description: Gives the ability to embed foreign file types within an SML document such as sound files, MML files etc., and for them to be processed appropriately. Attributes:
Name type Description Specifies the type of file that is being embedded. (Required) Values audio embedded file is an audio file. mml an mml file is embedded. src music_fil e If type = audio, then src attribute gives path to audio file. If type = mml, then music_file specifies the path of the mml music A character string. A character string.
file. lyr_file If type = mml, then lyr_file specifies the path of the mml lyrics file. A character string.
Properties: empty. Example:

<embed type=mml music_file=songs/aaf.mml lyr_file=songs/aaf.lyr/>
emph
Description: Emphasizes a syllable within a word. Attributes:
Name target Description Specifies which phoneme in contained text will be the target phoneme. If target is not specified, default target will be the first phoneme found within the contained text. The strength of the emphasis. (Default level is weak). Specifies if the element is to affect the contained texts phoneme pitch values, or duration values, or both. (Default is pitch only). Values A character string representing a phoneme symbol. Uses the MPRA phoneme set. weakest, weak, moderate, strong. p affect pitch only. d affect duration only. b affect both pitch and duration.
level affect
Properties: Cannot contain other elements. Example:

I have told you <emph affect=b level=moderate>so</emph> many times.
happy
Description: Simulates the effect of happiness on the voice (i.e. generates a voice that sounds happy). Attributes: none. Properties: Can contain other non-emotion elements. Example:
<happy>I have some wonderful news for you.</happy>
neutral
Description: Gives a neutral intonation to the spoken utterance. Attributes: none. Properties: Can contain other non-emotion elements. Example:
<neutral>I can sometimes sound bored like this.</neutral>
p
Description: Element used to divide text into paragraphs. Can only occur directly within an sml element. The p element wraps emotion tags. Attributes: none. Properties: Can contain all other elements, except itself and sml. Example:
 <sad>Today its been raining all day,</sad> <happy>But theyre calling for sunny skies tomorrow.</happy> 
pause
Description: Inserts a pause in the utterance. Attributes:
Name length msec smoot h Description Specified the length of the utterance using descriptive value. Specifies the length of the utterance in milliseconds. Specifies if the last phonemes before this pause need to be lengthened slightly. Values short, medium, long. A positive number. yes, no (default = yes)
Properties: empty.
Example:
Ill take a deep breath <pause length=long/> and try it again.
pitch
Description: Element that changes pitch properties of contained text. Attributes:
Name middle range Description Increases/decreases pitch average of contained text by N% Increases/decreases pitch contained text by N%. range of Values (+/-)N%, highest, high, medium, low, lowest. (+/-)N%
Properties: Can contain other non-emotion elements. Example: Not I, <pitch middle=-20%>said the dog</pitch>
pron
Description: Enables manipulation of how something is pronounced. Attributes:
Name sub Description The string the contained text should be substituted with. Required. Values A character string.
Properties: Cannot contain other elements. Example:

You say tomato, I say <pron sub=toe may toe>tomato</pron>
rate
Description: Sets the speech rate of the contained text. Attributes:
Name speed Description Sets the speed, can be a percentage value higher or lower than the current, or a descriptive term. Required. Values (+/-)N%, fastest, fast, medium, slow, slowest.
Properties: Can contain other non-emotion elements. Example:

I live at <rate speed=-10%>10 Main Street.</rate>
sad
Description: Simulates the effect of sadness on the voice (i.e. generates a voice that sounds sad). Attributes: none. Properties: Can contain other non-emotion elements. Example:
<sad>Honesty is hardly ever heard.</sad>
sml
Description: Root element that encapsulates all other SML tags. Attributes: none. Properties: root node, can only occur once. Example:
<sml> <happy>The sml tag encapsulates all other tags</happy> </sml>
speaker
Description: Specify the speaker to use for the contained text. Attributes:
Name gender name Description Sets the gender of the speaker. Default = male. Specifies the diphone database that Values male, female. A character string.
should be used (e.g. en1, us1 etc).
Properties: Can contain other non-emotion tags. Example:

<speaker gender=female>Im a young woman.</speaker>
volume
Description: Sets the speaking volume. Note, this element sets only the volume, and does not change voice quality (e.g. quiet is not a whisper). Attributes:
Name level Description Sets the volume level of the contained text. Can be a percentage value, or a descriptive term. Values (+/-)N%, loud. soft, normal,
Properties: Can contain other non-emotion tags. Example:

<volume level=soft>You can barely hear me.</volume>
Appendix B SML DTD

<!-####################################################################### ## # Speech Markup Language (SML) DTD, version 1.0. # # Usage: # <!DOCTYPE sml SYSTEM "./sml-v01.dtd"> # # Author: John Stallo # Date: 11 September, 2000. # ######################################################################## --> <!ENTITY % LowLevelElements "pitch | rate | emph | pause | volume | pron | speaker"> <!ENTITY % HighLevelElements "angry | happy | sad | neutral | embed"> <!ELEMENT sml (p)+> <!ELEMENT p (%HighLevelElements;)+> <!ELEMENT speaker (#PCDATA | %LowLevelElements;)*> <!ATTLIST speaker gender (male | female) "male" name (en1 | us1) "en1"> <!-####################################### # HIGH-LEVEL TAGS ####################################### -->  <!ELEMENT angry (#PCDATA | %LowLevelElements;)*> <!ELEMENT happy (#PCDATA | %LowLevelElements;)*> <!ELEMENT sad (#PCDATA | %LowLevelElements;)*> <!ELEMENT neutral (#PCDATA | %LowLevelElements;)*>  <!ELEMENT embed EMPTY>
<!ATTLIST embed type CDATA #REQUIRED src CDATA #IMPLIED music_file CDATA #IMPLIED lyr_file CDATA #IMPLIED> <!-####################################### # LOW-LEVEL TAGS ####################################### --> <!ELEMENT pitch (#PCDATA | %LowLevelElements;)*> <!ATTLIST pitch middle CDATA #IMPLIED base (low | medium | high) "medium" range CDATA #IMPLIED> <!ELEMENT rate (#PCDATA | %LowLevelElements;)*> <!ATTLIST rate speed CDATA #REQUIRED>  <!ELEMENT emph (#PCDATA)> <!ATTLIST emph level (weakest | weak | moderate | strong) "weak" affect CDATA #IMPLIED target CDATA #IMPLIED> <!ELEMENT pause EMPTY> <!ATTLIST pause length (short | medium | long) "medium" msec CDATA #IMPLIED smooth (yes | no) "no"> <!ELEMENT volume (#PCDATA)> <!ATTLIST volume level CDATA #REQUIRED> <!ELEMENT pron (#PCDATA)> <!ATTLIST pron sub CDATA #REQUIRED>
Appendix C Festival and Visual C++

Compiling Festival with Visual C++
I've recently been able to get Festival 1.4.1 and the Edinburgh Speech Tools Library successfully compiled using Visual C++ 6.0 under Microsoft Windows NT 4.0. There were many but simple problems, so I started to keep track of what I did to get it to work so the process could be repeated if, God forbid, I needed to recompile it in the future. Having said this, you'll realise I can't make any guarantees that the changes described in this document will work, or are all that is required for compiling Festival on your system, but I've made this available with the hope it may prove helpful to someone. First things first, make sure you read the README and INSTALL files that come with Festival. These two files provide important information about the Festival package, on setting it up and compiling. Also, read the manual! You can find an online version here. Also, in every file that you modify, leave some sort of comment stating what was changed, why it was changed, how to undo the modification, and who changed it! I encapsulated my modifications in the following way: ///////////////////////////////////// // STALLO MODIFICATION START modified code... // STALLO MODIFICATION END ///////////////////////////////////// An important note to be made is that these modifications were not suggested by anyone from Festival, and so some parts of this document may very well be erroneous. All I know is that I've been able to get it working with these changes in place! My advice would be to first follow the instructions given in Festival's INSTALL and README files. If you come across any problems, start looking through this document to see if there's anything that may help you. Something I remarked is that most of the problems were in compiling the Speech Tools library. Once this was done, compiling Festival met with very little difficulty.
Building the Edinburgh Speech Tools Library

The Edinburgh Speech Tools Library should be compiled before Festival (all files are found under the speech_tools directory). Configuration Before carrying out the configuration instructions given in INSTALL: 1. 2. Make sure you have a recent version of the Cygnus Cygwin make utility working on your system. You can download the Cygwin package here. We used version b20.1. Make changes to the following files: a. b. config/system.sh - Statement obtaining OSREV didn't work on our system. Hardcode value (i.e. OSREV=20.1) config/system.mak - This file shouldn't really be hand edited as it's created automatically from the config file, but if the system gets confused, you may need to change the value of OSREV from 1.0 to 20.1.
3.
Now carry out the configuration instructions.
Creating VCMakefiles From memory, there were no mishaps here. Only, again, make sure the make utility you're using is recent, preferably from the Cygnus Cygwin package version b20.1 (we had trouble with earlier versions). Building To build the system: nmake /nologo /fVCMakefile However, you'll probably incur a number of compile errors. The following is a list of changes I had to make: SIZE_T redefinition in siod/editline.h. SIZE_T is redefined in editline.h since it is already defined in a Windows include file. Fix by inserting compiler directive to only define SIZE_T if it hasn't been defined before, or simply don't define it (i.e. use #if 0 ... #endif). "unknown size" error for template in base_class/EST_TNamedEnum.cc. nmake doesn't seem to like the syntax used in template void EST_TValuedEnumI<ENUM,VAL,INFO::initialise(const void *vdefs, ENUM (*conv)(const char *)). Go to the line it's grumbling about, and do the following: comment this line out: (const struct EST_TValuedEnumDefinition *defs = (const struct EST_TValuedEnumDefinition *)vdefs; insert these 2 lines (__my_defn can be anything, just make sure it doesn't conflict with another typedef!): typedef EST_TValuedEnumDefinition __my_defn; const __my_defn *defs = (const __my_defn *)vdefs;
Multiply defined symbols in base_class/inst_templ/vector_fvector_t.cc EST_TVector.cc is already included in vector_f_.cc, so a conflict is occurring. I'm not sure nmake would be the only one to complain about this though. Comment out this line: #include "../base_class/EST_TVector.cc"
T * const p_contents never initialised in EST_TBox constructor (include/EST_TBox.h) nmake wants p_contents to be initialised in the constructor using an initializer list. Comment EST_TBox constructor out: EST_TBox(void) {}; Define EST_TBox constructor with an initializer list: EST_TBox (void) : p_contents (NULL) {}; I don't think it matters what p_contents is initialised to since the inline comment says that the constructor will never be called anyway.
Building Festival
If I recall correctly, once the Speech Tools Library was successfully compiled, there were no problems compiling Festival. However, once you run any of the executables such as festival.exe, pay attention to any
error messages appearing at startup. If most functions are unavailable to you in the Scheme interpreter, it's probably because it can't find the lib directory that contains the Scheme (.scm) files.
Using the Festival C/C++ API

(Linking the Festival libraries with your own source code) Follow the directions given in the Festival manual for using Festival's C/C++ API. You'll need to include the appropriate header files, and make sure you have the include paths set correctly within the Visual C++ environment (these are done through Project-->Settings-->C/C++). Try compiling the sample program given in the manual that uses the Festival functions.
Compiling
If you've included the right header files, and the include paths are correct, you should have no trouble. However, make sure your program source files have a .cpp extension, and not just .c - you'll get all sorts of compilation errors. (Take it from someone who lost a lot of time over this silly error!) Also, make sure SYSTEM_IS_WIN32 is defined by adding it to the pre-processor definitions in Project-->Settings-->C/C+ +.
Linking
The manual gives instructions on which Festival and Speech Tool libraries your program needs to link with. (Note: the manual mentions .a UNIX library files. You should look for .lib files instead.) Add linking information through Project-->Settings-->Link. In addition to the libraries specified in the manual, you'll need to link two others libraries: wsock32.lib - for the socket functions. winmm.lib - for multimedia functions such as PlaySound().
These two libraries should already be on your system, and Visual C++ should already be able to find them. That's it! Good luck with your work :) If you've found this document to be helpful, please drop me an email - I'd like to know if it's proving useful for others. If you have found errors in this document, or have other suggestions for getting around problems, I'd especially like to hear from you! I also welcome any questions you may have. You can contact me at stalloj@cs.curtin.edu.au. Finally, my warm thanks to Alan W Black, Paul Taylor, and Richard Caley for making the Festival Speech Synthesis System freely available to all of us :)
Appendix D Evaluation Questionnaire

SIMULATING EMOTIONAL SPEECH FOR A TALKING HEAD QUESTIONNAIRE
PLEASE NOTE: You do NOT have to take part in this questionnaire. If you find any of these questions intrusive feel free to leave them unanswered. Any data collected will remain strictly confidential, and anonymity will be preserved.
SECTION 1 - PERSONAL
Age: ____________
AND
BACKGROUND DETAILS
t Male
Gender: t Female
In which country have you lived for most of your life? ______________________________________________________________________________________ Is English your first spoken language? t Yes Where do you use computers? t Home t Work t No
t School
t I Dont
t Other _____________
The next section will commence shortly. PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO THANKYOU
SECTION 2 EMOTION RECOGNITION

Part A.
For each of the speech samples played, choose ONE which you think is most appropriate for how the speaker sounds. 1. The telephone has not rung at all today. a) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------b) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------c) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------d) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 4 EXAMPLES WILL NOW BE REPLAYED ----------------------------------------------------------------------------------------------------------------------2. I received my assignment mark today. a) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------b) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------c) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------d) t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ ----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 4 EXAMPLES WILL NOW BE REPLAYED -----------------------------------------------------------------------------------------------------------------------
Part B.
For each of the speech sample played, choose ONE which you think is most appropriate for how the speaker sounds. 1. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------2. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------3. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------4. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------5. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED -----------------------------------------------------------------------------------------------------------------------6. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------7. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------8. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------9. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------10. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED ------------------------------------------------------------------------------------------------------------------------
11.
t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------12. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------13. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------14. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------15. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED -----------------------------------------------------------------------------------------------------------------------16. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------17. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------18. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------19. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------20. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED ------------------------------------------------------------------------------------------------------------------------
SECTION 3 TALKING HEAD

Please look and listen at the Talking Head examples before filling out this section. [TALKING HEAD DEMONSTRATIONS] Of the two examples just played: a) Was one example easier to understand than the other? (Please tick ONE box only) t The FIRST example was easier to understand. t The SECOND example was easier to understand. t Neither was easier to understand than the other. Reason for your choice: _________________________________________________________ _________________________________________________________ _________________________________________________________ b) Did one speaker seem best able to express itself? (Please tick ONE box only) t The FIRST speaker seemed best able to express itself. t The SECOND speaker seemed best able to express itself. t Neither was able to express itself better than the other. Reason for your choice: _________________________________________________________ _________________________________________________________ _________________________________________________________ c) Did one example seem more natural than the other? (Please tick ONE box only) t The FIRST example seemed more natural. t The SECOND example seemed more natural. t Neither seemed more natural than the other. Reason for your choice: _________________________________________________________ _________________________________________________________ _________________________________________________________ d) Did one example seem more interesting to you? (Please tick ONE box only) t The FIRST example seemed more interesting. t The SECOND example seemed more interesting. t Neither seemed more interesting than the other. Reason for your choice: _________________________________________________________ _________________________________________________________ _________________________________________________________
SECTION 4 GENERAL
Prior to this demonstration:
Had you ever heard of speech synthesis?
t Yes t Yes
t No
t No
Had you seen an animated Talking Head before?
If so, can you remember its name? ______________________________
Any further comments you could give about the Talking Head and/or its voice would be greatly appreciated. __________________________________________________________________________________ __________________________________________________________________________________ __________________________________________________________________________________ __________________________________________________________________________________
END OF QUESTIONNAIRE THANK YOU VERY MUCH FOR YOUR HELP
Appendix E Test Phrases for Questionnaire, Section 2B

Neutral phrases 1. I saw her walking along the beach this morning. 2. The book is lying on the table. 3. My creator is a human being. 4. I have an appointment at 2 oclock tomorrow. 5. Jim came to basketball training last night.
Emotional phrases 6. I have some wonderful news for you. (Happiness) 7. I cannot come to your party tomorrow. (Sadness) 8. I would not give you the time of day. (Anger) 9. Smoke comes out of a chimney. (Neutral) 10. Dont tell me what to do. (Anger)

Simulating Emotional Speech For A Talking Head November 2000

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simulating Emotional Speech For A Talking Head November 2000

Uploaded by

Copyright:

Available Formats

Simulating Emotional Speech for a Talking Head