Professional Documents
Culture Documents
November 2000
Contents
1 Introduction..............................................................................................1 2 Problem Description................................................................................2
2.1 Objectives..............................................................................................................2 2.2 Subproblems..........................................................................................................2 2.3 Significance...........................................................................................................3
3 Literature Review....................................................................................5
3.1 Emotion and Speech..............................................................................................5 3.2 The Speech Correlates of Emotion........................................................................6 3.3 Emotion in Speech Synthesis................................................................................8 3.4 Speech Markup Languages...................................................................................9 3.5 Extensible Markup Language (XML).................................................................10 3.5.1 XML Features..........................................................................................10 3.5.2 The XML Document................................................................................11 3.5.3 DTDs and Validation..............................................................................12 3.5.4 Document Object Model (DOM).............................................................14 3.5.5 SAX Parsing.............................................................................................15 3.5.6 Benefits of XML......................................................................................16 3.5.7 Future Directions in XML........................................................................17 3.6 FAITH.................................................................................................................18 3.7 Resource Review.................................................................................................20 3.7.1 Text-to-Speech Synthesizer.....................................................................20 3.7.2 XML Parser..............................................................................................22 3.8 Summary.............................................................................................................23
4 Research Methodology..........................................................................25
4.1 Hypotheses..........................................................................................................25
Page i
November 2000
4.2 Limitations and Delimitations.............................................................................26 4.2.1 Limitations...............................................................................................26 4.2.2 Delimitations............................................................................................26 4.3 Research Methodologies.....................................................................................27
5 Implementation......................................................................................28
5.1 TTS Interface.......................................................................................................28 5.1.2 Module Inputs..........................................................................................29 5.1.3 Module Outputs........................................................................................30 5.1.4 C/C++ API...............................................................................................31 5.2 SML Speech Markup Language.......................................................................32 5.2.1 SML Markup Structure............................................................................32 5.3 TTS Module Subsystems Overview....................................................................34 5.4 SML Parser..........................................................................................................35 5.5 SML Document...................................................................................................36 5.5.1 Tree Structure...........................................................................................36 5.5.2 Utterance Structures.................................................................................37 5.6 Natural Language Parser.....................................................................................39 5.6.2 Obtaining a Phoneme Transcription........................................................40 5.6.2 Synthesizing in Sections..........................................................................42 5.6.3 Portability Issues......................................................................................43 5.7 Implementation of Emotion Tags........................................................................44 5.7.1 Sadness.....................................................................................................45 5.7.2 Happiness.................................................................................................46 5.7.3 Anger........................................................................................................47 5.7.4 Stressed Vowels.......................................................................................48 5.7.5 Conclusion...............................................................................................48 5.8 Implementation of Low-level SML Tags............................................................49 5.8.1 Speech Tags.............................................................................................49 5.8.2 Speaker Tag..............................................................................................53 5.9 Digital Signal Processor......................................................................................54 5.10 Cooperating with the FAML module................................................................55 5.11 Summary...........................................................................................................57
Page ii
November 2000
7 Future Work...........................................................................................82
7.1 Post Waveform Processing..................................................................................82 7.2 Speaking Styles...................................................................................................83 7.3 Speech Emotion Development............................................................................84 7.4 XML Issues.........................................................................................................85 7.5 Talking Head.......................................................................................................86 7.6 Increasing Communication Bandwidth...............................................................87
11 Appendix B SML DTD.....................................................................102 12 Appendix C Festival and Visual C++..............................................104 13 Appendix D Evaluation Questionnaire...........................................107 14 Appendix E Test Phrases for Questionnaire, Section 2B..............113
Page iii
November 2000
List of Figures
15 Figure 1 - An XML document holding simple weather information.11 16 Figure 2 - Sample section of a DTD file...............................................12 17 Figure 3 - XML syntax error - list and item tags incorrectly matched. 13 18 Figure 4 - Well-formed XML document, but does not follow grammar specification in DTD file (an item tag occurs outside of list tag)...........................................................................................................13 19 Figure 5 Well-formed XML document that also follows DTD grammar specification. Will not produce any parse errors..............13 20 Figure 6 - DOM representation of XML example..............................15 21 Figure 7 - FAITH project architecture................................................19 22 Figure 8 - Talking Head being developed as part of the FAITH project at the School of Computing, Curtin University of Technology. 20 23 Figure 9 - Top level outline showing how Festival and MBROLA systems were used together...................................................................21 24 Figure 10 - Black box design of the system, shown as the TTS module of a Talking Head...................................................................................28 25 Figure 11 - Top-level structure of an SML document........................32 26 Figure 12 - Valid SML markup............................................................33 27 Figure 13 - Invalid SML markup.........................................................33 28 Figure 14 - TTS module subsystems.....................................................34 29 Figure 15 - Filtering process of unknown tags....................................36 30 Figure 16 - SML Document structure for SML markup given above. 37
Page iv
November 2000
31 Figure 17 - Utterance structures to hold the phrase the moon. U = CTTS_UtteranceInfo object; W = CTTS_UtteranceInfo object; P = CTTS_PhonemeInfo object; pp = CTTS_PitchPatternPoint object. 39 32 Figure 18 - Tokenization of a part of an SML Document................................................................................................40
33 Figure 19 - SML Document sub-tree representing example SML markup....................................................................................................41 34 Figure 20 - Raw timeline showing server and client execution when synthesizing example SML markup above..........................................43 35 Figure 21 Multiply factors of pitch and duration values for emphasized phonemes............................................................................50 36 Figure 22 - Processing a pause tag........................................................51 37 Figure 23 - The effect of widening the pitch range of an utterance...52 38 Figure 24 - Processing the pron tag......................................................52 39 Figure 25 - Example MBROLA input..................................................55 40 Figure 26 - Example utterance information supplied to the FAML module by the TTS module. Example phrase: And now the latest news.......................................................................................................56 41 Figure 27 - A node carrying waveform processing instructions for an operation.................................................................................................83 42 Figure 28 - Insertion of new submodule for post waveform processing................................................................................................83 43 Figure 29 - SML Markup containing a link to a stylesheet................84 44 Figure 30 - Inclusion of an XML Handler module to centrally manage XML input................................................................................85 45 Figure 31 - Proposed design of TTS Module architecture to minimize bandwidth problems between server and client..................................87
Page v
November 2000
List of Tables
46 Table 1 - Summary of human vocal emotion effects.............................8 47 Table 2 - Summary of human vocal emotion effects for anger, happiness, and sadness..........................................................................44 48 Table 3 Speech correlate values implemented for sadness..............45 49 Table 4 - Speech correlate values implemented for happiness..........46 50 Table 5 - Speech correlate values implemented for anger.................47 51 Table 6 - Vowel-sounding phonemes are discriminated based on their duration and pitch..................................................................................48 52 Table 7- MBROLA command line option values for en1 and us1 diphone databases to output male and female voices.........................54 53 Table 8 - Statistics of participants........................................................63 54 Table 9 - Confusion matrix template....................................................64 55 Table 10 - Confusion matrix with sample data...................................65 56 Table 11 - Confusion matrix showing ideal experiment data: 100% recognition rate for all simulated emotions.........................................65 57 Table 12 Listener response data for neutral phrases spoken with happy emotion........................................................................................66 58 Table 13 Section 2A listener response data for neutral phrases.....67 59 Table 14 - Listener response data for Section 2A, Question 1...........68 60 Table 15 - Listener response data for Section 2A, Question 2...........68 61 Table 16 - Listener responses for utterances containing emotionless text with no vocal emotion.....................................................................70 62 Table 17 - Listener responses for utterances containing emotive text with no vocal emotion............................................................................71 63 Table 18 - Listener responses for utterances containing emotionless text with vocal emotion..........................................................................72
Page vi
November 2000
64 Table 19 - Listener responses for utterances containing emotive text with vocal emotion.................................................................................73 65 Table 20 - Percentage of listeners who improved in emotion recognition with the addition of vocal emotion effects for neutral text. 74 66 Table 21 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for neutral text...........................................................................................................74 67 Table 22 - Percentage of listeners whose emotion recognition improved with the addition of vocal emotion effects for emotive text. 75 68 Table 23 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for emotive text...........................................................................................................75 69 Table 24 Listener responses for participants who speak English as their first language. Utterance type is neutral text, emotive voice. 76 70 Table 25 Listener responses for participants who do NOT speak English as their first language. Utterance type is neutral text, emotive voice.........................................................................................76 71 Table 26 Listener responses for participants who speak English as their first language. Utterance type is emotive text, emotive voice. 77 72 Table 27 Listener responses for participants who do NOT speak English as their first language. Utterance type is emotive text, emotive voice.........................................................................................77 73 Table 28 - Participant responses when asked to choose the Talking Head that was more understandable....................................................78 74 Table 29 Participant responses when asked which Talking Head seemed best able to express itself..........................................................79 75 Table 30 Participant responses when asked which Talking Head seemed more natural..............................................................................80 76 Table 31 Participant responses when asked which Talking Head seemed more interesting........................................................................80
Page vii
Chapter 1 Introduction
When we talk we produce a complex acoustic signal that carries information in addition to the verbal content of the message. Vocal expression tells others about the emotional state of the speaker, as well as qualifying (or even disqualifying) the literal meaning of the words. Because of this, listeners expect to hear vocal effects, paying attention not only to what is being said, but how it is said. The problem with current speech synthesizers is that the effect of emotion on speech is not taken into account, producing output that sounds monotonic, or at worst distinctly machine-like. As a result of this, the ability of a Talking Head to express its emotional state will be adversely affected if it uses a plain speech synthesizer to "talk". The objective of this research was to develop a system that is able to incorporate emotional effects in synthetic speech, and thus improve the perceived naturalness of a Talking Head. This thesis reviews the literature in the fields of speech emotion, synthetic speech synthesis, and XML. A discussion on XML is featured prominently in this thesis because it was the vehicle chosen for directing how the synthetic voice should sound. It also had considerable impact on how speech information was processed. The design and implementation details of the project are discussed to describe the developed system. An in-depth analysis of the projects evaluation data is then given, concluding with a discussion of future work that has been identified.
2.2 Subproblems
A number of subproblems were identified to successfully develop a system with the stated objectives. 1. Design and implementation of a speech markup language. It was desirable that the markup language be XML-based; the reasons for this will become apparent later in the thesis. The role of the speech markup language (SML) is to
1
provide a way to specify in which emotion a text segment is to be rendered. In addition to this, it was decided to extend the application of the markup to provide a mechanism for the manipulation of generally useful speech properties such as rate, pitch, and volume. SML was designed to closely follow the SABLE specification, described by Sproat et al (1998). 2. Evaluation of each of the existing text-to-speech (TTS) submodules of the Talking Head was required. Its aim was to determine what could and could not be reused. This included assessing the existing TTS modules API, and the modules that interface with other subsystems of the Talking Head (namely the MPEG-4 subsystem). 3. Cooperative integration with modules that were being concurrently written for the Talking Head, namely the gesture markup language being developed by Huynh (2000). The collaboration between the two subprojects was aimed at providing the Talking Head with synchronization of vocal expressions and facial gestures. An architecture specification for allowing facial and speech synchronization is given by Ostermann et al. (1998). 4. Since the Talking Head is being developed to run over a number of platforms (Win32, Linux, and IRIX 6.3), it was crucial that the new TTS module would not hamper efforts to make the Talking Head a platform independent application.
2.3 Significance
The project is significant because despite the important role of the display of emotion in human communication, current text-to-speech synthesizers do not cater for its effect on speech. Research to add emotion effects to synthetic speech is ongoing, notably by Murray and Arnott (1996), but has been mainly restricted to a standalone system, and not part of a Talking Head as this project set out to do. Increased naturalness in synthetic speech is seen as being important for its acceptance (Scherer, 1996), and this is likely to be the case for applications of Talking Head technology as well. This thesis is attempting to address this need. Advances in this area will also benefit work in the fields of speech analysis, speech recognition and speech synthesis when dealing with natural variability. This is because work with the speech correlates of emotion will help support or disprove speech correlates identified in speech
analysis, help in proper feature extraction for the automatic recognition of emotion in the voice, and generally improve synthetic speech production.
frighten a would-be assailant, with the body tensing for a possible confrontation. The expression of emotion through speech also serves to communicate to others our judgement of a particular situation. Importantly, vocal changes due to emotion may in fact be cross-cultural in nature, though this may only be true for some emotions, and further work is required to ascertain this for certain (Murray, Arnott and Rohwer, 1996). We also deliberately use vocal expression in speech to communicate various meanings. Sudden pitch changes will make a syllable stand out, highlighting the associated word as an important component of that utterance (Dutoit, 1997). A speaker will also pause at the end of key sentences in a discussion to allow listeners the chance to process what was said, and a phrases pitch will increase towards the end to denote a question (Malandro, Barker and Barker, 1989). When something is said in a way that seems to contradict the actual spoken words, we will usually accept the vocal meaning over the verbal meaning. For example, the expression thanks a lot spoken in an angry tone will generally be taken in a negative way, and not as a compliment, as the literal meaning of the words alone would suggest. This underscores the importance we place on the vocal information that accompanies the verbal content.
3. The content is ignored altogether, either by using equipment designed to extract various speech attributes, or by filtering out the content. The latter technique involves applying a low-pass filter to the speech signal, thus eliminating the high frequencies that word recognition is dependent upon. (This meets with limited success, however, since some of the vocal information also resides in the high frequency range.) The problem of speech parameter identification is further compounded by the subjective nature of these tests. This is evident in the literature, as results taken from numerous studies rarely agree with each other. Nevertheless, a general picture of the speech parameters responsible for the expression of emotion can be constructed. There are three main categories of speech correlates of emotion (Cahn, 1990; Murray, Arnott and Rohwer, 1996): Pitch contour. The intonation of an utterance, which describes the nature of accents and the overall pitch range of the utterance. Pitch is expressed as fundamental frequency (F0). Parameters include average pitch, pitch range, contour slope, and final lowering. Timing. Describes the speed that an utterance is spoken, as well as rhythm and the duration of emphasized syllables. Parameters include speech rate, hesitation pauses, and exaggeration. Voice quality. The overall character of the voice, which includes effects such as whispering, hoarseness, breathiness, and intensity.
It is believed that value combinations of these speech parameters are used to express vocal emotion. Table 1 is a summary of human vocal emotion effects of four of the socalled basic emotions: anger, happiness, sadness and fear (Murray and Arnott, 1993; Galanis, Darsinos and Kokkinakis, 1996; Cahn, 1990; Davtiz, 1976; Scherer, 1996). The parameter descriptions are relative to neutral speech.
Anger
Happiness
Sadness
Fear
Slightly faster Much higher Much wider Higher Smooth, upward inflections
Much faster Very much higher Much wider Higher Downward terminal inflections Irregular voicing1 Precise
Much wider Higher Abrupt, downward, directed contours Breathy, chesty tone1 Clipped
Resonant1 Slurred
terms used by Murray and Arnott (1993). Table 1 - Summary of human vocal emotion effects.
The summary should not be taken as a complete and final description, but rather is meant as a guideline only. For instance, the table above emphasizes the role of fundamental frequency as a carrier of vocal emotion. However, Knower (1941, as referred in Murray and Arnott, 1993) notes that whispered speech is able to convey emotion, even though whispering makes no use of the voices fundamental frequency. Nevertheless, being able to succinctly describe vocal expression like this has significant benefits for simulating emotion in synthetic speech.
carefully constructed rules. Two of the better known systems capable of adding emotionby-rule effects to speech are the Affect Editor, developed by Cahn (1990b), and HAMLET, developed by Murray and Arnott (1995) (Murray, Arnott and Newell, 1988). The systems both make use of the DECtalk text-to-speech synthesizer, mainly because of its extensive control parameter features. Future work is concerned with building a solid model of emotional speech, as this area is seen as being limited by our understanding of vocal expression, and the quality of the speech correlates used to describe emotional speech (Cahn, 1988; Murray and Arnott, 1995; Scherer, 1996). Although not within the scope of the project, it is worth mentioning that research is being undertaken in concept-to-speech synthesis. This work is aimed at improving the intonation of synthetic speech by using extra linguistic information (i.e. tagged text) provided by another system, such as a natural language generation (NLG) system (Hitzeman et al, 1999). Variability in speech is also being investigated in the area of speech recognition with the aim of possibly developing computer interfaces that respond differently according to the emotional state of the user (Dellaert, Polzin and Waibel, 1996). Another avenue for future research could be to incorporate the effects of facial gestures on speech. For instance, Hess, Scherer and Kappas (1988) noted that voice quality is judged to be friendly over the phone when a person is smiling. A model that could cater for this would have extremely beneficial applications for recent work concerned with the synchronization of facial gestures and emotive speech in Talking Heads. Finally, simulating emotion in synthetic speech not only has the potential to build more realistic speech synthesizers (and hence provide the benefits that such a system would offer), but will also add to our understanding of speech emotion itself.
Most research and commercial systems allow for such an annotation scheme, but almost all are synthesizer dependent, thus making it extremely difficult for software developers to build programs that can interface with any speech synthesizer. Recent moves by industry leaders to standardize a speech markup language has led to the draft specification of SABLE, a system independent, SGML-based markup language (Sproat et al, 1998). The SABLE specification has evolved from three existing speech synthesis markup languages: SSML (Taylor and Isard, 1997), STML (Sproat et al, 1997), and Javas JSML.
2. Structure the structure of an XML document can be nested to any level of complexity since it is the author that defines the tag set and grammar of the document. 3. Validation if a tag set and grammar definition is provided (usually via a Document Type Definition (DTD)), then applications processing the XML document can perform structural validation to make sure it conforms to the grammar specification. So though the nested structure of an XML document can be quite complex, the fact that it follows a very rigid guideline makes document processing relatively easy.
<?xml version=1.0?> <weather-report> <date>March 25, 1998</date> <time>08:00</time> <area> <city>Perth</city> <state>WA</state> <country>Australia</country> </area> <measurements> <skies>partly cloudy</skies> <temperature>20</temperature> <h-index>51</h-index> <humidity>87</humidity> <uv-index>1</uv-index> </measurements> </weather-report>
Figure 1 - An XML document holding simple weather information. Markup tag
One of the main observations that should be made for the example given in Figure 1 is that an XML document describes only the data, and not how it should be viewed. This is unlike HTML, which forces a specific view and does not provide a good mechanism for data description (Graham and Quinn, 1999). For example, HTML tags such as P, DIV, and TABLE describe how a browser is to display the encapsulated text, but are
inadequate for specifying whether the data is describing an automotive part, is a section of a patients health record, or the price of a grocery item. The fact that an XML document is encoded in plain text was a conscious decision made by the XML designers the designing of a system-independent and vendorindependent solution (Bosak, 1997). Although text files are usually larger than comparable binary formats, this can be easily compensated for using freely available utilities that can efficiently compress files, both in terms of size and time. At worst, the disadvantages associated with an uncompressed plain text file is deemed to be outweighed by the advantages of a universally understood and portable file format that does not require special software for encoding and decoding.
Extending this example, the different levels of validation performed by an XML parser can be seen. Figure 3 shows an XML document that does not meet the syntax specified in the XML specification.
<?xml version=1.0?> <list><item> Item 1 </list></item>
Figure 3 - XML syntax error - list and item tags incorrectly matched.
Figure 4 shows a well-formed XML document (i.e. it follows the XML syntax), but does not follow the grammar specified in the linked DTD file. (The DTD file is the one given in Figure 2).
<?xml version=1.0?> <!DOCTYPE list SYSTEM list-dtd-file.dtd <list> <item>Item 1</item> <item>Item 2</item> </list> <item>Item 3</item>
Figure 4 - Well-formed XML document, but does not follow grammar specification in DTD file (an item tag occurs outside of list tag).
Figure 5 shows a well-formed XML document that also meets the grammar specification given in the DTD file.
<?xml version=1.0?> <!DOCTYPE list SYSTEM list-dtd-file.dtd <list> <item>Item 1</item> <item type=x>Item 2</item> <item>Item 3</item> </list>
Figure 5 Well-formed XML document that also follows DTD grammar specification. Will not produce any parse errors.
The XML Recommendation states that any parse error detected while processing an XML document will immediately cause a fatal error (Extensible Markup Language, 1998) the XML document will not be processed any further, and the application will
not attempt to second guess the authors intent. Note that the DTD does NOT define how the data should be viewed either. Also, the DTD is able to define which sub-elements can occur within an element, but not the order in which they occur; the same applies for attributes specified for an element. For this reason, an application processing an XML document should avoid being dependent on the order of given tags or attributes.
a.
<weather-report> <date>October 30, 2000</date> <time>14:40</time> <measurements> <skies>Partly cloudy</skies> <temperature>18</temperature> </measurements> </weather-report>
b.
<weather-report> <weather-report>
<date> <date>
<time> <time>
<measurements> <measurements>
14:40
<skies> <skies>
<temperature> <temperature>
Partly cloudy
18
A SAX handler, on the other hand, can process very large documents since it does not keep the entire document in memory during processing. SAX, the Simple API for XML, is a standard interface for event-based XML parsing (SAX 2.0, 2000). Instead of building a structure representing the entire XML document, SAX reports parsing events (such as the start and end of tags) to the application through callbacks.
client computer, reducing the servers workload and thus enhancing server scalability. Embedding of multiple data types XML documents can contain virtually any kind of data type such as image, sound, video, URLs, and also active components such as Java applets and ActiveX. Data delivery since XML documents are encoded in plain text, data delivery can be performed on existing networks, sent using HTTP just like HTML. Combined with the XML features discussed in section 3.5.1, the above list underscores the enormous potential of XML. Indeed, the extent of these benefits makes XML a core component in a wide range of applications: from dissemination of information in government agencies to the management of corporate logistics; providing telecommunication services; XML-based prescription drug databases to help pharmacists advise their customers; simplifying the exchange of complex patient records obtained from different data sources, and much more (SoftwareAG, 2000a).
XML Namespaces specification that describes how an URL can be associated with every tag and attribute within an XML document.
XML Schemas 1 & 2- aimed at helping developers to define their own XML-based formats. Future applications of some of these XML components to this project is discussed later in this thesis. For more information on these emerging technologies, see the XML Cover Pages (Cover, 2000).
3.6 FAITH
A very brief description of the FAITH project is required in order to gain an understanding of where the TTS module fits within the Talking Head architecture. Figure 7 shows a simplified view of the various subsystems that make up the Talking Head.
TTS
Text to synthesise Waveforms FAPs (visemes)
Brain
Personality
FAML
Text questions
FAPs
SERVER CLIENT
MPEG-4
Questions
Waveforms
FAPs
User Interface
As Figure 7 shows, the architecture has been designed to fit a client/server framework; the client is responsible for interfacing with the user, where the Talking Head is rendered, audio played, extra information displayed etc. The server accepts a users text input (such as questions, dialog etc), and processing is carried out in the following order: The Brain module, developed by Beard (1999), processes the users text input and forms an appropriate response. The response is then sent to the TTS module. The TTS module is responsible for producing the speech equivalent of the Brains text response and outputs a waveform. It also outputs viseme information in MPEG-4s FAP (Facial Animation Parameter) format, and passes this to modules responsible for generating more FAP values. Visemes are the visual equivalent of speech phonemes (e.g. the mouth forms a specific shape when saying oo). Generated FAP values can be used to move specific points of the Talking Heads face (to produce head movements, blinking, gestures etc), which is the purpose of the next two subsystems. The Personality modules role is to generate MPEG-4 FAP values with the goal of simulating various personalities such as friendliness and dominance (Shepherdson, 2000). The FAML modules role is to generate MPEG-4 FAP values to display various head movements, and facial gestures and expressions specified through a special markup language (Huynh, 2000). As the diagram shows, communication between the client and server is done via an implementation of the MPEG-4 standard (Cechner, 1999). For a more detailed, summarized description of the FAITH project, see Beard et al (2000). (The reader should note however, that the paper describes the old TTS module, and not the (newer) TTS module described in this thesis). Figure 8 shows one of the models rendered on the client side with which the user interfaces with.
Figure 8 - Talking Head being developed as part of the FAITH project at the School of Computing, Curtin University of Technology.
Festival is a widely recognized research project developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh, with the aim of offering a free, high quality text-to-speech system for the advancement of research (Black, Taylor and Caley, 1999). The MBROLA project, initiated by the TCTS Lab of the Facult Polytechnique de Mons (Belgium), is a free multi-lingual speech synthesizer developed with aims similar to Festivals (MBROLA Project Homepage, 2000).
Text
NLP (Festival)
Phonemes, pitch and duration
DSP (MBROLA)
Waveform
Figure 9 - Top level outline showing how Festival and MBROLA systems were used together.
It was decided for this project to use the Festival system as the natural language parser (NLP) component of the module, which accepts text as input and transcribes this to its phoneme equivalent, plus duration and pitch information. This information can be then given to the MBROLA synthesizer, acting as the digital signal processing unit (DSP), which produces a waveform from this information. Although Festival has its own DSP unit, it was found that the Festival + MBROLA combination produces the best quality. It is important to note that the Festival system supports MBROLA in its API. Because of the phoneme-duration-pitch input format required for MBROLA, it provides very fine pitch and timing control for each phoneme in the utterance. As stated before, this level of control is simply unattainable with commercial systems, except
DECtalk. The advantage of using MBROLA over DECtalk, however, is in the fact that once a phonemes pitch is altered in the latter system, the generated pitch contour is overwritten. Cahn (1990) first mentioned this problem, and as a result did not manipulate the utterance at the phoneme level, limiting the amount of control, which ultimately hindered the quality of the simulated emotion. To overcome this, Murray and Arnott (1995) had to write their own intonation model to replace the DECtalk generated pitch contour when they changed pitch values at the phoneme level. Fortunately, this is not an issue with MBROLA, as changes to the pitch and duration levels can be done prior to passing it to MBROLA (as Figure 9 shows). Therefore, it can be seen that the Festival + MBROLA option offers high control comparable to the DECtalk synthesizer, with the benefit of less complexity. The use of Festival and MBROLA also addressed the platform independent subproblem described in Section 2.2. Although developed mainly for the UNIX platform, its source code can be ported to the Win32 platform via relatively minor modifications. The MBROLA Homepage offers binaries for many platforms, including: Win32, Linux, most Unix OS versions, BeOS, Macintosh and more. Before the final decision was made to make use of the Festival system however, an important issue required investigation. The previous TTS module of the Talking Head did not use the Festival system because although it was acknowledged that Festivals output is of a very high quality, the computation time was deemed to be far too expensive to use in an interactive application (Crossman, 1999). For example, the phrase Hello everybody. This is the voice of a Talking Head. The Talking Head project consists of researchers from Curtin University and will create a 3D model of a human head that will answer questions inside a web browser. took about 45 seconds to synthesize on a Silicon Graphics Indy workstation (Crossman, 2000). It is contested however, that the negative impression that could be made of the Festival system from such data may be a little misled. Though execution time may take longer on an SG Indy workstation, informal testing on several standard PCs (Win32 and Linux platforms) showed that the same phrase took less than 5 seconds to synthesize (including the generation of a waveform). Since TTS processing is done on the server side, the system can be easily configured to ensure Festival will carry its processing on a faster machine. Therefore, Festivals synthesis time was not considered a problem.
Since it is expected that the programs input will contain marked up text, an XML parser was required to parse and validate the input, and create a DOM tree structure for easy processing. There are a number of freely available XML parsers, though many are still in development stage and implement the XML specification to varying degrees. One of the more complete parsers is libxml, a freely available XML C library for Gnome (libxml, 2000). Using libxml as the XML parser fulfilled the needs of the project in a number of ways: a) Portability written in C, the library is highly portable. Along with the main program, it has been successfully ported to the Win32, Linux and IRIX platforms. b) Small and simple only a limited range of the XML features are being used, therefore a complex parser was not required. This is not to say that libxml is a trivial library as it offers some powerful features. c) Efficiency Informal testing showed libxml parses large documents in surprisingly little time. Although not used for this project, libxml offers a SAX interface to allow for more memory-efficient parsing (see section 3.5.5). d) Free libxml can be obtained cost-free and license-free. It is important to note that the libxml librarys DOM tree building feature was used to help create the required objects that hold the programs utterance information. However, care was taken to make sure the programs objects were not dependent on the XML parser being used. Instead, a wrapper class, CTTS_SMLParser, used libxml as the XML parser and output a custom tree-like structure very similar to that of the DOM. This ensured that all other objects within the program used the custom structure, and not the DOM tree that libxml outputs. (See Chapter 5 for more details.)
3.8 Summary
This chapter has explored research that was applicable to this project, focusing on how the literature can help with achieving the stated objectives and subproblems of Chapter 2, and supporting the hypotheses of Chapter 4. More specifically, the literature was investigated to find the speech correlates of emotion, seeking clear definitions so that there was a solid base to work from during the implementation phase. The work of prominent researchers in the field of synthetic speech emotion, such as Murray and
Arnott (1995) and Cahn (1990) who have already attempted to simulate emotional speech, was sought in order to gain an understanding of the problems involved, and the approach taken in solving them. The in-depth review on XML served two purposes: a) to describe what XML is and what the technology is trying the address, and b) to expound the benefits of XML so as to justify why SML was designed to be XML-based. A resource review was given to discuss the issues involved when deciding which tools to use for the TTS module, and to address one of the subproblems stated in Section 2.3; that is, that the TTS module should be able to run across the Win32, Linux, and UNIX platforms.
4.1 Hypotheses
The project was developed to test the following hypotheses: 1. The effect of emotion on speech can be successfully synthesized using control parameters. 2. Through the addition of emotive speech: a) Listeners will be able to correctly recognize the intended emotion being synthesized. b) Head. Information will be communicated more effectively by the Talking
It should be noted that hypothesis 2a allows for a significant error rate in recognizing the simulated emotion as we ourselves find it difficult to understand each others nonverbal cues, and are often misunderstood as a result. Malandra, Barker and Barker (1989), and Knapp (1980) discuss difficulties in emotive speech recognition. For the hypothesis to be accepted, however, the recognition rate will be significantly higher than mere chance, showing proof that correct recognition of the simulated emotion is indeed occurring.
4.2.2 Delimitations
The purpose of this research is to determine how well the vocal effects of emotion can be added to synthetic speech it is not concerned with generating an emotional state for the Talking Head based on the words it is to speak. Therefore, the system will not know the required emotion to simulate from the input text alone. This top-level information will be provided through the use of explicit tags, hence the need for the implementation of a speech markup language. Due to the strict time constraints placed on this project, the emotions that are to be simulated by the system were bounded to happiness, sadness, and anger. These three emotions were chosen because of the wealth of study carried out on these emotions (and hence an increased understanding) compared to other emotions. This is because happiness, sadness, and anger (along with fear and grief) are often referred to as the basic emotions, on which it is believed other emotions are built on.
Chapter 5 Implementation
This chapter discusses the implementation of the TTS module to simulate emotional speech for a Talking Head, plus the stated subproblems of Section 2. The discussion covers how the modules input is processed, and how the various emotional effects were implemented. This will involve a description of the various structures and objects that are used by the TTS module. Since the module relies heavily on SML, the speech markup language that was designed and implemented to enable direct control over the modules output, the chapter discusses SML issues such as parsing and tag processing.
TTS Module
Viseme information
Figure 10 - Black box design of the system, shown as the TTS module of a Talking Head.
It was mentioned earlier in Section 3.7.1 that the TTS module uses the Festival Speech Synthesis System and the MBROLA Synthesizer. Figure 10 does not show any
of this detail, nor should it. What is important to describe at this level is the modules interface; how the module produces its output is irrelevant to the user of the module.
Plain Text The simplest form of input, plain text means that the TTS module will endeavour to render the speech-equivalent of all the input text. In other words, it will be assumed that no characters within the input represent directives for how to generate the speech. As a result of this, speech generated using plain text will have default speech parameters, spoken with neutral intonation.
SML Markup If direct control over the TTS modules output is desired, then the text to be spoken can be marked up in SML, the custom markup language implemented for the module. Although an in-depth description of SML will not be given here (see Section 2 and Appendix A), it was designed to provide the user of the TTS module with the following abilities: Direct control of speech production. For example, the system could be specified to speak at a certain speech rate, pitch, or pronounce a particular word in a certain way (this is especially useful for foreign names). Control over speaker properties. This gives the ability to not only have control of how the marked up text is spoken, but also who is speaking. Speaker properties such as gender, age, and voice can be dynamically changed within SML markup. The effect of the speakers emotion on speech. For example, the markup may specify that the speaker is sad for a portion of the text. As a result, the speech will sound sad. One of the primary objectives of this thesis is to determine how effective the simulated effect of emotion on the voice is.
Another important feature of the TTS module with regards to the input it can receive is that the module is able to handle unknown tags present within the text. This is important because other modules within the Talking Head may (and do) have their own markup languages to control processing within that module. For example, Huynh (2000) has developed a facial markup language to specify many head movements and expressions. If any non-SML tags are present within the input given to the TTS module, they will simply be filtered out of the SML input. A very important note to make is that the filtering out of unknown tags is done before any XML-related parsing is carried out; the XML Recommendation explicitly states that the presence of any unknown tags that do not appear in the documents referenced DTD should immediately cause a fatal error (Extensible Markup Language, 1998). Therefore, it is important that before any processing of the TTS modules input takes place, proper filtering takes place since it is expected that non-SML tags will present in the input. Admittedly, the very fact that the TTS module is given input that may not contain pure SML markup does not reflect a solid design of the system. However, the FAITH project has only just begun to make use of XML-based markup languages for its various modules and the XML processing architecture within the Talking Head is not very mature. One possibly better approach to maintaining several XML-based markup languages (such as SML and FAML) is discussed in the Future Work section (Chapter 7).
Waveform The TTS module will always produce a sound file which, when played, is the speech equivalent of the text received as input. The sound file is in the WAV 16-bit format. Once a waveform is produced, it is sent via MPEG-4 to the client side of the application where it is played (see Section 3.6).
Visemes The visual equivalent of the phonemes (speech sounds) that are spoken are also output from the TTS module. The viseme output is encoded as stated in the MPEG-4
specification (MPEG, 1999). The phoneme-to-viseme translation submodule is one of the few that were retained from the existing TTS module.
e.g.
CTTS_Central::SpeakTextEx
(const
char
*Message,
int Emotion) Synthesizes the actual character string pointed to by Message, simulating the emotion specified by Emotion. Emotion can be
A special note to make is that the type of input given to each of these functions is not explicitly specified. For instance, does the file given to SpeakFromFile contain plain text, or is it an SML document? Each of the API functions listed above automatically detect the input type using a simple heuristic: if the start of the input file or character string contains an XML header declaration, then it is detected to be SML markup, otherwise the input is treated as plain text. A C API has also been made available, which has the same functionality as the CTTS_Central objects interface. Only, initialization and destruction routines have to be explicitly called. TTS_Initialise () used to initialise the TTS module. The function is called once only.
TTS_SpeakFromFile (const char *Filename) same as CTTS_Central::SpeakFromFile (const char *Filename).
TTS_Destroy () used to nicely cleanup the TTS module once it is not need anymore. The function is called once only.
<?xml version="1.0"?> <!DOCTYPE sml SYSTEM "./sml-v01.dtd"> <sml> <p>...</p> <p>...</p> <p>...</p> </sml> Figure 11 - Top-level structure of an SML document. Paragraphs
In turn, each p node can contain one or more emotion tags (sad, angry, happy, and
neutral), and instances of the embed tag; text not contained within an emotion tag is
not allowed. For example, Figure 12 shows valid SML markup, while Figure 13 shows SML markup that is invalid because it does not follow this rule. Note that unlike lazy HTML, the paragraph (p) tags must be closed properly.
<p>
<neutral>Please remain quiet.</neutral> <embed src=sound.wav/> <angry>Who made that noise?</angry> </p>
<p> <sad>I have some sad news:</sad> this part of the markup is not valid SML. </p> Figure 13 - Invalid SML markup.
All tags described in Appendix A can occur inside an emotion tag (except sml, p, and embed). A limitation of SML is that emotion tags cannot occur within other emotion tags. However, unless explicitly specified, most other tags can contain even instances of tags with the same name. For example, a pitch tag can contain another pitch tag as the following example shows.
<pitch range=+100%> Not I, <pitch middle=-15%>said the dog.</pitch> </pitch>
The described structure of an input file containing SML markup can be confirmed by SMLs DTD (see Appendix B). Should the input file not conform to the DTD specification, a parse error will occur and, in accordance with the XML Recommendation, the input will not be processed.
Plain text
Phoneme data
Text
SML Parser
libxml
SML Document
Phoneme info
Visual Module
Visemes
Tags + Text/Phonemes
Modified Text/Phonemes
Phoneme data
Waveform
As Figure 14 shows, the design of the TTS module subsystems is centered on the SML Document object. The main steps for synthesizing the modules input text involve the creation, processing, and output of the SML Document. This is broken down into the following tasks: a) Parsing. The input text is parsed by the SML Parser, and creates an SML Document object. The SML Parser makes use of libxml. b) Text to Phoneme Transcription. The Natural Language Parser (NLP) is responsible for transcribing the text into its phoneme equivalent, plus providing intonation information in the form of each phonemes duration and pitch values. This information is given to the SML Document object and
stored within its internal structures. The NLP unit makes use of the Festival Speech Synthesis System. c) SML Tag Processing. Any SML tags present in the input text are processed. This usually involves modifying the text or phonemes held within the SML Document. d) Waveform Generation. The phoneme data held within the SML Document is given to the Digital Signal Processing (DSP) unit to generate a waveform. The DSP makes use of the MBROLA Synthesizer. e) Viseme Generation. The Visual Module is responsible for transcribing the phonemes to their viseme equivalent. Again, the phoneme data is obtained from the SML Document. In this thesis, the Visual Module will not be discussed in any further detail since it is has reused much of the old TTS modules subroutines. Crossman (1999) provides a description of the phoneme-to-viseme translation process.
The TTS module keeps track of all SML tag names by keeping a special XML document that holds SML tag information2. Filtering of the input is done by creating a copy of the input file, and copying only those tags that are known. It is important that this filtering process is carried out because the input is envisaged to contain other nonSML tags, such as those belonging to the FAML module. Figure 15 shows the filtering process.
SML tag information Tag lookup Known tag
Input file
The XML document is called tag-names.xml, and is held in the special TTS resource directory TTS_rc.
hold markup information, attribute values, and character data. Figure 16 shows the highlevel structure of an SML Document that would be constructed for the accompanying SML markup. Note how each node has a type that specifies what type of node it is. The hierarchical nature of the SML Document implies which text sections will be rendered in what way a parent will affect all its children. So, for the example in Figure 16, the emph node will affect the phoneme data of its (one) child node, the text node containing the text too. The happy node will affect the phoneme data of all its (three) children nodes containing the text Thats not, too, and far away respectively. Tags that were specified with attribute values are represented by element nodes that point to attribute information (this is not shown on Figure 16 for clarity purposes).
<?xml version="1.0"?> <!DOCTYPE sml SYSTEM "./sml-v01.dtd"> <sml> <p> <neutral>I live at <rate speed=-10%>10 Main Street</rate> </neutral> <happy> Thats not <emph>too</emph> far away. </happy> </p> </sml>
sml DOCUMENT_NODE
p ELEMENT_NODE
neutral ELEMENT_NODE
happy ELEMENT_NODE
rate ELEMENT_NODE
emph ELEMENT_NODE
Utterance level the whole phrase held in that node. The CTTS_UtteranceInfo class is responsible for holding information at this level. Word level the individual words of the utterance. The CTTS_WordInfo class is responsible for holding information at this level. Phoneme level the phonemes that make up the words. The CTTS_PhonemeInfo class is responsible for holding information at this level. 4. Phoneme pitch level the pitch values of the phonemes (phonemes can have multiple pitch values). The CTTS_PitchPatternPoint class is responsible for holding information at this level. The way that the above mentioned objects are organized within a text node is as follows: A text node contains one CTTS_UtteranceInfo object. The CTTS_UtteranceInfo object contains a list of CTTS_WordInfo objects that contain word information. In turn, each
CTTS_WordInfo
1.
2. 3.
object
contains
list
of A
CTTS_PhonemeInfo
objects
that
contain
phoneme
information.
CTTS_PhonemeInfo object contains the actual phoneme and its duration (ms).
Each
CTTS_PhonemeInfo
object
then
contains
list
of
phoneme. A pitch point is characterized by a pitch value, and a percentage value of where the point occurs within the phonemes duration.
U W the P dh pp
(0,95)
P@ pp (50,101)
pitch value % inside phoneme length
Figure 17 - Utterance structures to hold the phrase the moon. U = CTTS_UtteranceInfo object; W = CTTS_UtteranceInfo object; P = CTTS_PhonemeInfo object; pp = CTTS_PitchPatternPoint object.
Once word information is stored within each nodes utterance object, phoneme data can be generated for each word. Obtaining the actual phoneme data (including intonation) is a more complex process however. This is because an entire phrase should be given to Festival in order for correct intonation to be generated. As an example, consider the following SML markup (the corresponding nodes held in the SML Document are shown in Figure 19).
<happy> <rate speed=-15%>I wonder,</rate> you pronounced it <emph>tomato</emph> did you not? </happy>
happy
rate
you pronounced it
emph
I wonder,
tomato
If each text nodes contents is given to Festival one at a time (i.e. first I wonder, then you pronounced it, and so forth), then though Festival will be able to produce the correct phonemes it will not generate proper pitch and timing information for the phonemes. This will result in an utterance whose words are pronounced properly, but contains inappropriate intonation breaks that make the utterance sound unnatural. An appropriate analogy to this would be if a person were shown a pack of cards with words written on them one at a time, and asked to read it out loud. The person, not knowing what words will follow, will not know how to give the phrase an appropriate intonation. Now if the same person is given a card that contains the entire sentence on it, then now knowing what the phrase is saying, the person will read it out loud correctly. The same approach was taken in the solution to this problem. Continuing the above example will help understand how this is done. The SML Document is traversed until an emotion node is encountered. In the example, traversal would stop at the happy node. The contents of its child text nodes are then concatenated to make one phrase. So, the contents of the four text nodes in Figure 19 would be concatenated to form the phrase I wonder, you pronounced it tomato did you not? The phrase is stored in a temporary utterance object held in the happy node.
The phrase is given to Festival, and Festival generates the phoneme transcription as well as intonation information. The entire phoneme data is stored in the happy nodes temporary utterance object. Because each text node already contains word information in their utterance objects, it is a simple process to disperse the phoneme data held in the happy node amongst its children. The temporary utterance object in happy is then destroyed. If this procedure is followed then correct intonation is given to the utterance. Of course, a limitation is that this will not solve the problem of having an emotion change in mid sentence. However, the algorithm makes an assumption that this will not occur frequently, and that if it does, the intonation will not be needed to continue over emotion boundaries and a break is acceptable.
SERVER
Synthesizing neutral tag (Utterance 1) Synthesizing happy tag (Utterance 2) Synthesizing sad tag (Utterance 3) Idle Idle
CLIENT
Playing Utterance 1
Playing Utterance 2
Playing Utterance 3
Figure 20 - Raw timeline showing server and client execution when synthesizing example SML markup above.
Anger Speech rate Pitch average Pitch range Intensity Pitch changes Faster Very higher much
Happiness Slightly faster Much higher Much wider Higher Smooth, upward inflections
Sadness Slightly slower Slightly lower Slightly narrower Lower Downward inflections
Much wider Higher Abrupt, downward, directed contours Breathy, chesty tone1 Clipped
Resonant1 Slurred
Table 2 - Summary of human vocal emotion effects for anger, happiness, and sadness.
To implement the guidelines found in the literature on human speech emotion, Murray and Arnott (1995) developed a number of prosodic rules for their HAMLET system. The TTS module has adopted some of these rules, though slight modifications were required. Also, other similar prosodic rules have been developed through personal experimentation.
5.7.1 Sadness
Basic Speech Correlates Following the literature-derived guideline for the speech correlates of emotion shown in Table 2, Table 3 shows the parameter values set for the SML sad tag. The values were optimized for the TTS module, and are given as percentage values relative to neutral speech.
Parameter Speech rate Pitch average Pitch range Volume Value (relative to neutral speech) -15% -5% -25% 0.6
As a result of the above speech parameter changes, the speech is slower, lower in tone, and is more monotonic (pitch range reduction gives a flatter intonation curve). The volume is reduced for sadness so that the speaker talks more softly. (Implementation details on how speech rate, volume and pitch values are modified can be found in Section 5.8). Prosodic rules The following rules, adopted from Murray and Arnott (1995), were deemed to be necessary for the simulation of sadness. Some parameter values were slightly modified to work best with the TTS module. 1. Eliminate abrupt changes in pitch between phonemes. The phoneme data is scanned, and if any phoneme pairs have a pitch difference of greater than 10% then the lower of the two pitch values is increased by 5% of the pitch range. 2. Add pauses after long words. If any word in the utterance contains six or more phonemes, then a slight pause (80 milliseconds) is inserted after the word. The following rules were developed specifically for the TTS module. 1. Lower the pitch of every word that occurs before a pause. Such words are lowered by scanning the phoneme data in the particular word, and
lowering the last vowel-sounding phoneme (and any consonant-sounding phonemes that follow) by 15%. This has the effect of lowering the last syllable. 2. Final lowering of utterance. The last syllable of the last word in the utterance is lowered in pitch by 15%.
5.7.2 Happiness
Basic Speech Correlates Following the literature-derived guideline for the speech correlates of emotion shown in Table 2, Table 4 shows the parameter values set for the SML happy tag. The values were optimized for the TTS module, and are given as percentage values relative to neutral speech.
Parameter Speech rate Pitch average Pitch range Volume Value (relative to neutral speech) +10% +20% +175% 1.0 (same as neutral)
As a result of the above speech parameter changes, the speech is slightly faster, is higher in tone, and sounds more excited since intonation peaks are exaggerated due to the pitch range increase. Prosodic rules The following rules were adopted from Murray and Arnott (1995). Some parameter values were slightly modified to work best with the TTS module. 1. Increase the duration of stressed vowels. The phoneme data is scanned, and the duration of all primary stressed vowel phonemes is increased by 20%. Stressed vowels are discussed in Section 5.7.4. 2. Eliminate abrupt changes in pitch between phonemes. The phoneme data is scanned, and if any phoneme pairs have a pitch difference of greater than 10% then the lower of the two pitch values is increased by 5% of the pitch range. 3. Reduce the amount of pitch fall at the end of the utterance. Utterances usually have a pitch drop in the final vowel and any following
consonants. This rule increases the pitch values of these phonemes by 15%, hence reducing the size of the terminal pitch fall.
5.7.3 Anger
Basic Speech Correlates Table 5 shows the parameter values set for the SML angry tag. Note that the values were optimized for the TTS module, and are given as percentage values relative to neutral speech.
Parameter Speech rate Pitch average Pitch range Volume factor Value (relative to neutral speech) +18% -15% -15% 1.7
Prosodic rules The following rule was adopted from Murray and Arnott (1995). 1. Increase the pitch of stressed vowels. The phoneme data is scanned, and the pitch of primary stressed vowels is increased by 20%, while secondary stressed vowels are increased by 10%. Inspection of the parameter values in Table 5 will reveal that they differ considerably from the guidelines shown in Table 2. Initially, speech parameters were set for anger as shown in Table 2 but preliminary tests showed that even with different prosodic rules, the angry tag produced output that was too similar to that of the happy tag (both had increases in speech rate, pitch average and pitch range). It was decided to keep the increase in speech rate to denote an increase in excitement, but lower the pitch average. The lower voice seemed to better convey a menacing tone, for the same reason why animals utter a low growl to ward off possible intruders. With the help of the increase in volume, the pitch average lowering also results in a perceived hoarseness in the voice3, although vocal effects could not be implemented due to a limitation of the Festival and MBROLA systems. The combination of the decreased pitch range and the intonation rule results in a flatter intonation curve
Increased hoarseness in the voice for anger is supported in the literature (Murray and Arnott, 1993).
with sharper peaks. This upholds Table 2s description of pitch changes for anger: abruptcontour.
Table 6 - Vowel-sounding phonemes are discriminated based on their duration and pitch.
Therefore, the prosodic rules made use of the fact that different stress types existed for vowel-sounding phonemes, basing classification on the criteria shown in the table above. Whether or not this follows the stressed phoneme definition of Arnott and Murray (1995) is unclear; the fact remains that Table 6 allowed the implementation of prosodic rules that involved different stressed vowel types, and its success will be demonstrated in this thesis evaluation section (Chapter 6).
5.7.5 Conclusion
While the literature identifies a number of speech correlates of emotion, speech synthesizer limitations did not allow for the implementation of some of them; namely,
intensity, articulation and voice quality parameters. This may have been the main reason why happiness and anger would have been too similar if the recommendations found in the literature had been strictly followed based on acoustic features alone there were too few differences between the two emotions. In discussing HAMLETs prosodic rules, Murray and Arnott (1995) state that the rules were developed to be as synthesizer independent as possible. This was found to be the case, though specific values given in their paper for the DECtalk system obviously could not be used. Still, the values served as a very good indication of what settings needed to be changed for different emotions. The very fact that some of the HAMLET prosodic rules could be implemented in this projects TTS module serves to show that the work of Murray and Arnott (1995) is not speech synthesizer dependent. This has the added advantage that it assures that the emotion rules implemented in the TTS module are not dependent on the Festival Speech Synthesis System and the MBROLA Synthesizer.
also affected. Figure 21 shows the factors that duration and pitch values are multiplied by.
<emph target=o affect=b level=moderate>sorry</emph> target phoneme duration = 2.0 pitch =1.3
sorry phonemes
s o r ii
neighbours duration = 1.8 pitch = 1.2
Figure 21 Multiply factors of pitch and duration values for emphasized phonemes.
The attributes that can be specified within the emph tag give various options on how the word will be emphasized. For instance, the affect attribute can specify whether only the pitch should change for the affected phonemes (default), or just the duration, or both. The level attribute specifies the strength of the emphasis (e.g. weak (default), moderate, strong). The current limitation of the tag is that only one target phoneme can be specified. However, this can be easily modified so that multiple target phonemes can exist within a word by extending the target attribute and how it is processed. b) embed The embed tag enables foreign file types to be embedded within SML markup. The type of embedded file is specified through the type attribute. Currently, two file types are supported in SML: audio the embedded file is an audio file, and is played aloud. The filename is specified through the src attribute. Embedding audio files within SML markup is useful for sound effects.
<embed type=audio src=sound.wav/>
mml Music Markup Language (MML) markup is being embedded, and signifies that the voice will sing the specified song. Two input files are specified through the following attributes: music_file (contains music description of song), and lyr_file (contains songs lyrics). Processing of
this file involves obtaining the two input filenames, and calling the MML library that will perform synthetic singing. The implementation of MML for synthetic singing was developed by Stallo (2000).
<embed type=mml music_file=rowrow.mml lyr_file=rowrow.lyr/>
c) pause The pause tag inserts a silent phoneme at the end of the last word of the previous text node with a duration specified by the length or msec attributes. Finding the text node previous to the pause node can be is non-trivial, as the algorithm must be able to handle any sub-tree structure. shows a possible structure that the algorithm must be able to handle.
Take a deep <emph>breath</emph> <pause length=medium/> and continue
text
ELEMENT_NODE
emph
pause
ELEMENT_NODE length=medium
text
TEXT_NODE breath
text
d) pitch Pitch average The pitch average is modified by changing the pitch values of every phonemes pitch point(s). The middle attribute specifies by which factor the pitch values need to change. Modifying the pitch value of every phoneme by the same amount has the effect of changing the pitch average.
Pitch range Modifying the pitch range of an utterance requires that the pitch average be known. Therefore, the pitch average of the utterance is calculated before any pitch values can be modified. Each pitch value is then recalculated using the following equation:
This has the effect of moving pitch points further away from the average line, for pitch values that are both greater and less than the pitch average (as shown in Figure 23). Care is taken that the new pitch values do not go below/higher than predetermined thresholds, or otherwise the voice loses its human quality and sounds too machine-like.
pitch average pitch average Original pitch values New pitch values Figure 23 - The effect of widening the pitch range of an utterance.
e) pron The pron tag is used to specify a particular pronunciation of a word. It pron tag is the only tag that modifies the text content of an SML Document text node, and is processed before the contents is given to the NLP module at the text to phoneme transcription stage. This is because the value of the sub attribute overwrites the contents of the text node. When the phoneme transcription stage is reached, the substituted text is what is given to the NLP module, and so the phoneme transcription reflects the specified pronunciation of the markup. Figure 24 illustrates how this is done for the markup segment:
<pron sub=toe may toe>tomato</pron>
pron
pron
NLP
TEXT_NODE tomato
text
text
text
Sub-tree reflects the structure of the example markup. pron tag has attribute sub value of toe may toe
pron nodes sub attribute value overwrites contents of child text node.
At phoneme transcription stage, text node gives its contents to NLP, and receives phoneme transcription.
f) rate
The speech rate is modified very easily by affecting the phoneme duration data in the text node structures. How much the speech should increase/decrease by is specified by the speed attribute, and each phonemes duration is multiplied by a factor reflecting the value of speed.
g) volume
The volume is modified through the usage of the MBROLA v command line option. The level attribute specifies the volume change, and this is converted to a suitable value to pass to MBROLA through the command line. The disadvantage of this way of implementing volume control is that MBROLA applies the volume to the whole utterance it synthesizes. Since sections are passed to MBROLA at the emotion tag level, the volume can vary at most from emotion to emotion. Therefore, dynamic volume change within an emotion tag is not possible, and has been identified as an area for future work (see Section 7.1).
Name The value of the name attribute is passed to MBROLA on the command line, and must be the name of an MBROLA diphone database that already exists on the system. Using a different diphone database changes the voice since the recorded speech units contained within the database are from a different source. The TTS module can currently use two diphone databases: en1, and us1. Making use of more diphone databases (when they become available) requires very minimal additions. Unfortunately, specifying the actual name of the MBROLA diphone database in the SML markup has been identified as a design flaw, since an SML user will now be aware that the MBROLA synthesizer is being used and is forced to write markup that directly accesses an MBROLA voice. It is very important that this should be altered in the future.
Gender The MBROLA synthesizer provides the ability to change frequency values and voice characteristics, and thus provides a way of obtaining a male and female voice from the same diphone database. Obtaining a male or female voice requires the specification of the following MBROLA command line options: 1. Frequency ratio specified through the -f command line option. For instance, if -f 0.8 is specified on the MBROLA command line, all fundamental frequency values will be multiplied by 0.8 (voice will sound lower). 2. Vocal tract length ratio specified through the -l command line option. For instance, if the sampling rate of the database is 16000, indicating -l 18000 allows you to shorten the vocal tract by a ratio of 16/18 (which will make the voice sound more feminine). Unfortunately, values for the f and l MBROLA command line options are dependent on the diphone database being used. Table 7 shows the parameter values required to obtain a male and female voice for the en1 and us1 diphone databases.
Gender Male (using en1) Female en1) Female us1) (using Frequency ratio (f) 1.0 1.6 0.9 1.5 Vocal tract ratio (l) 16000 20000 16000 16000
Table 7- MBROLA command line option values for en1 and us1 diphone databases to output male and female voices.
The set of recorded speech sounds is often referred to as the voice data corpus, or a diphone database.
MBROLA accepts as input a list of phonemes, together with prosodic information (in the form of phoneme duration and a piecewise linear description of pitch), and from this is able to produce speech samples. The MBROLA input format is very intuitive and simple for describing phoneme data. Hence the design of the TTS modules utterance structures (discussed in Section 5.5.2) was based on the MBROLA input format. Figure 25 shows example MBROLA input required to produce a speech sample of the word emotion. Each line holds the following information, delimited by white space: phoneme, duration, n pitch-point pairs. A pitch-point pair is represented by two numbers: a percentage value representing how far into the phonemes duration the point occurs, and the actual pitch value.
# 210
phoneme
duration
pitch point
choice of good descriptive tag names (if an appropriate name is already being used in another tag set), and similar tag names such as anger and angry will only confuse future users of TTS and FAML markup. A possible solution would be the use of XML Namespaces, which allow different markup languages to contain the same tag names (ambiguity is resolved through the use of resolution identifiers e.g. sml.angry and faml.angry) 2. FAML API functions are called from within the TTS module to initialize the FAML module, and to allow the FAML module to modify the output FAP stream. 3. The creation of temporary utterance files in a format required by the FAML module. This information is needed by the FAML module for the proper synchronization of its generated facial gestures and the TTS modules speech. The utterance file contains word and phoneme information in the format shown in Figure 26.
word >And # 210 a 90 n 49 d 46 >now n 72 au 262 # 210 >the dh 45 @ 40 >latest l 69 ei 136 t 71 i 66 s 85 t 66 >news n 72 y 45 uu 217 z 82 # 210
Figure 26 - Example utterance information supplied to the FAML module by the TTS module. Example phrase: And now the latest news
5.11 Summary
This chapter has been able to demonstrate that the design and implementation of the described TTS module has endeavoured to follow the black box design principle in virtually all of its subsystems. As Figure 7 shows, the dependence of other Talking Head modules on the TTS module is strictly limited to its precisely defined inputs any modification of the internal workings of the TTS module will not affect how the rest of the Talking Head functions. The TTS modules own dependence on tools it uses such as libxml, the Festival Speech Synthesis System, and the MBROLA Synthesizer has been carefully bounded through proper design of C++ classes and structures. There is also no dependence of the DSP unit (which makes use of MBROLA) on the NLP unit (Festival), so minimal changes would be required if another NLP unit was used, but the TTS subsystem design would remain unchanged. The only requirements for a new speech synthesizer to be used in the NLP and DSP units would be the ability to manipulate the utterance at the phoneme data level (including pitch and duration information). The prosodic and speech parameter rules for the emotion tags are totally speech synthesizer independent, except for the volume settings. Because of this, the emotion tags could be easily ported for use in another TTS module using different speech synthesizers. Processing of most of the low-level speech tags makes use of the SML Documents utterance structures only. However, the speaker and volume tags make heavy use of the MBROLA Synthesizer due to the fact that these tags affect the way MBROLA produces the waveform. Future work will look at minimizing the dependence of these two tags on MBROLA.
Section 2 Emotion Recognition This section was made up of two parts, and borrowed heavily from the testing method described in Murray and Arnott (1995). Both parts dealt with the recognition of emotion in synthetic speech, but followed a different format. Part A had only two test phrases, but each phrase was synthesized under four different emotions: anger, happiness, sadness, and neutral (no emotion). The two phrases were specially chosen because they were deemed to be emotionally undetermined (Murray and Arnott, 1995). Emotionally undetermined phrases are phrases whose emotion cannot be determined simply by the words. For instance, the phrase I received my assignment mark today can be convincingly said under a variety of emotions the speaker may be sad about the mark he or she received, can be feeling happy about it, or can even be angry having felt to be treated unfairly. For Part B, ten test phrases were prepared: five of these were emotionally neutral phrases, and the other five were emotionally biased phrases. An example of a neutral phrase is The book is lying on the table. An example of an emotionally biased, or emotive, phrase is I would not give you the time of day (the words already carry negative connotations, which could probably influence the listener to identify anger or disgust). Part B consisted of the following utterances (in this order): a) b) Five neutral phrases spoken without emotion. Five emotive phrases spoken without emotion.
c) The same phrases as in (a) above, but spoken in one of the four emotions (includes neutral). d) The same phrases as in (b) above, but spoken in one of the four emotions. Both Cahn (1990b) and Murray and Arnott (1995) expressed difficulty in finding appropriate test phrases, and indeed, the same difficulties were encountered when designing the questionnaire for this project, especially in finding neutral phrases that sounded convincing under any of the four emotions. A number of phrases were borrowed from the experiments of the aforementioned researchers, and the rest were original. It is important to note that the participants were not made aware of the different types of phrases that were prepared. See Appendix E for a list of the example test phrases used. An important design issue for this section of the questionnaire was deciding the way in which the participants should indicate their choice. Murray and Arnott (1995) have
identified (and used) two basic methods of user input that are suitable for speech emotion recognition: forced response tests, and free response tests. In a forced response test, the subject is forced to choose from a list of words the one that best describes the emotion that he or she perceives is being spoken. In a free response test, the subject may write down any word that he or she thinks best describes the emotion. In experiments performed by Cahn (1990), the evaluation was based on a forced response test, with only the six emotions that the Affect Editor simulated as possible responses. Participants were also asked to indicate (on a scale of 1 to 10) how strongly they heard the emotion in the utterance, and how sure they were. For this project, it was decided to adopt the forced response test because of its simplicity and to avoid the possible ambiguity of a free response test (for instance, if a participant wrote down exasperated, should it be categorized as angry or disgusted, or neither?). However, a mechanism needed to be provided so as not to limit the possible selection of responses. This was important because only three emotions other than neutral anger, happiness, and sad were simulated by the system. It was feared that should the selection list be confined to just four possible responses, then it could potentially invalidate the data. For instance, any utterance that contained positive words, or contained a positive tone to it, would immediately be perceived as happy only because it would be the only positive emotion in the list. This situation was avoided by adding two distractor emotions in the selection list surprise and disgust. This way, listeners would have more choice. In addition to this, an Other option was added to the selection list to enable the participant to write down their own descriptive term if they so wished. Therefore, a total of seven possible responses were made available for the four different types of emotion being simulated. It is beneficial that the tests in this section had a very similar structure to the forced response tests conducted by Murray and Arnott (1995). The experiment could not be exactly the same since time limitations did not permit as many test utterances to be played, and not all the emotions that were tested in Murray were simulated. Still, the experiments are similar enough to provide at least a loose comparison of results. It would have been interesting if a female voice would have been tested since the voice gender can be changed through the speaker tag (see Section 5.8.2). However, the questionnaire length was strictly limited, and a sufficient number of examples of one gender were needed to obtain useful data; interchanging male and female voices would have introduced complex variables, and making valid conclusions from the data would have been difficult, if not impossible. Still, differences in subject responses to male and
female voices could have possibly provided useful data for analysis and this should certainly be looked into in the future (see Chapter 7). Section 3 Talking Head For this section it was desired to obtain data that could describe the effect that adding vocal emotion to a Talking Head had on a users perception of the Talking Head. To do this it was decided to prepare two movies of the Talking Head: one speaking without vocal emotion effects, and the other including emotion effects in its speech all other variables must remain as much the same as possible (e.g. facial expressions, movements, and the actual words spoken). It was not possible to have the visual information of the Talking Head exactly the same for both examples, since the inclusion of certain speech tags affected the length of the utterance, and so slight movements such as eye blinking may have been different. Facial expressions showing emotion however, were the same for both examples (the exact placement, duration and intensity of the expressions could be controlled via facial markup tags (Huynh, 2000)). The utterance that was synthesized for both examples was excerpted from the Lewis Carroll classic Alices Adventures in Wonderland (Carroll, 1946). The excerpt contained dialogue between three characters of the story: Alice, the March Hare, and the Hatter. The passage was chosen for its wonderful expressiveness (it is a childrens novel), the fact that it included dialogue, and because various emotions such as sad, curiosity, and disappointment could be used when reading the passages. The first examples speech was synthesized without any markup at all, and therefore included Festivals intonation unchanged. For the second example, the text was marked up by hand using a variety of high-level speech emotion tags, and lower-level speech tags such as rate, pitch and emph (see Section 5.8).
Section 4 General Information The last section ascertained if the participant had heard of the term speech synthesis before, and if they had seen a Talking Head before. The section also aimed at receiving comments about the speech examples the participant had heard, and about the Talking Head they had seen.
screen at the front of the theatre. The room also had an adequate sound system enabling each participant to comfortably listen to the demonstration. A total of 45 participants took place in the demonstration, a number large enough to produce adequate data. The demonstration began by very briefly introducing the participants to what the project is about. They were told that by filling out the questionnaire they would be helping in evaluating how well the project addresses the problem it had set out to solve. Nevertheless, it was made clear to all participants that it was not compulsory that they should sit the demonstration. The participants were given an overview of what to expect in the questionnaire, including the fact that they would be played a number of sound files and asked to comment on each one. It was emphasized however, that it was not they who were being tested, but the program itself, and that there were no right or wrong answers. The participants were asked to fill out Section 1 of the questionnaire. They were told the relevance of the questions being asked in that section, but that they were under no obligation to answer questions they did not feel comfortable with. The participants were encouraged to ask questions to clarify any details they felt had not been made clear. The sound and movie demonstrations of Sections 2 and 3 had been pre-rendered and made part of a Microsoft PowerPoint presentation that was shown on the large screen at the front of the lecture theatre. The reason for this was to avoid waiting for the utterances to be generated, and to provide a way of showing the audience the current section and example number that was being played. It also minimized the risk of anything going wrong. Parts A and B of Section 2 consisted of the test utterances described in Section 6.1.1 being played. For each test utterance, the sound file was played twice and the participants were given time to choose which emotion best suited how the speaker sounded. In order to give the participants a chance to listen to the test utterance again and confirm their choice, the five most recent utterances were repeated every fifth example. The repeating of the utterances also served as a mental break for the participants. The next section of the questionnaire consisted of the two Talking Head examples described in Section 6.1.1. The participants were asked to first watch both of the examples, and then fill out that part of the questionnaire. Before the examples were played, the participants were asked to scan the four questions asked in that section so as to aid them in what they should look for the Talking Heads clarity, expressiveness, naturalness, and appeal. However, the participants were not asked to focus on the voice only, even though it was only the voice that changed in the two examples. The participants were then asked to fill out the rest of the questionnaire, which consisted of more general questions (see Appendix D).
Important to note is that there were some factors that did not allow the evaluation to be carried out under ideal conditions. For instance, the very fact that the demonstration was held with a group of participants may have been a cause of distraction for some. Moving towards an ideal situation would have been for each participant to sit the questionnaire one at a time, in front of a computer without anyone else in the room. This would have minimized any distractions, and would have allowed the participant to go at his or her own pace. Given the time and resources available, however, this was not possible. Still, the demonstration was designed to be short enough to keep the participants attention and give each participant the ability to review their answers through playbacks, and conducted in such a way as to give the participants ample time to answer the questions.
In addition to the statistics shown in the table, it should be noted that all participants were students enrolled in a second year Computer Science introductory graphics unit. All participants were computer literate who used computers at home, school and for work. 71.1% had heard of the term speech synthesis before, while 26.7% had not (2.2% were unknown). Also, 82.2% had seen a Talking Head before, while 15.6% had not (2.2% were unknown).
S T I M U L U S
For example, in Table 10, the first row of example percentage values show that when an utterance was combined with happy vocal emotion effects generated by the system, 42.2% of listeners perceived the emotion as happy (i.e. correct recognition took place), 2.2% perceived it as sounding sad, 6.7% angry, 17.9% neutral, 23.7% surprised, 4.4% disgusted, and 2.9% specified something else. The data in the example therefore, would indicate happiness was the most recognized emotion for that utterance, and that the utterance was mostly confused with surprise and neutral. Note that the Other category
5
The intended meaning is not that the participants failed to recognize the simulated emotion through a fault of their own, but simply that the participants choice didnt match the emotion being simulated.
also includes those participants who could not decide which emotion was being portrayed.
PERCEIVED EMOTION
Happy Sad Angry Neutral Surprised Disgusted Other
S T I M U L U S
42.2%
2.2%
6.7%
17.9%
23.7%
4.4%
2.9%
From the above example, it can be seen that the data should be read in rows; each row representing an utterance or group of utterances that were simulating a specific emotion, and the cell values for that row show the distribution of the participants responses. Cells that have the same row and column names hold values that represent when the listeners perceived emotion matched the emotion being simulated any other cell represents wrong recognition. A table holding values such as those in Table 11 would therefore be showing ideal data 100% recognition of the simulated speech emotion took place for all emotions.
PERCEIVED EMOTION
Happy Sad Angry Neutral Surprised Disgusted Other
S T I M U L U S
100%
0.0% 0.0% 0.0%
0.0%
0.0% 0.0%
100%
0.0% 0.0%
100%
0.0%
100%
Table 11 - Confusion matrix showing ideal experiment data: 100% recognition rate for all simulated emotions.
Of course, this is virtually impossible, since it is well documented in the literature that our recognition rate is far less than perfect for when humans speak (Malandra, Barker and Barker, 1989). Knapp (1980) identifies three main factors that describe why this is so: 1. People vary in their ability to express their emotions (not only in speech, but also in other forms of communication such as facial expressions and body language).
2. People vary in their ability to recognize emotional expressions. In a study described in Knapp (1980), listeners ranged from 20 percent correct to over 50% correct. 3. Emotions themselves vary in being able to be correctly recognized. Another study showed anger was identified 63 percent of the time whereas pride was only identified 20 percent of the time. So if obtaining such ideal data for human speech is unrealistic, this is more the case when emotion is being simulated in synthetic speech.
S T I M U L U S
18.9%
3.3%
3.3%
41.1%
22.2%
6.7%
4.4%
Table 12 Listener response data for neutral phrases spoken with happy emotion.
P Similarly, Table 13 shows E R C E response E M O T Iall N emotions demonstrated listener I V E D data for O four in the test utterances for Section 2A. Significant values are displayed in a larger font. S T I M U L U S
Sad
Angry
Neutral
Surprised Disgusted
Other
18.9%
0.0% 6.7% 6.7%
3.3%
3.3% 1.1%
41.1% 22.2%
7.8% 0.0% 5.6% 3.3%
6.7% 6.7%
77.8%
0.0%
44.4% 21.1%
2.2%
21.1%
3.3%
23.3%
54.4%
The following observations can be made from the data for each emotion (including happy, which has already been discussed). a) Happy. A poor recognition rate (18.9%), with the stimulus being strongly confused with neutral (41.1%) and surprise (22.2%). b) Sad. A very high recognition rate occurred for sadness (77.8%), with little confusion occurring with other possible emotions. c) Angry. A relatively high percentage of listeners recognized the angry stimulus (44.4%), but a considerable amount confused anger with neutral and disgust (21.1% each). d) Neutral. Most listeners correctly recognized when the utterance was played without emotion (54.4%), but a significant portion (23.3%) perceived the emotion as sad. The Analysis Any percentage value substantially greater than 14% was deemed as significant. This is based on the logic that if all participants had randomly chosen one of the seven emotions, then a particular emotion would have 1/7 (~14%) chance of being chosen. Therefore, if a cell has a percentage substantially greater than 14% then there must have been a factor (or factors) that influenced the listeners choice. Except for happy, all emotions had a recognition rate greater than 14%. However, sadness was the only emotion that enjoyed little confusion with other emotions. So with the exception of sadness, average recognition for the simulated emotion in the utterances was quite low. To give a possible explanation for these values, it will be helpful to look at the data for each question separately. Table 14 shows listener response data for Question 1 of Section 2A, and Table 15 shows listener response data for Question 2.
PERCEIVED EMOTION
Happy Sad Angry Neutral Surprised Disgusted Other
S T I M U L U S
2.2%
6.7% 2.2%
57.8%
8.9% 6.7%
8.9% 11.1%
66.7%
0.0%
66.7%
2.2%
20.0%
2.2%
46.7%
40.0%
Table 14 - Listener response data for Section 2A, Question 1. PERCEIVED EMOTION
Happy Sad Angry Neutral Surprised Disgusted Other
S T I M U L U S
31.1%
0.0% 13.3% 13.3%
4.4%
0.0% 0.0%
24.4% 31.1%
6.7% 0.0% 6.7% 4.4%
4.4% 2.2%
88.9%
0.0% 0.0%
22.2% 35.6%
2.2%
22.2%
4.4%
68.9%
It is very obvious that the two tables hold substantially different data values. Whereas for Question 1 the happy stimulus received a recognition rate of 6.7%, with 57.8% mistaking it for neutral, this changed dramatically for Question 2, with 31.1% of listeners correctly identifying the happy emotion, and only 24.4% classifying the utterance as neutral. There is still considerable confusion occurring for the happy stimulus even for Question 2 (surprise is 31.1%), but the difference between the two questions recognition rates is too large to ignore. Therefore it is evident that the results in this section were very much utterance dependent. From the data, it seems that the phrase The telephone has not rung at all today (Question 1) was said more effectively in an angry tone than I received my assignment mark today (Question 2), and this was despite the care that was taken to choose neutral test phrases. Both Cahn (1990) and Murray and Arnott (1995) report this problem of utterance-dependant results, and it demonstrates the difficult task of obtaining quantitative results on speech emotion recognition. An interesting observation is noting when a particular stimulus received strong listener recognition and the emotion in the same row with the next highest response. For example, anger received strong recognition for Question 1 (see Table 14). For that
utterance, the emotion that anger was most confused with is disgust, which received 20.0%. Close scrutiny of the two tables will also reveal that when happiness was strongly recognized, the emotion it was most confused with is surprised (see row 1 of Table 15). Also, neutral in Table 14 was most confused with sad. The pattern that emerges is the one described in the literature; the pairs that are most often confused with each other are happiness-surprise, sadness-neutral, and anger-disgusted. This is because the speech correlates identified in the literature are very similar for these emotion pairs. As a consequence, emotion recognition will often be confused between these emotion pairs, especially with neutral text. From Table 14 and Table 15, it can also be seen that generally, recognition of the simulated emotions improved for Question 2. This could be due to the listeners becoming accustomed to the synthetic voice and learning to distinguish between the emotions. It could also have been due to the listeners, who were all students, relating more to the phrase of the second question, which was about receiving an assignment mark. This could be clarified with further tests.
In this part of the analysis, the data will be presented for each of these four test utterance types.
Table 16 shows listener responses for utterances with emotionless text with no vocal emotion; that is, utterances whose text was emotionally indistinguishable from just the words alone, and spoken with no vocal emotion effects. The text phrases used for this section are phrases 1-5 shown in Appendix E.
S T I M U L U S
PERCEIVED EMOTION
Happy Neutral Sad Angry Neutral Surprised Disgusted Other
2.2%
36.9%
2.7%
44.4%
3.1%
6.7%
4.0%
Table 16 - Listener responses for utterances containing emotionless text with no vocal emotion.
The observation that can be made from the data is that there was a strong recognition that the utterances were spoken with no emotion. However, the utterances were also confused with sadness, as it too received a strong listener response (36.9%). Again, like the previous section of the questionnaire (discussed in Section 6.2.2), it was found that there was a great deal of variation in listeners perception that was dependent on the utterance being spoken. For instance, one of the phrases in this section was The telephone has not rung at all today. The majority (73.3%) of listeners perceived the speaker as being sad, while only 20.0% thought it sounded neutral. Instead, the phrase I have an appointment at 2 oclock tomorrow was perceived more neutral (62.2% for neutral, 20.0% for sad). The point being emphasised is that although care was taken to choose emotionally undetermined phrases, the data suggests that this is a very difficult task. Our dependency on context to discriminate emotions is stated in Knapp (1980), and has been confirmed in this evaluation. Albeit in varying degrees, the general trend for this subsection was that the neutral voice was often perceived as sounding sad. This suggests that Festivals intonation, which is modeled to be neutral, may have an underlining sadness. Interestingly, Murray and Arnott (1995) also made this observation with the HAMLET system, which makes use of the MITalk phoneme duration rules described in Allen et al. (1987, Chapter 9).
Emotive text with no vocal emotion The use of utterances of this type endeavoured to determine the influence emotive text has on listener perception of emotion. Table 17 shows the data accrued for utterances spoken with no vocal emotion, but contained D E M Otext O N the speakers feelings can P E R C E I V E emotive T I (i.e. be approximately determined from the text alone). Phrases 6-10 of Appendix E were used for this subsection. S
T I M U L U S
Sad
Angry
Neutral
Surprised Disgusted
Other
8.9%
0.0% 2.2%
0.0% 4.4%
48.9%
0.0% 4.4%
42.2% 33.3%
0.0%
20.0%
0.0%
Table 17 - Listener responses for utterances containing emotive text with no vocal emotion.
The following observations can be made from the data: a) Row maxima occurred for three emotions corresponding to the emotion in the text (sadness, anger, and neutral). This shows that for these emotions, a good proportion of the listeners relied on the words in the utterance to distinguish between emotions. However, a significant amount of confusion occurred for sadness and anger; the sadness stimulus was perceived as neutral by 37.8% of listeners, and the anger stimulus was confused with neutral (33.3%) and disgusted (20.0%). b) The phrase containing happy text was very much perceived as having no emotion. It is unclear why this emotions recognition suffered more than the others did; maybe happiness heavily requires confirmation in the speakers voice for it to be identified as such. The significant confusion occurring with happiness, sadness, and anger suggests that although emotion perception can be strongly influenced by the words in the utterance, lack of vocal emotion in the voice meant that the utterances sounded unconvincing, and so emotion recognition was not very strong.
Emotionless text with vocal emotion The confusion matrix for phrases containing emotionless text (phrases 1-5 of Appendix E) spoken with vocal emotion is shown in Table 18; that is, utterances whose text was emotionally indistinguishable from just the words alone, but were spoken with various vocal emotion effects.
PERCEIVED EMOTION S T I M U L U S
Happy Happy Sad Angry Neutral Surprised Disgusted Other
24.4%
1.1%
0.0%
22.2% 40.0%
2.2%
10.0%
Sad Angry
0.0% 0.0%
72.2%
0.0%
0.0%
17.8%
2.2%
0.0% 0.0%
2.2% 4.4%
7.8% 2.2%
91.1%
Table 18 - Listener responses for utterances containing emotionless text with vocal emotion.
The following observations can be made from the data: a) The row maximum for the happiness stimulus was for the surprised emotion (40.0%), while 24.4% thought it sounded happy and 22.2% thought it sounded neutral. Though this is a low listener response for happiness, the high combined response of the happiness and surprised emotions seems to suggest that it had a pleasant tone. That listeners wrote descriptive terms (for the Other option) such as content, informative, proud, and curious seems to confirm this. b) A high correct recognition rate occurred for the sad stimulus (72.2%). Other listeners gave their own descriptive terms such as lethargic, disappointed, and sleepy. c) The anger stimulus received a very high recognition rate (91.1%), with very little confusion occurring with other emotions. Bearing in mind that the utterances for this subsection contained exactly the same text as the neutral text, neutral voice utterances (shown in Table 16), the strong effect vocal emotion has on a listeners perception of emotion in an utterance can be clearly seen. Whereas most listeners chose either neutral or sad for the neutral text, neutral voice utterances, Table 18 shows a much better overall emotion recognition for neutral text, emotive voice utterances.
Emotive text with vocal emotion The confusion matrix for phrases containing emotive text (phrases 6-10 of Appendix E) spoken with vocal emotion is shown in Table 19; that is, utterances whose text alone gave an indication of the speakers emotion, and were further spoken with the appropriate vocal emotion.
PERCEIVED EMOTION S T I M U L U S
Happy Sad Angry Neutral Surprised Disgusted Other
66.7%
0.0% 0.0% 6.7%
4.4%
0.0% 4.4%
13.3%
2.2% 0.0%
62.2%
0.0% 2.2%
24.4%
1.1%
77.8%
0.0%
15.6%
4.4%
71.1%
Table 19 - Listener responses for utterances containing emotive text with vocal emotion.
The following observations can be made from the data: a) As predicted, the confusion matrix shows a strong average recognition rate for all simulated emotions (seen in the high values down the diagonal line, matching row and column names). b) Confusion continued to occur for emotions, but was not as significant as other utterance types. Both of the above observations indicate that correct emotion perception is taking place, upholding the hypotheses related to this section (see Section 4.1). Important to note is the reoccurrence of the confusion pairs mentioned in Section 6.2.2 sad with neutral, and angry with disgusted. Happiness, in Table 19 at least, was not confused much with surprise; however, this may be due to the utterance not lending itself to this confusion as surprise is certainly being confused with happiness in Table 18. Interestingly, some listeners complained in the general comments section that it was very difficult to hear disgust or surprise in the speech utterances (the reader will recall that disgust and surprise were not being simulated, and were included in the questionnaire selection list as distractors). This may suggest that some listeners may have sometimes chosen disgust or surprise because they felt the emotion had to come up sooner or later. Happiness for this section enjoyed a high recognition rate after suffering low recognition rates for other utterance types. For the neutral text, emotive voice test utterances, happy utterances were perceived as having a pleasant tone, but were not explicitly identified as happiness. The high recognition rate for happiness for emotive text, emotive voice type utterances suggests that the emotive text helped clarify the utterance as not just being pleasant but happy.
By studying the two confusion matrices in Table 16 and Table 18, it can be seen that emotion recognition was enhanced for neutral phrases spoken with vocal emotion compared to the same text spoken without vocal emotion. The two confusion matrices however, show only the number of listeners who correctly or incorrectly recognized the simulated emotion. What they do not show is the number of listeners who improved with the addition of vocal emotion effects. To determine the effect of vocal emotion therefore, the data should be filtered to not show listeners who correctly recognized the intended emotion when the utterance was spoken without vocal emotion and then also correctly recognized when the utterance was spoken with vocal emotion. In order to address this, further analysis was carried out on the data to determine the effect the addition of vocal emotion had on listener emotion recognition. To do this, the analysis kept track of listeners who had improved in their recognition of the intended emotion, and also kept track of listeners whose recognition was shown to deteriorate when the utterance was spoken with vocal emotion. Table 20 shows for each simulated emotion, the percentage of listeners who incorrectly recognized the intended emotion when it was spoken without vocal emotion and who then improved in their recognition when the utterance was spoken with vocal emotion. Similarly, Table 21 shows the percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects.
Table 20 - Percentage of listeners who improved in emotion recognition with the addition of vocal emotion effects for neutral text.
Table 21 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for neutral text.
The above results show that an overall significant increase in emotion recognition occurred with the introduction of vocal emotion effects in a neutral utterance; this was true for all the simulated emotions. Anger gained the largest increase in emotion recognition (84.4%). This suggests that if a neutral text utterance is to be perceived as being spoken in anger, then its perception is very much dependent on the vocal emotion effects without vocal emotion, it is simply very difficult to perceive the speaker as being angry. Deterioration for sadness was higher than the other two emotions. The reason for this is not clear, except that confusion between sadness and neutrality occurred with all utterance types.
Table 22 - Percentage of listeners whose emotion recognition improved with the addition of vocal emotion effects for emotive text.
Table 23 Percentage of listeners whose emotion recognition deteriorated with the addition of vocal emotion effects for emotive text.
From the above results, it can be seen that all emotions received a significant increase in emotion recognition once vocal emotion was added to the utterance, with the greatest improvement occurring for happiness (57.8%). Anger did not improve so much with emotive text as with neutral text. This shows that emotive text also had a strong influence on determining the correct emotion being simulated. As with Table 21, Table 23 also shows that the deterioration of recognition with sadness was significantly higher than other emotions; sadness-neutral confusion was a problem throughout all the tests. Still, the substantial effect of vocal emotion even on emotive text phrases can be clearly seen through this analysis.
participants. One discrepancy was found however, with differences too significant to overlook: it was noted from the data that participants who did not speak English as their first language confused sadness with neutrality significantly more than people who did speak English as their first language. This was noted for neutral text, emotive voice and emotive text, emotive voice type utterances. Table 24 shows listener responses for the sadness stimulus for participants who speak English as their first language, while Table 25 shows the same listener responses for people who do not speak English as their first language. Both Tables are for neutral text, emotive voice utterance types.
S T I M U L U S
PERCEIVED EMOTION
Happy Sad Sad Angry Neutral Surprised Disgusted Other
0.0%
82.7%
0.0%
5.8%
0.0%
1.9%
9.6%
Table 24 Listener responses for participants who speak English as their first language. Utterance type is neutral text, emotive voice. PERCEIVED EMOTION
Happy Sad Sad Angry Neutral Surprised Disgusted Other
S T I M U L U S
0.0%
55.3%
0.0%
34.2%
0.0%
2.6%
7.9%
Table 25 Listener responses for participants who do NOT speak English as their first language. Utterance type is neutral text, emotive voice.
The pattern is also reflected in the following data for emotive text, emotive voice utterance types (Table 26 and Table 27).
S T I M U L U S
PERCEIVED EMOTION
Happy Sad Angry Neutral Surprised Disgusted Other
Sad
0.0%
76.9%
3.8%
11.5%
0.0%
0.0%
0.0%
Table 26 Listener responses for participants who speak English as their first language. Utterance type is emotive text, emotive voice.
S T I M U L U S
PERCEIVED EMOTION
Happy Sad Sad Angry Neutral Surprised Disgusted Other
0.0%
42.1%
5.3%
42.1%
0.0%
0.0%
10.5%
Table 27 Listener responses for participants who do NOT speak English as their first language. Utterance type is emotive text, emotive voice.
The significant difference in emotion perception for sadness suggests that cultural issues are coming into play; this is an issue that has received a good deal of interest in the non-verbal communication literature (Knapp, 1980) (Malandra, Barker and Barker, 1989). The high confusion between sadness and neutrality for participants without English as their first language may account for the high deterioration values for sadness seen in Table 21 and Table 23. Maybe comforting to know is that if there was confusion for sadness, neutral was chosen, as supported by Murray and Arnott (1995). It would be alarming indeed if the data would show that sadness could be easily confused with, say, disgust, as the consequences in real-life communication would be disastrous.
Understandability. If one Talking Head version was determined to be generally easier to understand than the other, then it can be assumed that listener comprehension would benefit. Expressiveness. The Talking Head version that would be determined to be better able to express itself would communicate more effectively its feelings, the mood of a story, the seriousness/lightheartedness of information etc. Naturalness. Lester and Stone (1997) show that believability of pedagogical agents (an application of Talking Heads) plays an important role in user motivation. Along with Bates (1994), it is shown that life-like quality and the display of complex behaviours (including emotion) is a crucial factor for believability. Interest. Users who find a speaker interesting will give their attention to what is being said. If this occurs, the speaker has the opportunity to communicate his or her ideas better. Understandability Table 28 shows the response of participants who compared how well they could understand the Talking Head in each example, and chose which one they felt was easier to understand.
Table 28 - Participant responses when asked to choose the Talking Head that was more understandable.
The data shows that the overwhelming majority of participants thought the second Talking Head demonstration (the one that included vocal expression) was easier to understand. From participants justifications for their choice, it was remarked that not only did most participants think the second demonstration was easier to understand, but that many thought the first demonstration was very difficult to understand. Participants wrote that the second demonstration was easier to understand because it spoke slower and because it had better expression in its voice. In fact, the speech rate of the second demonstration did not slow down; rather, more pauses between sentences and character
dialogues were present. This highlights the importance that silence has in our verbal communication. Although it could be thought that the second demonstration was perceived as being easier to understand because the story was told again, and so therefore the participants were hearing the story for the second time, the fact that the reasons for their choice were widely echoed shows that the voice was a determining factor. Expressiveness Table 29 shows the response of participants when asked which Talking Head demonstration seemed best able to express itself. Again, the second demonstration was overwhelmingly favoured over the first demonstration. Motivators for favouring the second demonstration were because it was perceived the storytelling was better structured, and because there was more variation in its tone. Many participants commented on how the variability and changes in pitch helped the turn-taking of the characters and to distinguish ideas behind [the] statements in the story. Still others commented on the dynamic use of volume, and that the tone of the second demonstration was more appropriate for the story.
Table 29 Participant responses when asked which Talking Head seemed best able to express itself.
Another factor that was stated by most participants who favoured the second demonstration was that the vocal and facial expressions were better synchronised, and that the vocal expressions stood out more. The synchronisation of vocal expression and facial gestures is deemed to be very important for communication by Cassell et al (1994a), and Cassell et al (1994b); it is encouraging that most participants were able to notice this and view it as desirable. Interestingly, participants who did not favour the second demonstration of the Talking Head commented that it was too expressive (in a caricature way).
Naturalness
Table 30 shows the response of participants when asked which Talking Head seemed more natural. The main reason that the majority of participants that voted for the second demonstration gave was that the speech now matched the facial expressions better, and so seemed more realistic. For this question, there were a considerable amount of participants who commented that the mouth and lip movements were unconvincing and therefore needed more work for the Talking Head to seem realistic.
Table 30 Participant responses when asked which Talking Head seemed more natural.
Interest When asked which Talking Head demonstration seemed more interesting, participants gave a variety of reasons for their choice. Most participants opted for the second demonstration, and Table 31 shows the data for this question.
Table 31 Participant responses when asked which Talking Head seemed more interesting.
Many participants wrote that because the first demonstration was difficult to understand, they lost interest in what it was saying. Conversely, the second demonstration was easier to understand, and as a result, they didnt have to concentrate as much. Others wrote that the Talking Head in the second demonstration seemed more alert and that it was happier. Several participants commented that the second demonstration was more interesting because it seemed it had more human qualities, and that the facial expressions were noticed more because of the improved speech. For the results shown in this section, it is safe to say that the overwhelming majority of participants favoured the second demonstration of the Talking Head over the first demonstration. What is gratifying is the fact that the participants werent told to focus on the Talking Heads voice but rather look at it on the whole, and yet most people
attributed the improvement between the two examples to the vocal expression of the Talking Head. A very important note to make however is that the results are from data that was obtained by comparing two versions of the Talking Head: with and without simulated vocal emotion. Further tests would need to be done to quantitatively determine how effective a communicator the Talking Head is with its improved speech. What is important is that the results have been able to show that with vocal emotion, the Talking Head is a better communicator, basing the argument that is has enhanced understandability, expressiveness, naturalness, and user interest.
6.4 Summary
This chapter briefly described how the evaluation research methodology was employed to test the hypotheses stated in Section 4.1. The TTS module was tested for how well listeners could recognize the simulated emotions, and the data was seen to largely support the stated hypotheses. The variables that can affect speech emotion recognition were investigated and discussed, and the test data was also seen to support both previous work done in the field of synthetic speech emotion (namely Murray and Arnott, 1995), and the general literature from the fields of paralinguistics and non-verbal behaviour. The last hypothesis of Section 4.1 was also tested to investigate if a Talking Head is able to communicate information more effectively when it is given the capability of speech expression. Through a series of demonstrations, this testing section showed that viewers overwhelmingly rated the Talking Head with vocal expression as much easier to understand, more expressive, natural, and more interesting to look and listen to. It is proposed that these factors contribute in making the Talking Head a more effective communicator.
Operation Parameters Start/End Figure 27 - A node carrying waveform processing instructions for an operation.
SML Document
Tags + Text/Phonemes Modified Phonemes Phoneme data
Processing directives
Waveform
Modified Waveform
For instance, one of the characteristics of dynamic style is a faster rate of speech, and studies have shown that higher ratings of intelligence, knowledge, and objectivity are ascribed to the speaker (Miller, 1976 as referenced in Malandra, Barker and Barker, 1989). Contrariwise, a speaker adopting a conversational style of speaking (which is characterized by a more consistent rate and pitch), is rated to be more trustworthy, better educated, and more professional by listeners (Pearce and Conklin, 1971 as referenced in Malandra, Barker and Barker, 1989). If these results can be reproduced, this will clearly have major repercussions for a Talking Head. An interesting way that speaking styles could be specified is through the use of Cascading Style Sheets (CSS), making use of Extensible Stylesheet Language (XSL) technology. The SML markup itself would not need to be changed to adopt a different speaking style, but rather the XSL document that defines the style and voice of the speaker. This confirms one of the benefits of XML that was stated earlier, where the XML file describes the data and does not force the presentation of the data.
<?xml-stylesheet ...?> <?xml-stylesheet ...?> <sml> <sml>
link to stylesheet
<xsl:stylesheet ...> <xsl:stylesheet ...> ... ... </xsl:stylesheet ...> </xsl:stylesheet ...>
... ...
</sml> </sml>
Stylesheet
SML File
c) It could well be that work in this area will find that speech correlates may not be as important for discriminating similar complex emotions (e.g. pride and satisfaction), and that it is rather what we say that provides the best cues for such speaker emotions. This is upheld by Knapp (1980), who states as we develop, we rely on context to discriminate emotions with similar characteristics.
TTS
XML Data (SML, FAML, etc)
TTS Output
Brain
XML Handler
FAML
FAML
FAML Output
Other
Other
Other Output
A complete re-think of how XML data is handled is needed if XML is to play a major role in the way the Talking Head processes information (which is what we believe will eventuate). There is a need for the framework to be put in place as soon as possible in order to aid the implementation of future modules so that handling of the input is done in a controllable manner, and that the full potential of XML is realized.
Text to be rendered
Waveform
TTS Module
SML Processing
TTS Module
Viseme Generator
Visemes
SERVER SIDE
CLIENT SIDE
Figure 31 - Proposed design of TTS Module architecture to minimize bandwidth problems between server and client.
The idea of sending text to the client instead of a waveform is not new, but other solutions state that to do this the entire TTS module should be on the client side. The text to be rendered is sent to the client, and the speech is fully synthesized there. However, a complex NLP module such as Festivals is quite large, and it would be very undesirable to force users to install such an application on their systems. The architecture in Figure 31 shows a system where all phoneme transcription and SML tag processing is still carried out on the server. Instead of producing a waveform and sending this to the client however, the phoneme information is sent across in MBROLAs input format (described in Section 5.9) to the TTS Module on the client side, where the waveform is produced by the DSP. The MBROLA Synthesizer is not a very large program, and has binaries available for most platforms. Note that although the diagram shows the Festival and MBROLA systems, any synthesizer that can deal with phoneme information at the input and output level can be used.
Chapter 8 Conclusion
This thesis has focused on the primary goal of simulating emotional speech for a Talking Head. The objectives reflected this by aiming to develop a speech synthesis system that is able to add the effects of emotion on speech, and to implement this system as the TTS module of a Talking Head. The literature was explored and investigation of research in the fields of non-verbal behaviour, paralinguistics, and speech synthesis allowed (with some confidence) the following hypotheses of Section 4.1 to be formed: 1. The effect of emotion on speech can be successfully synthesized using control parameters. 2. Through the addition of emotive speech: a) Listeners will be able to correctly recognize the intended emotion being synthesized. b) Head. Information will be communicated more effectively by the Talking
The review of the literature also helped to both identify and provide possible solutions to the subproblems stated in Section 2.2, namely the need to develop a speech markup language, and synchronizing speech and facial gestures. Throughout the entire design and implementation phase of the TTS module, the literature was a solid basis upon which design decisions were made. For instance, SMLs speech emotion tags implemented the speech correlates of emotion found in the literature and previous work on synthetic speech emotion, chiefly by Murray and Arnott (1995) and Cahn (1990). The evaluation of the TTS module was an integral part of this thesis. Therefore, an evaluative research methodology was employed to determine if the system supported or disproved the stated hypotheses. In order for testing to be carried out in a clear and
concise way that could be directly linked with the hypotheses, the questionnaire-based evaluation process was organized into two main sections: a) Speech emotion recognition, and
b) The extent of the effect vocal expression has on a Talking Heads ability to communicate. The results from the speech emotion recognition sections support hypotheses 1 and 2a; that is, strong recognition of the simulated emotions took place and so simulation of the emotive speech was successful. The evaluation was able to show that a good recognition rate of emotion occurred, comparable to that described in the literature for both synthetic speech emotion, and for human speech. The investigation of different combinations of neutral/emotive text and neutral/emotive vocal expression was important in that it helped identify the importance of the words we decide to use and how we choose to say them. Importantly, the results in Chapter 6 are seen to also confirm the literature; both in terms of the effects of the different variables involved (words and vocal expression), and also in terms of the emotions that are often confused together. As found in Murray and Arnott (1995) and Cahn (1990), the results were seen to be very utterance dependent and may have contributed to both correct recognition and also confusion between emotions. The reason for confusion between certain emotions is reported in the literature to be dependent on many factors, including the text in the phrase and its context, the speakers voice, and who is hearing the utterance, which brings gender and cultural issues in play (Malandra, Barker and Barker, 1980; Knapp, 1980). Results of the Talking Head experiments also proved to be positive with the overwhelming majority of viewers stating that the Talking Head was much easier to understand, more expressive, natural, and more interesting when vocal expression was added to its speech. It would be very interesting to investigate listener emotion recognition rates for the same utterances spoken by the Talking Head itself. This way, the importance of the combination of the visual channel (facial gestures) and the audio channel (speech) could be studied. There is much work yet to be done in the field of synthetic speech emotion, with Chapter 7 identifying a number of key areas which should be investigated. The main limitation of the project was the inability to implement dynamic volume control and speech correlates affecting the voice quality (such as nasality, hoarseness, breathiness etc). Notwithstanding these limitations however, this thesis has been able to demonstrate the effect synthetic speech emotion has on user perception when it is applied to a Talking
Head. It is envisaged it would bring similar benefits to other applications using speech synthesis. It is through this demonstration that the significance and value of this project is seen to have been confirmed.
Bibliography
Allen, J., Hunnicutt, M. S. and Klatt, D. (1987). From Text to Speech: The MITalk System, Cambridge University Press, Cambridge. Bates, J. (1994). The Role of Emotion in Believable Agents, Communications of the ACM, vol. 37, pp. 122-125. Beard, S. (1999). FAQBot, Honours Thesis, Curtin University of Technology, Bentley, Western Australia. Beard, S., Crossman, B., Cechner, P. and Marriott A. (1999). FAQbot, Proceedings of Pan Sydney Area Workshop on Visual Information Processing, Nov 1999, University of Sydney, Australia. Beard, S., Marriott, A. and Pockaj, R. (2000). A Humane Interface, OZCHI 2000 Conference on Human-Computer Interaction: Interfacing reality in the new millennium. (4-8 Dec, 2000), Sydney, Australia. Black, A. W., Taylor, P. and Caley, R. (1999). The Festival Speech Synthesis System System Documentation, Edition 1.4, for Festival Version 1.4.0, 17th June 1999, [Online], Available: www.cstr.ed.ac.uk/projects/festival/manual/festival_toc.html. Bos, B. (2000). XML in 10 points. http://www.w3.org/XML/1999/XML-in-10-points. [Online]. Available:
Bosak, J. (1997) XML, Java, and the Future of the Web. [Online]. Available: http://www.webreview.com/pub/97/12/19/xml/index.html Cahn, J. E. (1988). From Sad to Glad: Emotional Computer Voices, Proceedings of Speech Tech '88, Voice Input/Ouput Applications Conference and Exhibition, April 1988, New York City, pp. 35-37. Cahn, J. E. (1990). The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, vol. 8, pp. 1-19. Cahn, J. E. (1990b). Generating expression in synthesized speech, Technical Report, Massachusetts Institute of Technology Media Laboratory, MA, USA. Carroll, L. (1946). Alices Adventures in Wonderland, Random House, New York.
Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S., Becket, T., and Achorn, B. (1994a). Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents, In Proceedings of ACM SIGGRAPH 94. Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S., and Achorn, B. (1994b). Modeling the interaction between speech and gestures, In Proceedings of the 16th annual Conference of the Cognitive Science Society, pp. 119124. Cechner, P. (1999), NO-FAITH: Transport Service for Facial Animated Intelligent Talking Head, Honours Thesis, Curtin University of Technology, Bentley, Western Australia. Cover, R. (2000). The XML Cover Pages. [Online]. Available: http://www.oasisopen.org/cover/xml.html. Davitz, J. R. (1964). The Communication of Emotional Meaning, McGraw-Hill, New York. Dellaert, F., Polzin, T. and Waibel, A. (1996). Recognizing Emotion in Speech, Proceedings of ICSLP 96 the 4th International Conference on Spoken Language Processing, 3-6 October 1996, Philadelphia, PA, USA. Document Object Model (DOM) Level 1 Specification (Second Edition) (2000). [Online]. Available: http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929. Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis, Kluwer Acadesmic Publishers. Extensible Markup Language (XML) 1.0, (1998). http://www.w3.org/TR/1998/REC-xml-19980210. [Online]. Available:
Galanis, D., Darsinos, V. and Kokkinakis, G. (1996). Investigating Emotional Speech Parameters for Speech Synthesis, International Conference on Electronics, Circuits, and Systems (ICECS 96), pp. 1227-1230. Graham, I. S. and Quinn, L. (1999). XML Specification Guide. Wiley Computing Publishing, New York. Hallahan, W. I. (1996). DECtalk Software: Text-to-Speech Technology and Implementation, Digital Technical Journal. Heitzeman, J., Black, A. W., Mellish, C., Oberlander, J., Poesio, M. and Taylor, P. (1999). An Annotation Scheme for Concept-to-Speech Synthesis, Proceedings of the European Workshop on Natural Language Generation, Toulouse, France, pp. 5966.
Hess, U., Scherer, K. L. and Kappas, A. (1988). Multichannel Communication of Emotion, in Facets of Emotion Recent Research, Edited by Klaus R. Scherer, Lawrence Erlbaun Associates, Publishers, Hillsdale, New Jersey. Huynh, Q. (2000). Evaluation into the validity and implementation of a virtual news presenter for broadcast multimedia, Honours Thesis (yet to be published), Curtin University of Technology, Bentley, Western Australia. Knapp, M. L. (1980). Essentials of Nonverbal Communication, Holt, Rinehart and Winston, pp. 203-229. Lester, J. C. and Stone, B. A. (1997). Increasing Believability in Animated Pedagogical Agents, in First International Conference on Autonomous Agents, Marina del Rey, CA, USA, pp. 16-21. Levy, Y. (2000). WebFAITH (Web-based Facial Animated Interactive Talking Head), Honours thesis, Curtin University of Technology, Bentley, Western Australia. libxml (2000). The XML http://www.xmlsoft.org. C library for Gnome. [Online]. Available:
Malandra, L., Barker, L., and Barker, D. (1989). Nonverbal Communication, 2nd edition, Random House, pp. 32-50. Mauch, J. E. and Birch, J. W. (1993). Guide to the Successful Thesis and Dissertation, 3rd edition, Marcel Dekker, New York. MBROLA Project Homepage (2000). http://tcts.fpms.ac.be/synthesis/mbrola.html. Microsoft (2000). Benefiting From XML. http://msdn.microsoft.com/xml/general/benefits.asp. [Online], [Online]. Available: Available:
MPEG (1999). Overview of the MPEG-4 Standard. ISO/IEC JTC1/SC29/WG11 N2725. Seoul, South Korea. Murray, I. R. and Arnott, J. L. and Newell A. F. (1988). HAMLET Simulating emotion in synthetic speech, Proceedings of Speech '88, The 7th FASE Symposium (Institute of Acoustics), Edinburgh, August 1988, pp. 1217-1223 Murray, I. R. and Arnott, J. L. (1993). "Toward the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion", Journal of the Acoustical Society of America, February 1993, vol. 2, pp. 1097-1108. Murray, I. R. and Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech, Speech Communications, vol. 16, pp. 369-390.
Murray, I. R. and Arnott, J. L. (1996). Synthesizing emotions in speech: is it time to get excited?, in Proceedings of ICSLP 96 the 4th International Conference on Spoken Language Processing, Philadelphia, PA, USA, 3-6 October 1996, pp. 1816-1819. Murray, I. R. and Arnott, J. L. and Rohwer E. A. (1996). Emotional stress in synthetic speech: Progress and future directions, Speech Communication, vol. 20, pp. 85-91. Ostermann, J., Beutnagel, M., Fischer, A., Wang, Y. (1998). Integration of Talking Heads and Text-to-Speech Synthesizers for Visual TTS, in Proceedings of International Conference on Speech and Language Processing (ICSLP 98), Sydney, Australia, December 1998. Python/XML Howto (2000). [Online]. Available: http://www.python.org/doc/howto/xml. SAX 2.0 (2000). Sax: The Simple API for XML, version 2. [Online]. Available: http://www.megginson.com/SAX. Scherer, K. L. (1981). Speech and Emotional States, in Speech Evaluation in Psychiatry, edited by J. K. Darby (Grune and Stratton, New York). Scherer, K. L. (1996). Adding the Affective Dimension: a New Look in Speech Analysis and Synthesis, Proceedings of the International Conference on Speech and Language Processing (ICSLP 96), Philadelphia, USA, October 1996. Shepherdson, R. H. (2000). The Personality of a Talking Head, Honours Thesis, Curtin University of Technology, Bentley, Western Australia. SoftwareAG (2000a). XML Application Examples. [Online]. http://www.softwareag.com/xml/applications/xml_app_at_work.htm. SoftwareAG (2000b). XML Benefits. http://www.softwareag.com/xml/about/xml_ben.htm. [Online]. Available: Available:
Sproat, R., Taylor, P., Tanenblatt, M. and Isard, A. (1997). A Markup Language for Text-to-Speech Synthesis, in Proceedings of the Fifth European Conference on Speech Communication and Technology (Eurospeech 97), vol. 4, pp. 1747-1750. Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., and Lenzo, K. (1998). SABLE: A Standard for TTS Markup, in Proceedings of International Conference on Speech and Language Processing (ICSLP98), pp. 1719-1724. Stallo, J. (2000). Canto: A Synthetic Singing Voice. Technical Report, Curtin University of Technology, Western Australia. Available: http://www.computing.edu.au/~stalloj/projects/canto/. Taylor, P. and Isard, A. (1997). SSML: A speech synthesis markup language, Speech Communication, vol. 21, pp. 123-133.
The XML FAQ (2000). Frequently Asked Questions about the Extensible Markup Language. [Online]. Available: http://www.uc.ie/xml/.
SML Tags
angry
Description: Simulates the effect of anger on the voice (i.e. generates a voice that sounds angry). Attributes: none. Properties: Can contain other non-emotion elements. Example:
<angry>I would not give you the time of day</angry>
embed
Description: Gives the ability to embed foreign file types within an SML document such as sound files, MML files etc., and for them to be processed appropriately. Attributes:
Name type Description Specifies the type of file that is being embedded. (Required) Values audio embedded file is an audio file. mml an mml file is embedded. src music_fil e If type = audio, then src attribute gives path to audio file. If type = mml, then music_file specifies the path of the mml music A character string. A character string.
file. lyr_file If type = mml, then lyr_file specifies the path of the mml lyrics file. A character string.
emph
Description: Emphasizes a syllable within a word. Attributes:
Name target Description Specifies which phoneme in contained text will be the target phoneme. If target is not specified, default target will be the first phoneme found within the contained text. The strength of the emphasis. (Default level is weak). Specifies if the element is to affect the contained texts phoneme pitch values, or duration values, or both. (Default is pitch only). Values A character string representing a phoneme symbol. Uses the MPRA phoneme set. weakest, weak, moderate, strong. p affect pitch only. d affect duration only. b affect both pitch and duration.
level affect
happy
Description: Simulates the effect of happiness on the voice (i.e. generates a voice that sounds happy). Attributes: none. Properties: Can contain other non-emotion elements. Example:
neutral
Description: Gives a neutral intonation to the spoken utterance. Attributes: none. Properties: Can contain other non-emotion elements. Example:
<neutral>I can sometimes sound bored like this.</neutral>
p
Description: Element used to divide text into paragraphs. Can only occur directly within an sml element. The p element wraps emotion tags. Attributes: none. Properties: Can contain all other elements, except itself and sml. Example:
<p> <sad>Today its been raining all day,</sad> <happy>But theyre calling for sunny skies tomorrow.</happy> </p>
pause
Description: Inserts a pause in the utterance. Attributes:
Name length msec smoot h Description Specified the length of the utterance using descriptive value. Specifies the length of the utterance in milliseconds. Specifies if the last phonemes before this pause need to be lengthened slightly. Values short, medium, long. A positive number. yes, no (default = yes)
Properties: empty.
Example:
Ill take a deep breath <pause length=long/> and try it again.
pitch
Description: Element that changes pitch properties of contained text. Attributes:
Name middle range Description Increases/decreases pitch average of contained text by N% Increases/decreases pitch contained text by N%. range of Values (+/-)N%, highest, high, medium, low, lowest. (+/-)N%
Properties: Can contain other non-emotion elements. Example: Not I, <pitch middle=-20%>said the dog</pitch>
pron
Description: Enables manipulation of how something is pronounced. Attributes:
Name sub Description The string the contained text should be substituted with. Required. Values A character string.
rate
Description: Sets the speech rate of the contained text. Attributes:
Name speed Description Sets the speed, can be a percentage value higher or lower than the current, or a descriptive term. Required. Values (+/-)N%, fastest, fast, medium, slow, slowest.
sad
Description: Simulates the effect of sadness on the voice (i.e. generates a voice that sounds sad). Attributes: none. Properties: Can contain other non-emotion elements. Example:
<sad>Honesty is hardly ever heard.</sad>
sml
Description: Root element that encapsulates all other SML tags. Attributes: none. Properties: root node, can only occur once. Example:
<sml> <p> <happy>The sml tag encapsulates all other tags</happy> </p> </sml>
speaker
Description: Specify the speaker to use for the contained text. Attributes:
Name gender name Description Sets the gender of the speaker. Default = male. Specifies the diphone database that Values male, female. A character string.
volume
Description: Sets the speaking volume. Note, this element sets only the volume, and does not change voice quality (e.g. quiet is not a whisper). Attributes:
Name level Description Sets the volume level of the contained text. Can be a percentage value, or a descriptive term. Values (+/-)N%, loud. soft, normal,
<!ATTLIST embed type CDATA #REQUIRED src CDATA #IMPLIED music_file CDATA #IMPLIED lyr_file CDATA #IMPLIED> <!-####################################### # LOW-LEVEL TAGS ####################################### --> <!ELEMENT pitch (#PCDATA | %LowLevelElements;)*> <!ATTLIST pitch middle CDATA #IMPLIED base (low | medium | high) "medium" range CDATA #IMPLIED> <!ELEMENT rate (#PCDATA | %LowLevelElements;)*> <!ATTLIST rate speed CDATA #REQUIRED> <!-- SPEAKER DIRECTIVES: Tags that can only encapsulate plain text --> <!ELEMENT emph (#PCDATA)> <!ATTLIST emph level (weakest | weak | moderate | strong) "weak" affect CDATA #IMPLIED target CDATA #IMPLIED> <!ELEMENT pause EMPTY> <!ATTLIST pause length (short | medium | long) "medium" msec CDATA #IMPLIED smooth (yes | no) "no"> <!ELEMENT volume (#PCDATA)> <!ATTLIST volume level CDATA #REQUIRED> <!ELEMENT pron (#PCDATA)> <!ATTLIST pron sub CDATA #REQUIRED>
3.
Creating VCMakefiles From memory, there were no mishaps here. Only, again, make sure the make utility you're using is recent, preferably from the Cygnus Cygwin package version b20.1 (we had trouble with earlier versions). Building To build the system: nmake /nologo /fVCMakefile However, you'll probably incur a number of compile errors. The following is a list of changes I had to make: SIZE_T redefinition in siod/editline.h. SIZE_T is redefined in editline.h since it is already defined in a Windows include file. Fix by inserting compiler directive to only define SIZE_T if it hasn't been defined before, or simply don't define it (i.e. use #if 0 ... #endif). "unknown size" error for template in base_class/EST_TNamedEnum.cc. nmake doesn't seem to like the syntax used in template void EST_TValuedEnumI<ENUM,VAL,INFO::initialise(const void *vdefs, ENUM (*conv)(const char *)). Go to the line it's grumbling about, and do the following: comment this line out: (const struct EST_TValuedEnumDefinition *defs = (const struct EST_TValuedEnumDefinition *)vdefs; insert these 2 lines (__my_defn can be anything, just make sure it doesn't conflict with another typedef!): typedef EST_TValuedEnumDefinition __my_defn; const __my_defn *defs = (const __my_defn *)vdefs;
Multiply defined symbols in base_class/inst_templ/vector_fvector_t.cc EST_TVector.cc is already included in vector_f_.cc, so a conflict is occurring. I'm not sure nmake would be the only one to complain about this though. Comment out this line: #include "../base_class/EST_TVector.cc"
T * const p_contents never initialised in EST_TBox constructor (include/EST_TBox.h) nmake wants p_contents to be initialised in the constructor using an initializer list. Comment EST_TBox constructor out: EST_TBox(void) {}; Define EST_TBox constructor with an initializer list: EST_TBox (void) : p_contents (NULL) {}; I don't think it matters what p_contents is initialised to since the inline comment says that the constructor will never be called anyway.
Building Festival
If I recall correctly, once the Speech Tools Library was successfully compiled, there were no problems compiling Festival. However, once you run any of the executables such as festival.exe, pay attention to any
error messages appearing at startup. If most functions are unavailable to you in the Scheme interpreter, it's probably because it can't find the lib directory that contains the Scheme (.scm) files.
Compiling
If you've included the right header files, and the include paths are correct, you should have no trouble. However, make sure your program source files have a .cpp extension, and not just .c - you'll get all sorts of compilation errors. (Take it from someone who lost a lot of time over this silly error!) Also, make sure SYSTEM_IS_WIN32 is defined by adding it to the pre-processor definitions in Project-->Settings-->C/C+ +.
Linking
The manual gives instructions on which Festival and Speech Tool libraries your program needs to link with. (Note: the manual mentions .a UNIX library files. You should look for .lib files instead.) Add linking information through Project-->Settings-->Link. In addition to the libraries specified in the manual, you'll need to link two others libraries: wsock32.lib - for the socket functions. winmm.lib - for multimedia functions such as PlaySound().
These two libraries should already be on your system, and Visual C++ should already be able to find them. That's it! Good luck with your work :) If you've found this document to be helpful, please drop me an email - I'd like to know if it's proving useful for others. If you have found errors in this document, or have other suggestions for getting around problems, I'd especially like to hear from you! I also welcome any questions you may have. You can contact me at stalloj@cs.curtin.edu.au. Finally, my warm thanks to Alan W Black, Paul Taylor, and Richard Caley for making the Festival Speech Synthesis System freely available to all of us :)
SECTION 1 - PERSONAL
Age: ____________
AND
BACKGROUND DETAILS
t Male
Gender: t Female
In which country have you lived for most of your life? ______________________________________________________________________________________ Is English your first spoken language? t Yes Where do you use computers? t Home t Work t No
t School
t I Dont
t Other _____________
The next section will commence shortly. PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO THANKYOU
Part B.
For each of the speech sample played, choose ONE which you think is most appropriate for how the speaker sounds. 1. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------2. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------3. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------4. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------5. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED -----------------------------------------------------------------------------------------------------------------------6. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------7. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------8. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------9. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------10. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED ------------------------------------------------------------------------------------------------------------------------
11.
t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------12. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------13. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------14. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------15. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED -----------------------------------------------------------------------------------------------------------------------16. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------17. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------18. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------19. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------20. t No emotion t Happy t Sad t Angry t Disgusted t Surprised t Other __________________ -----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED ------------------------------------------------------------------------------------------------------------------------
SECTION 4 GENERAL
Prior to this demonstration:
t Yes t Yes
t No
t No
Any further comments you could give about the Talking Head and/or its voice would be greatly appreciated. __________________________________________________________________________________ __________________________________________________________________________________ __________________________________________________________________________________ __________________________________________________________________________________
Emotional phrases 6. I have some wonderful news for you. (Happiness) 7. I cannot come to your party tomorrow. (Sadness) 8. I would not give you the time of day. (Anger) 9. Smoke comes out of a chimney. (Neutral) 10. Dont tell me what to do. (Anger)