You are on page 1of 83

Minor Project Report

SPEECH TEXT ARTIFICE

(December 2010)

Submitted in partial fulfillment of the requirement


For the award of the degree of

Bachelor of Technology
In
COMPUTER SCIENCE

Guide : Submitted By:


MRS. RACHNA JAIN Aayush Sharma (0171152707)
Vinay Pareek (0221152707)
Lakshay Gaur (0301152707)
Sahil Sikka (0321152707)

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING,


A-4, Paschim Vihar, New Delhi-110063
August, 2010
SPEECH TEXT ARTIFICE

CERTIFICATE

This is to certify that the project report entitled "SPEECH TEXT ARTIFICE" by Aayush
Sharma(07/BV/CS/017),Vinay Pareek (07/BV/CS/022),Lakshay Gaur(07/BV/CS/030), Sahil
Sikka(07/BV/CS/032) submitted to the „Department of Computer Science Engineering‟ of
Bharati Vidyapeeth College of Engineering, New Delhi, is an authentic record of their own
work carried during the period of August 2010 to November 2010 under my guidance and
completed the project work to my satisfaction.

The matter embodied in this project has not been submitted earlier for the award of any
degree or diploma to the best of my knowledge & belief.

Date:

Project Guide
PROF. RACHNA JAIN

2
SPEECH TEXT ARTIFICE

ACKNOWLEDGEMENT

We gratefully acknowledge the guidance provided by the project supervisor Mrs. Rachna
Jain throughout the development of the project, for her valuable inputs, able guidance,
encouragement, whole-hearted cooperation and constructive criticism throughout the duration
of our project.

We take this opportunity to pay regards to all our teachers who have directly or indirectly
helped us in this project. Last but not the least, we express our thanks to all our friends for
their co-operation and support.

3
SPEECH TEXT ARTIFICE

SYNOPSIS

SPEECH TEXT ARTIFICE is an Speech Recognition application designed to operate in


Microsoft Windows.
This application would enable a user especially those who are visually/physically challenged
to accomplish simple tasks in Windows. The TEXT PAD with in the application will provide
the output which can be saved and documented for further usage. The application uses SAPI
5.0 (Speech Application Program Interface) and incorporates its interfaces to make the
application work. The system is designed in Microsoft .NET C# and uses Microsoft Speech
Synthesis Engine to carry out the speech synthesis for a particular sound file.

SPEECH RECOGNITION BASICS

Speech recognition is the process by which a computer (or other type of machine) identifies
spoken words. Basically, it means talking to your computer, and having it correctly recognize
what you are saying.

Hardware Requirements:
 Sound Cards
 Microphones
 Computers/Processors

Applications:
The possible applications of speech technology include every single task related with the
action of the human voice. In this sense, the application fields can vary from speech
production, storage, transmission and recognition processes. Some of the potential
applications of speech technology that we mention below:
 Speech synthesis (voice response systems)
 Digital transmission and storage (optimized encryption of signals)
 Speaker verification and identification (control of access, legal applications)
 Aids to the handicapped
 Speech recognition (automatic dictation, command and control)

4
SPEECH TEXT ARTIFICE

Types of Speech Recognition:


Speech recognition can be of two types based on the grammar that the recognition is based
on. (Grammar is in other words the list of possible recognition outputs that can be generated.)
 Command & Control
 Dictation
In a command and control scenario a developer provides a limited set of possible word
combinations, and the speech recognition engine matches the words spoken by the user to the
limited list. In command and control the accuracy of recognition is very high. It is always
better for applications to implement command and control as the higher accuracy of
recognition makes the application respond better. In Dictation mode the recognition engine
compares the input speech to the whole list of the dictionary words. For the dictation mode to
have a high accuracy of recognition is it important that the user has prior trained the
recognition engine by speaking in to it.

Components of speech recognition application:


Every speech recognition application consists of
 An engine that translates waves into text
 A list of speech commands.

TEXT TO SPEECH APPLICATION

It takes text as input and produces a stream of audio as output using synthetic voices. This
application can be used for proof reading of document. This basically serves as a reader, in
literal sense, of text files. We can change the properties of Synthetic Voice. The user as per
requirement can vary following properties:

VOLUME: The volume can be set to be as loud or low as required by the user.
RATE: The user can decide how fast or slow the speech should be.

5
SPEECH TEXT ARTIFICE

CONTENT

Certificate………………………………………………………… 2
Acknowledgement……………………………………………….. 3
Synopsis…………………………………………………………... 4
1 Project Overview……………………………………………... 9
1.1 Project Objective……………………………………………………. 9
1.2 Abstract……………………………………………………………… 10
1.3 Project Scope....................................................................................... 10

2 Literature Review……………………………………………. 11
2.1 An Overview Of Speech Recognition……………………………….. 11
2.2 History………………………………………………………………. 12
2.3 Types Of Speech Recognition............................................................. 13
2.3.1 Isolated Speech……………………………………………….. 13
2.3.2 Connected Speech…………………………………………….. 13
2.3.3 Continuous Speech…………………………………………… 13
2.3.4 Spontaneous Speech………………………………………….. 13
2.4 Speech Recognition Process………………………………………... 14
2.4.1 Components Of Speech Recognition System………………... 15
2.5 Uses Of Speech Recognition Programs............................................. 16
2.6 Applications………………………………………………………... 16
2.6.1 From Medical Perspective…………………………………… 16
2.6.2 From Military Perspective…………………………………… 17
2.6.3 From Educational Perspective……………………………….. 17
2.7 Speech Recognition Weakness And Flaws………………………… 17
2.8 The Future Of Speech Recognition………………………………… 18
2.9 Few Speech Recognition Software………………………………… 19
2.9.1 X Voice……………………………………………………… 19
2.9.2 ISIP…………………………………………………………... 19
2.9.3 Ears…………………………………………………………... 19

6
SPEECH TEXT ARTIFICE

2.9.4 CMU Sphinix………………………………………………… 20


2.9.5 Nico Ann Toolkit…………………………………………….. 20
3 Methodology And Tools……………………………………. 21
3.1 Fundamentals To Speech Recognition……………………………... 21
3.1.1 Utterances……………………………………………………. 21
3.1.2 Pronunciation………………………………………………… 21
3.1.3 Grammar…………………………………………………….. 21
3.1.4 Accuracy…………………………………………………….. 22
3.1.5 Vocabularies………………………………………………… 22
3.1.6 Training……………………………………………………… 22
3.2 Technical Requirements………………………………………….. 23
3.3 Methodology……………………………………………………….. 23
3.4 Speech API Overview……………………………………………… 24
3.4.1 API Overview……………………………………………….. 24
3.4.2 API For Text-to-Speech……………………………………… 25
3.4.3 API For Speech Recognition………………………………… 25
3.5 Speech Synthesis…………………………………………………… 28
3.5(a) Speech Synthesis..................................................................... 30
3.6 Speech Recognition………………………………………………… 31
3.6(a) Speech Recognizer Class……………………………………. 35
3.6(b) Grammer Class……………………………………………… 35
3.6(c) Using MICROSOFT ROBOTICS DSS Command Node…… 36

4 Documentation………………………………………………. 37
4.1 System Requirements………………………………………………. 37
4.1.1 Minimum Requirements………………………………………. 37
4.1.2 Best Requirements……………………………………………. 37
4.2 Hardware Requirements…………………………………………… 38
4.3 Software Requirements……………………………………………. 38
4.4 Context Diagram…………………………………………………... 39
4.5 Sequence Diagram………………………………………………… 40
4.6 Package Diagram………………………………………………….. 41

7
SPEECH TEXT ARTIFICE

4.7 Activity Diagram…………………………………………………. 42

5 Implementation and Testing………………………………… 46


5.1 Working……………………………………………………………… 46
5.2 Initial Test,Results and Discussions…………………………………. 47

6 Conclusion……………………………………………………. 54
6.1 Advantages of software……………………………………………… 54
6.2 Disadvantages……………………………………………………….. 54
6.3 Future Enhancements……………………………………………….. 54

7 References…………………………………………………… 55

Appendices
8 Appendix -A Source Code…………………………………… 57
9 Appendix –B Snapshots……………………………………… 78
10 Appendix -C Glossary………………………………………... 81

8
SPEECH TEXT ARTIFICE

LIST OF FIGURES
1. Speech Technology………………………………………………………13

2. Speech Recognition Process……………………………………………..15

3. Speech Synthesis………………………………………………………...29

4. Speech Recognition……………………………………………………...32

5. Context Diagram………………………………………………………...40

6. Sequence Diagram……………………………………………………....41

7. Package Diagram………………………………………………………...42

8. Activity Diagram………………………………………………………...43

9.Test Cases………………………………………………………………...48

9
SPEECH TEXT ARTIFICE

CHAPTER 1

PROJECT OVERVIEW

This thesis report considers an overview of speech recognition technology, software

development, and its applications. The first section deals with the description of speech

recognition process, its applications in different sectors, its flaws and finally the future of

technology. Later part of report covers the speech recognition process, and the code for the

software and its working. Finally the report concludes at the different potentials uses of the

application and further improvements and considerations.

1.1 Project Objective

10
SPEECH TEXT ARTIFICE

1.2 Abstract

Speech Recognition is the ability of a computer to recognize general, naturally flowing


utterances from a wide variety of users. It recognizes the caller's answers to move along the
flow of the call.
Speech recognition technology is one from the fast growing engineering technologies. It has a
number of applications in different areas and provides potential benefits. Nearly 20% people
of the world are suffering from various disabilities; many of them are blind or unable to use
their hands effectively. The speech recognition systems in those particular cases provide a
significant help to them, so that they can share information with people by operating
computer through voice input.
This project is designed and developed keeping that factor into mind, and a little effort is
made to achieve this aim. Our project is capable to recognize the speech and convert the input
audio into text; it also enables a user to perform operations such as “save, open, exit” a file by
providing voice input. It also helps the user to open different system software such as opening
Ms-paint, notepad and calculator.

At the initial level, effort is made to provide help for basic operations as discussed above , but
the software can further be updated and enhanced in order to provide with more operations.

1.3 Project Scope

This project has the speech recognizing and speech synthesizing capabilities though it is not a
complete replacement of what we call a NOTEPAD but still a good text editor to be used
through voice. This software also can open windows based softwares such as Notepad, Ms-
paint and more.

11
SPEECH TEXT ARTIFICE

CHAPTER 2

LITERATURE REVIEW

SPEECH TECHNOLOGY
Three primary speech technologies are used in voice processing applications: Stored Speech,
Text-to- Speech and Speech Recognition. Stored speech involves the production of computer
speech from an actual human voice that is stored in a computer‟s memory and used in any of
several ways.
Speech can also be synthesized from plain text in a process known as text-to – speech which
also enables voice processing applications to read from textual database.
Speech recognition is the process of deriving either a textual transcription or some form of
meaning from a spoken input.
Speech analysis can be thought of as that part of voice processing that converts human speech
to digital forms suitable for transmission or storage by computers.
Speech synthesis functions are essentially the inverse of speech analysis – they reconvert
speech data from a digital form to one that‟s similar to the original recording and suitable for
playback.
Speech analysis processes can also be referred to as a digital speech encoding ( or simply
coding) and speech synthesis can be referred to as Speech decoding.

2.1 An Overview of Speech Recognition


Speech recognition is a technology that able a computer to capture the words spoken by a
human with a help of microphone [1] [2]. These words are later on recognized by speech
recognizer, and in the end, system outputs the recognized words. The process of speech
recognition consists of different steps that will be discussed in the following sections one by
one.
An ideal situation in the process of speech recognition is that, a speech recognition engine
recognizes all words uttered by a human but, practically the performance of a speech
recognition engine depends on number of factors. Vocabularies, multiple users and noisy

12
SPEECH TEXT ARTIFICE

environment are the major factors that are counted in as the depending factors for a speech
recognition engine [3].

2.2 History
The concept of speech recognition started somewhere in 1940s [3], practically the first speech
recognition program was appeared in 1952 at the bell labs, that was about recognition of a
digit in a noise free environment [4], [5].
technology,
in this period work was done on the foundational paradigms of the speech recognition that is
automation and information theoretic models [15].
-100 words) of
isolated words, based on simple acoustic-phonetic properties of speech sounds [3]. The key
technologies that were developed during this decade were, filter banks and time
normalization methods [15].
-1000 words) using simple template-based,
pattern recognition methods were recognized.
In 1980s large vocabularies (1000-unlimited) were used and speech recognition problems
based on statistical, with a large range of networks for handling language structures were
addressed. The key invention of this era were HIDDEN MARKOV MODEL (HMM) and the
stochastic language model, which together enabled powerful new methods for handling
continuous speech recognition problem efficiently and with high performance[3] .

In 1990s the key technologies developed during this period were the methods for
Stochastic language understanding, statistical learning of acoustic and language models, and
the methods for implementation of large vocabulary speech understanding systems.

After the five decades of research, the speech recognition technology has finally
entered marketplace, benefiting the users in variety of ways. The challenge of designing a
machine that truly functions like an intelligent human is still a major one going forward.

13
SPEECH TEXT ARTIFICE

2.3 Types of Speech Recognition


Speech recognition systems can be divided into the number of classes based on their ability to
recognize that words and list of words they have. A few classes of speech recognition are
classified as under:

2.3.1 Isolated Speech

Isolated words usually involve a pause between two utterances; it doesn‟t mean
that it only accepts a single word but instead it requires one utterance at a time [4].

2.3.2 Connected Speech

Connected words or connected speech is similar to isolated speech but allow


seperate utterances with minimal pause between them.

2.3.3 Continuous speech

Continuous speech allow the user to speak almost naturally, it is also called the
computer dictation.

2.3.4 Spontaneous Speech


At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An
ASR (Automated Speech Recognizer) system with spontaneous speech ability should be able
to handle a variety of natural speech features such as words being run together, "ums" and
"ahs", and even slight stutters.

14
SPEECH TEXT ARTIFICE

2.4 Speech Recognition Process

15
SPEECH TEXT ARTIFICE

2.4.1 Components of Speech Recognition System

Voice Input

With the help of microphone audio is input to the system, the pc sound card produces the
equivalent digital representation of received audio [8] [9] [10].

Digitization
The process of converting the analog signal into a digital form is known as digitization [8], it
involves the both sampling and quantization processes. Sampling is converting a continuous
signal into discrete signal, while the process of approximating a continuous range of values is
known as quantization.

Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text
transcriptions, and using software to create statistical representations of the sounds that make
up each word. It is used by a speech recognition engine to recognize speech [8]. The software
acoustic model breaks the words into the phonemes [10].

Language Model
Language modeling is used in many natural language processing applications such as speech
recognition tries to capture the properties of a language and to predict the next word in the
speech sequence [8]. The software language model compares the phonemes to words in its
built in dictionary [10].

Speech Engine
The job of speech recognition engine is to convert the input audio into text [4]; to accomplish
this it uses all sorts of data, software algorithms and statistics. Its first operation is digitization
as discussed earlier, that is to convert it into a suitable format for further processing. Once
audio signal is in proper format it then searches the best match for it. It does this by
considering the word it knows, once the signal is recognized, it returns its corresponding text
string.

16
SPEECH TEXT ARTIFICE

2.5 Uses of Speech Recognition Programs


Basically speech recognition is used for two main purposes. First and foremost dictation that
is in the context of speech recognition is translation of spoken words into text, and second
controlling the computer, that is to develop such software that probably would be capable
enough to authorize a user to operate different application by voice [4][11].Writing by voice
let a person to write 150 words per minute or more if indeed he/she can speak that much
quickly. This perspective of speech recognition programs create an easy way for composing
text and help the people in that industry to compose millions of words digitally in short time
rather then writing them one by one, and this way they can save their time and effort.
Speech recognition is an alternative of keyboard. If you are unable to write or just don‟t want
to type then programs of speech recognition helps you to do almost anything that you used to
do with keyboard.

2.6 Applications

2.6.1 From medical perspective


People with disabilities can benefit from speech recognition programs. Speech recognition is
especially useful for people who have difficulty using their hands, in such cases speech
recognition programs are much beneficial and they can use for operating computers. Speech
recognition is used in deaf telephony, such as voicemail to text.

2.6.2 From military perspective


Speech recognition programs are important from military perspective; in Air Force speech
recognition has definite potential for reducing pilot workload. Beside the Air force such
Programs can also be trained to be used in helicopters, battle management and other options.

2.6.3 From educational perspective


Individuals with learning disabilities who have problems with thought-to-paper
communication (essentially they think of an idea but it is processed incorrectly causing it to
end up differently on paper) can benefit from the software.

17
SPEECH TEXT ARTIFICE

Some other application areas of speech recognition technology are described as under[13]:

Command and Control

ASR systems that are designed to perform functions and actions on the system are defined as
Command and Control systems. Utterances like "Open Netscape" and "Start a new browser"
will do just that.

Telephony

Some Voice Mail systems allow callers to speak commands instead of pressing buttons to
send specific tones.

Medical/Disabilities
Many people have difficulty typing due to physical limitations such as repetitive strain
injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty
hearing could use a system connected to their telephone to convert the caller's speech to text.

2.7 Speech Recognition weakness and flaws


Besides all these advantages and benefits, yet a hundred percent perfect speech recognition
system is unable to be developed. There are number of factors that can reduce the accuracy
and performance of a speech recognition program. Speech recognition process is easy for a
human but it is a difficult task for a machine, comparing with a human mind speech
recognition programs seems less intelligent, this is due to that fact that a human mind is God
gifted thing and the capability of thinking, understanding and reacting is natural, while for a
computer program it is a complicated task, first it need to understand the spoken words with
respect to their meanings, and it has to create a sufficient balance between the words, noise
and spaces. A human has a built in capability of filtering the noise from a speech while a
machine requires training, computer requires help for separating the speech sound from the
other sounds.

18
SPEECH TEXT ARTIFICE

Few factors that are considerable in this regard are[10] :

Homonyms: Are the words that are differently spelled and have the different meaning but
acquires the same meaning, for example “there” “their” “be” and “bee”. This is a challenge
for computer machine to distinguish between such types of phrases that sound alike.

Overlapping speeches: A second challenge in the process, is to understand the speech


uttered by different users, current systems have a difficulty to separate simultaneous speeches
from multiple users.

Noise factor: The program requires hearing the words uttered by a human distinctly and
clearly. Any extra sound can create interference, first you need to place system away from
noisy environments and then speak clearly else the machine will confuse and will mix up the
words.

2.8 The future of speech recognition:

• Accuracy will become better and better.


• Dictation speech recognition will gradually become accepted.
• Greater use will be made of “intelligent systems” which will attempt to guess what the
speaker intended to say, rather than what was actually said, as people often misspeak and
make unintentional mistakes.
• Microphone and sound systems will be designed to adapt more quickly to changing
background noise levels, different environments, with better recognition of extraneous
material to be discarded.

19
SPEECH TEXT ARTIFICE

2.9 Few Speech Recognition Softwares

2.9.1 X Voice

X Voice is a dictation/continuous speech recognizer that can be used with a variety of


XWindow applications. This software is primarily for users.

HomePage: http://www.compapp.dcu.ie/~tdoris/Xvoice/
http://www.zachary.com/creemer/xvoice.html
Project: http://xvoice.sourceforge.net

2.9.2 ISIP
The Institute for Signal and Information Processing at Mississippi State University has made
its speech recognition engine available. The toolkit includes a front−end, a decoder, and a
training module. It's a functional toolkit. This software is primarily for developers.
The toolkit (and more information about ISIP) is available at:

FTP site: http://www.isip.msstate.edu/project/speech

2.9.3 Ears

Although Ears isn't fully developed, it is a good starting point for programmers wishing
to start in ASR. This software is primarily for developers.

FTP site: ftp://svr−ftp.eng.cam.ac.uk/comp.speech/recognition/

2.9.4 CMU Sphinix


Sphinx originally started at CMU and has recently been released as open source. This is a
fairly large program that includes a lot of tools and information. It is still "in development",
but includes trainers, recognizers, acoustic models, language models, and some limited
documentation. This software is primarily for developers.

Homepage: http://www.speech.cs.cmu.edu/sphinx/Sphinx.html

Source: http://download.sourceforge.net/cmusphinx/sphinx2−0.1a.tar.gz

20
SPEECH TEXT ARTIFICE

2.9.5 NICO ANN Toolkit

The NICO Artificial Neural Network toolkit is a flexible back propagation neural
network toolkit optimized for speech recognition applications.
This software is primarily for developers.
Its homepage: http://www.speech.kth.se/NICO/index.html

21
SPEECH TEXT ARTIFICE

CHAPTER 3

METHODOLOGY AND TOOLS

3.1 Fundamentals to Speech Recognition


Speech recognition is basically the science of talking with the computer,and having it orrectly
recognized [17].
To elaborate it we have to understand the following terms [4], [13] :

3.1.1 Utterances
When user says some things, then this is an utterance[13] in other words speaking a word or
a combination of words that means something to the computer is called an utterance.
Utterances are then sent to speech engine to be processed.

3.1.2 Pronunciation
A speech recognition engine uses a process word is its pronunciation, that represents what the
speech engine thinks a word should sound like [4].Words can have the multiple
pronunciations associated with them.

3.1.3 Grammar
Grammar uses particular set of rules in order to define the words and phrases that are going to
be recognized by speech engine, more concisely grammar define the domain with which the
speech engine works [4]. Grammar can be simple as list of words or flexible enough to
support the various degrees of variations.

3.1.4 Accuracy
The performance of the speech recognition system is measurable[4]; the ability of recognizer
can be measured by calculating its accuracy. It is useful to identify an utterance.

22
SPEECH TEXT ARTIFICE

3.1.5 Vocabularies

Vocabularies are the list of words that can be recognized by the speech recognition engine[4].
Generally the smaller vocabularies are easier to identify by a speech recognition engine,
while a large listing of words are difficult task to be identified by engine.

3.1.6 Training
Training can be used by the users who have difficulty of speaking or pronouncing certain
words, speech recognition systems with training should be able to adapt.

Speaker Dependence vs. Speaker Independence

Speaker dependence describes the degree to which a speech recognition system requires
knowledge of a speaker‟s individual voice characteristics to successfully process speech. The
speech recognition engine can “learn” how you speak words and phrases; it can be trained to
your voice.
Speech recognition systems that require a user to train the system to his/her voice are known
as speaker-dependent systems. If you are familiar with desktop dictation systems, most are
speaker dependent. Because they operate on very large vocabularies, dictation systems
perform much better when the speaker has spent the time to train the system to his/her voice.

Speech recognition systems that do not require a user to train the system are known as
speaker-independent systems. Speech recognition in the Voice XML world must be
speaker-independent. Think of how many users (hundreds, maybe thousands) may be calling
into your web site. You cannot require that each caller train the system to his or her voice.
The speech recognition system in a voice-enabled web application MUST successfully
process the speech of many different callers without having to understand the individual
voice characteristics of each caller.

23
SPEECH TEXT ARTIFICE

3.2 Technical Requirements:

Hardware

1. Microphone
2. Speakers

Software
1. Smart Draw2000 (For drawing the Gantt chart and Speech Recognition Model)
2. Visual Paradigm for UML 7.1 (for Use case and Activity Diagram)
3. MS-Paint
4. MICROSOFT ROBOTICS DSS node.
5. Windows XP3/VISTA/7
6. Microsoft Speech API 5.0
7. VISUAL STUDIO with .NET Frame work
8. MS Office 2007 (Documentation).

3.3 Methodology
As an emerging technology, not all developers are familiar with speech recognition
technology. While the basic functions of both speech synthesis and speech recognition takes
only few minutes to understand (after all, most people learn to speak and listen by age two),
there are subtle and powerful capabilities provided by computerized speech that developers
will want to understand and utilize.
Despite very substantial investment in speech technology research over the last 40 years,
speech synthesis and speech recognition technologies still have significant limitations. Most
importantly, speech technology does not always meet the high expectations of users familiar
with natural human-to-human speech communication. Understanding the limitations - as well
as the strengths - is important for effective use of speech input and output in a user interface
and for understanding some of the advanced features of the Microsoft Speech API.
An understanding of the capabilities and limitations of speech technology is also important
for developers in making decisions about whether a particular application will benefit from
the use of speech input and output.

24
SPEECH TEXT ARTIFICE

3.4 SPEECH API Overview:

The SAPI application programming interface (API) dramatically reduces the code overhead
required for an application to use speech recognition and text-to-speech, making speech
technology more accessible and robust for a wide range of applications.

This section covers the following topics:

 API Overview
 API for Text-to-Speech
 API for Speech Recognition

3.4.1 API Overview

The SAPI API provides a high-level interface between an application and speech engines.
SAPI implements all the low-level details needed to control and manage the real-time
operations of various speech engines.

The two basic types of SAPI engines are text-to-speech (TTS) systems and speech
recognizers. TTS systems synthesize text strings and files into spoken audio using synthetic
voices. Speech recognizers convert human spoken audio into readable text strings and files.

API overview

25
SPEECH TEXT ARTIFICE

3.4.2 API for Text-to-Speech

Applications can control text-to-speech (TTS) using the ISpVoice Component Object Model
(COM) interface. Once an application has created an ISpVoice object (see Text-to-Speech
Tutorial), the application only needs to call ISpVoice::Speak to generate speech output from
some text data. In addition, the IspVoice interface also provides several methods for changing
voice and synthesis properties such as speaking rate ISpVoice::SetRate, output volume
ISpVoice::SetVolume and changing the current speaking voice ISpVoice::SetVoice.

Special SAPI controls can also be inserted along with the input text to change real-time
synthesis properties like voice, pitch, word emphasis, speaking rate and volume. This
synthesis markup sapi.xsd, using standard XML format, is a simple but powerful way to
customize the TTS speech, independent of the specific engine or voice currently in use.

The IspVoice::Speak method can operate either synchronously (return only when completely
finished speaking) or asynchronously (return immediately and speak as a background
process). When speaking asynchronously (SPF_ASYNC), real-time status information such
as speaking state and current text location can polled using ISpVoice::GetStatus. Also while
speaking asynchronously, new text can be spoken by either immediately interrupting the
current output (SPF_PURGEBEFORESPEAK), or by automatically appending the new text
to the end of the current output.

In addition to the ISpVoice interface, SAPI also provides many utility COM interfaces for the
more advanced TTS applications.

3.4.3 API for Speech Recognition

Just as ISpVoice is the main interface for speech synthesis, ISpRecoContext is the main
interface for speech recognition. Like the ISpVoice, it is an ISpEventSource, which means
that it is the speech application's vehicle for receiving notifications for the requested speech
recognition events.

An application has the choice of two different types of speech recognition engines
(ISpRecognizer). A shared recognizer that could possibly be shared with other speech

26
SPEECH TEXT ARTIFICE

recognition applications is recommended for most speech applications. To create an


ISpRecoContext for a shared ISpRecognizer, an application need only call COM's
CoCreateInstance on the component CLSID_SpSharedRecoContext. In this case, SAPI will
set up the audio input stream, setting it to SAPI's default audio input stream. For large server
applications that would run alone on a system, and for which performance is key, an InProc
speech recognition engine is more appropriate. In order to create an ISpRecoContext for an
InProc ISpRecognizer, the application must first call CoCreateInstance on the component
CLSID_SpInprocRecoInstance to create its own InProc ISpRecognizer. Then the application
must make a call to ISpRecognizer::SetInput (see also ISpObjectToken) in order to set up the
audio input. Finally, the application can call ISpRecognizer::CreateRecoContext to obtain an
ISpRecoContext.

The next step is to set up notifications for events the application is interested in. As the
ISpRecognizer is also an ISpEventSource, which in turn is an ISpNotifySource, the
application can call one of the ISpNotifySource methods from its ISpRecoContext to indicate
where the events for that ISpRecoContext should be reported. Then it should call
ISpEventSource::SetInterest to indicate which events it needs to be notified of. The most
important event is the SPEI_RECOGNITION, which indicates that the ISpRecognizer has
recognized some speech for this ISpRecoContext. See SPEVENTENUM for details on the
other available speech recognition events.

Finally, a speech application must create, load, and activate an ISpRecoGrammar, which
essentially indicates what type of utterances to recognize, i.e., dictation or a command and
control grammar. First, the application creates an ISpRecoGrammar using
ISpRecoContext::CreateGrammar. Then, the application loads the appropriate grammar,
either by calling ISpRecoGrammar::LoadDictation for dictation or one of the
ISpRecoGrammar::LoadCmdxxx methods for command and control. Finally, in order to
activate these grammars so that recognition can start, the application calls
ISpRecoGrammar::SetDictationState for dictation or ISpRecoGrammar::SetRuleState or
ISpRecoGrammar::SetRuleIdState for command and control.

When recognitions come back to the application by means of the requested notification
mechanism, the lParam member of the SPEVENT structure will be an ISpRecoResult by
which the application can determine what was recognized and for which ISpRecoGrammar of
the ISpRecoContext.

27
SPEECH TEXT ARTIFICE

An ISpRecognizer, whether shared or InProc, can have multiple ISpRecoContexts associated


with it, and each one can be notified in its own way of events pertaining to it. An
ISpRecoContext can have multiple ISpRecoGrammars created from it, each one for
recognizing different types of utterances.

28
SPEECH TEXT ARTIFICE

3.5 SPEECH SYNTHESIS

A speech synthesizer converts written text into spoken language. Speech synthesis is also
referred to as text -to-speech (TTS) conversion.

The major steps in producing speech from text are as follows:


• Structure analysis: Process the input text to determine where paragraphs,
sentences and other structures start and end. For most languages, punctuation and
formatting data are used in this stage.
• Text pre-processing: Analyze the input text for special constructs of the
language. In English, special treatment is required for abbreviations, acronyms, dates, times,
numbers, currency amounts, email addresses and many other forms. Other languages need
special processing for these forms and most languages have other specialized requirements.
The remaining steps convert the spoken text to speech.

• Text-to-phoneme conversion: Convert each word to phonemes. A


phoneme is a basic unit of sound in a language. US English has around 45 phonemes
including the consonant and vowel sounds. For example, "times" is spoken as four phonemes
"t ay m s". Different languages have different sets of sounds (different phonemes). For

29
SPEECH TEXT ARTIFICE

example, Japanese has fewer phonemes including sounds not found in English, such as "ts" in
"tsunami".

•Prosody analysis: Process the sentence structure, words and phonemes to determine
appropriate prosody for the sentence. Prosody includes many of the features of speech other
than the sounds of the words being spoken. This includes the pitch (or melody), the timing (or
rhythm), the pausing, the speaking rate, the emphasis on words and many other features.
Correct prosody is important for making speech sound right and for correctly conveying the
meaning of a sentence.

• Waveform production: Finally, the phonemes and prosody information are used to
produce the audio waveform for each sentence. There are many ways in which the speech can
be produced from the phoneme and prosody information. Most current systems do it in one of
two ways: concatenation of chunks of recorded human speech, or formant synthesis using
signal processing techniques based on knowledge of how phonemes sound and how prosody
affects those phonemes. The details of waveform generation are not typically important to
application developers.

The Text to Speech service is designed to provide your applications with a verbal interface. It
can be used in conjunction with the Speech Recognizer service for two-way communication
with the computer.

As a type of speech engine, much of the functionality of a Synthesizer is inherited from the
Engine interface in the System.Speech.Synthesis package and from other classes and
interfaces in that package.

30
SPEECH TEXT ARTIFICE

3.5 (a) SpeechSynthesis:

Inheritance Hierarchy:
System.Object
System.Speech.Synthesis.SpeechSynthesizer
Namespace: System.Speech.Synthesis
Assembly: System.Speech (in System.Speech.dll)
Syntax: public sealed class SpeechSynthesizer : IDisposable

The SpeechSynthesizer type exposes the following members.

Constructor:

Name Description
The Central class of System.Speech
package is used to obtain a speech
synthesizer by calling the
SpeechSynthesizer SpeechSynthesizer method. The
SetOutputToWaveFile along with
Speak argument provides the formatted
output of appropriate synthesizer. In this
example a synthesizer that speaks
English is requested.

Properties:

Name Description
Rate Rate at which user want to listen .

State Pitch and quality (as per hardware) .

Voice Tone can be set .

Volume Modulate the volume from 0 to 100.

31
SPEECH TEXT ARTIFICE

3.6 SPEECH RECOGNITION

Speech recognition (SR) converts spoken words to written text and as a result can be used to
provide user interfaces that use spoken input. The Speech Recognizer service enables you to
include speech recognition support for your application. Speech recognition requires a special
type of software, called an SR engine.

The SR engine may be installed with the operating system or at a later time with other
software. Speech-enabled packages such as word processors and web browsers, may install
their own engines or they can use existing ones. Additional engines are also available through
third party manufacturers. These engines are typically designed to only support a specific
language and may also target a certain vocabulary; for example, a vocabulary specializing in
medical or legal terminology.

32
SPEECH TEXT ARTIFICE

Operations:

The Speech Recognizer service supports the following requests and notifications.

Operation Description

Get Returns the entire state of the Speech Recognizer service.

InsertGrammarEntry Inserts the specified entry (or entries) of the supplied grammar
into the current grammar dictionary. If certain entries exist
already a Fault is returned and the whole operation fails
without the current dictionary being modified at all.

UpdateGrammarEntry Updates entries that already exist in the current grammar


dictionary with the supplied grammar entries. If certain entries
in the supplied grammar do not exist in the current dictionary
no Fault is returned. Instead, only the existing entries are
updated.

UpsertGrammarEntry Inserts entries from the supplied grammar into the current
dictionary if they do not exist yet or updates entries that already
exist with entries from the supplied grammar.

DeleteGrammarEntry Deletes those entries from the current grammar directory


whose keys are equal to one of the supplied grammar entries. If
a key from the supplied grammar entries does not exist in the
current directory no Fault is returned, but any matching entries
are deleted.

SetSrgsGrammarFile Sets the grammar type to SRGS file and tries to load the
specified file, which has to reside inside your application's
/store folder (directory). If loading the file fails, a Fault is
returned and the speech recognizer returns the state it was
before it processed this request. SRGS grammars require
Windows Vista or Windows 7 and will not work with
Windows XP and Windows Server 2003.

33
SPEECH TEXT ARTIFICE

EmulateRecognize Sets the SR engine to emulate speech input but by using Text
(string). This is mostly used for testing and debugging.

GrammarType Specifies the type of grammar the SR engine will use, either a
simple Dictionary grammar or SRGS grammar.

SpeechDetected Indicates that speech (audio) has been detected and is being
processed.

SpeechRecognized Indicates that speech has been recognized.

SpeechRecognitionRejected Indicates that speech was detected, but not recognized as one of
the words or phrases in the current grammar dictionary. The
duration of the speech is available as DurationInTicks.

To support SR you define a grammar - the words and phrases to be recognized and then use
notifications provided by the service to determine what SR engine recognized as the spoken
input. The Speech Recognizer service supports usage of simple dictionary-style grammars.

System.Speech.Recognition Namespace
The Recognition namespace contains Windows Desktop Speech technology types for
implementing speech recognition.

The Windows Desktop Speech Technology software offers a basic speech recognition
infrastructure that digitizes acoustical signals, and recovers words and speech elements from
audio input.

Applications use System.Speech.Recognition namespace to access and extend this basic


speech recognition technology, by defining algorithms for identifying and acting on specific
phrases or word patterns, and by managing the run time behavior of this speech
infrastructure.

Applications manage and obtain use grammars -- sets of rules defining how specific
combinations of words and phrases are to be understood --through the general purpose
Grammar class, which hosts runtime, persisted, or dynamically constructed instances of

34
SPEECH TEXT ARTIFICE

SrgsDocument. SrgsDocument instances contain W3C Speech Recognition Grammar


Specification (SRGS) compliant grammar documents.

A simplified means of specifying grammar documents is provided through the


GrammarBuilder and Choices classes. Full support for generating SRGS compliant grammars
is provided by the members of the System.Speech.Recognition.SrgsGrammar name space.

In addition, a special case grammar to support a conventional dictation model is available


through DictationGrammar objects.

Instances of SpeechRecognizer and SpeechRecognitionEngine objects supplied with


appropriate Grammar objects provide the primary access to the Windows Desktop Speech
Technology recognition engines.

The SpeechRecognizer class is used to create client applications making use of a system's
current recognition technology, which is configured through the Audio Input member of the
Control Panel, and a computer's default audio input mechanism.

Building an application using SpeechRecognitionEngine allows for more control on the


configuration and type of recognition engine, which runs in process. Using
SpeechRecognitionEngine also provides for the dynamic selection of audio input, whether
from devices or files.

35
SPEECH TEXT ARTIFICE

3.6 (a) Speech Recognizer class

Inheritance Hierarchy:
System.Object
System.Speech.Recognition.SpeechRecognizer
Namespace: System.Speech.Recognition
Assembly: System.Speech (in System.Speech.dll)
Syntax: public class SpeechRecognizer : IDisposable

3.6 (b) Grammer class

Inheritance Hierarchy:
System.Object
System.Speech.Recognition.Grammar
System.Speech.Recognition.DictationGrammar
Namespace: System.Speech.Recognition
Assembly: System.Speech (in System.Speech.dll)
Syntax: public class Grammer

Speech Recognizer GUI:

The SpeechRecognizer service represents the core speech recognition service (as opposed to
the SpeechRecognizerGui service which offers the user interface component to the core
service). The core service allows for usage of simple dictionary-style grammars as well as
complex SRGS (Speech Recognition Grammar Specification) grammars, specified in XML.

It does not require any connections, and it will start up when you run the diagram. You can
also optionally start an instance of the Speech Recognizer GUI once you have a DSS node
running by using a web browser and going to the Control Panel page. Starting the service will
automatically attempt to load the default SR engine.

36
SPEECH TEXT ARTIFICE

3.6 (c) Steps required to evolve Dictionary style grammer for Speech-To-Text
conversion using MICROSOFT ROBOTICS DSS command node:

Step 1: Set the Initial State

The SpeechRecognizer service supports the Initial State partner. The initial state is used to
configure:

 What type of grammar is being used


 What the grammar looks like or where it can be loaded from

The default config file has to be called "SpeechRecognizer.config.xml", and it specifies the
commands that will be use by the recognizer.

Step 2: Start and Run the Sample

Start the DSS Command Prompt from the Start > All Programs menu.

Start a DSS Host node and create an instance of the service by typing the following
command:

dsshost /p:50000 /m:"samples\config\SpeechRecognizer.manifest.xml"

Step 3: Start the GUI to Configure Speech Recognition.

At the bottom of the SpeechRecognizerGui web page you can define a grammar. Note that
the SpeechRecognizer only recognizes words and phrases that are in its grammar. If the
grammar is empty, then nothing will be recognized.

37
SPEECH TEXT ARTIFICE

CHAPTER 4

DOCUMENTATION

4.1 System Requirements Specification:


System Requirement Specification describes the various supporting tools that are required for
the system to run. This specifies the software and hardware requirements for the execution of
system. There are certain special hardware/software that must be possessed by the user before
using the system . They are:

4.1.1 Minimum requirements


• PENTIUM 4 processor
• 512 MB of RAM
• Microphone
• Sound card

4.1.2 Best requirements


• CORE i3 2.26 GHz Processor
• 1 GB or more of RAM
• Sound cards with very clear signals
• High quality microphones

All the above mentioned components must be installed prior to the execution of the system.

The performance of the system largely depends on the type of hardware used especially the
microphone. There are special Headphones, which provide relatively more accurate results.

38
SPEECH TEXT ARTIFICE

4.2 Hardware Requirements

Sound cards
Speech requires relatively low bandwidth, high quality 16 bit sound card will be better
enough to work . Sound must be enabled, and proper driver should be installed. Sound cards
with the 'cleanest' A/D (analog to digital) conversions are recommended, but most often the
clarity of the digital sample is more dependent on the microphone quality and even more
dependent on the environmental noise. Some speech recognition systems might require
specific sound cards.

Microphones
A quality microphone is key when utilizing the speech recognition system. Desktop
microphones are not suitable to continue with speech recognition system, because they have
tendency to pick up more ambient noise. The best choice, and most common is the headset
style. It allows the ambient noise to be minimized, while allowing you to have the
microphone at the tip of your tongue all the time. Headsets are available without earphones
and with earphones (mono or stereo).

Computer/ Processors
Speech recognition applications can be heavily dependent on processing speed.
This is because a large amount of digital filtering and signal processing can take place in
ASR.

4.3 Software Requirements

The Developer as well as User site must have following softwares installed in order for
correct working of the system:

 Visual Paradigm for UML 7.1 (for Use case and Activity Diagram)
 MS-Paint
 MICROSOFT ROBOTICS DSS node.
 Windows XP3/VISTA/7

39
SPEECH TEXT ARTIFICE

 Microsoft Speech API 5.0


 VISUAL STUDIO with .NET Frame work
 MS Office 2007 (Documentation).

4.4 CONTEXT DIAGRAM for SPEECH TEXT ARTIFICE:

The context diagram shows how the other systems are interacting with SPEECH TEXT
ARTIFICE. SAPI is the backbone of system. Software uses various interfaces provided by
SAPI for Speech Recognition and Speech Synthesis. Special SAPI controls can also be
inserted along with the input text to change real-time synthesis properties like voice, pitch,
word, emphasis, speaking rate and volume. The Application interacts with SAPI using API
(Application Programming Interface) and the SAPI interacts with Recognition and TTS
engine using DDI (Device Driver Interface).

CONTEXT diagram

40
SPEECH TEXT ARTIFICE

4.5 SEQUENCE DIAGRAM for SPEECH TEXT ARTIFICE:

Following figure describes the sequence in which the processes are performed for TEXT to
SPEECH as well as SPEECH to TEXT interface.

For TTS text is entered as input and then along, worked upon by API for synthesis and
processing to pass it on to TTS engine which in turn provides us with SPEECH as output
from specific hardware included.

For STT speech is entered as input and then processed by SAPI for recognition of acoustics
so that the output can be transferred to STT engine which provides suitable output in form of
TEXT within the text pad provided in software.

SEQUENCE diagram

41
SPEECH TEXT ARTIFICE

4.6 PACKAGE DIAGRAM for SPEECH TEXT ARTIFICE:

PACKAGES are UML constructs that enable you to organize model elements into groups,
making your UML diagrams simpler and easier to understand. Packages are depicted as file
folders and can be used on any of the UML diagrams, although they are most common on
USE CASE diagram and CLASS diagrams because these models have a tendency to grow.

There are three main packages being used here:

 Speech Text Artifice: Contain the tools for implications.


 SAPI: SAPI provides various objects and interfaces required to make a speech
enabled application.
 Speech SDK: The Microsoft Speech SDK comes with its own Recognition engine
and a Text-To-Speech engine . SAPI interacts with these engines to analyze the input.

4.6.1 For SPEECH SYNTHESIS:

42
SPEECH TEXT ARTIFICE

4.6.2 For SPEECH RECOGNITION:

43
SPEECH TEXT ARTIFICE

4.7 ACTIVITY DIAGRAM for SPEECH TEXT ARTIFICE:

4.7.1

Saving the document

44
SPEECH TEXT ARTIFICE

4.7.2 Writing Text

4.7.3 Opening Document

45
SPEECH TEXT ARTIFICE

4.7.4 Opening System Software

46
SPEECH TEXT ARTIFICE

CHAPTER 5

IMPLEMENTATION AND TESTING

5.1 WORKING
This software is designed to recognize the speech and also has the capabilities for speaking
and synthesizing means it can convert speech to text and text to speech. This software named
„SPEECH TEXT ARTIFICE‟ has the capability to write spoken words into text area of
notepad, and also can recognize your commands as “save, open, clear” this software is
capable of opening windows software such as notepad, ms paint, calculator through voice
input.
The synthesize part of this software helps in verifying the various operations done by user
such as read out the written text for user also informing that what type of actions a user is
doing such as saving a document, opening a new file or opening a file previously saved on
hard disk .

47
SPEECH TEXT ARTIFICE

5.2 INITIAL TEST, RESULTS AND DISCUSSIONS:

Various TEST CASES as implemented along with results:

1
Test Case No:

Project Name: Text To Speech

Description: Reading a text file from the system and displaying it in a


text box.

Input No. Description Output Expected Output


Open an unknown
1 Error Message Error Message
file format
Open a file not
2 residing on the Hard Error Message Error Message
Disk
Open a valid Text
3 File Open File Open
file

Status: PASSED

48
SPEECH TEXT ARTIFICE

2
Test Case No:

Project Name: Text To Speech

Description: Changing Voice Properties

Input No. Description Output Expected Output

1 Set rate value=5 Rate increases Rate increases


2 Set rate value=1 Rate decreases Rate decreases
3 Set rate value=11 Error message Error message
4 Set volume value=5 Volume decreases Volume decreases
drastically drastically
5 Set volume Volume increases Volume increases
value=100 drastically drastically
6 Set volume Error message Error message
value=110
7 Click Get Voices and Voice changes Voice changes
choose a voice

Status: PASSED

49
SPEECH TEXT ARTIFICE

3
Test Case No:

Project Name: Text To Speech

Description: Checking Up

Input No. Description Output Expected Output


1 Phrase =”Ball” Ball Ball
2 Phrase =” ”(blank) Error Message Error Message
3 Phrase =”Read” Read Read
4 Phrase =”UFO” Each alphabet is Each alphabet is
spoken separately spoken separately
5 Phrase =”123” One hundred twenty One hundred twenty
three three
6 Phrase =”Hello. How A small pause after A small pause after
are you” Hello Hello

Status: PASSED

50
SPEECH TEXT ARTIFICE

4
Test Case No:

Project Name: Text To Speech

Description: Save file

Input No. Description Output Expected Output


1 Save file “text1.txt” File Saved File Saved
2 Quit without clicking Prompt to save file Prompt to save file
save

Status: PASSED

51
SPEECH TEXT ARTIFICE

The various test cases for Speech To Text are as follows:

1
Test Case No:

Project Name: Speech To Text

Description: Accuracy check without User Training

Pass Criteria: 50%

Input No. Description Output Expected Output


1 Phrase=”I am a boy” “I and a high” ”I am a boy”
2 Phrase=”Hello” “Hello” “Hello”
3 Phrase=”This is to “The his uniform” ”This is to inform”
inform”
4 Phrase=”I am a “I am a boy” “I am a boy”
boy”(Slow)
5 Phrase=”Nice to “Knife to seat” ”Nice to meet you”
meet you”

Status: FAILED with 40% accuracy.

Result: Pre-emptive training is required.

52
SPEECH TEXT ARTIFICE

2
Test Case No:

Project Name: Speech To Text

Description: Accuracy check with user training (half)

Pass Criteria: 50%

Input No. Description Output Expected Output


1 Phrase=”I am a boy” ”I am a boy” ”I am a boy”
2 Phrase=”Hello” “Hello” “Hello”
3 Phrase=”This is to ”This is to inform” ”This is to inform”
inform”
4 Phrase=”I am a “I am a boy” “I am a boy”
boy”(Slow)
5 Phrase=”Cat and “Cat and mouse” “Cat and mouse”
mouse”
6 Phrase=”Anybody “Study” ”Anybody there”
there”

Status: FAILED with 66.66% accuracy.

Result: Training improves accuracy.

53
SPEECH TEXT ARTIFICE

3
Test Case No:

Project Name: Speech To Text

Description: User Training is done (half) but the person who hasn’t
done the training is speaking.

Pass Criteria: 50%

Input No. Description Output Expected Output


1 Phrase=”I am a boy” ”I am a boy” ”I am a boy”
2 Phrase=”Hello” “Hello” “Hello”
3 Phrase=”This is to ”This is to uniform” ”This is to inform”
inform”
4 Phrase=”I am a “I am a boy” “I am a boy”
boy”(Slow)
5 Phrase=”Cat and “Cat and house” “Cat and mouse”
mouse”
6 Phrase=”Anybody “Study” ”Anybody there”
there”

Status: PASSED with 50% accuracy.

Result: Training improves accuracy but every user must load his/her own
profile while dictating.

54
SPEECH TEXT ARTIFICE

CHAPTER 6

CONCLUSION

6.1 Advantages of software


Able to write the text through both keyboard and voice input.
Voice recognition of different notepad commands such as open save and clear.
Open different windows softwares, based on voice input.
Requires less consumption of time in writing text.
Provide significant help for the people with disabilities.
Lower operational costs.

6.2 Disadvantages
Low accuracy
Not good in the noisy environment.

6.3 Future Enhancements


This work can be taken into more detail and more work can be done on the project in order to
bring modifications and additional features. The current software doesn‟t support a large
vocabulary, the work will be done in order to accumulate more number of samples and
increase the efficiency of the software. The current version of the software supports only few
areas of the notepad but more areas can be covered and effort will be made in this regard.

This Thesis/Project work of speech recognition started with a brief introduction of the
technology and its applications in different sectors. The project part of the Report was based
on software development for speech recognition. At the later stage we discussed different
tools for bringing that idea into practical work. After the development of the software finally
it was tested and results were discussed, few deficiencies factors were brought in front. After
the testing work, advantages of the software were described and suggestions for further
enhancement and improvement were discussed.

55
SPEECH TEXT ARTIFICE

REFERENCES

BOOKS AND INTERNET

[1] “Speech recognition- The next revolution” 5th edition.

[2] Ksenia Shalonova, “Automatic Speech Recognition” 07 DEC 2007


Source:http://www.cs.bris.ac.uk/Teaching/Resources/COMS12303/lectures/Ksenia_Shalonov
a-Speech_Recognition.pdf

[3] http:/ /www.abiityhub.com/speech/speech-description.htm

[4] "Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993. ISBN:


0130151572.

[5] "Speech and Language Processing: An Introduction to Natural Language Processing,


Computational Linguistics and Speech Recognition". D. Jurafsky, J. Martin. 2000. ISBN:
0130950696.

[6] http:/ /www.abiityhub.com/speech/speech-description.htm

[7] Charu Joshi “Speech Recognition”


Source: http://www.scribd.com/doc/2586608/speechrecognition.pdf
Date Added 04/21/2008

[8] John Kirriemuir “Speech recognition technologies” March 30th2003

[9] http://electronics.howstuffworks.com/gadgets/high-tech-
gadgets/speechrecognition.htm/printable last updated: 30th October 2009

[10] http://www.jisc.ac.uk/media/documents/techwatch/ruchi.pdf

56
SPEECH TEXT ARTIFICE

[11] http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-
recognition3.htm

[12] http://en.wikipedia.org/wiki/Speech_recognition
Visited: 6 DEC 2010

[13] Stephen Cook “”Speech Recognition HOWTO” Revision v2.0 April 19, 2002
Source: http: / /www.scribd.com / doc/ 2586608/ speechrecognitionhowto

[14] B.H. Juang & Lawrence R. Rabiner, “Automatic Speech Recognition – A Brief History
of the Technology Development” 10/08/2004 .
Source: http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALI-ASRHistory-
final- 10-8.pdf

57
SPEECH TEXT ARTIFICE

APPENDIX –A

SOURCE CODE

WINDOW APPLICATION PAGE :


{Program.cs}

using System;
using System.Collections.Generic;
using System.Linq;
using System.Windows.Forms;

namespace demo11
{
static class Program
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]

//the main function that allow application to run in window form.

static void Main()


{
Application.EnableVisualStyles();
Application.SetCompatibleTextRenderingDefault(false);
Application.Run(new Form1());
}
}
}

58
SPEECH TEXT ARTIFICE

MAIN APPLICATION PAGE :-


{Form1.cs}

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.ServiceProcess;
using System.Management;
using System.Runtime.InteropServices;
using System.Threading;

namespace demo11
{
public partial class Form1 : Form
{
int kk; string str = "\n";
public Form1()
{

InitializeComponent();
button1.Enabled = false;
menuStrip1.Visible = false;

}
demo11.text2s tt = new text2s();
demo11.speech2t ttt = new speech2t();

59
SPEECH TEXT ARTIFICE

demo11.neww tttt = new neww();


demo11.helpp hp = new helpp();
demo11.splash sp=new splash();

private void fILEToolStripMenuItem_Click(object sender, EventArgs e)


{
menuStrip1.Visible = false;
menuStrip2.Visible = true;
}

private void vIEWToolStripMenuItem_Click(object sender, EventArgs e)


{
menuStrip1.Visible = false;
menuStrip3.Visible = true;
}

private void hELPToolStripMenuItem_Click(object sender, EventArgs e)


{
menuStrip1.Visible = false;
menuStrip4.Visible = true;

private void eXITToolStripMenuItem_Click(object sender, EventArgs e)


{
this.Close();
}

private void hOWTOUSEToolStripMenuItem_Click(object sender, EventArgs e)


{
hp.MdiParent = this;
hp.WindowState = FormWindowState.Maximized;
hp.Show();

60
SPEECH TEXT ARTIFICE

private void aBOUTUSToolStripMenuItem_Click(object sender, EventArgs e)


{
MessageBox.Show("SR CONVERTER. ALL RIGHTS ARE RESERVED. FOR
MORE DETAILS CONTACT AAYUSH SHARMA ,BVCOE(0171152707)", "ABOUT
US", MessageBoxButtons.OK, MessageBoxIcon.Information);
}

private void bACKToolStripMenuItem1_Click(object sender, EventArgs e)


{
menuStrip4.Visible = false;
menuStrip1.Visible = true;
}

private void nEWToolStripMenuItem_Click(object sender, EventArgs e)


{

private void oPENToolStripMenuItem_Click(object sender, EventArgs e)


{

private void sAVEToolStripMenuItem_Click(object sender, EventArgs e)


{

private void eXITToolStripMenuItem1_Click(object sender, EventArgs e)


{
menuStrip2.Visible = false;
menuStrip1.Visible = true;

61
SPEECH TEXT ARTIFICE

tttt.Hide();
tt.Hide();
ttt.Hide();
}

//menu item that will bring user to main page

private void bACKToolStripMenuItem_Click(object sender, EventArgs e)


{
menuStrip3.Visible = false;
menuStrip1.Visible = true;
tt.Hide();
ttt.Hide();

public static bool IsServiceInstalled(string serviceName)


{
// get list of Windows services
ServiceController[] services = ServiceController.GetServices();

// try to find service name


foreach (ServiceController service in services)
{
if (service.ServiceName == serviceName)
return true;
}
return false;
}

private void button1_Click(object sender, EventArgs e)


{
// ManagementObjectSearcher searcher = new ManagementObjectSearcher("select *
from win32_share");

62
SPEECH TEXT ARTIFICE

// foreach (ManagementObject obj in searcher.Get())


// {

// string s;
// s = string.IsNullOrEmpty(obj.GetPropertyValue("DeviceName").ToString()) ?
string.Empty : obj.GetPropertyValue("DeviceName").ToString();

// textBox1.Text = s.ToString();

// }

}
private void tEXTTOSPEECHToolStripMenuItem_Click(object sender, EventArgs e)
{

tt.MdiParent = this;
tt.WindowState = FormWindowState.Maximized;
tt.Show();
}

private void sPEECHTOTEXTToolStripMenuItem_Click(object sender, EventArgs e)


{

ttt.MdiParent = this;
ttt.WindowState = FormWindowState.Maximized;
ttt.Show();
}

private void nEWFILEFORTEXTTOSPEECHToolStripMenuItem_Click(object sender,


EventArgs e)
{
tt.MdiParent = this;

63
SPEECH TEXT ARTIFICE

tt.WindowState = FormWindowState.Maximized;
tt.Show();

menuStrip1.Visible = false;
menuStrip3.Visible = true;
menuStrip2.Visible = false;
}

private void nEWFILEFORSPEECHTOTEXTToolStripMenuItem_Click(object sender,


EventArgs e)
{
ttt.MdiParent = this;
ttt.WindowState = FormWindowState.Maximized;
ttt.Show();
menuStrip2.Visible = false;
menuStrip1.Visible = false;
menuStrip3.Visible = true;
}

//after pressing this button instruction written in this function will be executed during
loading of this page.driver’s information will b shown after specific time.

private void Form1_Load(object sender, EventArgs e)


{
//if (this.IsDisposed )
//{
// this.Hide();
// this.WindowState = FormWindowState.Minimized;

// sp.Show();
// sp.WindowState = FormWindowState.Normal;

64
SPEECH TEXT ARTIFICE

// System.Windows.Forms.Timer tm = new System.Windows.Forms.Timer();

// //Timer tm = new Timer();


// tm.Interval = 4500;
// tm.Tick += MyTick;
// tm.Start(); kk = 3;
//}
//else
//{

//}
menuStrip4.Visible = false;

ManagementObjectSearcher objSearcher = new


ManagementObjectSearcher("SELECT * FROM Win32_SoundDevice");

ManagementObjectCollection objCollection = objSearcher.Get();

string str1 = "";

foreach (ManagementObject obj in objCollection)


{
// Thread.Sleep(900);
foreach (PropertyData property in obj.Properties)
{
str = String.Format("{0}:{1}\t\n", property.Name, property.Value);

str = str1 + str;

displayy();
str1 = str;
}
}

65
SPEECH TEXT ARTIFICE

void displayy()
{
System.Windows.Forms.Timer tm = new System.Windows.Forms.Timer();
tm.Interval = 2500;
tm.Tick += Mytick;
tm.Start();
progressBar1.Increment(3);
}
int i = 0;
public void Mytick(object obj, System.EventArgs e)
{
if (i < 20)
{
textBox1.Text = str;
i++;
}
else
{
((System.Windows.Forms.Timer)obj).Stop();
}
button1.Enabled = true;
}
//void MyTick(object obj, System.EventArgs ea)
//{
// // sp.Hide();
// // sp.Dispose();
// //this.Show();
// //this.WindowState = FormWindowState.Maximized;
// //this.WindowState = FormWindowState.Normal;
// ((System.Windows.Forms.Timer)obj).Stop();
//}

66
SPEECH TEXT ARTIFICE

//after completion of driver information user can go to next page .

private void button1_Click_1(object sender, EventArgs e)


{
button1.Hide();
textBox1.Hide();
menuStrip1.Visible = true; label1.Dispose();
progressBar1.Dispose();
}
}
}

67
SPEECH TEXT ARTIFICE

TEXT TO SPEECH PAGE :-


{Text2S.cs}

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using SpeechLib;
using System.IO;
using System.Threading;

namespace demo11
{
public partial class text2s : Form
{
int X = 50;

public text2s()
{
InitializeComponent();
label2.Text = X.ToString();
}

private void TEXT2S_LOAD(object sender, EventArgs e)


{
textBox1.Text = "";
}

68
SPEECH TEXT ARTIFICE

//after pressing speak button code will speak whatever it find in main text box.

private void btnListen_Click(object sender, EventArgs e)


{
if (textBox1.Text != "")
{
SpVoice voice = new SpVoice();
voice.Volume = X;
voice.Speak(textBox1.Text, SpeechVoiceSpeakFlags.SVSFlagsAsync);
voice.WaitUntilDone(Timeout.Infinite);
}
else
MessageBox.Show("Please enter text for speech", "Text to Speech",
MessageBoxButtons.OK, MessageBoxIcon.Information);

//after pressing this button volume of speaker will increase by 10 unit

private void button2_Click(object sender, EventArgs e)


{
if (X < 100)
{
X = X + 10;
button3.Enabled = true ;
}

if (X == 100)
{ button2.Enabled = false; }

label2.Text = X.ToString();

69
SPEECH TEXT ARTIFICE

//after pressing this button volume of speaker will decrease by 10 unit

private void button3_Click(object sender, EventArgs e)


{

if (X > 10)
{
X = X - 10;
button2.Enabled = true ;
}
if (X == 10)
button3.Enabled = false;

label2.Text = X.ToString();
}

//after pressing this button open dialog box will open and user can select any notpad file
and will see data in text box.

private void button1_Click(object sender, EventArgs e)


{
OpenFileDialog openfiledialog1 = new OpenFileDialog();

string stro = ""; ;

openfiledialog1.Title = "choose path";


openfiledialog1.Filter = "Text Files|*.txt";
openfiledialog1.FilterIndex = 1;

if (openfiledialog1.ShowDialog() != DialogResult.Cancel)
{

70
SPEECH TEXT ARTIFICE

stro = openfiledialog1.FileName.ToString();
StreamReader objreader;
objreader = new StreamReader(stro);
textBox1.Text = objreader.ReadToEnd();
objreader.Close();
}

//After pressing this button user can save his content in to notepad file.

private void button4_Click(object sender, EventArgs e)


{
SaveFileDialog saveFileDialog1 = new SaveFileDialog();

string str = ""; ;

saveFileDialog1.Title = "Specify Destination Filename";


saveFileDialog1.Filter = "Text Files|*.txt";
saveFileDialog1.FilterIndex = 1;
saveFileDialog1.OverwritePrompt = true;

if (saveFileDialog1.ShowDialog() != DialogResult.Cancel)
str = saveFileDialog1.FileName.ToString();

File.WriteAllText(@str, textBox1.Text);

// this button will close running window.

private void button5_Click(object sender, EventArgs e)


{
this.Close();

71
SPEECH TEXT ARTIFICE

}
//this button will clear content of text box.
private void button6_Click(object sender, EventArgs e)
{
textBox1.Text = "";
}
}
}

72
SPEECH TEXT ARTIFICE

SPEECH TO TEXT :-
{Speech2T.cs}

using Microsoft.Win32;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using SpeechLib;
using System.IO;
using System.Threading;
using System.Speech;
using System.Speech.Recognition;
using System.Speech.Synthesis.TtsEngine;

namespace demo11
{
public partial class speech2t : Form
{
public speech2t()
{
InitializeComponent();
}
public void listener_Reco(int StreamNumber, object StreamPosition,
SpeechRecognitionType RecognitionType, ISpeechRecoResult Result)
{

string heard = Result.PhraseInfo.GetText(0, -1, true);


//string heard = Result.

73
SPEECH TEXT ARTIFICE

textBox1.Text += heard;
textBox1.Text = textBox1.Text.ToString() + " ";
}

//after pressing this button application will listen for words from user and will write in
to text box.

private void btnListen_Click_1(object sender, EventArgs e)


{
// Speech Recognition Object

SpSharedRecoContext listener;

// Grammar object

ISpeechRecoGrammar grammar;
listener = new SpeechLib.SpSharedRecoContext();
listener.Recognition += new
_ISpeechRecoContextEvents_RecognitionEventHandler(listener_Reco);
//grammar = listener.CreateGrammar(0);

//grammar.DictationLoad("",SpeechLoadOption.SLOStatic);
//grammar.DictationSetState(SpeechRuleState.SGDSActive);
//SpeechRecognitionEngine RecognitionEngine = new SpeechRecognitionEngine();
//RecognitionEngine.LoadGrammar(new DictationGrammar());
//RecognitionResult Result = RecognitionEngine.Recognize(new
SetInputToDefaultAudioDevice());
//StringBuilder Output = new StringBuilder();
//foreach (RecognizedWordUnit Word in Result.Words)
//{
// textBox1.Text = Result.Words.ToString();
//}

74
SPEECH TEXT ARTIFICE

private void button3_Click(object sender, EventArgs e)


{
textBox1.Text = "";
}

private void sAVEToolStripMenuItem1_Click(object sender, EventArgs e)


{
Stream myStream;
SaveFileDialog saveFileDialog1 = new SaveFileDialog();

saveFileDialog1.Filter = "txt files (*.txt)|*.txt|All files (*.*)|*.*";


saveFileDialog1.FilterIndex = 2;
saveFileDialog1.RestoreDirectory = true;

if (saveFileDialog1.ShowDialog() == DialogResult.OK)
{
if ((myStream = saveFileDialog1.OpenFile()) != null)
{
// Code to write the stream goes here.
myStream.Close();
}
}

private void sAVEToolStripMenuItem_Click(object sender, EventArgs e)


{

SaveFileDialog saveFileDialog1 = new SaveFileDialog();


string str = ""; ;

75
SPEECH TEXT ARTIFICE

saveFileDialog1.Title = "Specify Destination Filename";


saveFileDialog1.Filter = "Text Files|*.txt";
saveFileDialog1.FilterIndex = 1;
saveFileDialog1.OverwritePrompt = true;

if (saveFileDialog1.ShowDialog() != DialogResult.Cancel)
str = saveFileDialog1.FileName.ToString();

File.WriteAllText(@str, textBox1.Text);
}

private void button2_Click(object sender, EventArgs e)


{
textBox1.Clear();
}

private void file1ToolStripMenuItem_Click(object sender, EventArgs e)


{
textBox1.Clear();
}

private void oPENToolStripMenuItem_Click(object sender, EventArgs e)


{
OpenFileDialog openfiledialog1 = new OpenFileDialog();

string stro = ""; ;

openfiledialog1.Title = "choose path";


openfiledialog1.Filter = "Text Files|*.txt";
openfiledialog1.FilterIndex = 1;

if (openfiledialog1.ShowDialog() != DialogResult.Cancel)

76
SPEECH TEXT ARTIFICE

stro = openfiledialog1.FileName.ToString();

textBox1.Text = stro;

StreamReader objreader;
objreader = new StreamReader(stro);
textBox1.Text = objreader.ReadToEnd();
objreader.Close();

//after pressing this button content of text box will b erased and user can speak in new
file.

private void button1_Click(object sender, EventArgs e)


{
textBox1.Clear();
}

//after pressing this button user can open any file of hard disk and can append data into
this file.

private void button4_Click(object sender, EventArgs e)


{
OpenFileDialog openfiledialog1 = new OpenFileDialog();

string stro = ""; ;

openfiledialog1.Title = "choose path";


openfiledialog1.Filter = "Text Files|*.txt";
openfiledialog1.FilterIndex = 1;

if (openfiledialog1.ShowDialog() != DialogResult.Cancel)

77
SPEECH TEXT ARTIFICE

stro = openfiledialog1.FileName.ToString();

textBox1.Text = stro;

StreamReader objreader;
objreader = new StreamReader(stro);
textBox1.Text = objreader.ReadToEnd();
objreader.Close();
}

//after pressing this button user can save his/her content in to hardisk as a notepad file
for future use.

private void button5_Click(object sender, EventArgs e)


{

SaveFileDialog saveFileDialog1 = new SaveFileDialog();


string str = ""; ;
saveFileDialog1.Title = "Specify Destination Filename";
saveFileDialog1.Filter = "Text Files|*.txt";
saveFileDialog1.FilterIndex = 1;
saveFileDialog1.OverwritePrompt = true;

if (saveFileDialog1.ShowDialog() != DialogResult.Cancel)
str = saveFileDialog1.FileName.ToString();
File.WriteAllText(@str, textBox1.Text);
}
}
}

78
SPEECH TEXT ARTIFICE

APPENDIX –B

SNAPSHOTS
This is the starting page when application is run. Splash screen appears as soon as the
interface debugs. After checking for the hardware and sound drivers, information generated is
displayed.
If any bugs avialable then reported there after.
Main screen of our software shows various features like FILE , VIEW , HELP, ABOUT US.
Clicking on VIEW button opens us to SPEECH TO TEXT and TEXT TO SPEECH
interfacce.

79
SPEECH TEXT ARTIFICE

80
SPEECH TEXT ARTIFICE

81
SPEECH TEXT ARTIFICE

APPENDIX –C

GLOSSARY

cmdAssociate Phrase : This method can be utilized to add new phrases for applications that
are not added to the grammer list as yet.

cmdLoadFromFile : This is a method by which a grammer file (xml) can be a loaded onto a
grammer object.

Grammer File : Grammer file is a XML file that stores the phrases that the recognition
engine should recognize or look for.

Grammer object : A Grammer object is one that can load an XML file.

Phoneme : Phoneme is a part of STA tools that is concerned with the command and control .
It takes a particular action whenever it recognize a phase.

RC_FalseRecognition : This is an event that gets fired whenever a phrase is not recognized .

RC_Recognition : This is an event that gets fired whenever a phrase is recognized.

Recognition Engine : An engine that analyzes spoken text via micro phone.

SAPI (Speech Application Programming Interface): SAPI is the backbone of a STA. it


provides methods/interfaces/objects to develop speech application.

Speech Recognition : Speech Recognition is the method in which human voice is recognized
and the consequent action takes place.

Speech Synthesis : Speech Synthesis is the method by which written text is spoken by the
interface.

82
SPEECH TEXT ARTIFICE

User Training : The training Wizard present in windows that trains the recognition engine
and maintain user profile.

83

You might also like