Automatic Grapheme - To - Phoneme Conversion of Arabic Text

Science and Information Conference 2015
July 28-30, 2015 | London, UK
Automatic Grapheme-to-Phoneme
Conversion of Arabic text
Belal Al-Daradkah and Bashir Al-Diri
School of Computer Science
University of Lincoln
Lincoln, UK
Baldaradkah, Baldiri @ Lincoln.ac.uk
AbstractAn automated computerized system is
presented that converts Arabic graphemes to
phonemes (G2P), by using the phonology rules of
the Arabic language, supported by a dictionary of
exceptions.. The system was evaluated on a
publicly available dataset, consisting of 620 fully
diacritic Arabic sentences, which were manually
segmented. The validation process was applied to
three different cases: sentences, words, and
phonemes, yielding very promising results. The
performance of the proposed system, using the
precision and the sensitivity as metrics, was
99.19% and 99.42%, respectively. All Arabic
language rules were applied and tested. The
developed system can be applied to any diacritic
Arabic text.
KeywordsArabic; phonemes; graphemes; phonology;
rules.
I.
INTRODUCTION
The development of text to speech (TTS) systems requires a

graphemes-to-phonemes conversion process (phonetization),
which is an essential component of any speech process
application [1]. At present, phonetization tends to be done
manually [2], which is a time consuming process and is subject
to errors, in addition to the need of highly skilled linguistic
experts [3]. This paper addresses these issues, by introducing
an automatic phoneme transcription system, for Arabic
language text. Automatic grapheme to phoneme conversion is
important not only for speech applications, either synthesis or
recognition, but also for all text-to-speech systems [4]. TTS can
be used in many applications, such as education, learning and
communication [5]. These applications can help people with
special needs, like blind people and those who are learning
foreign languages. These applications require accurate
phonetization in order to derive the correspondence between
the orthography and the relevant sounds. Grapheme to
phoneme transcription, for any language, is affected by the
level of correspondence between letters and sounds [6]. This
manuscript introduces an automated system to convert Arabic
graphemes to their mapped phonemes, using the rule-based
technique, supported by a special dictionary. This dictionary
will be used to transcribe a set of exceptional words that do not
follow phonetization rules.
The rest of the manuscript is organized into four sections.
In Section 2, the current phonetization methods will be
presented. The proposed system and methodology will be
presented in Section 3. In Section 4, our results will be

presented and discussed. Finally, our conclusion will be
presented in section 5.
II.
BACKGROUND
Phonetization is the transcription of written text into sounds

symbols. There are three different approaches to perform this
process. Firstly, the dictionary based approach, which is based
on storing the maximum phonological knowledge in a lexicon.
A comprehensive dictionary requires huge storage space and
tedious effort during creation [7]. Secondly, the phonological
rule-based approach uses a comprehensive set of grapheme-tophoneme rules and a phonetic post-processor to transcribe text
into actual sounds. Rule-based approach is language specific
and is widely used in speech synthesis [8]. Thirdly, the data
driven techniques are classified into three classes: a) the
pronunciation by analogy (PbA), b) statistical methods based
on stochastic theory on nearest neighbor, and c) methods based
on neural networks [9].
Implementations of the phonetization system differ
according to the language complexity, which is divided into
two main classes, based on the clarity of the correspondence
between language letters and pronunciation sounds. This
correspondence could be simple, as in Spanish and Finish,
where the rule-based approach covers most of the cases and the
exceptional-words dictionary is small. On the other hand,
correspondence could be complex [8], as in English and
French, where the rule-based approach covers a small part of
the language, while the exceptional-words dictionary is very
large. The Arabic language falls into the middle of the previous
two classes, because the Arabic graphemes are easier to
transcribe into sounds by a set of letter-to-sounds rules,
augmented by a dictionary of exception words. For these
reasons it has been classified as a phonetic language [10].
Arabic language is the official language and the mother
tongue of more than 200 million people across 22 Arab
countries [1]. In addition, more than 1 billion non-Arab ethnic
Muslims in the world use Arabic in their worships. There is
also an interest in learning and processing Arabic in
computational linguistics. Arabic language has 28 consonants,
3 long vowels and 3 short vowels, called harakat tashkeel
Fatha, Dhamah, and Karsah [11]. Arabic also has five
diacritical marks; three Nutation marks, derived by doubling
the short vowels Tanween Fateh(an), Dham(un) and
Kaser(in), and two other diacritics called Shaddah and
Skoon. Shaddah is used to indicate the doubling of Arabic
consonants, whereas "Skoon" is used as an indication of no
Haraka, i.e. consonants that had not been followed by a
short vowel [12]. Together these formulate the 34 phonemes of
the Arabic language.
1|Page
www.conference.thesai.org

The literature of Arabic phonetic rules has limited resources

in comparison to other languages. This can be attributed to
more than one factor. For instance, Arabic text is usually
written without diacritics, leading to a shortage of a
comprehensive Arabic phonetic corpus. Numerous studies
attempted to introduce these rules and to use them for
converting text to sound symbols. For example, Al-ghamdi et
al. [13] used the Arabic phonology rules and some Tajweed
rules to convert text to sound symbols, by listing a number of
phonetic and phonemic rules with exceptional words. However,
this method was not evaluated on publicly available datasets,
so as to measure the performance and to identify the best
sequence of applying these rules, since some of these rules
contradict each other and could produce inappropriate output.
El-imam et al. [8,15] proposed a system to phonetize isolated
words using a corpus consisting of 6000 words with a small
exceptional dictionary. He addressed a set of challenges for
transcribing Arabic grapheme-to-phoneme, by studying the
properties of Arabic phonology, including phonetic rules.
Ahmed [2] applied a set of letter-to-sound rules to simplify
computer voice production, using in total 150 allophones and
vowels/consonants combinations, showing that these rules were
the backbone of any Arabic text-to-speech application. Selim
and Anbar [14] developed a rule-based phonetic transcription
system for Arabic texts, using 291 non-diacritic words from
newspapers. The system, which was validated against a limited
number of words, had a moderate accuracy.
Previous studies in speech-to-text transcription methods
were oriented towards processing broadcast news data in
Modern Standard Arabic, and have since then been extended to
address a larger variety of broadcast data, which, as a result,
need to be able to handle dialectal speech [16].
The explicit modeling of gemination and the introduction of
pronunciation variants, led to significant improvements in
TABLE I.
Grapheme
(G)
Phoneme
(F)
speech-to-text transcription methods performance. Harrat et al.

[17] developed a Speech-to-Speech translation system between
Modern Standard Arabic and Algiers dialect. The system
included both a Text-to-Speech module and a Grapheme-toPhoneme converter for the Arabic language. . They used both a
rule-based approach and a statistical approach, yielding an
accuracy of 92% and 85% respectively, despite the lack of
resources for Arabic language. Their work included a
phonetized dictionary for Algiers dialect.
Arabic letter-to-sound rules can be divided into two classes:
a) phonemic and b) phonetic. Phonemic rules are used for
grapheme-to-phoneme conversion, whereas the phonetic rules
are used to transcribe phonemes to phones. In this paper, we
investigate the first part of these rules, by designing a system to
work on a fully vowelized (diacritic) Arabic text and to be able
to transcribe its graphemes to phonemes. Our future work will
focus on addressing the problem of phonetic rules, in order to
convert the phonemes to phones.
III.
PROPOSED METHOD
Initially, the rules were collected from different recourses in

literature, and were then classified and prioritized, taking into
consideration that the majority of Arabic graphemes have their
corresponding phonemes, with few exceptions. In our proposed
method, we adopted the rule-based approach for converting
Arabic graphemes to phoneme, respecting the Arabic
phonology rules. The sequence of performing these rules is so
important because the output of one rule will be the input to the
following one. Mapping each grapheme (G) to its
corresponding phoneme (F) has one of the following cases:
insertions, deletions or substitutions. The SAMPA (Speech
Assessment Methods Phonetic Alphabet) will be used to
represent Arabic phonemes as shown in Table1.
THE SAMPA (SPEECH ASSESSMENT METHODS PHONETIC ALPHABET).
X
s
x d D R Z s
S
d` t` D`
\
`

Grapheme

(G)
Phoneme
?
u
G F q k L m n H W J i:
i
u
a:
a
(F)
`
:
In our system we split sentences into words; each word has to
be checked against the exception word dictionary. If a word is
It is worth mentioning that not all words can be converted
to correct phonemic outcomes, using phonetization rules. There
are some distinct exceptions, so we created a special dictionary
found, then the mapping transcript will be retrieved; otherwise
for these words. This dictionary includes the most frequent
the word has to be processed, using the structured rules,
words that require special pronunciation attributes and do not
according to the proper rule sequence, taking into consideration
follow certain rules, including also their correct phonemic
the possibility of handling an adjacent pair of words. A sample
transcriptions. The exception words that belong to this special
of exception words from the dictionary is shown in Table 2.
dictionary, sometimes contain graphemes not pronounced or
has to be articulated and did not occur in the word. This usually
TABLE II. EXCEPTIONAL WORD SAMPLE
happens with words containing unseen Alif or Hamzah
TRANSCRIPTIONS.
e.g., /?rraXma:nu/.
?
B t
T Z
Special Words
Phonemic
2|Page

Transcription
?rraXma:nu
?ula:?ika
ha:Da:
Da:lika
la:kin
ha:?ula:?i
A. System Limitations
Our proposed system requires the input Arabic text data to
be fully vowelized (diacritic). Arabic phonology and
pronunciation are highly affected by the short vowels, which
are used to punctuate the Arabic pronunciation and to add
semantic information to the words. In general, short vowels do
not appear in Arabic text, since native speakers can recognize
these marks by context, meaning and sense. Therefore, it is
necessary to use diacritics to obtain the correct pronunciation.
This is an important issue for all text-to-speech and automatic
speech recognition applications [20], as well as for those
applications that are designed to teach Arabic language to nonnative Arabic speakers. Therefore, fully diacritics Arabic text is
highly needed in the process of phonetization.
B. System Description
After analyzing the Arabic phonemic rules, a clear and
well-structured algorithm was implemented, in order to
maintain the high priority rules sequence. Our framework of
phonetization depends on the rule-based method, which mainly
utilizes the Arabic phonology rules, augmented by the
exception word dictionary. Figure 1 shows the workflow of our
system.
formats. Based on our implementation/coding of these rules, a

Set of Notations was used for representing these Arabic
phonetization rules to facilitate, understand and implement the
automatic transcription process. The proposed system
elaborated 18 Arabic phonemic rules. Each grapheme might be
omitted or replaced by one or more phonemes. The general
form of a rule is as follows:
G F / XGY ,
(1)
where G is the target grapheme, F is the corresponding
phoneme, X and Y are the right and left grapheme of G
respectively (X and Y might be null). The order of applying
these rules is very important, so the additional rules should be
applied, before the converting rules and the omitting rules. The
set C represents the Arabic consonants, whereas sets V l
and V s represent the Arabic long and short vowels.
C= ,

V = {V l Vs } , V l= { } , V s ={ } .
The following are our definitions of the implemented rules:
Skoon rule: the fully diacritic text means that each

consonant should be followed by Harakeh which is
either Skoon ( ), Shaddah or a short vowel; this
rule aims to remove Skoon as follows:
R 01:{Gi GiGi C } .
Shaddah Rule: This rule is applied to a consonant with

Shadah, by omitting the Shadah and repeating
that consonant. The consonant was doubled instead of
using one letter with Shadah.
R 02:{Gi Fi Fi +1|Gi= } ,
(2)
(3
)
Hamzah Rules: Hamzah used to be written in more

than one form; it might appear as isolated," "or
supported by one of the following letters
{ } . By applying this rule, Hamzah
will be written without any supported grapheme as
follows:
R 03 :{Gi Gi { }} ,
Fig. 1.System work flow

The fully vowelized text is inserted to the system sentence
by sentence. Then, each sentence is converted to an array of
words and each word is checked against the exception word
dictionary. Following this, the phonology rules are applied to
the standard words. On the contrary, the exception phrases are
checked against the dictionary, before applying the rules to
them and extracting the corresponding transcriptions.
C. Phonetization Rules
Arabic phonetization rules were collected from the
literature [13], [15], and which were introduced in various
Alif Rules: Alif rules are based on location of Alif

in a word or a sentence. If Alif is located at the
beginning of the first word in a sentence, then it is
mapped to Hamzah as follows:
R 04 : { Gi | Gi= i=1 }
(5)
If the Alif is followed by Lam at the beginning of

a word (definite Al) Al Atta`reef and this word is not
the first word in the sentence, here called Hamzah
Wasel, then it will be omitted, as follows:
3|Page

(6
)
R 05 : { Gi |Gi= G i+1= } .
If Alif comes in Maqsurah form it will be mapped

to Fatha as follows:
R 06 : { Gi
Tanween is mapped into one

accordingly, followed by the letter .
|Gi=
}.
If Alif appears at the end of a word after plural w

Waw Aljame` the Alif will be omitted, as follows:
R 07 : { Gi |Gi= G i1= i=n } ,
R 08 : { Gi |Gi= Gi1=
Taa Marboota Rules: Ta Marboota always occurs at

the end of a word and may be followed by a vowel or
not.
R 09: { Gi | Gi= Gi +1 {V , }}
(9
)
(1
0)
If Taa Marboota appears at the end of a word and is

not followed by a vowel then it is mapped to , as
follows:
R 14 :{Gi | G i=
},
R 15:{Gi | Gi=
}.
Vowel rules: these rules handled the case of long

vowels { }sequence after short vowels { },
i.e. when short vowels occurred before synchronous
long vowels like "fatha" befor "alif","dhamah" before
"waw" and "kasrah" before "yaa", this can be
represented as follows:
The associated short vowel with long vowel will be

omitted as:
(8
)
If the Alif appears after Tanween Fateh then it will

be omitted, as follows:
},
vowel
while n is the number of letters in that word.
R 13:{Gi | Gi=
short
R 16: { Gi F i Gi1 |Gi { } Gi { }} .
Long vowel is mapped to the associated short relevant

phoneme.
(18
)
Last word in a sentence rule: in Arabic, when stopping
at the end of a sentence, there should be no diacritic,
hence this rule aims to remove the short vowels, which
are found on the last letter in the last word at the end of
a sentence.
V
R 17: { G F|G Vl V s { }} .
l
R 18: {G i |Gi { }} .
(19)
(1
1)
D. Phonetization Rules Sequence

As previously mentioned, the proper sequence and
prioritization of the rules is crucial. To obtain the correct results
from our system, we implemented the order of the rules in a
Lam rule: there are two set of Arabic letters Shamsi
systematic sequence, in order to get the correct phoneme
and Qamari sets:
transcription and a high accuracy result. We also took into
S={ } consideration that the outcome of one rule is an input to the
next one, and to prevent and reduce the discrepancy if it is
,
found. Figure 2 illustrates the proper order of these rules,
Q={ } clarifying the right priority of the execution of these rules. The
process starts with the additional rules (replacing one grapheme
.
with one or more phonemes), and then handles the conversion
If Lam is founded in the Al atareef and is
rules and in the end it deals with the deletion and grapheme
followed by one of Shamsi letters it is omitted, as
removal rules.
follows:
(1
R 11: { Gi |G i= Gi+1 S } .
2)
R 10: {G i | Gi= Gi +1 {V , }} .
If Lam is followed by one of the Qamari letters

then it is mapped to its corresponding phoneme as
follows:
R 12: { Gi Fi|Gi= Gi+1 Q} .
Tanween rules: The Tanween is an "n" sound added

sometimes to the end of an Arabic word and appears by
doubling the short vowels at the end of that word. So
4|Page

Fig. 2.Rules proper sequence

IV.
RESULTS AND DISCUSSION
Our system is built using the Java programing language on

a Dell 64-bit windows operating system with 4GB (RAM),
Intel Core 2 Quad CPU @ 2.66Hz
2.67GHz. A fully
vowelized (diacritic) Arabic text corpus consisting of 620
sentences, created by [18], was used for evaluating the
developed system. This dataset was built based on the one
thousand most frequently used words in Arabic, which include
Arabic graphemes, in addition to the most common exception
words. The sentences cover words from different arts and
science fields, including Quran, Hadith, names, poetry, articles,
themes and proverbs, formulated in 3440 words and 27030
graphemes, segmented manually. The output of our system
consists of phonemic sentences; these sentences consist of
26238 phonemes, and are compared against manual segmented
phonemic sentences.
There are 26088 true positive matched phonemes with the

manual reference. There are also 211 false negative phonemes,
which are the missed ones from our system as well as 150 false
positive phonemes, which are produced incorrectly by our
system. The number of phonemes produced by our system
(SO) is the number of matched phonemes (TP) added to the
number of incorrectly phonemes (FP), while the manual
reference (MR) consists of the number of matched phonemes
(TP) added to the number of missed phonemes (FN).
The performance of our system is evaluated, using the
precision and the sensitivity metrics of transcription; the
precision is the ratio of TP to SO, and the sensitivity is the ratio
of TP to MR. The precision of our system is 26088/
(26088+211) = 99.19%; accordingly, the sensitivity is 26088/
(26088+150) = 99.42%. There are 580 out of 620 sentences
that were phonetized completely correctly. The sources of the
transcription errors of the implemented system are the Arabic
poet sentences, which need special rules to maintain the poetry
balance and the prosody Arud science. The Arud is a
method consisting of a set of rules, used to segment the
phonemes from the Arabic poems and find the meter of Arabic
poetry [19]. The Arabic poet sentences are identified and their
poetry balance was maintained. We added a tag for these
sentences in order to be segmented according to their own
rules. Other discrepancies in matching will be handled in our
future work. Our system can be used to phonetize any fully
diacritic Arabic text, since it was evaluated, using a
comprehensive and rich corpus. This system also evaluated the
existing transcription phoneme rules by implementing and
applying them on a manual standard dataset. Table 3 shows a
sample of our system output, which is fully, matched the
manual transcription phonemes.
TABLE III. SAMPLE OF BOTH OUR SYSTEM OUTPUT AND THE MANUAL SEGMENTED PHONETIZATION
1
2
3
4
5
6
7
8
9
1
0
Arabic sentence

Manual
System
i:amSilX\isà:nu+?àla:+?arba?àti+?arZul
i:amSilX\isà:nu+?àla:+?arba?àti+?arZul
?àmilna:+?àla:+qadamin+u:asa:q
?alku:?
ù+u:arrukbatu+mafa:sìlun+muhimmatun
+fil?at`ra:f
nasX\abuddama+minalka?`bi+?ìndal?
at`fa:lirrud`dà?`
?àmilna:+?àla:+qadamin+u:asa:q
?alku:?
ù+u:arrukbatu+mafa:sìlun+muhimmatun
+fil?at`ra:f
nasX\abuddama+minalka?`bi+?ìndal?
at`fa:lirrud`dà?`
?abun+?ummun+?
una:di:kum+fahal+tus`Gu:na+lau:
+marrah
?abi:+?ummi:+?uxti:+u:a?axi:+?
uX\ibbukum+minal?a?`ma:q
?ùt`latussabti+u:al?aX\adi+?à:lami:i:ah
sì:a:mul?
iTnai:ni+mustaX\abbun+u:akaDa:lika+sì:
a:mulxami:s
sa?azu:ruka+i:au:maTTula:Ta:?i+?au:il?
arbi?à:?
?abi:+?ummi:+?uxti:+u:a?axi:+?
uX\ibbukum+minal?a?`ma:q
?ùt`latussabti+u:al?aX\adi+?à:lami:i:ah
sì:a:mul?
iTnai:ni+mustaX\abbun+u:akaDa:lika+sì:a
:mulxami:s
sa?azu:ruka+i:au:maTTula:Ta:?i+?au:il?
arbi?à:?
?usbu:?ùn+u:a:X\idun+la:+i:akfi:lissafar
?usbu:?ùn+u:a:X\idun+la:+i:akfi:lissafar
?abun+?ummun+?
una:di:kum+fahal+tus`Gu:na+lau:+marrah
The transcription rules used in our system have various

frequencies. The most frequently used rules were the short
5|Page

vowels and Shadah rules; Table 4 shows the number of times

that the implemented rules were used.
We evaluated our system using a publicly dataset [18].
During the evaluation process, we adjusted the sequence of
applying these rules to get high performance according to
manually segmented phonemes; our sequence of applying these
rules is slightly different from those previously published. Also
we implemented 18 Arabic phonetization rules compared to the
13 Arabic rules in Al-ghamdi [13] and 11 rules in El-imam
[15]. Tajweed rules will be implemented in the next stage of
our research.
TABLE IV. FREQUENCIES OF IMPLEMENTED RULES.

Rule
Frequency number
of rule
1066
969
961
669
477
368
239
95
30
Short vowels
Shaddah
Lam
Hamza
Taa
Tanween
Alif
Long vowels
Last Word and
Skoon
V.
CONCLUSION
An automated computerized system for transcribing

graphemes to phonemes for Arabic standard text was
developed. This system depends on the methodology of the
rule-based concept in phonetization. These rules were written
in mathematical contexts, using set notations. This system has
two steps; firstly, the exception words, which do not follow the
standard phonological rules, using a dictionary to extract their
own phonology transcriptions. Secondly the text is phonetized,
using the Arabic phonological rules with proper priority
ordering. The developed system is evaluated against a
benchmark of a manual phonetization dataset. Our system
showed promising rates of accuracy and precision, which
enabled the system to be used and processed on any Arabic
vowelized corpus. Our future plan is to implement the phonetic
rules to convert the phonemes to phones with integration of
holy Quran Tajweed rules. This will help in resolving any gaps
in Arabic letter to sound rules.
REFERENCES
[1] ABUSHARIAH,
M.A.M.,
AINON,
R.N.,
ZAINUDDIN, R., ELSHAFEI, M. and KHALIFA,
O.O., 2010. Phonetically rich and balanced speech
corpus for Arabic speaker-independent continuous
automatic speech recognition systems, Information
Sciences Signal Processing and their Applications
(ISSPA), 2010 10th International Conference on
2010, pp. 65-68.
[2] AHMED, M.E., 1991. Toward an Arabic text-tospeech system. The Arabian Journal for Science and
Engineering, 16(4).
[3] AINSWORTH, W., 1973. A system for converting
English text into speech. Audio and Electroacoustics,
IEEE Transactions on, 21(3), pp. 288-290.
[4] ANDERSEN, O., KUHN, R., LAZARIDS, A.,
DALSGAARD, P., HAAS, J. and NOTH, E., 1996.
Comparison of two tree-structured approaches for
grapheme-to-phoneme
conversion,
Spoken
Language, 1996. ICSLP 96. Proceedings., Fourth
International Conference on 1996, IEEE, pp. 17001703.
[5] MEIHAMI, H., 2013. Text-To-Speech Software: a
New Perspective in Learning and Teaching Word
Stress, Word Intonation, Pitch Contour, and Fluency
of English Reading. International Letters of Social
and Humanistic Sciences, (08), pp. 24-33.
[6] GIBSON, E.J., PICK, A., OSSER, H. and
HAMMOND, M., 1962. The role of graphemephoneme correspondence in the perception of words.
The American Journal of Psychology, , pp. 554-570.
[7] TAJCHMAN, G., FOSTER, E. and JURAFSKY, D.,
1995. Building multiple pronunciation models for
novel words using exploratory computational
phonology. EUROSPEECH 1995, Citeseer.
[8] EL-IMAM, Y.A., 2004. Phonetization of Arabic:
rules and algorithms. Computer Speech & Language,
18(4), pp. 339-373.
[9] KUJALA, J.V., 2013. A probabilistic approach to
pronunciation by analogy. Computer Speech &
Language, 27(5), pp. 1049-1067.
[10] RIZK, M., MOHANNA, Y., MOUGNIEH, H.,
HAMAD, M., KHALIL, F. and GHADDAR, A.,
2011. Arabic Text to Speech Synthesizer: Arabic
Letter to Sound Rules. International Review on
Computers & Software, 6(1), pp. 72-85.
[11] ALOTAIBI, Y.A., 2012. Comparing ANN to HMM in
implementing limited Arabic vocabulary ASR
systems. International Journal of Speech Technology,
15(1), pp. 25-32.
[12] HABASH, N., SOUDI, A. and BUCKWALTER, T.,
2007.
On
Arabic
Transliteration.
Arabic
Computational Morphology. Springer, pp. 15-22.
[13] AL-GHAMDI, M.M., AL-MUHTASIB, H. and
ELSHAFEI, M., 2004. Phonetic Rules in Arabic
Script. Journal of King Saud University-Computer
and Information Sciences, 16, pp. 85-115.
[14] SELIM, H. and ANBAR, T., 1987. A phonetic
transcription system of Arabic text, Acoustics,
Speech, and Signal Processing, IEEE International
Conference on ICASSP'87. 1987, IEEE, pp. 14461449.
[15] EL-IMAM, Y.A., 1989. An unrestricted vocabulary
Arabic speech synthesis system. Acoustics, Speech
and Signal Processing, IEEE Transactions on,
37(12), pp. 1829-1845.
6|Page

[16] LAMEL, L., MESSAOUDI, A. and GAUVAIN, J.,

2009. Automatic speech-to-text transcription in
Arabic. ACM Transactions on Asian Language
Information Processing (TALIP), 8(4), pp. 18.
[17] HARRAT, S., MEFTOUH, K., ABBAS, M. and
SMALI, K., 2014. Grapheme To Phoneme
Conversion-An Arabic Dialect Case, Spoken
Language Technologies for Under-resourced
Languages 2014.
[18] AL-DIRI, B., SHARIEH, A. and HUDAIB, T., 2004.
An Arabic speech corpus: a database for Arabic
speech recognition. Dirasat: Pure Sciences, 31(2),

pp. 208-219.
[19] ALNAGDAWI, M.A., 2013. Finding Arabic Poem
Meter using Context Free Grammar. Journal of
Communications and Computer Engineering.
[20] DUTOIT, T., 1997. High-quality text-to-speech
synthesis:
An
overview.
JOURNAL
OF
ELECTRICAL
AND
ELECTRONICS
ENGINEERING AUSTRALIA, 17, pp. 25-36.
7|Page

Automatic Grapheme - To - Phoneme Conversion of Arabic Text

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Grapheme - To - Phoneme Conversion of Arabic Text

Uploaded by

Copyright:

Available Formats

Science and Information Conference 2015

July 28-30, 2015 | London, UK

The development of text to speech (TTS) systems requires a

presented in Section 3. In Section 4, our results will be

Phonetization is the transcription of written text into sounds

Science and Information Conference 2015

The literature of Arabic phonetic rules has limited resources

speech-to-text transcription methods performance. Harrat et al.

Initially, the rules were collected from different recourses in

THE SAMPA (SPEECH ASSESSMENT METHODS PHONETIC ALPHABET).

Science and Information Conference 2015

formats. Based on our implementation/coding of these rules, a

Skoon rule: the fully diacritic text means that each

Shaddah Rule: This rule is applied to a consonant with

Hamzah Rules: Hamzah used to be written in more

Fig. 1.System work flow

Alif Rules: Alif rules are based on location of Alif

If the Alif is followed by Lam at the beginning of

Science and Information Conference 2015

If Alif comes in Maqsurah form it will be mapped

Tanween is mapped into one

If Alif appears at the end of a word after plural w

R 07 : { Gi |Gi= G i1= i=n } ,

Taa Marboota Rules: Ta Marboota always occurs at

If Taa Marboota appears at the end of a word and is

Vowel rules: these rules handled the case of long

The associated short vowel with long vowel will be

If the Alif appears after Tanween Fateh then it will

while n is the number of letters in that word.

R 16: { Gi F i Gi1 |Gi { } Gi { }} .

Long vowel is mapped to the associated short relevant

D. Phonetization Rules Sequence

If Lam is followed by one of the Qamari letters

R 12: { Gi Fi|Gi= Gi+1 Q} .

Tanween rules: The Tanween is an "n" sound added

Science and Information Conference 2015

Fig. 2.Rules proper sequence

RESULTS AND DISCUSSION

Our system is built using the Java programing language on

There are 26088 true positive matched phonemes with the

The transcription rules used in our system have various

Science and Information Conference 2015

vowels and Shadah rules; Table 4 shows the number of times

TABLE IV. FREQUENCIES OF IMPLEMENTED RULES.

An automated computerized system for transcribing

Science and Information Conference 2015

[16] LAMEL, L., MESSAOUDI, A. and GAUVAIN, J.,

speech recognition. Dirasat: Pure Sciences, 31(2),

You might also like