You are on page 1of 17

I.J.

Intelligent Systems and Applications, 2014, 12, 69-85


Published Online November 2014 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijisa.2014.12.10

A Survey on Statistical Based Single Channel


Speech Enhancement Techniques
Sunnydayal. V
National Institute of Technology Warangal, Warangal-506004, India
Email: sunnydayal45@gmail.com

N. Sivaprasad, T. Kishore Kumar


National Institute of Technology Warangal, Warangal-506004, India
Email: Email: nsprasad@nitw.ac.in, kishorefr@gmail.com

Abstract—Speech enhancement is a long standing problem with In general, to perform reliably in noisy environments,
various applications like hearing aids, automatic recognition and there exists a requirement for digital voice communi-
coding of speech signals. Single channel speech enhancement cations, automatic speech recognition systems and
technique is used for enhancement of the speech degraded by human-machine interfaces. For example, in hands-free
additive background noises. The background noise can have an
adverse impact on our ability to converse without hindrance or
operation of cellular phones in vehicles, the speech signal
smoothly in very noisy environments, such as busy streets, in a which is to be transmitted may be contaminated by
car or cockpit of an airplane. Such type of noises can affect reverberation and background noise. In several cases,
quality and intelligibility of speech. This is a survey paper and these systems work well in nearly noise-free conditions,
its object is to provide an overview of speech enhancement however, their performance deteriorates rapidly in noisy
algorithms so that enhance the noisy speech signal which is conditions. Therefore, the development of preprocessing
corrupted by additive noise. The algorithms are mainly based on algorithms for speech enhancement is always interesting.
statistical based approaches. Different estimators are compared. The goal of speech enhancement varies according to
Challenges and Opportunities of speech enhancement are also particular applications, such as to increase intelligibility,
discussed. This paper helps in choosing the best statistical based
technique for speech enhancement
to boost the overall speech quality, and to improve the
performance of voice communication devices. Speech
Index Terms—Speech Enhancement; Wiener Filtering; MMSE analysis is used to extract features directly pertinent for
Estimator; Bayesian Estimators; Maximum A Posteriori (MAP) different applications. Speech analysis can be
Estimators implemented in time domain (operating directly on the
speech waveform) and in the frequency domain (after a
spectral transformation of speech) [1].
I. INTRODUCTION Many single-channel speech enhancement approaches
have been proposed over the years. Different types of
The main aim of speech enhancement is to improve the
algorithms are proposed for speech enhancement such as
performance of speech communication systems in noisy
spectral subtraction, wiener filtering, statistical based
environments. Speech enhancement can be applied to a
methods, noise estimation algorithms [2]. Speech signal
speech recognition system or a mobile radio
representation in time domain measurements includes
communication system, a set of low quality recordings or
energy, average zero crossing rate and the auto
to improve the performance of aids for the hearing
correlation functions [3]. In time domain approaches,
impaired. The interference source may be a wide-band
which includes Kalman filter based methods, the speech
noise in the form of a white or colored noise, a periodic
enhancement is performed directly on the time domain
signal such as in room reverberations, hum noise, or it
noisy speech signal via the application of enhancement
can take the form of fading noise. The speech signal may
filters (i.e. Linear convolution). In the frequency domain
be simultaneously attacked by one or more noise source.
approaches, a short-time Fourier transform (STFT) [4] is
There are two principal perceptual criteria for measuring
typically applied to a time domain noisy speech signal.
the speech enhancement system performance. The first
This class of methods includes, among others, spectral
criterion is the quality of the enhanced signal measures its
subtraction and Bayesian approaches. The enhancement
clarity, distorted nature, and also the level of residual
can also be performed in other domains. Different
noise in that signal. The second criterion is measuring the
algorithms are proposed to evaluate the speech
intelligibility of the enhanced signal. This is an objective
enhancement that is intelligibility of processed speech
measure which provides the percentage of words that
and quality of processed speech [2]. For example, the so-
could be properly identified by listeners. Most speech
called subspace approach is obtained by applying a
enhancement systems improve the signal quality at the
Karhunen-Loeve Transform (KLT) [5] to the time
expense of reducing its intelligibility. Our goal in speech
domain signal, performing the enhancement in that
processing is to obtain a more convenient or more useful
domain and finally going back to the time domain using
representation of information carried by the speech signal.

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
70 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

an inverse KLT operation. In all these approaches, the statistical based approaches for speech degraded by
modification should be made to the noisy speech. background noise. Section IV presents the challenges and
The paper is organized as follows: section II presents opportunities in speech enhancement. Section V
an overview of speech enhancement, classification of concludes with a summary and conclusion.
speech enhancement methods. Section III discusses the

Approaches for Speech Enhancement

Additive Noise Removal

Time Domain Approaches Transform domain Statistical Based Approaches Subspace Auditory
1. Comb Filtering Approaches 1. Wiener Filtering Approach Masking
2. LPC based Filtering 1. DFT 2. ML Estimators Methods
3. Adaptive Filtering 2. DCT 3. Bayesian Estimators
- Kalman Filtering, 3. Wavelets 4. MMSE Estimators
- H  algorithm 4. Spectral Subtraction - MMSE Magnitude
Estimator
4. HMM Filtering
- MMSE Complex
5. Neural Networks
Exponential Estimator
4. Log MMSE Estimators
5. MAP Estimators
6. Perceptually Motivated
Bayesian Estimators

Reverberation Cancellation

Temporal Processing Cepstral Processing


1. De-Reverberation 1. CMS and RASTA
Filtering processing
2. Envelope Filtering

Multi Speech (Speaker) Separation

1. CASA Methods
2. Sinusoidal Modelling
Fig. 2.1 Classification of Speech Enhancement Methods

Classification of speech enhancement methods is shown


II. OVERVIEW OF SPEECH ENHANCEMENT in Fig 2.1
Enhancement means improvement in the value of
quality of something. When it is applied to speech, this is
III. STATISTICAL-MODEL BASED METHODS
nothing but the improvement in intelligibility and/or
quality of degraded speech signal by using signal
A. Wiener Filtering
processing tools. Speech enhancement is a very difficult
problem for two reasons. The first reason is the nature Let y [n] be a discrete time noisy sequence
y n   xn   wn 
and characteristics of the noise signals can change (1)
randomly, application to application and in time. The
second reason is the performance measure is defined Where x[n] is the desired signal, which we also refer to
differently for each application. There are two perceptual as the “Object” and w[n] is the unwanted background
criteria are widely used to measure the performance of noise. We assume x[n] and w[n] to be uncorrelated
algorithms. They are quality and intelligibility. random process, wide sense stationary with power

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 71

spectral density functions denoted by s x   and s w   modifications of standard Wiener filtering and propound
respectively. One approach to recovering the desired a new speech enhancement technique. The main aim is to
signal x [n] will rely on the additivity of power spectra improve the quality of the enhanced speech signal
provided by standard Wiener filtering by controlling the
s y    s x    s w   (2) latter via a second filter regarded as psychoacoustically
motivated weighting factor. Philipos C. Loizou [8]
For recovering an object sequence x[n] corrupted by introduced a new speech improvement approach of a
additive noise w[n], that is from a sequence frequency specific composite gain function for Wiener
Y [n] = x [n] +w [n], is to find a linear filter h [n] such filtering, intended by the recently established finding that
that the sequence xˆn  yn* hn minimizes the expected the acoustic cues at low frequencies will improve speech
value of under the condition that the signals x[n] and w[n] recognition in noise by the combined electrical and
are stationary and uncorrelated and the frequency domain acoustic stimulation technique. With this modification the
solution to this stochastic optimization problem is given planned approach is in a position to recover a lot of low
by frequency (LF) parts and enhance the speech quality
s x   adaptative procedure is employed to see the low
H s    (3)
s x    s w   frequency boundary. Jacob Benesty [9] presents a
theoretical analysis on the performance of the optimal
Which is referred as wiener filter. When the signals x[n] noise-reduction filter within the frequency domain. By
and w[n] meet the conditions under which the Wiener is using the autoregressive (AR) model each the clean
derived, that is stationary and uncorrelated object and speech and noise are modelled, built relation between the
background, the wiener filter provides noise suppression AR parameters of the clean speech and wiener filter and
without considerable distortion in the object estimate and noise signals. The authors showed that if the noise is not
the background residual. The required power spectra sx  predictable, the Wiener filter is mostly related to the AR
and sw   can be estimated by averaging over multiple parameters of the desired speech signal. When the desired
frames when x[n] and h[n] sample functions are provided. signal is not predictable, then the Wiener filter mostly
The background and desired signal are non-stationary in related to the AR parameters of the noise signal. Chung-
the sense that their power spectra change over time, that Chien Hsu [10] proposed a signal-channel speech
is they can be expressed as time varying functions s x n,   enhancement algorithm by applying the traditional
and sw n,  . Thus every frame of STFT is processed by
Wiener filter within the spectro-temporal modulation
domain. In this work the multiresolution Spectro-
different wiener filter. We can express time varying temporal analysis and synthesis framework for Fourier
wiener filter as spectrograms extends to the analysis-modification-
sˆ x  pL,   synthesis (AMS) framework for speech enhancement.
H s  pL,    (4)
sˆ x    sˆ w  
Feng Huang [11] proposed a transform-domain Wiener
filtering approach for enhancing speech periodicity. The
Where sˆx  pL,  is an estimate of time varying power enhancement of speech is performed based on the linear
prediction residual signal. Two sequential lapped
spectrum of x[n]. frequency transforms are applied to the residual during a
Jacob Benesty [6] proposed a quantitative performance pitch-synchronous manner. The residual signal is
behaviour of the wiener filter within the context of noise effectively described by two separate sets of transform
reduction. The author showed that within the single coefficients that correspond to the periodic and aperiodic
channel case the a posteriori signal/noise (SNR) (defined elements, severally. For the transform coefficients of the
when the Wiener filter) is larger than or adequate to the a periodic and aperiodic components, different filter
priori SNR (defined before the Wiener filter), indicating parameters are designed. A template-driven methodology
that the Wiener filter is usually ready to reach noise is employed to estimate the filter parameters for the
reduction. The amount of noise reduction is normally periodic components, whereas in the case of aperiodic
proportional to the amount of speech degradation. The components, the filter parameters can be calculated by
authors showed that speech distortion will be higher using local SNR for effective noise reduction.
managed in three alternative ways. If we have some a
priori data (such as the linear prediction coefficients) of B. Maximum-Likelihood Estimators
the clean speech signal, this a priori data will be exploited The estimator based on maximum likelihood principle,
to attain a noise reduction while maintaining the level of termed the Maximum Likelihood Estimator (MLE). We
speech distortion. Once no a priori data is offered, we can obtain an estimate that is about the Minimum
will still reach better control of noise reduction and Variance Unbiased (MVU) estimator. Variance of
speech distortion by properly manipulating the Wiener estimator should be minimum. Because, as the estimation
filter, leading to a sub optimum wiener filter. Amehraye, accuracy improves the variance decreases. The estimator
D.Pastor A. Tamtaoui [7] proposed a speech is unbiased means, on the average the estimator will yield
enhancement technique, which deals with musical noise the true value of unknown parameters. Estimation
resulting from subtractive type algorithms and accuracy directly depends on probability density function
particularly Wiener filtering. The authors compared many (PDF). The attributes of approximation relies on the
methods that introduce perceptually motivated

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
72 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

property that the MLE is asymptotically (for large data) This is the Bayesian approach thus named as a result of
efficient. its implementation relies directly on Bayes theorem. The
The advantage of MLE is that we can consistently find motivation for doing this is twofold. First, if we tend to
it for a given data set numerically. This is because the have available some prior knowledge about  , we will
Maximum Likelihood Estimator (MLE) is determined as incorporate it into our estimator. The mechanism for
the maximum of known function, namely, likelihood doing this we need to assume  is random variable with a
function. Fig 3.1 shows the search of MLE. [12]. given prior PDF.
On the other hand, Classical approach, finds it tough to
make use of any prior knowledge. So Bayesian approach
will improve the estimation accuracy. Second, Bayesian
estimation is beneficial in situations where an MVU
estimator cannot be found. The Bayesian approach to data
modelling is shown in Fig 3.2 [12].

Fig. 3.1 Grid search for MLE

If, for example, the allowable value of  lie in the


interval [a, b], then we need only maximize pX ;  over
that particular interval. The “safest” way to do this
maximization is to perform a grid search over [a, b]
interval. As long as the spacing between  is small Fig. 3.2 Bayesian Approach to data modeling
enough, then we are guaranteed to find the MLE for the
given set of data. In classical approach we might choose for a realization
M.L.Malpass [13] proposed enhancing speech in of noise w(n) and add it to given A. This procedure is
associate additive acoustic noise surroundings is to continual for M times. Every time we need to add new
perform a spectral decomposition of a frame of noisy realization of w(n) to given A. In Bayesian approach, for
speech associated to attenuate a specific spectral prong every realization we might opt for A according to PDF
relying on what proportion the measured speech and and so generate w(n) as shown in Fig 3.3. We might then
noise power exceeds an estimate of the background noise. repeat this procedure for M times. In the classical
By using a two-state model for the speech event that is approach, we would obtain for every MSE for every
speech absent or speech present and by using the assumed value of A and MSE depends on A. In Bayesian,
maximum likelihood estimator of the magnitude of the Single MSE figure obtained would be average over PDF
acoustic spectrum results during a new category of of A and MSE does not depend upon A. Ephraim [15]
suppression curves which allows a tradeoff of speech developed a Bayesian estimation approach for enhancing
distortion against noise suppression. Takuya Yoshioka speech signals which have been degraded by statistically
[14] proposed a speech enhancement methodology for independent additive noise. Especially, maximum a
signals contaminated by room reverberation and additive posteriori (MAP) signal estimators and minimum mean
background noise. The subsequent conditions are square error (MMSE) are developed using hidden
assumed: (1) The spectral parts of speech and noise are Markov models (HMM‟s) for the clean signal and the
statistically independent Gaussian random variables. (2) noise process. The authors showed that the MMSE
In every frequency bin the convolutive distortion channel estimator comprises a weighted sum of conditional mean
is modeled as associate auto-regressive system. (3) The estimators for the composite states of the noisy signal
speech power spectral density is modeled as associate all- (pairs of states of the models for the signal and noise),
pole spectrum, whereas that of power spectral density of where the weights equal the posterior probabilities of the
the noise is assumed to be stationary and given prior to. composite states given the noisy signal. A gain-adapted
Under these conditions, the proposed methodology MAP estimator is developed by using the expectation-
estimates the parameters of the channel and those of the maximization (EM) algorithm. Chang hui you [16]
all-pole speech model supported the maximum likelihood proposed Beta-order minimum mean-square error
estimation methodology. (MMSE) speech enhancement approach for estimating
the short time spectral amplitude (STSA) of a speech
C. Bayesian Estimators
signal. Analyzes the characteristics of the Beta-order
In case of classical approach to statistical estimation STSA MMSE estimator and therefore the relation
within which the parameter  of interest is assumed to be between the worth of and therefore the spectral amplitude
a a deterministic however unknown constant. Instead of gain perform of the MMSE technique. The effectiveness
assuming  is a random variable whose specific of a variety of fixed values in estimating STSA supported
realization we tend to should estimate. the MMSE criterion is investigated. Whenever the speech

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 73

level is above the noise level, a priori SNR involves a estimation approaches that take into account the
frame delay and is no longer a smoothed SNR estimate, excitation variances as a part of the a priori data, within
which is more appropriate for non stationary signals. the proposed technique authors computed online for
Whenever the speech level is close to or below the noise every short-time segment, based on the observation at
level, a priori SNR estimation equation has a smoothing hand. Consequently, the strategy performs well in non-
effect and musical noise is greatly reduced. Therefore, stationary noise conditions. The resulting estimates of the
total suppression is improved. Sriram Srinivasan [17] speech spectra and the estimates of noise spectra are
proposed a Bayesian minimum mean square error employed in a Wiener filter. Estimation of the functions
approach for the joint estimation of the short-term of the short-term predictor parameters is additionally self-
estimator parameters of speech and noise. By this work's addressed, especially one that results in the minimum
author used trained codebooks of speech and noise linear mean square error estimate of the clean speech signal.
predictive coefficient to model the a priori data needed by The Bayesian approach is given in Fig 3.3. [12].
the Bayesian scheme. In distinction to current Bayesian

No Yes
PDF Known First two moments
known LMMSE
Estimator

Yes No
Yes
Compute mean of
posterior PDF MMSE
estimator

No
Yes

Maximize Posterior PDF MAP estimator

N
o

Fig. 3.3 Bayesian Approach


Signal processing Problem

model

Yes Yes
Dimensionality a problem Prior Knowledge
Bayesian
Approach

No
No
No
Yes
Prior Knowledge Bayesian New data model
Approach

Yes
No

Classical Approach

Fig. 3.4 Classical Versus Bayesian Approach

Achintya Kundu [18] considered a general linear density function (PDF) of the clean signal employing a
model of signal degradation, by modeling he probability Gaussian mixture model (GMM) and additive noise by a

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
74 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

Gaussian PDF, and also derived the minimum mean For low SNRs gain is decreased and heavy noise will
square error (MMSE) estimator which is non-linear. mask speech signal, so estimator will apply small gain
Takuya Yoshioka [19] proposed an adaptive noise such that remove much of noise.
suppression methodology for non-stationary noise based Eric Plourde [25] proposed a new cost function that
on the Bayesian estimation methodology. generalizes those of existent Bayesian STSA estimators
The subsequent conditions are assumed: so get the corresponding closed-form answer for the best
(1) Speech and noise samples are statistically clean speech STSA. In this approach, associate degree
independent, and that they follow auto-regressive (AR) estimate of the clean speech derived by minimizing the
processes. (2) The noise AR model prior distribution of expectation of a cost function that penalizes errors within
the parameters of a current frame is similar to the the clean speech estimate. Beta-Order STSA MMSE
posterior distribution of these parameters calculated estimator applies a power law to the estimated and actual
within the previous frame. Based on these conditions, clean speech STSA within the square error of the cost
The proposed method approximates the AR model function. Jincang hao [26] proposes the speech model is
parameters with the joint posterior distribution and assumed to be a Gaussian mixture model (GMM) within
therefore the speech samples are used by the variational the log-spectral domain. This is often in distinction to
Bayesian method. Moreover, described an efficient most current models in the frequency domain. Precise
implementation by assuming that each covariance matrix signal estimation could be a computationally indictable
has the Toeplitz structure. The difference between problem. The authors derived the three approximations to
classical approach and a Bayesian approach is shown in boost the efficiency of signal estimation. By using
the Fig 3.4. [12]. Kullback–Leiber (KL) divergency criterion the Gaussian
Eric Plourde[20] proposed a new family of Bayesian approximation transforms the log-spectral domain GMM
estimators for speech enhancement wherever the cost into the frequency domain exploitation. The frequency
function includes each an power law and a weighting domain Laplace technique computes the maximum
factor. The parameters of the cost function, and thus of posteriori (MAP) estimator for the spectral amplitude.
the corresponding estimator gain, are chosen supported Similarly, the log-spectral domain Laplace technique
characteristics of the human auditory system, it is found computes the log-spectral amplitude of the MAP
that selecting the parameter leads to a decrease of the estimator. Further, the gain and noise spectrum adaptation
estimator gain at high frequencies. As a result this are implemented using the expectation–maximization
frequency dependence of the gain will improve the noise (EM) algorithm inside the GMM under Gaussian
reduction, whereas it limits the speech distortion. So new approximation.
estimators attains higher improvement performance than P.Spencer Whitehead [27] proposed a robust Bayesian
existing Bayesian estimators (MMSE, STSA, LSA, WE analysis is employed, so that to analyze the sensitivity of
error), each in terms of objective and subjective measures. the algorithms to the errors in the case of noise estimate
P. J. Wolfe [21],[22] masking thresholds were introduced and improve the SNR ratio. Eric Plourde [28] proposed a
within the Bayesian estimator‟s cost function to make it a Bayesian short-time spectral amplitude (STSA)
lot of perceptually significant whereas in P.C. Loizou estimation for single channel speech enhancement, the
[23], many perceptually relevant distortion metrics were spectral parts assumed unrelated. However, the
thought of as cost functions. One in every of the cost assumption is inexact since some correlation is present in
functions that was found to yield the most effective leads the practice. The authors investigated a multidimensional
to [23] was based the perceptually weighted error Bayesian STSA estimator that assumes correlated
criterion employed in speech coding. In [23] the error spectral parts. The author derived closed-form
spectrum is weighted by a filter that is that the inverse of expressions for higher and a lower bound on the specified
the original speech spectrum. This was adapted from [23] estimator. Joint fundamental and model order estimation
by proposing a generalization of the MMSE STSA cost is a very important drawback in many applications like
function wherever the error between the estimated and speech and music processing. Jesper Kjær Nielsen [29]
actual clean speech STSA is weighted by the STSA of the developed an approximate estimation algorithm of those
clean speech raised to an exponent, the resulting quantities using Bayesian interference. The interference
estimator is termed Weighted Euclidian (WE). Another regarding the fundamental frequency and therefore the
generalization of the MMSE STSA cost function was model order relies on a probability model that
planned by You et al. [24] within the Beta-Order STSA corresponds to a minimum of priori information. From
MMSE estimator; that we are going to denote as Beta-SA this probability model, the precise posterior distribution
for convenience. The Beta-SA estimator applies a power of the fundamental frequency is provided. Nasser
law to the estimated and actual clean speech STSA within Mohammadiha [30] presented a speech enhancement
the square error of the cost function. The advantage of the algorithm that relies on a Bayesian Nonnegative Matrix
proposed method is both weighting and compression in Factoring (NMF) each Minimum Mean square error
the cost function. Estimators are more advantageous at (MMSE) and maximum a-Posteriori (MAP) estimates of
low SNRs because distortions of high frequency contents the magnitude of the clean speech DFT coefficients are
of speech, such as fricatives will be less perceptual in derived. To use the temporal continuity of the speech
heavy noise (low SNR), but they could become more signal and noise signal, a proper prior distribution is
perceptible in regions where noise is weak. introduced by widening the NMF coefficients posterior

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 75

distribution at the previous time frames. To perform the proposed that the amplitude of a gain normalized Fourier
above ideas, a recursive temporal update scheme is transform coefficient are gamma-like distributed. This
proposed to get the mean of the prior distribution, Further, model leads to a Rayleigh distribution for the amplitude
the uncertainty of the priori data is ruled by the shape of every signal spectral part, and assumes insignificant
parameter of the distribution that is learnt automatically probability for low amplitude realizations. Therefore, this
based on the nonstationarity of the signals. model will cause less suppression of the noise than
different amplitude distribution models (e.g., gamma) that
D. MMSE Estimator
assume high probability for low amplitude Realizations.
Y= h(X) + V, Where Y is noisy vector, h is a known Ephraim [35] derived a short-time spectral amplitude
function, and V a noise vector. (STSA) estimator for speech signals that MMSE of the
The posterior probability of X is given by log spectra and examine it in enhancing noisy speech. It
p y / x  px  was found that the new estimator is effective in
px / y   (5) enhancing the noisy speech, and it considerably improves,
 p y / x  px dx the quality of speech. Zhong-Xuan Yuan [36] proposed a
hybrid algorithm that incorporates the (MMSE algorithm
The mean of X according to p(x/y): and the auditory masking algorithm for speech
enhancement. This hybrid algorithm yield reduction in
xˆ  y   E x / y  
 xpx / ydx (6) residual noise compared with that in the enhanced speech
produced by the MMSE algorithm alone. Guo-Hong Ding
An estimator x̂ for x is a function [37] proposed a unique speech enhancement approach,
known as power spectral density minimum mean-square
xˆ : y  xˆ y  (7) error (PSD-MMSE) estimation-based speech
enhancement, that is implemented within the power
The value of x̂ y  at a specific observed value y is an spectral domain wherever stationary random noise may
estimate of x. The mean square error of an estimator x̂ is be modeled as exponential distribution. Speech
given by magnitude-squared spectra can be modelled as mixed


MSExˆ   E X  xˆ  y 
2
 (8)
exponential distribution and an MMSE estimator is built
based on the parametric distributions. Li deng [38]
proposed a unique speech feature enhancement technique
The minimum mean square error (MMSE) estimator based on nonlinear acoustic environment model,
x̂MMSE is the one that minimizes the MSE. probabilistic model that effectively incorporates the phase
relationship between the clean speech and therefore the
xˆ MMSE  y  EX / Y  y (9) corrupting noise within the acoustic distortion method.
The MMSE estimator is unbiased. The core of the enhancement algorithm is that the
MMSE estimator for the log Mel power spectra of clean
xˆ MMSE Y   EX  (10) speech supported the phase-sensitive environment model.
The clean speech joint probability and noisy speech joint
The posterior MSE is defined as (for every y): probability are modeled as a multivariate Gaussian. Since

MSExˆ / y   E X  xˆ  y  / Y  y
2
 (11)
a noise estimate is needed by the MMSE estimator, a high
quality, successive noise estimation algorithm is
additionally developed and given. The sequential MAP
with minimal value MMSE(y) (maximum a posteriori) learning for noise estimation is
Yariv Ephraim [31] proposed the class of speech better than the sequential maximum likelihood learning,
enhancement systems which capitalize on the major each evaluated under the identical phase-sensitive. Li
importance of the short-time spectral amplitude (STSA) Deng [39] proposed an algorithm which exploits joint
of the speech signal in its perception. The author derived prior distributions within the clean speech model that
the MMSE STSA estimator, based on modeling noise and incorporate each the static and frame-differential dynamic
speech spectral components as statistically independent cepstral parameters. From the given the noisy
Gaussian random variables. Analyzed the performance of observations, full posterior probabilities for clean speech
the proposed STSA estimator and compare it with Wiener can be computed by incorporating a linearized version of
estimator based STSA estimator. MMSE STSA estimator a nonlinear acoustic distortion model. The ultimate kind
examines the quality of signals or their strengths under of the derived conditional MMSE estimator is shown to
noisy conditions and in the regions where the presence of be a weighted sum of three separate terms, and therefore
signal is uncertain. The MMSE STSA estimator is the sum is weighted once more by the posterior for every
combined with the complex exponential of the noisy of the mixture component within the speech model.
phase in constructing the enhanced signal. a priori The first of the three terms is shown to arrive naturally
probability distribution of the speech and noise Fourier from the predictive mechanism embedded within the
expansion coefficients should be known to derive MMSE acoustic distortion model in the absence of any prior data.
STSA estimator. Example, Zelinski and Noll [32], [33] The remaining two terms result from the speech model
gain normalized cosine transform coefficients are using solely the static prior and solely the dynamic proir,
approximately Gaussian distributed. Porter and Boll [34]

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
76 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

respectively. The main innovations of the author‟s work amplitudes, is applied in single channel speech
are: enhancement. As opposition to state of art estimators, the
1) Incorporation of the dynamic cepstral features optimal estimator derived for a given clean speech
within the Bayesian framework for effective speech spectral phase. Finally, it shows that the phase contains
feature enhancement. extra data which will be exploited to distinguish the
2) A new enhancement algorithm using the full outliers within the noise from the target signal.
posterior that elegantly integrates the predictive data from
E. Log-MMSE Estimator
the nonlinear acoustic distortion model, the prior data
based on the static clean speech cepstral distribution, and Jason Wung [46] proposed a unique post-filtering
therefore the prior based on the frame-differential method applied after the log STSA filter. Since the post-
dynamic cepstral distribution; filter comes from vector quantization of clean speech
Rainer Martin [40] proposed a minimum mean-square information, it is a similar result of imposing clean supply
error estimator of the clean speech spectral magnitude spectral constraints on the enhanced speech. Once
that uses each parametric compression function within the combined with the log STSA filter, the extra filter will
estimation error criterion and a parametric prior noticeably suppress residual artifacts by effectively
distribution for the statistical model of the clean speech lowering the residual noise of decision-directed
magnitude. The novel parametric estimator has several estimation similarly at reducing the musical noise of ML
famous magnitude estimators as a special solution and, in estimation. Bengt J. Borgstrom [47] proposed a family of
addition, estimators that combines the beneficial log-spectral amplitude (LSA) estimators for speech
properties of various known solutions. Yoshihisa Uemura enhancement. Generalized Gamma distributed (GGD)
[41] revealed new findings regarding the generated priors are assumed for speech short-time spectral
musical noise in minimum mean-square error short-time amplitudes (STSAs), providing mathematical flexibility
spectral amplitude (MMSE STSA) process. The authors in capturing the statistical behaviour of speech.
proposed a objective metric of musical noise based on F. MMSE Estimation of the pth-Power Spectrum
kurtosis change ratio on spectral subtraction (SS).
Additionally found a stimulating relationship between the Y=X+W (12)
degree of generating musical noise, the strength
parameter of Spectral Subtraction processing, the shapes Where Y is noisy signal, X is clean signal and W is the
of signal‟s probability density function, The proposed noise.The Minimum Mean Square Estimation is given by
algorithm is aimed to automatically evaluate the sound
quality of various forms of noise reduction ways using

  2
min E  X k2  Xˆ k2 
 
(13)
kurtosis change ratio. Philipos C. Loizou [42] presents a
unique speech enhancement algorithm which will Where X k and X̂ k are clean speech and estimated clean
considerably improve the signal-to-residual spectrum speech respectively.
ratio by combining statistical estimators of the spectral
magnitude of the speech and noise. The noise spectral
magnitude estimator comes from the speech magnitude  
E X k2 / Y  k  
 k  1  k

 k  1   k
 2
Yk (14)
estimator, by transforming the a priori and also a 
posterior SNR value. By expressing the signal-to-residual The Minimum Mean Square Estimation for the Pth
spectrum ratio as a function of the estimator‟s gain order is given by
function, and also derived a hybrid strategy which will
improve the signal-to-residual spectrum ratio when the a
priori and the a posterior SNR are detected to be under 0
min E 

p
 p 2
 X k  Xˆ k 

 (15)
dB. T. Hasan Md.K. Hasan [43] proposed a new MMSE
estimator for DCT domain speech enhancement assuming Estimated Amplitude for the P th order is given by
the two state possibilities of signal and noise DCT 1
coefficients, and also proposed the constructive and k   p    p  p
Xˆ k    1 ,1; k  Yk (16)
destructive interferences. The optimum non-linear k   2   2 
MMSE estimator may be derived by considering the
conditional events. Yu Gwang Jin [44] proposed a Gain function :
G p  k ,  k 
unique speech enhancement algorithm based on data-
(17)
driven residual gain estimation. The system consists of
two stages. The noisy input signal is processed at the first
Yariv Ephraim [48] derived an exact expression for the
stage by conventional speech enhancement module from
covariance of the log-periodogram power spectral density
that each enhanced signal and several other SNR-related
estimator for a zero mean Gaussian process.
parameters are obtained. At the second stage, the residual
gain, which is estimated by a data driven methods, is G.MMSE Estimators Based on Non-Gaussian
applied to the improved signal to regulate it more. Timo Distributions
Gerkmann [45] derived a minimum mean square error In case of MMSE algorithms, we assume that the real
(MMSE) optimal estimator for clean speech spectral and imaginary parts of the clean Discrete Fourier

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 77

Transform (DFT) coefficients are modeled by a Gaussian the clean speech DFT coefficients are characterized by
distribution. However, the assumption of Gaussian, holds deterministic, however unknown amplitude and phase
asymptotically for long duration analysis frames. For values, whereas the noise DFT coefficients are assumed
these specific frames the span of the correlation of the to follow a zero-mean Gaussian probability density
signal is much shorter than that of the DFT size. This function (PDF). The use of the proposed estimator leads
assumption holds good for the noise DFT coefficients, to less suppression as compared to the case where speech
but it does not hold good for the speech DFT coefficients. DFT coefficients are assumed stochastic. Yiannis
These coefficients are typically estimated using relatively Andrianakis [53] proposed a speech enhancement
short (20–30 ms) duration windows. This is the reason algorithm that takes advantage of the time and frequency
why we use non-Gaussian distributions for modeling the dependencies of speech signals. The above dependencies
real parts and imaginary parts of the speech DFT are incorporated within the statistical model using ideas
coefficients. Specifically, the Gamma probability from the Theory of Markov Random Fields. Above all
distributions or the Laplacian probability distributions the speech STFT amplitude samples are modelled with a
can be used to model the distributions of the real parts Chi Markov Random Field priors, and then used for the
and imaginary parts of the DFT coefficients. For example, estimator based on the Iterated Conditional Modes
MMSE estimator can use a Laplacian PDF for the noise methods. The novel prior is additionally coupled with a
DFT coefficients and gamma PDF for the speech DFT „harmonic‟ neighborhood, that allows the directly
coefficients and vice versa. In summary, MMSE adjacent samples on the time frequency plane, also
estimators that use non-Gaussian pdf's for speech and considers samples that are one pitch frequency apart,
noise Fourier transform coefficients yields improvements therefore on benefit of the made structure of the voiced
in performance. speech time frames. Cohen [54] has shown that
Saeed Gazor [49] proposed an algorithm that the noisy consecutive samples inside a frequency bin are extremely
speech signal is first decorrelated and then the clean correlated, whereas the DD approach [55] for the
speech components are estimated from the decorrelated estimation of the a priori SNR owes its success for the
noisy speech samples. The distributions of clean speech most part to the exploitation of the dependencies between
signals are assumed to be Laplacian and the noise signals successive spectral amplitude samples of speech.
are assumed to be Gaussian. The clean speech Correlations exist between consecutive samples on the
components are estimated either by using maximum frequency axis of the STFT, that stem not only from the
likelihood (ML) or by using minimum-mean-square-error spectral leakage caused by the tapered windows
(MMSE) estimators. These ML and MMSE estimators employed in the calculation of the STFT, however from
require some statistical parameters derived from speech the most common modulation of the amplitude of
and noise. Rainer Martin [50] proposed an algorithm that samples that belong to adjacent harmonics of the voiced
the estimating DFT coefficients within the MMSE sense, speech frames, as according in [56]. In the case of linear
when the prior probability density function of the clean mean-squared error (MSE) complex-DFT estimator (eg.
speech DFT coefficients will be modelled by a complex Wiener filter), its magnitude-DFT (MDFT) counterparts
laplacian or by a complex bilateral Gamma density. The are considered in the context of speech enhancement.
PDF of the noise DFT coefficients can be modelled as Therefore Hendrick R.C. [57] proposed a linear MSE
complex Gaussian density or complex Laplacian density. MDFT estimator for speech enhancement. The author
Estimators based on Gaussian noise model and the super presented linear MSE MDFT estimators depends on the
Gaussian speech model provides a higher segSNR than speech DFT coefficients distributions, unlike the linear
linear estimators. Estimators based on Laplacian noise complex-DFT estimators. Bengt Jonas Borgström [58]
model achieve the better performance only for high SNR proposed a stochastic framework for designing optimal
conditions but provide better residual noise. The short-time spectral amplitude (STSA) estimators for
proposed estimator requires adaptive a priori SNR speech enhancement assuming phase equivalence of
smoothing and limiting to achieve high residual noise noise and speech signals. By assuming additive
quality. Richard C Hendriks [51] proposed a technique superposition of speech and noise, that is inexplicit by the
for speech enhancement based on the DFT. The maximum-likelihood (ML) phase estimate, effectively
generalized DFT magnitude estimator obtained from the project the optimal spectral amplitude estimation
results has a special case, the present scheme is based on drawback onto a 1-D subspace of the complex spectral
Rayleigh speech priors, whereas the complex DFT plane, so simplifying the problem formulation. By
estimators generalize existing schemes are based on assuming Generalized Gamma distributions (GGDs) for a
Gaussian speech priors, Laplacian speech priors and priori distribution of each speech and noise STSAs, and
Gamma speech priors. Richard C. Hendriks [52] also derived separate families of novel estimators
proposed a deterministic model in combination with the consistent with either the maximum-likelihood (ML), the
well-known stochastic models for speech enhancement. minimum mean-square error (MMSE), or the maximum a
The authors derived a MMSE estimator under a posteriori (MAP) criterion. The employment of GGDs
combined stochastic–deterministic speech model with permits optimal estimators to be determined in
speech presence uncertainty and showed that for various generalized form, so specific solutions may be obtained
distributions of the DFT coefficients the combined by substituting statistical parameters resembling expected
stochastic–deterministic speech model coefficients. Here, speech and noise priors. The proposed estimators exhibit

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
78 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

sturdy similarities to well-known STSA solutions. For chosen fixed over time and frequency. It can be
instance, the Magnitude Spectral Subtracter (MSS) and advantageous to derive clean speech estimators under a
Wiener filter (WF) are obtained from specific cases of distribution that may be over time and frequency. Assume
GGD shape parameters. Several investigations showed Speech DFT coefficients follow MNIG distribution,
that speech enhancement approaches may be improved by motivated by the actual fact that speech could be a heavy-
speech presence uncertainty (SPU) estimation. Balazs tailed method yet. Under this assumption, the speech
Fodor [59] proposed a new consistent solution for MMSE DFT amplitudes are often shown to be Rayleigh inverse
speech amplitude (SA) estimation under SPU, based on Gaussian (RIG) distributed. Under the MNIG and RIG
spread of speech priors as the generalized gamma distribution, the authors derived clean speech MAP
distribution. The proposed approach shows outperform estimators for the complex DFT coefficients and speech
each the SPU-based MMSE-SA estimator relying on a amplitudes, severally. Guo-Hong Ding [61] introduced
Gaussian speech prior. maximum a posteriori (MAP) framework to sequentially
estimate noise parameters. By using the first-order vector
H. Maximum A Posteriori (MAP) Estimators
Taylor series (VTS) approximation to the nonlinear
A proportional cost function results in an optimal environmental function within the log-spectral domain,
estimator which is median of posterior PDF. For a “Hit- the estimation is implemented. Tim Fingscheidt [62]
or-Miss” cost function the mode or maximum location of proposed a maximum a posteriori estimation jointly of
posterior PDF is the optimal estimator. The latter is spectral amplitude and phase (JMAP). It allows speech
termed the maximum a posterior (MAP) estimator. In the models (Gaussian, super-Gaussian. etc.) whereas the
MAP estimation approach we choose ˆ to maximize noise DFT coefficients PDF is being modelled as a
posterior PDF or Gaussian mixture (GMM). Such a GMM covers each a
non-Gaussian stationary noise method, however
ˆ  arg max p / x  (18) additionally a non-stationary method that changes

between Gaussian noise modes of various variance with
This is shown to minimize the Bayes risk for a “Hit-or- the probability of the GMM weight.
Miss” cost function. In finding maximum of p / x we
MMSE Versus MAP
can observe that
The MMSE estimator is usually tougher to derive than
px /   p 
p / x   (19) another estimator for random parameters like the
px  maximum a posteriori (MAP) estimator for speech
waveforms. The MMSE estimator has in observe
So an equivalent maximization is of px /  p  . This is exhibited systematically superior enhancement
reminiscent of MLE except for the presence of prior PDF. performance over the MAP estimator for speech
Hence the MAP estimator is given by waveforms. Although the MMSE estimator is outlined for
the MSE distortion measure, it's optimally additionally
ˆ  arg max px /   p  (20) extends over to category different distortion measures.

This property doesn't hold for the MAP estimator. As a
or result of the perceptually significant distortion measure
for speech is unknown, the wide coverage of the
ˆ  arg maxln  p / x   ln  p  (21) distortion categories by the MMSE estimator with same

optimality is extremely desirable.
Richard C. Hendriks and Rainer Martin [60] proposed
I. General Bayesian Estimators Perceptually Motivated
a new class of estimators for speech enhancement within
Bayesian Estimators
the DFT domain. The authors considered a
multidimensional inverse Gaussian (MNIG) distribution Philipos [63] proposed Bayesian estimators of the
for the speech DFT coefficients. This distribution models short-time spectral magnitude of speech based on
a wide range of processes, from a heavy-tailed process to perceptually motivated cost functions to overcome the
less heavy-tailed processes. Under the MNIG shortcomings of the MMSE estimator. The authors
distributions complex DFT and amplitude estimators are derived three classes of Bayesian estimators of the speech
derived. In distinction to alternative estimators, the magnitude spectrum. The first class of estimators uses
suppression characteristics of the MNIG-based estimators spectral peak information, the second class uses a
can be adapted online to the underlying distribution of the weighted-Euclidean cost function, this function considers
speech DFT coefficients. When compared to noise auditory masking effects, and the third class is designed
suppression algorithms based on pre-selected super- for penalizing the spectral attenuation. From the
Gaussian distributions, the MNIG-based complex DFT estimation theory, the MMSE estimator minimizes the
and amplitude estimators performance shows the better Bayes risk based on a squared-error cost function. The
improvement in terms of segmental ratio. Though super- squared-error cost function is most commonly used
Gaussian distributions offer clearly a higher description because of its mathematical tractable and easy to evaluate.
of the speech DFT distribution than a distribution, the The disadvantage of the squared error cost function is it
form of the distributions within the strategies is often treated positive estimation and negative estimation errors
the same way. But the perceptual effect of the positive

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 79

errors (i.e., the estimated magnitude is smaller than the other estimators. Zhong-Hua Fu Jhing-Fa Wang [68]
true magnitude) and negative errors (i.e., the estimated addresses the problem of speech presence probability
magnitude is larger than the true magnitude) is not the (SPP) estimation. Speech is approximately sparse in time-
same in the applications of speech enhancement. Hence, frequency domain, integrated the time and frequency
the positive errors and negative errors need not be minimum tracking results to estimate the noise power
weighted equally. To overcome above problems proposed spectral density and therefore the a posteriori signal-to-
a Bayesian estimator of short time spectral magnitude of noise. By applying Bayes rule, the final SPP is estimated,
speech based perceptually motivated. The authors derived that controls the time variable smoothing of the noise
Bayesian estimator preserves weak segment of speech power spectrum.
(low energy) such as fricatives and stop consonants.
K. Methods for Estimating the A Priori Probability of
MMSE estimator do not typically do well with such low
Speech Absence
energy speech segments because of low segmental SNR.
Estimator errors produced by MMSE estimator, WLR The a priori probability of speech absence is
estimator are small near the spectral peaks and large in   is assumed to be fixed. Q is set to 0.5 to
qk  p H 0k
spectral valleys, where residual noise is audible. Bayesian address the worst case scenario in which speech and noise
estimator that emphasizes spectral valleys more than are equally likely to occur. Using conditional
spectral peaks performed in terms of having less residual probabilities pYk / H1k  and pYk / H 0k  a binary decision
noise and better speech quality. Eric Plourde [64]
bk was made for frequency bin k according to
   
proposed a family of Bayesian estimators for speech
enhancement wherever the cost function includes each If p Yk / H1k  p Yk / H 0k ,then bk  0 (speech present)
power law and a weighting factor. Secondly, set the Else bk  1 (speech absent)
parameters of the estimator based on perceptual The a priori probability of speech absence for frame m,
considerations, the masking properties of the ear and also denoted as qk m can then be obtained by smoothing the
the perceived loudness of sound.
values of bk over past frames
J. Incorporating Speech Presence/ Absence Probability in
Speech Enhancement qk m  cbk  1  cqk m 1 (22)
Nam Soo Kim [65] proposed a novel approach for Where c is a smoothing constant
speech enhancement technique based on a global soft Israel Cohen [69] proposed a simultaneous detection
decision. The proposed approach provides spectral gain and estimation approach for speech enhancement. A
modification, speech absence probability (SAP) detector for speech presence in the frequency domain
computation and noise spectrum estimation using the (STFT) is combined with an estimator, this detector
same statistical model assumption. Israel cohen [66] jointly minimizes a cost function that considers both
proposed an estimator for the a priori SNR, and detection and estimation errors. Cost parameters control
introduced an efficient estimator for the a priori speech the tradeoff between speech distortion, caused by residual
absence probability. By a soft-decision approach, the musical noise resulting from false-detection and missed
speech presence probability (SPP) is estimated for every detection of speech components. High attenuation speech
frequency bin and every frame. The soft decision spectral coefficients due to missed detection errors may
approach exploits the robust correlation of the speech significantly degrade speech quality and intelligibility.
presence in neighboring frequency bins of consecutive False detecting noise transients as speech contained bins
frames. Generally, a priori speech absence probability may produce annoying musical noise. Decision Directed
(SAP) yields better results. Interaction between estimate (DD) approach may not suitable for transient
for a priori SNR and estimate for a priori SAP may environment. Since high energy noise burst may yield an
deteriorate the performance of the speech enhancement. instantaneous increase in a posteriori SNR and
The advantage of the proposed method is, for low input corresponding increase in a priori SNR. Spectral gain
SNR and non stationary noise, avoids the musical would be higher than desire value and transient noise
residual noise and attenuation of weak speech signal component would be sufficiently attenuated. Rainer
components. The proposed method avoids attenuation of Martin [70] proposed an improved estimator for the
weak speech signal components. Rainer Martin [67] speech presence probability at every time–frequency
proposed a novel estimator for the probability of speech point within the short-time Fourier transform domain. In
presence at every time-frequency point within the short- distinction to existing approaches, this estimator does not
time discrete Fourier domain. Whereas existing depend on adaptively estimated and so signal-dependent a
estimators perform quite dependably in stationary noise priori signal-to-noise estimate. It so decouples the
environments, they typically exhibit a large false alarm estimation of the speech presence probability from the
rate in non stationary noise that leads to a good deal of estimation of the clean speech spectral coefficients in
noise leakage when applied to a speech enhancement task. speech enhancement.
The proposed estimator eliminates this problem by The proposed posteriori speech presence probability
temporally smoothing the cepstrum of the a posteriori estimator gives probabilities near zero when the speech is
signal-to-noise (SNR), and yields significantly low absent. The proposed posteriori SPP estimator achieves
speech distortions and less noise leakage in each, the probabilities near one when the speech is present. The
stationary noise and non stationary noise as compared to

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
80 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

authors proposed a detection framework for determining amount of musical noise, however the estimated a priori
the fixed a priori signal-to-noise. Smoothing posterior SNR is biased since it depends on the speech spectrum
SNR has major advantage of reducing variance of speech estimation within the previous frame. Therefore, the gain
presence probability estimator. function matches the previous frame because of this
reason there is degradation in the noise reduction
L. Improvements to the Decision-directed Approach
performance. This causes a reverberation effect. So
Israel Cohen [71] proposed a non causal estimator for proposed a way known as two step noise reduction
the a priori SNR, and a corresponding non causal speech (TSNR) technique, which solves this problem, whereas
enhancement algorithm. In distinction to the decision maintaining the benefits of the decision-directed approach.
directed estimator of Ephraim and Malah, the non causal The a priori SNR estimation is refined by a second step to
estimator is capable of discriminating between speech avoid the bias of the Decision Directed (DD) approach,
onsets and noise irregularities. Onsets of speech are better therefore removing the reverberation effect. However, for
preserved, whereas an additional reduction of musical small SNRs, classic short-time noise reduction methods,
noise is achieved. A priori SNR cannot respond too fast as well as TSNR technique, introduces harmonic
to an abrupt increase in instantaneous SNR, which yields distortion in enhanced speech as a result of the
increase in musical residual noise. A priori SNR heavily unreliableness of estimators. This can be principally
depends on strong time correlation between successive because of the difficult task of noise power spectrum
speech magnitudes. The authors proposed a recursive a density (PSD) estimation in single-microphone schemes.
priori SNR estimator. Non causal estimator incorporates To overcome this problem, the authors proposed
future spectral measurement to better predict the spectral harmonic regeneration noise reduction (HRNR)
variance of clean speech. Israel cohen [72] proposed a technique.. A nonlinearity is employed to regenerate the
statistical model for speech enhancement that takes into degraded harmonics of the distorted signal in efficient
consideration the time-correlation between successive way. The resulting artificial signal is made so as to refine
speech spectral parts. It retains the simplicity related to the a priori SNR used to calculate a spectral gain able to
the statistical model, and allows the extension of existing preserve the speech harmonics. Yao Ren [75] proposed
algorithms for non-causal estimation. The sequence of an MMSE a priori SNR estimator for speech
speech spectral variances may be a random process, i.e, enhancement. This estimator is also having similar
mostly correlate with the speech spectral magnitude benefits of the Decision Directed (DD) approach,
sequence. For the a priori SNR, Causal and non causal however, there is no need of an ad-hoc weighting factor
estimators are derived in agreement with the model to balance the past a priori SNR and current maximum
assumptions and therefore the estimation of the speech likelihood SNR estimate with smoothing across frames.
spectral parts. Cohen presented that a special case of the Colin Breithaupt [76] proposed analysis of the
causal estimator degenerates to a “decision-directed” performance of noise reduction algorithms in low SNR
estimator with a time-varying frequency-dependent and transient conditions, where the authors considered
weight factor. LSA estimator is superior to STSA approaches using the well-known decision-directed SNR
estimator, since it results in much lower residual noise estimator. The authors show that the smoothing
without further affecting the speech itself. Hendriks [73] properties of the decision-directed SNR estimator in low
proposed noisy speech spectrum based on adaptive time SNR conditions are often analytically described and
segmentation. The authors demonstrated the potential of which limits the noise reduction for wide used spectral
adaptive segmentation in both ML and Decision Directed speech estimators based on the decision-directed
methods, making a better estimate of a priori SNR. approach are often expected. Illustrated that achieving
Variance of estimated noisy speech power spectrum can each good preservation of speech onsets in transient
be reduced by using Bartlett method. But the decrease in conditions on One side and also the suppression of
variance causes side effects and also decreases the musical noise on the opposite are often particularly
frequency resolution. Bartlett method was used across the problematic once the decision-directed SNR estimation is
segment consisting of 3 frames located symmetrically employed. Suhadi Suhadi [77] proposed a data-driven
around frame to be enhanced. It decreases variance, but approach to a priori SNR estimation, it should be used
there is some disadvantages (i) Position of segment with with a large vary of speech enhancement techniques, such
respect to underlying noisy frame that needs to be as, e.g., the MMSE (log) spectral amplitude estimator, the
enhanced is predetermined. (ii) Segments should vary super Gaussian JMAP estimator, or the Wiener filter. The
with speech sounds (ideally). Vowel sounds may be proposed SNR estimator employs two trained artificial
considered upto 40-50ms, consonants sounds may be neural networks, among these two networks, one trained
considered upto 5ms. Decision Directed (DD) approach artificial neural network is for speech presence, and the
with adaptive segmentation will result better other trained artificial neural network is for speech
approximation compared to DD approach without absence. The classical DD a priori SNR estimator
adaptive segmentation., No pre-echo presents with an proposed by Ephraim and Malah is diminished into its
adaptive segmentation, but. pre-echo presents without an two additive parts, currently represent the two input
adaptive segmentation (fixed segment),. Cyril Plapous signals to the neural networks. As an alternate to the
[74] addresses a problem of single-microphone speech neural networks, additionally, lookup tables are
enhancement in noisy environments. The well-known investigated. The employment of those data- driven
decision-directed (DD) approach drastically limits the

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 81

nonlinear a priori SNR estimator reduces speech (ii) If the stationary of speech sound is shorter than this
distortion, notably in speech onset, whereas holding a fixed segment size, smoothing is applied across stationary
high level of noise attenuation in speech absence. The boundaries resulting in blurring of transients and rapidly
performance of most weighting rules is dominantly varying speech components, leading to degradation of
determined by the a priori SNR, whereas the a posterior speech intelligibility.
SNR acts simply as a correction parameter just in case of Summary of estimators is given in table 1 (comparison
low a priori SNR. Hence, to reduce musical tones is by between classical estimators) and table 2 (comparison
rising the estimates of the a priori SNR. However, as a between Bayesian estimators).
result of the constant weighting factor close to unity, the
Difference between causal and Non-causal Estimators:
greatly reduced variance of the a priori SNR estimate is
The causal a priori SNR estimator is closely associated
lacking the ability to react quickly to the abrupt increase
with the decision-directed estimator of Ephraim and
of the instantaneous SNR. Pei Chee Yong [78] proposed
Malah. A special case of the causal estimator degenerates
a modified a priori SNR estimator for speech
to a “decision-directed” (DD) estimator with a time-
enhancement. Well known decision-directed (DD)
varying frequency-dependent weight factor. The
approach is modified by matching every gain function
weighting factor is monotonically decreasing as a
with the noisy speech spectrum at current frame instead
function of the instantaneous SNR, resulting effectively
of the previous one. The proposed algorithm eliminates
in a very large weighting factor throughout speech
the transient distortion speech and this algorithm also
absence, and a smaller weighting factor throughout
reduces the impact of the selection of the gain perform
speech presence.
towards the extent of smoothing within the SNR estimate.
This reduces each the musical noise and therefore the
Drawbacks of fixed window: signal distortion.
(i) In signal regions which can be considered stationary
Summary of estimators
for longer time than segment used, the variance of
spectral estimator is unnecessarily large.

Table 1. Comparison between classical estimators


Data Model/ Optimality/
Approach Estimator Comments
Assumptions Error Criterion
If CRLB satisfies
 ln p X ;  
Cramer- 
 I  g  X     ˆ achieves the
CRLB, This is An efficient estimator
Rao Lower PDF p  X ;   is known This condition then the estimator is
Minimum variance may not exist. Hence this
ˆ  g X  Where I   is PXP matrix
Bound
unbiased MVU approach may fail.
(CRLB)
depend only on  and g(X) is p- dimensional estimator.
function of the data X.
Find sufficient statistic T(X) by factoring “Completeness” of
Rao- PDF as sufficient statistic must be
Blackwell p X ;   g T  X , h X  ˆ is MVU
Lehmann- PDF p X ;  is known checked. Dimensional
Where T(X) is a p-dimensional function of estimator sufficient statistic may not
Scheffe
(RBLS) X, g is function depending on T and  , and exist, so this method may
h depends on X fail.

E  X   H
Best ˆi for i=1.2,..p
Where H is N X p If w is gaussian random
Linear
Unbiased
known matrix , C is the
covariance matrix of X is 
ˆ  H T C 1H 
1 T 1
H C X
has minimum
variance of all vector then ˆ is also
Estimator unbiased estimators MVU estimator.
known. Equvilently
(BLUE)
X  H  w that are linear in X.
Not optimal in
Maximum ˆ is the value of  maximizing general. Under If an MVU estimator
Likelihood PDF p X ;  is known certain conditions on exists, the Maximum
p  X ;  where X is replaced by observed
Estimator PDF (For large data), Likelihood procedure will
(MLE) data samples. asymptotically it is produce it
MVU estimator.
Minimizing LS error
Least xn  sn;   wn does not translate into
Squares n=0,1…N-1. Where ˆ is the value of  that minimizes None in general
minimizing estimation
Estimator signal sn;  depends on J    X  S  T X  S   error. If w is Gaussian
(LSE) unknown parameters. random vector then LSE is
equivalent to MLE.

The noncausal a priori SNR estimator employs future the causal estimators and noncausal estimators shows that
spectral measurements to higher predict the spectral the variations are primarily noticeable throughout speech
variances of the clean speech. The comparison between onsets. The causal a priori SNR estimator, yet because the

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
82 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

decision-directed estimator, cannot respond too quick to spectral distortion than the decision-directed technique
abrupt increase within the instantaneous SNR, since it and the causal estimator.
essentially implies a rise within the level of musical Speech enhancement gain functions are computed form
residual noise. In contrast, the noncausal estimator, have (i) Estimate of noise power spectrum (ii) Estimate of
some subsequent spectral measurements at hand, is noisy power spectrum. The variance of these spectral
capable of discriminating between noise irregularities and estimators degrades quality of enhanced speech. So we
speech onsets. Noncausal estimator shows the need to decrease the variance.
improvement within the segmental SNR and lower log-

Table 2 Comparison between Bayesian estimators


Data Model/ Optimality/
Approach Estimator Comments
Assumptions Error Criterion
ˆ  E / X  Where the expectation is ˆi minimizes the Bayesian MSE
with respect to posterior PDF
Minimum Mean
Square Error
PDF p X ;  is
known, where  is p / X  
p X /   p    
 2
BMSE ˆi  E   i  ˆi 
 
In the non-Gaussian
case this will be
(MMSE)
Estimator
considered to be a  pX /  p d i=1,2….p difficult to
implement.
random variable. p X /   is specified as data model and Where the expectation is with
respect to p X ;  i 
p   as prior PDF for  .
For PDFs whose
ˆis the value of  maximizes
Maximum A PDF p X ;  is p / X  or the value that maximizes
mean and mode
(location of
posterior known, where  is p X /   p  Maximizes the “Hit-or-Miss”
maximum) are same,
(MAP) function.
p X /  
considered to be a is specified as data model and MMSE and MAP
Estimator random variable.
prior PDF.for  . estimators will be
p   as
identical.
Linear If X,  are jointly
The first two ˆi minimum the Bayesian MSE
Minimum Mean moments of joint Gaussian, this is
Square Error PDF p X ;   are ˆ  E   CxC xx
1
x  EX  of all estimators that are linear identical to the
(LMMSE) function of X. MMSE and MAP
Estimator known
estimator.

is annoying and it is very difficult to consider one


IV. CHALLENGES AND OPPORTUNITIES specific signal when many of the speech signals come
from all around at the same time however it will be done.
As the applications are broadened, the definition of In blind source separation with multiple microphones,
speech enhancement is currently changing into more we tend to try and separate completely different signals
general. The classification of speech enhancement coming at the same time from the various directions. The
techniques is shown in Fig 2.1. The classification is cocktail party effect might be able to separate the signal
mainly based on time domain [79], transform domain [80] of interest from the rest that not precisely what blind
[81] and statistical based approaches. It should certainly source separation algorithms do.
include the reinforcement of the signal from the
corruption of competing speech or even from degradation
of filtered version of the same signal. Signal separation
V. SUMMARY AND CONCLUSION
and dereverberation issues are more challenging than the
classical noise reduction problem. In a room and in In this paper overview of speech enhancement is
hands-free context the signal that is picked by a discussed along with the classification of techniques in
microphone from a talker contains not only direct path speech enhancement. Statistical based techniques along
signal, but also delayed replicas and attenuated of the with their properties, limitations are explained.
source signal because of reflections from the objects and Comparison between classical and Bayesian estimators
boundaries within the room. Because of this multipath are explained. Disadvantage of fixed window technique is
propagation there will be an introduction of spectral discussed in this paper. The major and vital differences
distortion and echoes in to observation signal, termed as between causal and non-causal estimators find a place in
reverberation. This dereverberation is needed to enhance the discussion of single channel speech enhancement
the intelligibility of the speech signal. It is crucial to techniques. Finally, challenges and opportunities for
determine factors that enable or prohibit humans with speech enhancement are discussed in this paper.
normal hearing to listen to and make sense of one speaker
from among a multitude of speakers and/or amidst the
cacophony of background noise and babble. This ability REFERENCES
to sift and select the speech patterns of one speaker from
[1] D. O‟Shaughnessy, Speech Communication: Human and
among the numerous is labeled cocktail party effect. Thus Machine, Addison-Wesley, Reading, MA, 1987.
known as a cocktail party effect. It is well known that [2] P.C. Loizou, “Speech Enhancement: Theory and Practice,”
listening during this type of situations with only one ear 1st Ed. Boca Raton, FL: CRC, 2007.

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 83

[3] L.R. Rabiner, R.W. Schafer, “Digital Processing Of [20] Plourde E, Champagne B., “Auditory-based spectral
Speech Signals”, Prentice Hall, Englewood Cliffs, NJ, amplitude estimators for speech enhancement,” IEEE
1978. Trans. Audio, Speech, and Language Process., vol. 16, no.
[4] T. F. Quatieri, Discrete-Time Speech Signal Processing: 8, pp. 1614–1623, Nov. 2008.
Principles and Practice.Prentice Hall, 2001 [21] P. J. Wolfe, S. J. Godsill, “Towards a perceptually optimal
[5] Yi hu, P.C. Loizou, “Subspace approach for enhancing spectral amplitude estimator for audio signal
speech corrupted by colored noise”, in Proc. IEEE Int. conf. enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech,
Acoustics, Speech and Signal Process. (ICASSP), 2002. I- Signal Process., Istanbul, Turkey, 2000, pp. 821–824.
573-576. [22] P. J. Wolfe, S. J. Godsill, “A perceptually balanced loss
[6] Jingdong Chen , Benesty, J. , Yiteng Huang Doclo, “New function for short-time spectral amplitude estimation,” in
insights into the noise reduction wiener filter,” IEEE Trans. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1218– Hong Kong, 2003, pp. 425–428.
1234, July 2006. [23] P. C. Loizou, “Speech enhancement based on perceptually
[7] Amehraye, A., Pastor, D., Tamtaoui, A., “Perceptual motivated Bayesian estimators of the magnitude
improvement of wiener filtering,” in Proc. IEEE Int. conf. spectrum,” IEEE Trans. Speech Audio Process., vol. 13, no.
Acoustics, Speech and Signal Process (ICASSP), 2008, pp. 5, pp. 857–869, Sep. 2005.
2081-2084 [24] C. H. You, S. N. Koh, S. Rahardja, “ Beta-order MMSE
[8] Fei Chen, Loizou, P.C. “Speech enhancement using a spectral amplitude estimation for speech enhancement,”
frequency-specific composite wiener function,” in Proc. IEEE Trans. Speech Audio Process., vol. 13, no. 4, pp.
IEEE Int. conf. Acoustics, Speech and Signal Process. 475–486, Jul. 2005.
(ICASSP), 2010, pp. 4726-4729. [25] Plourde E, Champagne B, “Generalized bayesian
[9] Jingdong Chen, Benesty, J., “ Analysis of the frequency- estimators of the spectral amplitude for speech
domain wiener filter with the prediction gain,” in Proc. enhancement,” IEEE Signal Processing Letters., vol. 16, no.
IEEE Int. conf. Acoustics, Speech and Signal Process. 6, pp.485–488, June. 2009.
(ICASSP), 2010, pp. 209-212. [26] Jiucang Hao, Attias H, Nagarajan S, Sejno T.J, “Speech
[10] Chung-Chien Hsu, Tse-En Lin, Jian-Hueng Chen and Tai- enhancement, gain, and noise spectrum adaptation using
Shih Chi, “Spectro-temporal subband wiener Filter for approximate bayesian estimation,” IEEE Trans. Audio,
speech enhancement,” in Proc. IEEE Int. conf. Acoustics, Speech, and Language Process., vol. 17, no. 1, pp.24–37,
Speech and Signal Process. (ICASSP), 2012, pp. 4001- Jan 2009.
4004 [27] Whitehead P.S, Anderson D.V, “Robust bayesian analysis
[11] Feng Huang, Tan Lee , Kleijn, W.B. “Transform domain applied to wiener filtering of speech,” in Proc. IEEE Int.
wiener filter for speech periodicity enhancement,” in Proc. conf. Acoustics, Speech and Signal Process. (ICASSP),
IEEE Int. conf. Acoustics, Speech and Signal Process. 2011, pp. 5080-5083.
(ICASSP), 2012, pp. 4577-4580. [28] Plourde E,Champagne B, “Multidimensional STSA
[12] Steven M Kay, “Fundamentals of statistical Signal estimators for speech enhancement with correlated spectral
Processing: estimation Theory” Prentice Hall, New Jersey, components,” IEEE Trans. Signal Process., vol. 59, no. 7,
1993. pp.3013–3024, July. 2011
[13] McAulay, R. , Malpass, M., “Speech enhancement using a [29] Nielsen J.K, Christensen M.G. Jensen S.H, “An
soft-decision noise suppression filter,” IEEE Trans. approximate bayesian fundamental frequency estimator,”
Acoustic, Speech, Signal. Process., vol. 28, no. 2, pp. 137– in Proc. IEEE Int. conf. Acoustics, Speech and Signal
145, April 1980. Process. (ICASSP), 2012, pp. 4617-4620.
[14] Yoshioka T, Nakatani T, Hikichi Takafumi Miyoshi, M, [30] Mohammadiha N , Taghia J, Leijon A., “Single channel
“Maximum likelihood approach tospeech enhancement for speech enhancement using bayesian nmf with recursive
noisy reverberant signals,” in proc..IEEE Int. conf. temporal updates of prior distributions,” in Proc. IEEE Int.
Acoustics, Speech and Signal Process. (ICASSP), 2008, pp. conf. Acoustics, Speech and Signal Process. (ICASSP),
4585-4588. 2012, pp. 4561-4564.
[15] Ephraim Y, “Bayesian estimation approach for speech [31] Ephraim Y, Malah D., “Speech enhancement using a-
enhancement using hidden markov models,” IEEE Trans. minimum mean-square error short-time spectral amplitude
Signal Process., vol. 40, no. 4, pp. 725-735, April. 1992. estimator,” IEEE Trans. Acoustic, Speech, Signal. Process.,
[16] Chang Huai You , Soo Ngee Koh , Rahardja, S, “Beta- vol. 32, no. 6, pp. 1109–1121, Dec 1984.
order mmse spectral amplitude estimation for speech [32] J. M. Tribolet, R. E. Crochiere, “Frequency domain coding
enhancement,” IEEE Trans. On Speech and Audio. of speech,” IEEE pans. Acoust., Speech, Signal Processing,
Process., vol. 13, no. 4, pp.475–481, July 2005. vol. ASSP-27, p. 522, Oct. 1979.
[17] Srinivasan S, Samuelsson J. , Kleijn W.B, “Codebook- [33] R. Zelinski, P. Noll, “Adaptive transform coding of speech
based bayesian speech enhancement for nonstationary signals,” IEEE Pans. Acoust., Speech, Signal Processing,
environments,” IEEE Trans. Audio, Speech, and Language vol. ASSP-25, p. 306, Aug. 1977.
Process., vol. 15, no. 2, pp. 441–452, Feb 2007. [34] J. E. Porter, S. F. Boll, “Optimal estimators for spectral
[18] Kundu A, Chatterjee S, Sreenivasa Murthy A, Sreenivas restoration of noisy speech,” in Roc. IEEE Int. Conf
T.V, “GMM based bayesian approach to speech Acoust., Speech, Signal Processing, Mar. 1984, pp.
enhancement in signal transform domain,” in Proc. IEEE 18A.2.1-18A.2.4.
Int. conf. Acoustics, Speech and Signal Process. (ICASSP), [35] Ephraim, Y., Malah, D., “Speech enhancement using a
2008, pp. 4893-4896. minimum mean-square error log- spectral amplitude,”
[19] Yoshioka T., Miyoshi M., “Adaptive suppression of non- IEEE Trans. Acoustic, Speech, Signal. Process., vol. 33, no.
stationary noise by using the variational bayesian method,” 2, pp. 443–445, Apr. 1985.
in Proc. IEEE Int. conf. Acoustics, Speech and Signal [36] Zhong-Xuan Yuan, Soo Ngee Koh, Soon, I.Y., “Speech
Process. (ICASSP), 2008, pp. 4889-4892. enhancement based on hybrid algorithm,” IET Electronics
Letters, vol. 35, no. 20, pp.1710–1712, Sept. 1999.

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
84 A Survey on Statistical Based Single Channel Speech Enhancement Techniques

[37] Guo-Hong Ding, Taiyi Huang,; Bo Xu, “Suppression of stochastic–deterministic speech model,” IEEE Trans.
additive noise using a power spectral density mmse Audio, Speech, and Language Process, vol. 15, no. 2,
estimator,” IEEE Signal Processing Letters., vol. 11, no.6, pp.406–415, Feb. 2007.
pp.585–588, June. 2004. [53] Andrianakis Y, White Paul R, “A speech enhancement
[38] Li Deng , Droppo J, Acero A., “Enhancement of log mel algorithm based on a chi MRF model of the speech STFT
power spectra of speech using a phase- sensitive model the amplitudes,” IEEE Trans. On Audio, Speech, and
acoustic environment and sequential estimation of the Language Process., vol. 17, no. 8, pp.1508–1517, Nov.
corrupting noise,” IEEE Trans. Speech and Audio Process., 2009.
vol. 12, no. 2, pp.133–143, March. 2004. [54] I. Cohen, “Relaxed statistical model for speech
[39] Li Deng , Droppo, J, Acero, A., “Estimating cepstrum of enhancement and a priori SNR estimation,” IEEE Trans.
speech under presence of noise using a joint prior of static On Speech Audio Process., vol. 13, no. 5, pp. 870–881,
and dynamic features,” IEEE Trans. Speech and Audio Sep. 2005.
Process., vol. 12, no. 3, pp.218–233, May. 2004. [55] Y. Ephraim, D. Malah, “Speech enhancement using a
[40] Breithaup C., Krawczyk M., Martin R, “Parameterized minimum-mean square error short-time spectral amplitude
MMSE spectral magnitude estimation for the enhancement estimator,” IEEE Trans. Acoust, Speech, Signal Process.,
of noisy speech ,” in Proc. IEEE Int. conf. Acoustics, vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984.
Speech and Signal Process. (ICASSP), 2008, pp. 4037- [56] E. Zavarehei, S. Vaseghi, and Q. Yan, “Noisy speech
4040. enhancement using harmonic-noise model and codebook-
[41] Uemura Y, Takahashi Yu, Saruwatari H, Shikano K, based post-processing,” IEEE Trans. Audio, Speech, Lang.
Kondo K., “Musical noise generation analysis for noise Process., vol. 15, no. 4, pp. 1194–1203, May 2007.
reduction methods based on spectral subtraction and [57] Hendriks R.C, Heusdens R., “On linear versus non-linear
MMSE STSA estimation,” in Proc. IEEE Int. conf. magnitude-DFT estimators and the influence of super-
Acoustics, Speech and Signal Process. (ICASSP), 2009, pp. gaussian speech priors,” in Proc. IEEE Int. conf. Acoustics,
4433-4436. Speech and Signal Process. (ICASSP), 2010, pp. 4750-
[42] Yang Lu, Loizou, P.C., “Speech enhancement by 4753.
combining statistical estimators of speech and noise,” in [58] Borgstrom B.J, Alwan A, “A unified framework for
Proc. IEEE Int. conf. Acoustics, Speech and Signal Process. designing optimal STSA estimators assuming maximum
(ICASSP), 2010, pp. 4754- 4757. likelihood phase equivalence of speech and noise,” IEEE
[43] Hasan T, Hasan M.K., “MMSE estimator for speech Trans. Audio, Speech, and Language Process., vol. 19, no.
enhancement considering the constructive and destructive 8, pp.2579–2590, Nov. 2011.
interference of noise,” IET Signal Process., vol. 4, no. 1, [59] Fodor B, Fingscheidt T, “MMSE speech enhancement
pp. 1-11, Feb. 2010. under speech presence uncertainty assuming (generalized)
[44] Yu Gwang Jin , Chul Min Lee, Kiho Cho , “A data-driven gamma speech priors throughout,” in Proc. IEEE Int. conf.
residual gain approach for two-stage speech enhancement,” Acoustics, Speech and Signal Process. (ICASSP), 2012, pp.
in Proc. IEEE Int. conf. Acoustics, Speech and Signal 4033-4036.
Process. (ICASSP), 2011, pp. 4752-4755. [60] Hendriks R.C, Martin R, “MAP estimators for speech
[45] Gerkmann T, Krawczyk M, “MMSE-Optimal spectral enhancement under normal and rayleigh inverse gaussian
amplitude estimation given the stft-phase,” IEEE Signal distributions,” IEEE Trans. Audio, Speech, and Language
Processing Letter., vol. 20, no. 2, pp.129–132, Feb. 2013. Process., vol. 15, no. 3, pp.918–927, March. 2007.
[46] Wung J, Miyabe S, Biing-Hwang Juang , “Speech [61] Guo-Hong Ding, “Maximum a posteriori noise log-spectral
enhancement using minimum mean-square error estimation estimation based on first-order vector Taylor series
and a post-filter derived from vector quantization of clean expansion,” IEEE Signal Processing Letter, vol. 15, no. 2,
speech,” in Proc. IEEE Int. conf. Acoustics, Speech and pp.158–161, Jan. 2008.
Signal Process. (ICASSP), 2009, pp. 4657-4660. [62] Fodor B, Fingscheidt T , “Speech enhancement using a
[47] Borgstrom B.J, Alwan Abeer, “Log-spectral amplitude joint MAP estimator with gaussian mixture model for (non)
estimation with generalized gamma Distributions for stationary noise ,” in Proc.IEEE Int. conf. Acoustics,
speech enhancement,” in Proc. IEEE Int. conf. Acoustics, Speech and Signal Process. (ICASSP), 2011, pp. 4768-
Speech and Signal Process. (ICASSP), 2011, pp. 4756- 4771.
4749. [63] Loizou P.C, “Speech enhancement based on perceptually
[48] Ephraim Y, Roberts William J.J, “On second-order motivated bayesian estimators of magnitude spectrum,”
statistics of log-periodogram with correlated components,” IEEE Trans. Speech and Audio Process., vol. 13, no. 5,
IEEE Signal Processing Letter., vol. 12, no. 9, pp. 625–628, pp.857–869, Sept. 2005.
Sept. 2005. [64] Plourde E., Champagne B., “Perceptually based speech
[49] Gazor, S., Wei Zhang, “Speech enhancement employing enhancement using the weighted ß-SA estimator,” in Proc.
Laplacian–Gaussian mixture,” IEEE Trans. Speech and IEEE Int. conf. Acoustics, Speech and Signal Process.
Audio Process., vol. 13, no. 5, pp. 896–904, Sept. 2005. (ICASSP), 2008, pp. 4193-4196.
[50] Martin R, “Speech enhancement based on minimum mean- [65] Nam Soo Kim, Joon-Hyuk Chang, “Spectral enhancement
square error estimation and supergaussian priors,” IEEE based on global soft decision,” IEEE Signal Processing
Trans. Speech and Audio Process, vol. 13, no. 5, pp. 845– Letter., vol. 7, no. 5, pp.108–110, May. 2000.
856, Sept. 2005. [66] Cohen I., “Optimal speech enhancement under signal
[51] Erkelens J.S, Hendriks R.C, Heusdens R, JensenJ, presence uncertainty using log-spectral amplitude
“Minimum Mean-Square Error Estimation of Discrete estimator,” IEEE Signal Processing Letter., vol. 9, no. 4,
Fourier Coefficients With Generalized Gamma Priors,” pp. 113–116, Apr. 2002.
IEEE Trans. Audio, Speech, and Language Process., vol. [67] Gerkmann T, Krawczyk M, Martin R, “Speech presence
15, no. 6, pp.1741–1752, Aug. 2007. probability estimation based on temporal cepstrum
[52] Hendriks R.C, Heusdens R, Jensen J, “An MMSE smoothing ,” in Proc. IEEE Int. conf. Acoustics, Speech
estimator for speech enhancement under a combined and Signal Process. (ICASSP) 2010, pp. 4254-4257.

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85
A Survey on Statistical Based Single Channel Speech Enhancement Techniques 85

[68] Zhong-Hua Fu, Jhing-Fa Wang, “Speech presence from National Institute of Technology Calicut
probability estimation based on integrated time frequency in 2010. Presently he is research scholar in the
minimum tracking for speech enhancement in adverse department of ECE, National Institute of
environments ,” in Proc. IEEE Int. conf. Acoustics, Technology Warangal.
Speech and Signal Process. (ICASSP), 2010, pp. 4258- He is having teaching experience of 2 years
4261. (2010-2012) as an assistant professor. He has
[69] Abramson, A., Cohen I., “Simultaneous detection and published 3 international conferences. His
estimation approach for speech enhancement,” IEEE area of research interest is speech processing.
Trans. Audio, Speech, and Language Process., vol. 15, no.
8, pp. 2348–2359, Nov. 2007.
[70] Gerkmann T, Breithaupt C, Martin R., “Improved a Siva Prasad Nandyala received the B.Tech
posteriori speech presence probability estimation based on Degree in Electronics & Communication
a likelihood ratio with fixed priors,” IEEE Trans. Audio, Engineering from Jawaharlal Nehru
Speech, and Language Process., vol. 16, no. 5, pp.910–919, Technological University Hyderabad and
July. 2008. Master of Technology in Systems and Signal
[71] Cohen I., “Speech enhancement using a noncausal a priori Processing Electronics from JNTUCEH.
SNR estimator,” IEEE Signal Processing Letters., vol. 11, Presently he is a research scholar in the
no. 9, pp.725–728, Sept. 2004. department of ECE, National Institute of Technology Warangal.
[72] Cohen I, “Relaxed statistical model for speech He worked in WIPRO technologies in VLSI domain for two
enhancement and a priori SNR estimation,” IEEE Trans. years. He is having more than 10 publications. His area of
Speech and Audio Process., vol. 13, no. 5, pp. 870–881, interest includes speech recognition, speech emotion
Sept. 2005. recognition and speech enhancement.
[73] Richard C. Hendriks, Richard Heusdens, Jesper Jensen
“Adaptive Time Segmentation for Improved Speech
Enhancement,” IEEE Trans. On audio, speech, and Dr.T.Kishore Kumar (Member IETE,
language process. Vol. 14, No. 6, Nov 2006, Page (s): INDIA) received the B.Tech Degree in
2064 – 2074. Electronics & Communication Engineering
[74] Plapous C, Marro C, Scalart P, “Improved signal-to-noise and Master of Techonology in Digital
ratio estimation for speech enhancement,” IEEE Trans. Systems and Computer Electronics from Sri
Audio, Speech, and Language Process., vol. 14, no. 6, Venkateswara University and Jawaharlal
pp.2098–2108, Nov. 2006. Nehru Technological University India in
[75] Yao Ren , Johnson, M.T. , “An improved snr estimator for the year 1992 and 1996 respectively and
speech enhancement,” in Proc. IEEE Int. conf. Acoustics, obtained Doctrate degree in Digital Signal Processing from
Speech and Signal Process. (ICASSP), 2008, pp. 4901- Jawaharlal Nehru Technological University,Hyd India in 2004.
4904. From 1999 he was associated with NIT Warangal in the
[76] Breithaupt C. Martin R., “Analysis of the decision-directed position of Associate Professor, prior to this he was selected for
snr estimator for speech enhancement with respect to low- Cabinet Secretariat Prime Minister office Gov INDIA as
SNR and transient conditions,” IEEE Trans. Audio, Speech, Technical officer(Tele). His current areas of interest are signal
and Language Process., vol. 19, no. 2, pp. 277–289, Feb. processing and speech processing. At present he is guiding
2011. three Phd Scholars and two M.Tech Students. He has published
[77] Suhadi S, Last C, Fingscheidt T, “A data-driven approach several papers in the areas of Signal & Speech Processing. He
to a priori SNR estimation,” IEEE Trans. Audio, Speech, has participated in various courses such as Signal processing,
and Language Process., vol. 19, no. 1, pp.186–195, Jan. VLSI and Curriculum development which are organized by
2011. reputed Institutions in India such as IIT Kharagpur, Intel
[78] Pei Chee Yong, Nordholm S, Hai Huyen Dam, “Trade-off Bangalore, IUCEE at Infosys, Mysore.
evaluation for speech enhancement algorithms with respect
to the a priori SNR estimation,” in Proc. IEEE Int. conf.
Acoustics, Speech and Signal Process. (ICASSP), 2012, pp.
4657-4660.
[79] Chaogang Wu,Bo Li,Jin Zheng, “A Speech Enhancement
Method Based on Kalman Filtering,” IJWMT Vol. 1, No. 2,
April 2011.
[80] Noureddine Aloui,Ben Nasr Mohamed,Adnane Cherif,
“Genetic Algorithm For Designing QMF Banks and Its
Application In Speech Compression Using Wavelets,”
IJIGSP Vol.5, No.6, May 2013.
[81] Navneet Upadhyay, Abhijit Karmakar, “Spectral
Subtractive-Type Algorithms for Enhancement of Noisy
Speech: An Integrative Review,” IJIGSP Vol.5, No.11,
September 2013

Authors’ Profiles
Sunnydayal Vanambathina was born in Vijayawada, India.
He received his B.Tech Degree in Electronics &
Communication Engineering from JNTU Hyderabad, India in
2007 and received Master of Technology in Signal Processing

Copyright © 2014 MECS I.J. Intelligent Systems and Applications, 2014, 12, 69-85

You might also like