Professional Documents
Culture Documents
CI M 2 0 1 8 P R O C E E D I N G S 10 1
Proc e e d i n g s o f t h e 2 2 n d C I M , U d i n e , N o v e m b e r 2 0 -2 3 , 2 0 1 8
In the recent EU-FET project SkAT-VG 1 , phoneticians modules that are available both for analysis and, as control
have had an important role in identifying the most relevant knobs, for synthesis. The baseline is found in the results
components of non-speech voice productions [19]. They of the SkAT-VG project, which showed that vocal imita-
identified the broad categories of phonation (i.e., quasi tions are optimized representations of referent sounds, that
periodic oscillations due to vocal fold vibrations), turbu- emphasize those features that are important for identifica-
lence, slow myoelastic vibrations, and clicks, which can be tion. A large collection of audiovisual recordings of vocal
extracted automatically from audio with time-frequency and gestural imitations 2 offers the opportunity to further
analysis and machine learning [20], and can be made to enquire how people perceive, represent, and communicate
correspond to categories of sounds as they are perceived about sounds.
[15], and as they are produced in the physical world. A first assumption underlying this research approach,
Indeed, it has been argued that human utterances somehow largely justified by prior art and experiences, is that artic-
mimic “nature’s phonemes” [21]. ulatory primitives used to describe vocal utterances are ef-
fective as high-level descriptors of sound in general. This
1.2 Quantum Frameworks assumption leads naturally to an embodied approach to
sound representation, analysis, and synthesis. A second
It was Dennis Gabor [1] who first adopted the mathematics assumption is that the mathematics of quantum mechan-
of quantum mechanics to explain acoustic phenomena. In ics, relying on linear operators in Hilbert spaces, offers a
particular, he used operator methods to derive the time- formalism that is suitable to describe the objects compos-
frequency uncertainty relation and the (Gabor) function ing auditory scenes and their evolution in time. The latter
that satisfies minimal uncertainty. Time-scale representa- assumption is more adventurous, as this path has not been
tions [22] are more suitable to explain the perceptual de- taken in audio signal processing yet. However, the results
coupling of pitch and timbre, and operator methods can coming from neighboring fields (music cognition, image
be used as well to derive the gammachirp function, which processing) encourage us to explore this direction, and to
minimizes uncertainty in the time-scale domain [23]. Re- aim at improved techniques for sound analysis and synthe-
search in human and machine hearing [3] have been based sis.
on banks of elementary (filter) functions and these systems
are at the core of many successful applications in the audio
domain. 2. SKETCH OF A QUANTUM VOCAL THEORY
Despite its deep roots in the physics of the twentieth An embryonic theory of sound based on the postulates of
century, the sound field has not yet embraced the quan- quantum mechanics, and using high-level vocal descriptors
tum signal processing framework [24] to seek practical of sound, can be sketched as follows. Let be a vector
solutions to sound scene representation, separation and operator that provides information about the phonetic ele-
analysis. A quantum approach to music cognition has ments along a specific direction of measurement. Phona-
not been proposed until very recently [25], when its ex- tion, for example, may be represented by z , with eigen-
planatory power has been demonstrated to describe tonal states representing a upper and a lower pitch. Similarly,
attraction phenomena in terms of metaphorical forces. The the turbulence component may be represented by x , with
theory of open quantum systems has been applied to music eigenstates representing turbulence of two different distri-
to describe the memory properties (non-Markovianity) of butions. A measurement of turbulence prepares the system
different scores [26]. In the image domain, on the other in one of two eigenstates for operator x , and a successive
hand, it has been shown how the quantum framework can measurement of phonation would find a superposition and
be effective to solve problems such as segmentation. For get equal probabilities for the two eigenstates of z . The
example, the separation of figures from background can two operators z and x may also be made to correspond to
be obtained by evolving a solution of the time-dependent the two components of the classic sines+noise model used
Schrödinger equation [27], or by discretizing the time- in audio signal processing. If we add transients/clicks as a
independent Schrödinger equation [28]. An approach to third measurement direction (as in the sines + noise + tran-
signal manipulation based on the postulates of quantum sients model [12]) we can claim that there is no sound state
mechanics can also potentially lead to a computational for which the expectation value of the three components
advantage when using Quantum Processing Units. Results is zero: a sort of spin polarization principle as found in
in this direction are being reported for optimization quantum mechanics. The evolution of state vectors in time
problems [29]. is unitary, and regulated by a time-dependent Schrödinger
equation, with a suitably chosen Hamiltonian. The eigen-
1.3 Research Direction vectors of the Hamiltonian allow to expand any state vector
in that basis, and to compute the time evolution of such ex-
In the proposed research path, sound is treated as a super-
pansion. A pair of components can be simultaneously mea-
position of states, and the voice-based components (phona-
sured only if they commute. If they don’t, an uncertainty
tion, turbulence, slow myoelastic pulsations) are consid-
principle can be derived, as it was done for time-frequency
ered as observables to be represented as operators. The
and time-scale representations [1, 23]. The theory can be
extractors of the fundamental components, i.e. the mea-
extended to cover multiple sources, and the resulting mixed
surement apparati, are implemented as signal-processing
2 https://www.ircam.fr/projects/blog/multimodal-database-of-vocal-
1 www.skatvg.eu and-gestural-imitations-elicited-by-sounds/
CI M 2 0 1 8 P R O C E E D I N G S 10 2
Proc e e d i n g s o f t h e 2 2 n d C I M , U d i n e , N o v e m b e r 2 0 -2 3 , 2 0 1 8
states can be described via density matrices, whose time If the phon is prepared |ri (turbulent) and then the mea-
evolution can also be computed if a Hamiltonian operator surement apparatus is set to measure z , there will be equal
is properly defined. The effect of disturbances and devia- probabilities for |ui or |di phonation as an outcome. Es-
tions can be accounted for by introducing relaxation time, sentially, we are measuring what kind of phonation is in a
to model return to equilibrium. This sketch of a Quan- pure turbulent state. This measurement property is satis-
tum Vocal Theory needs to be developed and formally laid fied if
down. 1 1
|ri = p |ui + p |di . (2)
2 2
3. THE PHON FORMALISM In fact, any state |Ai can be expressed as |Ai =
↵u |ui + ↵d |di, where ↵u = hu|Ai, and ↵d = hd|Ai. The
Consider a 3d space with the orthogonal axes probability to measure pitch up is Pu = hA|ui hu|Ai =
↵u ⇤ ↵u , and the probability to measure pitch down is
z : phonation, with different pitches;
Pd = hA|di hd|Ai = ↵d ⇤ ↵d (Born rule).
x : turbulence, with noises of different frequency distribu- Likewise, if the phon is prepared |li and then the mea-
tions; surement apparatus is set to measure z , there will be equal
probabilities for |ui or |di phonation as an outcome. This
y : myoelastic, slow pulsations with different tempos. measurement property is satisfied if
The phon operator is a 3-vector operator that provides 1 1
|li = p |ui p |di , (3)
information about the phonetic elements in a specific di- 2 2
rection of the 3d phonetic space. which is orthogonal to"the linear combination (2).
In this section we present the phon formalism, obtained # " # In vector
p1 p1
by direct analogy with the single spin, as presented in ac- form, we have: |ri = 2 and |li = 2 ,
p1 p1
cessible presentations of quantum mechanics [30]. We use 2 2
standard Dirac notation. and
0 1
x = . (4)
3.1 Measurement along z 1 0
CI M 2 0 1 8 P R O C E E D I N G S 103
Proc e e d i n g s o f t h e 2 2 n d C I M , U d i n e , N o v e m b e r 2 0 -2 3 , 2 0 1 8
3.4 Measurement along an arbitrary direction is governed by the operator U, which is unitary, i.e.,
U† U = I. Taken a small time increment ✏, continuity of
Orienting the measurement apparatus along an arbitrary di-
0 the time-development operator gives it the form
rection n = [nx , ny , nz ] means taking a weighted mix-
ture: U(✏) = I i✏H, (14)
3.5 Time evolution whose energy eigenvalues are Ej = ±1, with energy
eigenvectors |Ej i.
In quantum mechanics the evolution of state vectors in time
4 We do not need physical dimensional consistency here, so we drop
| (t)i = U(t) | (0)i (13) Planck’s constant.
CI M 2 0 1 8 P R O C E E D I N G S 104
Proc e e d i n g s o f t h e 2 2 n d C I M , U d i n e , N o v e m b e r 2 0 -2 3 , 2 0 1 8
3.6 Uncertainty
Figure 1. A non-commutative diagram (in the style of cat-
If we measure two observables L and M (in a single ex- egory theory), representing the non-commutativity of mea-
periment) simultaneously, quantum mechanics prescribes surements of phonation (P ) and turbulence (T ) on audio
that the system is left in a simultaneous eigenvector of the A.
observables only if L and M commute, i.e. [L, M ] = 0.
Measurement operators along different axes do not com-
mute. For example, [ x , y ] = 2i z , and therefore phona-
tion and turbulence can not be simultaneously measured
with certainty.
The uncertainty principle, based on Cauchy-Schwarz
inequality in complex vector spaces, prescribes that the
product of the two uncertainties is at least as large as half
the magnitude of the commutator:
Figure 2. On the left, a voice is analyzed via the HPS
1 model. Then, the stochastic part is submitted to a new
L M |h | [L, M ] | i| (23) analysis. In this way, a measurement of phonation follows
2
a measurement of turbulence. On the right, the measure-
Equation (23) expresses the uncertainty principle. ment of turbulence follows a measurement of phonation.
If L = t is the time operator and S = i dt d
is the fre-
quency operator, and these are applied to the complex os-
cillator Aei!t , the time-frequency uncertainty principle re- through the extraction of the harmonic component in the
sults, and uncertainty is minimized by the Gabor function. HPS model, while the measurement of turbulence is per-
Starting from the scale operator, the gammachirp function formed through the extraction of the stochastic component
can be derived [23]. with the same model. The spectrograms for A00 and A⇤⇤ in
Figure 3 show the results of such two sequences of anal-
4. WHAT TO DO WITH ALL THIS yses on a segment of female speech 5 , confirming that the
commutator [T , P ] is non-zero.
Assuming that a vocal description of sound follows the Essentially, if we adopt the HPS model and skip the
postulates of quantum mechanics, the question is how to final step of addition and inverse transformation, we are
take advantage of the quantum formalism. How can we left with something that is conceptually equivalent to a
practically connect theoretical ideas with audio signal pro- quantum destructive measure. Let St be the filter that
cessing? extracts the stochastic part from a signal. As figure 4
shows, the spectrogram of St(x) is visibly different from
4.1 Non-commutativity and autostates
the spectrogram of x. Conversely, if we apply St once
We expect that measurement operators along different axes more, we get a spectrum that does not change much:
do not commute: this is the case, for example, of mea- St2 (x) = St(St(x)) ⇠ St(x). If we transform back
surements of phonation and turbulence. Let A be an au- from the second and third spectrograms of figure 4, we
dio sample. The measurement of turbulence by the oper- get sounds that are very close to each other. In fact,
ator T leads to T (A) = A0 . A successive measurement ideally, St2 (x) = St(x). It means that, after a measure
of phonation by the operator P gives P (A0 ) = A00 , thus of the non-harmonic component of some signal, the
P (A0 ) = P T (A) = A00 . If we perform the measurements output-signal can be considered as an autostate. If we
in the opposite order, with phonation first and turbulence perform the measure again and again, we still get the same
later, we obtain T P (A) = T (A⇤ ) = A⇤⇤ . We expect that result. Such a measure operation provokes the collapse
[T, P ] 6= 0, and thus, that A⇤ ⇤ 6= A00 . The diagram in of a hypothetical underlying wave function, which is
figure 1 shows non-commutativity in the style of category originally a superposition of states, and is reduced to a
theory. single state upon measurement. The importance of the
Measurements of phonation and turbulence can be actu- autostates in this framework is connected with the concept
ally performed using the Harmonic plus Stochastic (HPS)
5 https://freesound.org/s/317745/.
model [11]. The order of operations is visually described Hann window of 2048 samples, FFT of 4096 samples, hop size of 1024
in figure 2. The measurement of phonation is performed samples.
CI M 2 0 1 8 P R O C E E D I N G S 105
Proc e e d i n g s o f t h e 2 2 n d C I M , U d i n e , N o v e m b e r 2 0 -2 3 , 2 0 1 8
Figure 3. On the top, the spectrogram corresponding to a Figure 4. Top: spectrum of the original sound signal (a
measurement of phonation P following a measurement of female speech), Center: the stochastic component, derived
turbulence T , leading to P T (A) = A00 . On the bottom, from harmonic plus stochastic analysis (HPS), as the ef-
the spectrogram corresponding to a measurement of turbu- fect of a destructive measure, and Bottom: the stochastic
lence T following a measurement of phonation P , leading component of the stochastic component itself. The last two
to T P (A) = A⇤⇤ . spectra are very close.
of quantum measures, which may become practically tential energies due to restoring forces that are contained
feasible through a set of audio-signal analysis tools. in the utterance at a particular moment in time. We could
We can define other filters that extract more specific in- even think of a formal composition of forces acting at dif-
formation: for example, a glissando filter would extract ferent levels, and describe them through a nested categori-
glissando passages within a sound signal, giving as out- cal depiction.
put the inverse transform of the filtered spectrum, that is, a Similarly to what has been done by Youssry et
filtered signal where we can just hear the glissando effect al. [27], the Hamiltonian can be chosen to be time-
and nothing else. Re-performing the glissando-measure on dependent yet commutative (i.e., [H(0), H(t)] =
that sound, we still get the same sound: the new sound H(0)H(t) H(t)H(0) = 0), so that a closed-form
signal is an autostate of glissando. A suitable col- solution to state evolution can be obtained. A time-
lection of filters that act both on the original sounds as independent Hamiltonian such as the one leading to (22)
well as on their vocal imitations can produce simplified would not be very useful, both because forces indeed
spectra, easier to compare. These may give hints on how change continuously and because this would lead to oscil-
information are extracted by voice, that permit to recog- latory solution. Instead, with a time-varying commutative
nize the original sound source. This would confirm the Hamiltonian the time evolution can be expressed as
importance of human voice as a probe to investigate the i
Rt
H(⌧ )d⌧
world of sounds, and Quantum Vocal Theory as a bridge | (t)i = e 0 | (0)i . (24)
between quantum physics, acoustics, and cognition, with A simple choice is that of a Hamiltonian such as
possible further bridges to multisensory perception and in-
teraction. If we consider that voice can imitate not only H(t) = g(t)S, (25)
sounds, but also movements and, sometimes, even visual
shapes through crossmodal correspondences [31], new fas- with S a time-independent Hermitian matrix. A function
cinating scenarios open up for investigation. g(t) that ensures convergence of the integral in (24) is the
damping
4.2 Time evolution g(t) = e t . (26)
We know that time evolution of states is governed by the In an audio application, we can consider a slice of time and
unitary transformation (13) and by the Schrödinger equa- the initial and final states for that slice. We should look
tion (15). A measurement is represented by an operator for a Hamiltonian that leads to the evolution of the initial
that acts on the state and that causes its collapse onto one state into the final state. In image segmentation [27], where
of its eigenvectors. The system remains in a superposition time is used to let each pixel evolve to a final foreground-
of states until a measurement occurs, whose outcome is in- background assignment, the Hamiltonian is chosen to be
herently probabilistic. The states change under the effect
t 0 i
of external forces, which therefore determine the change in H = e f (x) , (27)
i 0
the probabilities associated with each state. It is a Hamil-
tonian such as (20) that controls the evolution process. We and f (·) is a two-valued function of a feature vector x that
can think of the vocal Hamiltonian as containing the po- contains information about a neighborhood of the pixel.
CI M 2 0 1 8 P R O C E E D I N G S 10 6
Proc e e d i n g s o f t h e 2 2 n d C I M , U d i n e , N o v e m b e r 2 0 -2 3 , 2 0 1 8
6. REFERENCES
[1] D. Gabor, “Acoustical quanta and the theory of hear-
ing,” Nature, vol. 159, no. 4044, p. 591, 1947.
Such function is learned from an example image with a [4] G. De Poli, A. Piccialli, and C. Roads, Representations
given ground truth. In audio we may do something simi- of musical signals. MIT Press, 1991.
lar and learn from examples of transformations: phonation
[5] D. Roden, “Sonic art and the nature of sonic events,”
to phonation, with or without pitch crossing; phonation to
Review of Philosophy and Psichology, vol. 1, no. 1,
turbulence; phonation to myoelastic, etc. We may also add
pp. 141–156, 2010.
a coefficient to the exponent in (26), to govern the rapidity
of transformation. As opposed to image processing, time [6] M. Leman, Embodied music cognition and mediation
is our playground. technology. MIT Press, 2008.
The matrix S can be set to assume the structure (20),
and the components of potential energy found in an utter- [7] S. D. Monache, D. Rocchesso, F. Bevilacqua,
ance field can be extracted as audio features. For example, G. Lemaitre, S. Baldan, and A. Cera, “Embodied sound
pitch salience can be extracted from time-frequency anal- design,” International Journal of Human-Computer
ysis [32] and used as nz component for the Hamiltonian. Studies, vol. 118, pp. 47 – 59, 2018.
Figure 5 shows the two most salient pitches, automatically
[8] W. W. Gaver, “How do we hear in the world? Explo-
extracted from a mixture of male and female voice 6 . Fre-
rations in ecological acoustics,” Ecological Psychol-
quent up-down jumps are evident, and they make difficult
ogy, vol. 5, no. 4, pp. 285–313, 1993.
to track a single voice. Quantum measurement induces
state collapse to |ui or |di and, from that state, evolution [9] O. Houix, G. Lemaitre, N. Misdariis, P. Susini, and
can be governed by (24). In this way, it should be possible I. Urdapilleta, “A lexical analysis of environmental
to mimic human figure-ground attention [33], and follow sound categories,” Journal of Experimental Psychol-
each individual voice. ogy: Applied, vol. 18, no. 1, p. 52, 2012.
CI M 2 0 1 8 P R O C E E D I N G S 107
Proc e e d i n g s o f t h e 2 2 n d C I M , U d i n e , N o v e m b e r 2 0 -2 3 , 2 0 1 8
[14] V. Isnard, M. Taffou, I. Viaud-Delmon, and C. Suied, [28] Ç. Aytekin, E. C. Ozan, S. Kiranyaz, and M. Gabbouj,
“Auditory sketches: very sparse representations of “Extended quantum cuts for unsupervised salient ob-
sounds are still recognizable,” PloS one, vol. 11, no. 3, ject extraction,” Multimedia Tools and Applications,
2016. vol. 76, no. 8, pp. 10443–10463, 2017.
[15] G. Lemaitre, A. Jabbari, N. Misdariis, O. Houix, and [29] F. Neukart, G. Compostella, C. Seidel, D. von Dollen,
P. Susini, “Vocal imitations of basic auditory features,” S. Yarkoni, and B. Parney, “Traffic flow optimization
The Journal of the Acoustical Society of America, using a quantum annealer,” Frontiers in ICT, vol. 4,
vol. 139, no. 1, pp. 290–300, 2016. p. 29, 2017.
[16] G. Lemaitre and D. Rocchesso, “On the effectiveness [30] L. Susskind and A. Friedman, Quantum Mechanics:
of vocal imitations and verbal descriptions of sounds,” The Theoretical Minimum. UK: Penguin Books, 2015.
The Journal of the Acoustical Society of America,
vol. 135, no. 2, pp. 862–873, 2014. [31] C. Spence, “Crossmodal correspondences: a tuto-
rial review,” Attention, Perception, and Psychophysics,
[17] M. Perlman and G. Lupyan, “People can create iconic vol. 73, no. 4, pp. 971–995, 2011.
vocalizations to communicate various meanings to
naı̈ve listeners,” Scientific reports, vol. 8, no. 1, [32] J. Salamon and E. Gomez, “Melody extraction from
p. 2634, 2018. polyphonic music signals using pitch contour charac-
teristics,” IEEE Transactions on Audio, Speech, and
[18] Z. Wallmark, M. Iacoboni, C. Deblieck, and R. A. Language Processing, vol. 20, pp. 1759–1770, Aug
Kendall, “Embodied Listening and Timbre: Percep- 2012.
tual, Acoustical, and Neural Correlates,” Music Per-
ception: An Interdisciplinary Journal, vol. 35, no. 3, [33] E. Bigand, S. McAdams, and S. Forêt, “Divided at-
pp. 332–363, 2018. tention in music,” International Journal of Psychology,
vol. 35, no. 6, pp. 270–278, 2000.
[19] P. Helgason, “Sound initiation and source types
in human imitations of sounds,” in Proceedings of
FONETIK 2014, pp. 83–88, 2014.
[20] E. Marchetto and G. Peeters, “Automatic recognition
of sound categories from their vocal imitation using au-
dio primitives automatically derived by SI-PLCA and
HMM,” in Proceedings of the International Symposium
on Computer Music Multidisciplinary Research, 2017.
[21] M. Changizi, Harnessed: How language and music
mimicked nature and transformed ape to man. Ben-
Bella Books, Inc., 2011.
[22] A. De Sena and D. Rocchesso, “A fast Mellin and scale
transform,” EURASIP Journal on Applied Signal Pro-
cessing, no. Article ID 8917, 2007.
[23] T. Irino and R. D. Patterson, “A time-domain, level-
dependent auditory filter: The gammachirp,” The Jour-
nal of the Acoustical Society of America, vol. 101,
no. 1, pp. 412–419, 1997.
[24] Y. C. Eldar and A. V. Oppenheim, “Quantum sig-
nal processing,” IEEE Signal Processing Magazine,
vol. 19, no. 6, pp. 12–32, 2002.
[25] P. beim Graben and R. Blutner, “Quantum approaches
to music cognition,” arXiv:1712.07417, 2018.
[26] M. Mannone and G. Compagno, “Characteriza-
tion of the degree of Musical non-Markovianity,”
arXiv:1306.0229, 2013.
[27] A. Youssry, A. El-Rafei, and S. Elramly, “A quantum
mechanics-based framework for image processing and
its application to image segmentation,” Quantum In-
formation Processing, vol. 14, no. 10, pp. 3613–3638,
2015.
CI M 2 0 1 8 P R O C E E D I N G S 108