Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders

Yin-Jyun Luo$^{1, 3}$, Chin-Cheng Hsu$^{2}$, Kat Agres$^{3,4}$, Dorien Herremans$^{1,3}$

$^{1}$Singapore University of Technology and Design
$^{2}$University of Southern California
$^{3}$Institute of High Performance Computing, A*STAR, Singapore
$^{4}$Yong Siew Toh Conservatory of Music, National University of Singapore
$\tt yinjyun\_luo@mymail.sutd.edu.sg$

Many-to-Many Singer and Vocal Technique Conversion

We demonstrate the audio samples corresponding to Fig. 2 in the paper.

The audio files are all inverted from Mel-spectrograms using Griffin-Lim.

The audio "original Mel-spectrogram" thereby serves as the upper bounds of audio quality.

We first give examples of the six vocal techniques

Notice different singers could express a vocal technique differently, and also sing at different levels of expression over time, which raises ambiguities on defining vocal techniques and poses challenges to data-driven models.

straight
f1
m9
f1
belt
m6
f7
m3
breathy
f6
f4
f8
lip trill
f9
m8
f9
vibrato
m1
f7
m11
vocal fry
m3
m2
f3

Fig. 2(b) - Vocal Technique Conversion

m8 lip trill
Original Mel-spectrogram
Convert to belt
Convert to breathy
Convert to straight
Convert to vibrato
Convert to vocal fry
m9 straight
Original Mel-spectrogram
Convert to belt
Convert to breathy
Convert to lip trill
Convert to vibrato
Convert to vocal fry

Addtional Samples for Vocal Technique Conversion

f4 breathy
Original Mel-spectrogram
Convert to belt
Convert to lip trill
Convert to vibrato
Convert to vocal fry

Fig. 2(a) - Singers Conversion

m3 belt
Original Mel-spectrogram
Convert to f1
Convert to f2
Convert to f4
Convert to m4
Convert to m6
f4 breathy
Original Mel-spectrogram
Convert to f1
Convert to f2
Convert to m3
Convert to m4
Convert to m6

Addtional Samples for Singer Conversion

f2 straight
Original Mel-spectrogram
Convert to f1
Convert to f2
Convert to m3
Convert to m4
Convert to m6