Learning Disentangled Representations of
Timbre and Pitch for Musical Instrument Sounds Using
Gaussian Mixture Variational Autoencoders

Supplementary Audio Files and Code

Yin-Jyun Luo$^{1}$, Kat Agres$^{2,3}$, Dorien Herremans$^{1,2}$

$^{1}$Singapore University of Technology and Design
$^{2}$Institute of High Performance Computing, A*STAR, Singapore
$^{3}$Yong Siew Toh Conservatory of Music, National University of Singapore
$\tt yinjyun\_luo@mymail.sutd.edu.sg, kat\_agres@ihpc.a-star.edu.sg, dorien\_herremans@sutd.edu.sg$

Controllable Synthesis of Instrument Sounds Given Pitch and Instrument


In this section, we complement the paper with the following synthesized Mel-spectrograms and the corresponding audio files.

We generate the Mel-spectrograms as described in the paper, and use Griffin-Lim to synthesize the waveforms. Note that we do not focus on good audio quality in this paper, and the inferior quality is mainly due to the algorithm used to synthesize the waveforms (as the original Mel-spectrograms and the generated ones result in the similar audio quality). We will address this, in the future work, by using advanced auto-regressive networks such as wavenets for audio synthesis.

Firstly, we present the audio that is synthesized using the original Mel-spectrogram. Specifically, we convert a sample to Mel-spectrogram, and resynthesize back to audio using Griffin-Lim. This is to give a reference of the audio quality obtained by Griffin-Lim in this work.

French horn
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Piano
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Cello
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Basson
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim

Now we demonstrate the controllable sound synthesis. As described in the paper Section 4.3, we specify the target pitch $\mathbf{y_m}$ and instrument $\mathbf{y_k}$, and sample the pitch code $\mathbf{z}_p$ and timbre code $\mathbf{z}_t$ from the conditional distribution $p(\mathbf{z}_p | \mathbf{y_m})$ and $p(\mathbf{z}_t | \mathbf{y_k})$, respectively, where $p(\mathbf{z}_{p} | \mathbf{y}_{p}) = \mathcal{N}(\mathbf{\mu}_{\mathbf{y}_{p}}, \textrm{diag}(\mathbf{\sigma}_{\mathbf{y}_{p}}))$ and $p(\mathbf{z}_{t} | \mathbf{y}_{t}) = \mathcal{N}(\mathbf{\mu}_{\mathbf{y}_{t}}, \textrm{diag}(\mathbf{\sigma}_{\mathbf{y}_{t}}))$. In the following demonstration, we specify the same pitches for all instruments, play the audio and display the corresponding Mel-spectrograms.

English horn
French horn
Tenor Trombone
Trompet
Piano