Learning Disentangled Representations of
Timbre and Pitch for Musical Instrument Sounds Using
Gaussian Mixture Variational Autoencoders

Supplementary Audio Files and Code

Yin-Jyun Luo$^{1}$, Kat Agres$^{2,3}$, Dorien Herremans$^{1,2}$

$^{1}$Singapore University of Technology and Design $^{2}$Institute of High Performance Computing, A*STAR, Singapore $^{3}$Yong Siew Toh Conservatory of Music, National University of Singapore $\tt yinjyun\_luo@mymail.sutd.edu.sg, kat\_agres@ihpc.a-star.edu.sg, dorien\_herremans@sutd.edu.sg$

Controllable Synthesis of Instrument Sounds Given Pitch and Instrument

In this section, we complement the paper with the following synthesized Mel-spectrograms and the corresponding audio files.

We generate the Mel-spectrograms as described in the paper, and use Griffin-Lim to synthesize the waveforms. Note that we do not focus on good audio quality in this paper, and the inferior quality is mainly due to the algorithm used to synthesize the waveforms (as the original Mel-spectrograms and the generated ones result in the similar audio quality). We will address this, in the future work, by using advanced auto-regressive networks such as wavenets for audio synthesis.

Firstly, we present the audio that is synthesized using the original Mel-spectrogram. Specifically, we convert a sample to Mel-spectrogram, and resynthesize back to audio using Griffin-Lim. This is to give a reference of the audio quality obtained by Griffin-Lim in this work.

French horn
The original audio waveform

The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim

Piano
The original audio waveform

The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim

Cello
The original audio waveform

The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim

Basson
The original audio waveform

The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim

Now we demonstrate the controllable sound synthesis. As described in the paper Section 4.3, we specify the target pitch $\mathbf{y_m}$ and instrument $\mathbf{y_k}$, and sample the pitch code $\mathbf{z}_p$ and timbre code $\mathbf{z}_t$ from the conditional distribution $p(\mathbf{z}_p | \mathbf{y_m})$ and $p(\mathbf{z}_t | \mathbf{y_k})$, respectively, where $p(\mathbf{z}_{p} | \mathbf{y}_{p}) = \mathcal{N}(\mathbf{\mu}_{\mathbf{y}_{p}}, \textrm{diag}(\mathbf{\sigma}_{\mathbf{y}_{p}}))$ and $p(\mathbf{z}_{t} | \mathbf{y}_{t}) = \mathcal{N}(\mathbf{\mu}_{\mathbf{y}_{t}}, \textrm{diag}(\mathbf{\sigma}_{\mathbf{y}_{t}}))$. In the following demonstration, we specify the same pitches for all instruments, play the audio and display the corresponding Mel-spectrograms.

English horn

French horn

Tenor Trombone

Trompet

Piano

Violin

Cello

Saxophone

Bassoon

Clarinet

Many-to-Many timbre transfer

In this section, we demonstrate the model's applicability in timbre transfer.

As described in Section 4.4 in the paper, we first infer $\mathbf{z}_p$ and $\mathbf{z}_t$ of the source input, and modify $\mathbf{z}_t$ (denoted as $\mathbf{z}_{source}$) by: $$\mathbf{z}_{transfer} = \mathbf{z}_{source} + \alpha\mathbf{\mu}_{source \rightarrow target},$$
where $\mathbf{\mu}_{source \rightarrow target} = \mathbf{\mu}_{target} - \mathbf{\mu}_{source}$, and $\alpha \in [0, 1]$. We then synthesize the spectrogram by passing $[\mathbf{z}_p, \mathbf{z}_{transfer}]$ to the decoder. See Fig. 4 for an illsutration of transferring French horn to piano.

Note that, in practice, we do not need labels of source instrument and pitch for timbre transfer, as the two variables are automatically inferred by $q(\mathbf{z}_p | \mathbf{X})$ and $q(\mathbf{z}_t | \mathbf{X})$, respectively. $q(\mathbf{y}_{t} | \mathbf{X})$ infers the mixture component (source instrument identity) to which $\mathbf{X}$ belongs, and $\mathbf{\mu}_{source \rightarrow target}$ is then obtained by subtracting mean of the mixture component of the target to the that of the source.

Following Fig. 5 in the paper, we demonstrate $\texttt{Fhn} \rightarrow \texttt{Pno}$, $\texttt{Pno} \rightarrow \texttt{Vc}$, $\texttt{Vc} \rightarrow \texttt{Bn}$, and $\texttt{Bn} \rightarrow \texttt{Fhn}$.
The source instrument is gradually changed to the target instrument, by $\alpha=\{0, 0.25, 0.5, 0.75, 1.0\}$.

French horn to piano

C2 mf Fhn

F#2 pp Fhn

Piano to cello

Notice that the model is able to generalize to the pitch $\texttt{G6}$ which is not within the range of the cello.

G6 pp Pno

D3 pp Pno

Cello to Bassoon

F3 pp Vc

D#4 pp Vc

Bassoon to French horn

D#4 pp Bn

C5 pp Bn

Disentangling the spectral centroid

In this section, we present the effect of latent traverse along the $13$th dimension of $\mathbf{z}_t$, which is discussed in Section 4.5 of the paper.

Ehn-B3-mf

Trop-D5-mf

Pno-C#6-mf

Vn-A4-mf

Bn-A3-mf

Ob-F6-mf

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders