We generate the Mel-spectrograms as described in the paper, and use Griffin-Lim to synthesize the waveforms. Note that we do not focus on good audio quality in this paper, and the inferior quality is mainly due to the algorithm used to synthesize the waveforms (as the original Mel-spectrograms and the generated ones result in the similar audio quality). We will address this, in the future work, by using advanced auto-regressive networks such as wavenets for audio synthesis.
Firstly, we present the audio that is synthesized using the original Mel-spectrogram. Specifically, we convert a sample to Mel-spectrogram, and resynthesize back to audio using Griffin-Lim. This is to give a reference of the audio quality obtained by Griffin-Lim in this work.
Now we demonstrate the controllable sound synthesis. As described in the paper Section 4.3, we specify the target pitch $\mathbf{y_m}$ and instrument $\mathbf{y_k}$, and sample the pitch code $\mathbf{z}_p$ and timbre code $\mathbf{z}_t$ from the conditional distribution $p(\mathbf{z}_p | \mathbf{y_m})$ and $p(\mathbf{z}_t | \mathbf{y_k})$, respectively, where $p(\mathbf{z}_{p} | \mathbf{y}_{p}) = \mathcal{N}(\mathbf{\mu}_{\mathbf{y}_{p}}, \textrm{diag}(\mathbf{\sigma}_{\mathbf{y}_{p}}))$ and $p(\mathbf{z}_{t} | \mathbf{y}_{t}) = \mathcal{N}(\mathbf{\mu}_{\mathbf{y}_{t}}, \textrm{diag}(\mathbf{\sigma}_{\mathbf{y}_{t}}))$. In the following demonstration, we specify the same pitches for all instruments, play the audio and display the corresponding Mel-spectrograms.
As described in Section 4.4 in the paper, we first infer $\mathbf{z}_p$ and $\mathbf{z}_t$ of the source input, and modify $\mathbf{z}_t$ (denoted as $\mathbf{z}_{source}$) by:
$$\mathbf{z}_{transfer} = \mathbf{z}_{source} + \alpha\mathbf{\mu}_{source \rightarrow target},$$
where $\mathbf{\mu}_{source \rightarrow target} = \mathbf{\mu}_{target} - \mathbf{\mu}_{source}$, and $\alpha \in [0, 1]$. We then synthesize the spectrogram by passing $[\mathbf{z}_p, \mathbf{z}_{transfer}]$ to the decoder. See Fig. 4 for an illsutration of transferring French horn to piano.
Note that, in practice, we do not need labels of source instrument and pitch for timbre transfer, as the two variables are automatically inferred by $q(\mathbf{z}_p | \mathbf{X})$ and $q(\mathbf{z}_t | \mathbf{X})$, respectively. $q(\mathbf{y}_{t} | \mathbf{X})$ infers the mixture component (source instrument identity) to which $\mathbf{X}$ belongs, and $\mathbf{\mu}_{source \rightarrow target}$ is then obtained by subtracting mean of the mixture component of the target to the that of the source.
Following Fig. 5 in the paper, we demonstrate $\texttt{Fhn} \rightarrow \texttt{Pno}$, $\texttt{Pno} \rightarrow \texttt{Vc}$, $\texttt{Vc} \rightarrow \texttt{Bn}$, and $\texttt{Bn} \rightarrow \texttt{Fhn}$.
The source instrument is gradually changed to the target instrument, by $\alpha=\{0, 0.25, 0.5, 0.75, 1.0\}$.