We demonstrate the audio samples corresponding to Fig. 2 in the paper.
The audio files are all inverted from Mel-spectrograms using Griffin-Lim.
The audio "original Mel-spectrogram" thereby serves as the upper bounds of audio quality.
Notice different singers could express a vocal technique differently, and also sing at different levels of expression over time, which raises ambiguities on defining vocal techniques and poses challenges to data-driven models.