Music Machine Learning


See also Performance Style Classification and
Music Understanding.

Back to Bibliography by Subject


Wu, He, Liu, Wang, and Dannenberg. “Transplayer: Timbre Style Transfer with Flexible Timbre Control,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Abstract: Music timbre style transfer aims at replacing the instrument timbre in a solo recording with another instrument, while preserving the musical content. Existing GAN-based methods can only achieve timbre style transfer between two given timbres. Inspired by the practice in voice conversion, we propose TransPlayer, which uses an autoencoder model with one-hot representations of instruments as the condition, and a Diffwave model trained especially for music synthesis. We evaluate our model in both the one-to-one transfer task and the many-to-many transfer task. The results prove that our method is able to provide one-to-one style transfer outputs comparable with the existing GAN-based method, and can transfer among multiple timbres with only one single model.

[Adobe Acrobat (PDF) Version]


Dai, Chen, Wu, Huang, Dannenberg. “SingStyle111: A Multilingual Singing Dataset With Style Transfer,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, Online, Nov 2023.

Abstract: There has been a persistent lack of publicly accessible data in singing voice research, particularly concerning the diversity of languages and performance styles. In this paper, we introduce SingStyle111, a large studio-quality singing dataset with multiple languages and different singing styles, and present singing style transfer examples. The dataset features 111 songs performed by eight professional singers, spanning 12.8 hours and covering English, Chinese, and Italian. SingStyle111 incorporates different singing styles, such as bel canto opera, Chinese folk singing, pop, jazz, and children. Specifically, 80 songs include at least two distinct singing styles performed by the same singer. All recordings were conducted in professional studios, yielding clean, dry vocal tracks in mono format with a 44.1 kHz sample rate. We have segmented the singing voices into phrases, providing lyrics, performance MIDI, and scores with phoneme-level alignment. We also extracted acoustic features such as Mel-Spectrogram, F0 contour, and loudness curves. This dataset applies to various MIR tasks such as Singing Voice Synthesis, Singing Voice Conversion, Singing Transcription, Score Following, and Lyrics Detection. It is also designed for Singing Style Transfer, including both performance and voice timbre style. We make the dataset freely available for research purposes. Examples and download information can be found at https://shuqid.net/singstyle111.

[Adobe Acrobat (PDF) Version]


Zhuo, Yuan, Pan, Ma, Li, Zhang, Liu, Dannenberg, Fu, Lin, Benetos, Chen, Xue, and Guo. “LyricWhiz: Robust Multilingual Lyrics Transcription by Whispering to ChatGPT,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, Online, Nov 2023.

Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today’s most performant chat-based large language model. In the proposed method, Whisper functions as the “ear” by transcribing the audio, while GPT-4 serves as the “brain,” acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.

[Adobe Acrobat (PDF) Version]


Ma, Yuan, Li, Zhang, Lin, Chen, Ragni, Yin, Benetos, Gyenge, Liu, Xia, Dannenberg, Guo, and Fu. “On the effectiveness of speech self-supervised learning for music,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, Online, Nov 2023.

Abstract: Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent models such as wav2vec2.0 have shown promise. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, and refer to them as music2vec and musicHuBERT, respectively. We train 12 SSL models with 95M parameters under various pre-training configurations and systematically evaluate the MIR task performances with 13 different MIR tasks. Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech. However, we identify the limitations of such existing speech-oriented designs, especially in modelling polyphonic information. Based on the experimental results, empirical suggestions are also given for designing future musical SSL strategies and paradigms.

[Adobe Acrobat (PDF) Version]


Li, Yuan, Zhang, Ma, Chen, Yin, Xiao, Lin, Ragni, Benetos, Gyenge, Dannenberg, Liu, Chen, Xia, Shi, Huang, Wang, Guo, and Fu. “MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training,” in The Twelfth International Conference on Learning Representations, Online, 2024.

Abstract: Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music.To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters.Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.

[Adobe Acrobat (PDF) Version]