Institutional Repository [SANDBOX]
Technical University of Crete
EN  |  EL

Search

Browse

My Space

Music generation using deep learning

Sotiropoulou Ileanna

Full record


URI: http://purl.tuc.gr/dl/dias/4EF1404E-A438-4147-8B51-79CA144C6D19
Year 2021
Type of Item Diploma Work
License
Details
Bibliographic Citation Ileanna Sotiropoulou, "Music generation using deep learning", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2021 https://doi.org/10.26233/heallink.tuc.90298
Appears in Collections

Summary

Machine learning and specifically deep learning methods have been applied to complex signal processing problems with remarkable results. Recent breakthroughs in audio synthesis involve the use of end-to-end deep neural networks to model speech in the auditory domain. WaveNet is one such model that is currently considered state-of-the-art in speech synthesis. In this thesis, we investigate the use of WaveNet and WaveRNN as vocoders for musical synthesis. Furthermore, we investigate WaveNet's potential to capture emotive patterns and create emotional music. Prior to choosing an optimal set of parameters for each model, it was critical to consider the spectral and structural distinctions between speech and music signals. Regarding the vocoders, we employed mel spectrograms as temporal local labels for audio reconstruction. The mood-conditional network received no structural instruction and was instead left to generate original audio, conditioned only on a specific mood tag. The models were trained intensively for a minimum of 9 days with WaveNet vocoder converging after 19 days. Synthesized waveforms were evaluated subjectively by human judges, as well as objectively with the use of the PESQ algorithm. Additionally, the respondents were asked to evaluate the mood-conditional samples by guessing the mood of each track. While WaveRNN eventually proved unfit for the nature of our problem, WaveNet-reconstructed waveforms are extraordinarily similar to the originals, with their 5-scale Mean Opinion Scores exceeding 4.0 in both subjective and objective evaluation. Also, remarkably, the majority of responders accurately predicted the moods of all four tracks. This result leads us to anticipate that with additional instruction, WaveNet will be able to respond to emotional cues and automatically create music that is clearly influenced by the range of human emotions.

Available Files

Services

Statistics