Audio Source Separation

Description

Dema Ushchapovskyy, Suraj Tirupati
June, 2019

Built a system which extracts vocal information from songs.

The approach used to solve this problem was to break the songs and vocals into their frequency spectra by taking the short-time fourier transforms. We trained CNN auto-encoders to obtain the vocals frequency spectrum from the song spectrum.

The biggest issue was obtaining the data, because the data needed was where the acapellas perfectly aligned its respective songs. After finding a dataset of about 100 songs, the final result achieved was 0.7db SNR on the vocal extracted. The interesting observation was that the SNR per song was very dependant on the genre. Genres like pop where the songs are dominated by a loud and clear vocal achieved higher SNRs then something like dance music where vocals were more in the background.