KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES. pdf

Authors: Anurag Kumar, Maksim Khadkevich, Christian Fügen


Audioset Dataset

Audioset is a large scale weakly labeled [2] dataset for sound events, Audioset. It contains a total of 527 sound events for which labeled videos from Youtube are provided. The maximum duration of the recordings is 10 seconds and a large portion of the example recordings are of 10 seconds duration. However, there are a considerable number of recordings with smaller duration.


Audioset Results

As shown in paper, the proposed weak label CNN approach (\(\mathcal{N}_S\)) outperforms a network trained under strong label assumption (\(\mathcal{N}_S^{slat}\)). Moreover, \(\mathcal{N}_S\) works smoothly with recordings of variable lengths and is computationally more efficient by over 30 % during training as well as test (See paper for comparison).

Here we provide, additional results and analysis. We also show some analysis on temporal localization of events within the recording.