Information Page for KNOWLEDGE TRANSFER PAPER

Loading [MathJax]/jax/output/CommonHTML/jax.js

KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES. pdf

Authors: Anurag Kumar, Maksim Khadkevich, Christian Fügen

Audioset Dataset

Audioset is a large scale weakly labeled [2] dataset for sound events, Audioset. It contains a total of 527 sound events for which labeled videos from Youtube are provided. The maximum duration of the recordings is 10 seconds and a large portion of the example recordings are of 10 seconds duration. However, there are a considerable number of recordings with smaller duration.

In this paper, we worked with the balanced train for training the models and Eval set for evaluation.
Balanced train set provides at least 59 examples for each sound event. It has a total of around 22, 000 recordings.
Eval set is the full evaluation set of Audioset. It consist of a total of around 20, 000 example recordings, again with at least 59 examples per class
Audioset is Multi-label dataset. On an average each recording example contains 2.7 classes [1].
Due to multi-label nature of recordings, the actual number of examples for several classes is higher. The class wise distribution of labels for both balanced train and eval set is shown in the figures below.

Number of Examples for Each Sound Event — Fig 1. - Number of examples for different sound events in balanced ( Red ) and eval (Green) sets

Fig 1. - Number of Events vs Number of Examples (Distribution of examples and events) ( Red ) and eval (Green) sets

Audioset Results

As shown in paper, the proposed weak label CNN approach ( $\mathcal{N}_S$ ) outperforms a network trained under strong label assumption ( $\mathcal{N}_S^{slat}$ ). Moreover, $\mathcal{N}_S$ works smoothly with recordings of variable lengths and is computationally more efficient by over 30 % during training as well as test (See paper for comparison).

Here we provide, additional results and analysis. We also show some analysis on temporal localization of events within the recording.

Average Precision (AP) for each sound event are provided in these figures AP1, AP2, AP3, AP4, AP5, AP6, AP7, AP8, AP9, AP10, AP11

Mean Average Precision (MAP) improves by an absolute 4.6 (27.5 % relative)
Each figure shows comparison for 50 sound events. The order of appearance is available here. First 50 in AP1, next 50 in AP2 and so on.

Area Under ROC curves (AUC) for each sound event are provided in these figures AUC1, AUC2, AUC3, AUC4, AUC5, AUC6, AUC7, AUC8, AUC9, AUC10, AUC11

Mean AUC improves by an absolute 1.2 (from 0.915 to 0.927, 1.3 % relative)
Each figure shows comparison for 50 sound events. The order of appearance is available here. First 50 in AP1, next 50 in AP2 and so on.

Localization of Sound Events: Our proposed network can perform localization of sound events. The network output at each segment can be used to perform localization of sound events. Some examples of these localizations are provided here. Note that Audioset does not provide time stamps of location of sound events and hence we cannot produce any quantative results on sound event localization.