KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES. pdf
Authors: Anurag Kumar, Maksim Khadkevich, Christian Fügen
ESC-50 Dataset
ESC-50 [4] is a sound event dataset. It consists of a total of 50 sound events. The list of sound events can be found here.
The dataset consists of a total of 2,000 recordings each of 5 seconds durations.
It comes pre-divided into 5 folds.
The training set consists of 4 out of 5 folds and the remaining 5th fold is used for testing. This is done all 5 ways and average accuracies are reported.
The training set is used for network adaptation as well as for training linear SVMs.
ESC-50 Results
Comparison of our proposed method with state of art methods is shown is paper
Our method not only outperforms previous methods by a considerable margin but also outperforms human accuracy on this dataset
Even direct representation obtained from \(\mathcal{N}_S\), that is without any task adaptive training, we obtain an average accuracy of 82.8%
, compared to 81.3% human accuracy on this dataset.
Best accuracy of 83.5% is obtained using F1 representations (with \(max()\) mapping), from \(\mathcal{N}_T^{I}\) and \(\mathcal{N}_T^{II}\)
Class-wise results - Below we show confusion matrix for two cases. Classwise confusion matrix for all cases are available here. The file name clarifies the representation used, e.g esc50.NT_III.F1.max.png means \(\mathcal{N}_T^{III}\) network, F1 representations and \(max()\) function to map segment level representations to full recording level representations. The figure files here, might be visually more pleasing. All numbers have been rounded to 2 decimal places.