Système neuronal pour réponses à des questions de

UNIVERSITÉ DE SHERBROOKEFaculté de génie

Département de génie électrique et de génie informatique

Système neuronal pour réponses à desquestions de compréhension de scène auditives

Mémoire de maitriseSpécialité : génie électrique

Jérôme Abdelnour

Sherbrooke (Québec) Canada

Mai 2021

MEMBRES DU JURY

Jean RouatDirecteur

Giampiero SalviÉvaluateur

Éric PlourdeÉvaluateur

RÉSUMÉ

Le présent projet introduit la tâche "réponse à des questions à contenu auditif" (AcousticQuestion Answering-AQA) dans laquelle un agent intelligent doit répondre à une ques-tion sur le contenu d’une scène auditive. Dans un premier temps, une base de données(CLEAR) comprenant des scènes auditives ainsi que des paires question-réponse pourchacune d’elles est mise sur pied afin de permettre l’entraînement de systèmes à base deneurones. Cette tâche étant analogue à la tâche "réponse à des questions à contenu visuel"(Visual Question Answering-VQA), une étude préliminaire est réalisé en utilisant un ré-seau de neurones (FiLM) initialement développé pour la tâche VQA. Les scènes auditivessont d’abord transformées en représentation spectro-temporelle afin d’être traitées commedes images par le réseau FiLM. Cette étude a pour but de quantifier la performance d’unsystème initialement conçu pour des scènes visuelles dans un contexte acoustique. Dansla même lignée, une étude de l’efficacité de la technique visuelle de cartes de coordonnéesconvolutives (CoordConv) lorsqu’appliquée dans un contexte acoustique est réalisée. Fi-nalement, un nouveau réseau de neurones adapté au contexte acoustique (NAAQA) estintroduit. NAAQA obtient de meilleures performances que FiLM sur la base de donnéesCLEAR tout en étant environ 7 fois moins complexe.

Mots-clés : Réseau neurones, AQA, VQA, Question réponse, Convolution, Coordconv,FiLM, Acoustique

TABLE DES MATIÈRES

1 INTRODUCTION 1

2 REVUE DE LA LITTÉRATURE 32.1 Réseaux de neurones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Perceptron multi-couches . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Apprentissage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Réseaux convolutifs . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.6 Resnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Visual Question answering et Feature-wise Linear Modulation (FiLM) . . . 92.3 Cartes de coordonnées . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Réseaux convolutifs appliqués à l’audio . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Filtres convolutifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Représentation spectro-temporelle . . . . . . . . . . . . . . . . . . . 152.4.3 Réseaux audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 BASE DE DONNÉES ET RÉSULTATS PRÉLIMINAIRES 193.1 CLEAR : A Dataset for Compositional Language and Elementary Acoustic

Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Introduction and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 Scenes and Elementary Sounds . . . . . . . . . . . . . . . . . . . . 223.3.2 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5.1 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . 263.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.7 APPENDIX : Statistics on the Data Set . . . . . . . . . . . . . . . . . . . 28

4 RÉSEAU DE NEURONES ACOUSTIQUE 334.1 NAAQA : A Neural Architecture for Acoustic

Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.1 Text-Based Question Answering . . . . . . . . . . . . . . . . . . . . 384.4.2 Visual Question Answering (VQA) . . . . . . . . . . . . . . . . . . 384.4.3 Acoustic Question Answering (AQA) . . . . . . . . . . . . . . . . . 38

v

vi TABLE DES MATIÈRES

4.4.4 Convolutional neural network on Audio . . . . . . . . . . . . . . . . 394.5 Variable scene duration dataset . . . . . . . . . . . . . . . . . . . . . . . . 404.6 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6.1 Baseline model : Visual FiLM . . . . . . . . . . . . . . . . . . . . . 424.6.2 NAAQA : Adapting feature extractors to acoustic inputs . . . . . . 434.6.3 Coordinate maps with acoustic inputs . . . . . . . . . . . . . . . . . 46

4.7 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.7.1 Acoustic Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.7.2 Experimental conditions . . . . . . . . . . . . . . . . . . . . . . . . 474.7.3 Initial configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.7.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.7.5 Coordinate Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.7.6 Complexity reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 524.7.7 Impact of dataset composition . . . . . . . . . . . . . . . . . . . . . 52

4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.8.1 NAAQA detailed analysis . . . . . . . . . . . . . . . . . . . . . . . 544.8.2 Importance of text versus audio modality . . . . . . . . . . . . . . . 554.8.3 Variability of the input size . . . . . . . . . . . . . . . . . . . . . . 56

4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.10 APPENDIX : Network Optimization . . . . . . . . . . . . . . . . . . . . . 58

5 CONCLUSION 615.1 Sommaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Retour sur les contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Travaux futurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

LISTE DES RÉFÉRENCES 65

LISTE DES FIGURES

2.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Perceptron multi couche . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Réseau convolutif. L’opération de pooling est définie dans la section 2.1.5 . 52.4 Champs récepteurs de filtres convolutif. Inspirée de [61]. . . . . . . . . . . 62.5 Example de Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Comparaison entre blocs convolutifs . . . . . . . . . . . . . . . . . . . . . . 82.7 Schéma du système VQA FiLM . . . . . . . . . . . . . . . . . . . . . . . . 102.8 Exemple de tâche de régression de coordonnée. Inspirée de [63]. . . . . . . 132.9 Cartes de coordonnées . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.10 Différentes formes de filtres convolutifs. Inspirée de [78]. . . . . . . . . . . . 162.11 Exemple de spectrogramme. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Example of an acoustic scene . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Distribution of answers in the dataset by set type. The color represent the

answer category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Distribution of question types. The color represent the set type. . . . . . . 303.4 Distribution of template types. The same templates are used to generate

the questions and answers for the training, validation and test set. . . . . . 313.5 Distribution of sound attributes in the scenes . . . . . . . . . . . . . . . . 32

4.1 Overview of the CLEAR dataset generation process . . . . . . . . . . . . . 374.2 Common Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Acoustic feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Test accuracy for NAAQA trained on different dataset sizes . . . . . . . . 534.5 Test accuracy of NAAQA final configuration by question type and the num-

ber of relation in the question . . . . . . . . . . . . . . . . . . . . . . . . . 54

vii

viii LISTE DES FIGURES

LISTE DES TABLEAUX

2.1 Architecture de la première étape de traitement du module image du sys-tème FiLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Architecture de chaque bloc résiduel . . . . . . . . . . . . . . . . . . . . . . 122.3 Type de questions dans base de donnée CLEVR [48] . . . . . . . . . . . . . 12

3.1 Types of questions with examples and possible answers. . . . . . . . . . . . 23

4.1 Types of questions with examples and possible answers . . . . . . . . . . . 404.2 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 Impact of different feature extractors . . . . . . . . . . . . . . . . . . . . . 484.4 Impact of the placement of Time and Frequency coordinate maps . . . . . 504.5 Comparison of Time and Frequency coordinate maps . . . . . . . . . . . . 514.6 Impact of the number of GRU text-processing units G . . . . . . . . . . . 584.7 Impact of the number of filters C and hidden units H in the classifier . . . 584.8 Impact of reducing the number of Resblock J and the number of filters M

in each Resblocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.9 Impact of the number of blocks K in the parallel feature extractor and its

projection size P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

ix

x LISTE DES TABLEAUX

LISTE DES ACRONYMES

Acronyme Définition

AQA Acoustic Question AnsweringBN Batch NormalisationCLEAR Dataset for Compositional Language and Elementary Acoustic ReasoningCLEVR Dataset for Compositional Language and Elementary Visual ReasoningCNN Convolutional Neural NetworkConv ConvolutionCQT Constant-Q transformdB DecibelFiLM Feature-wise Linear ModulationGRU Gated Recurrent UnitLUFS Loudness Unit Full ScaleMFCC Mel-frequency cepstral coefficientsMLP Multi-Layer PerceptronNAAQA Neural Architecture for Acoustic Question AnsweringQA Question AnsweringReLU Rectified Linear UnitSTFT Short-Time Fourier transformt-SNE t-distributed Stochastic Neighbor EmbeddingVQA Visual Question Answering

xi

xii LISTE DES ACRONYMES

CHAPITRE 1

INTRODUCTION

Le groupe de recherche NECOTIS [85] de la faculté de génie de l’université de Sherbrookeest le laboratoire coordonnateur du projet Interactive Grounded Language Understanding(IGLU [86]) financé par CHIST-ERA [99]. Dans le cadre de ce projet, un “jeu” nomméGuessWhat ? ! [25] a été développé. Il s’agit d’un jeu comportant 2 joueurs. Ceux-ci ont desrôles bien spécifiques. Le premier joueur assume le rôle de “l’oracle”. Il doit choisir un objetdans la scène et le garder secret. Le deuxième joueur doit deviner quel est l’objet choisipar l’oracle. Pour ce faire, les 2 joueurs doivent avoir une conversation orientée autour del’objet choisi par l’oracle. Le “devineur” pose une série de questions se répondant par ouiou par non à l’oracle afin de déduire quel est l’objet choisi. Exemple de question :

– Est-ce une personne ?– Est-ce que l’objet est de couleur rouge ?– Etc.

La tâche de l’oracle est de répondre à une question en lien avec le contenu d’une image.Cette tâche est appelée "réponse à des questions à contenu visuel" (Visual Question Ans-wering - VQA) [7]. Il existe plusieurs types de solutions dans la littérature pour ce pro-blème. Perez et al. [76] (membres du consortium IGLU) ont utilisé des réseaux de neuronesconvolutifs [58] et des réseaux de neurones récurrents [34] afin de proposer une solution àla tâche de VQA. Ceux-ci ont développé un mécanisme permettant de moduler les cartesd’activations d’un réseau convolutif en fonction d’une entrée textuelle (question).

Dans ce mémoire, nous proposons une nouvelle tâche comparable au VQA : réponse àdes questions à contenu auditif (Acoustic Question Answering-AQA). Dans cette tâche,un agent intelligent doit répondre à une question sur le contenu d’une scène auditive.Nous nous interrogeons sur la performance de techniques à base de réseau de neuronesinitialement développées afin de s’attaquer à des tâches visuelles lorsqu’appliquées dansun contexte acoustique. Plus précisément, nous nous intéressons à la performance de l’ar-chitecture FiLM [76] sur la tâche d’Acoustic Question Answering.

Cette problématique peut être formulée sous la forme d’une question :

1

2 CHAPITRE 1. INTRODUCTION

Est-ce qu’une architecture à base de réseau de neurones développée dans lebut de répondre à des questions sur des scènes visuelles peut être utilisée pourrépondre à des questions sur des scènes auditives ?

Pour répondre à cette question, j’ai conçu une base de données pour la tâche d’AQA. Unréseau de neurones de type FiLM a été par la suite entraîné sur cette base de données. Unenouvelle architecture, adaptée au contexte auditif de la tâche a été élaborée et évaluée.

En résumé, voici les contributions scientifiques originales de ce travail de recherche :– Une base de données composée de 50 000 scènes acoustiques et de 4 paires de ques-

tions et réponses par scène pour un total de 200 000 exemples.– Une analyse de la performance du système original FiLM sur la tâche d’Acoustic

Question answering.– Une architecture à base de réseaux de neurones développée spécifiquement pour la

tâche d’Acoustic Question answering.– Une analyse de la pertinence de l’utilisation de cartes de coordonnées dans un

contexte acoustique.Le présent document est structuré de la façon suivante : Le chapitre 2 fait une revue de lalittérature en lien avec les objectifs du projet. Le chapitre 3 définit une première version dela base de données CLEAR accompagnée de résultats préliminaires avec le système FiLM.Le contenu de ce chapitre a été publié à la conférence NeurIPS 2018 sous le nom CLEAR :A Dataset for Compositional Language and Elementary Acoustic Reasoning. Finalement, lechapitre 4 définit la version finale de la base de données CLEAR, une nouvelle architectureà base de réseaux de neurones adaptée au contexte acoustique de la tâche d’AQA ainsiqu’une analyse approfondie des résultats. Ce chapitre fait aussi l’analyse de l’utilisationde cartes de coordonnées dans un contexte acoustique. Le contenu de ce chapitre a étésoumis à la revue IEEE Pattern Analysis and Machine Intelligence sous le nom NAAQA :A Neural Architecture for Acoustic Question Answering.

CHAPITRE 2

REVUE DE LA LITTÉRATURE

Dans ce chapitre, nous introduisons les concepts élémentaires nécessaires à la compréhen-sion des articles présentés aux chapitres 3 et 4 de ce mémoire. Une revue de la littératureplus spécialisée est effectuée dans l’article de journal présenté au Chapitre 4.

2.1 Réseaux de neurones

2.1.1 Perceptron

X1

X2

X3

Xn

w1

w2

w3

wn Biais

Fonctionactivation

Figure 2.1 Perceptron

Le perceptron [84] est l’unité de base des réseaux de neurones artificiels. Le perceptronprend un certain nombre d’entré Xn et effectue une somme pondérée via une série depoids wn. Une transformation non-linéaire, appelée fonction d’activation, est par la suiteappliquée au résultat afin de mieux séparer l’espace. L’équation suivante définie l’opérationeffectuée par le perceptron :

z = f(N∑

n=1

wn · xn) (2.1)

Où z est la sortie du perceptron, xn représente une entrée, wn représente le poids associéeà cette entrée et f représente la fonction d’activation.

Lors de l’élaboration du perceptron, la fonction d’activation utilisée était la fonction hea-viside. Cette fonction est discontinue et est définit par l’équation suivante :

f(x) =

⎧⎨⎩0 si x < 0

1 si x ≥ 1(2.2)

3

4 CHAPITRE 2. REVUE DE LA LITTÉRATURE

Il existe plusieurs autres types de fonction d’activation tel que la fonction Sigmoide [35],la fonction tangente hyperbolique [50], la fonction ReLU [46], etc. La fonction d’activationla plus communément utilisée de nos jours est la ReLU. Elle est définie par l’équationsuivante :

f(x) = max(0, x) (2.3)

2.1.2 Perceptron multi-couches

X1

X2

X3

XN

CouchesCachées

Couche deSortie

Couched'entrée

Figure 2.2 Perceptron multi couche

Un certain nombre de perceptrons peut être assemblé tel qu’illustré à la Figure 2.2 afin demodéliser des expressions complexes. Ils sont organisés en couches : la couche d’entrée, lescouches cachées (en jaune) et la couche de sortie (en vert). Chaque neurone est connectéà tous les neurones de la couche précédente. Chaque connexion est modulée par un poidswij où i correspond à l’indice du neurone et j correspond à l’indice de la couche. Les biaisainsi que les fonctions d’activations ne sont pas inclus à la Figure 2.2 afin de simplifierl’illustration. Le nombre de couches cachées ainsi que le nombre de neurones par couchesont des hyper-paramètres qui doivent être ajusté lors de l’élaboration du réseau. Le nombrede neurones dans la couche de sortie dépend de la tâche pour laquelle est entraîné leréseau. Le réseau illustré à la Figure 2.2 pourrait, par exemple, être utilisé pour une tâched’approximation ou une tâche de classification binaire puisqu’il ne contient qu’un seulneurone de sortie. Pour une classification où il y a plus de 2 classes, le réseau pourraitcontenir C neurones de sortie où C est le nombre de classes.

2.1.3 Apprentissage

Un réseau de neurones peut être entraîné de façon supervisée ou non-supervisée. L’en-traînement supervisé est la stratégie la plus répandue et celle qui sera utilisée dans cemémoire. Un jeu de données annoté est nécessaire pour ce type d’entraînement. Il s’agit

2.1. RÉSEAUX DE NEURONES 5

Convolution ConvolutionPooling Pooling

Figure 2.3 Réseau convolutif. L’opération de pooling est définie dans la section2.1.5

d’un processus itératif où la sortie du réseau est comparée à la valeur attendue pour encalculer la différence (erreur) et appliquer un correctif aux poids du réseau (W ) en fonc-tion de cette erreur. Il existe plusieurs façons de calculer cette erreur. Par exemple il estpossible d’utiliser le critère de l’erreur moyenne carrée (MSE) ou encore le critère Cross-Entropy. Les poids du réseau (W ) sont optimisé de façon itérative afin de minimiser lecritère d’erreur (L) choisi. L’algorithme utilisé est la descente de gradient [87] et est définide la façon suivante :

W e+1 = W e − μ

(∂L(X,W )

∂W

)(2.4)

Où e est le numéro de l’itération, X est la matrice d’entrée, W est la matrice de poids duréseau, L est le critère d’erreur et μ est le taux d’apprentissage régissant la mise a jourdes poids. Sa valeur est définie lors de la phase d’apprentissage.

Le gradient est propagé à travers le réseau en partant de la sortie vers l’entrée. La dérivéepartielle est calculée en utilisant la règle de dérivation en chaine [82]. Les gradients locauxpour chaque couches sont multipliés ensemble ce qui fait en sorte que plus le nombre decouche est élevé, plus le gradient tend vers zéro. Un gradient qui tend vers zéro pour lescouches inférieures implique que les valeurs des poids ne changent presque pas lors de lamise à jour itérative (Équation 2.4). L’apprentissage devient alors très lent voir impossible.Ce problème est communément appelé disparition du gradient (vanishing gradient.)

2.1.4 Réseaux convolutifs

Les réseaux convolutifs dominent les applications en reconnaissance d’image depuis que leréseau AlexNet [54] a remporté la compétition ImageNet Large Scale Visual RecognitionChallenge [47] en 2012. Ceux-ci ont néanmoins été introduits bien avant, en 1998, par Lecunet al. [56]. Les réseaux convolutifs sont construits de façon hiérarchique [58] : les couchesinférieures capturent des composantes primitives et les couches supérieures combinentles composantes de la couche précédente afin d’apprendre des concepts de plus en pluscomplexes. Par exemple, pour un réseau faisant la reconnaissance de chaises, la premièrecouche pourrait capturer des lignes avec différentes orientations, puis la deuxième couche


Figure 2.4 Champs récepteurs de filtres convolutif. Inspirée de [61].

assemble ces lignes afin de reconnaître les pattes, le siège et le dossier. Finalement, la couchesupérieure assemble les composantes des pattes, du siège et du dossier pour reconnaîtrela chaise. La sortie d’une couche convolutive est appelée carte d’activation. Le nombre decartes d’activation dépend du nombre de filtres convolutifs dans une couche convolutive.Les couches convolutives sont généralement entrelacées avec des couches de pooling afin deréduire la taille des cartes d’activations (contribuant ainsi au processus de hiérarchisation).La Figure 2.4 illustre le champs réceptif d’un neurone dans différentes couches. L’opérationde pooling est détaillée à la Section 2.1.5. La dernière couche d’un réseau convolutif estplus souvent qu’autrement un perceptron multi-couche. Dans le cas d’un réseau faisant dela classification, cette couche permet l’apprentissage des relations entre les composantescomplexes et les différentes classes à reconnaître.

2.1. RÉSEAUX DE NEURONES 7

2.1.5 Pooling

58 24 45

92 67 91

93 74 93

12 38 8 6 33 5

58 2 24 11 9 45

86 55 38 22 91 32

92 7 18 67 85 39

23 85 74 4 15 93

17 93 49 62 46 34

Max-Pooling

27.5 12.3 23.0

60.0 36.3 61.8

54.5 47.3 47.0Average-Pooling

Figure 2.5 Example de Pooling

L’opération de pooling est utilisée afin de sous-échantillonner une carte d’activation. Pource faire, la carte d’activation est séparée en une série de rectangles de N×M . Une seule va-leur est retenue par rectangle ce qui a pour effet de réduire la taille de la carte d’activation.Cette opération contribue à la propriété d’insensibilité spatiale des réseaux convolutifs.Plus les cartes d’activations sont réduites, plus les activations représentent des conceptsabstraits et moins elles ont de lien avec la position sur l’image d’entrée tel qu’illustré à laFigure 2.4. Il existe plusieurs stratégies de pooling. Une première stratégie est de faire lamoyenne des valeurs dans chaque rectangle. Une autre stratégie est de garder la valeur laplus grande dans chaque rectangle. On garde ainsi la valeur la plus représentative de lacaractéristique extraite par le filte convolutif. La figure 2.5 montre un example pour lesdeux types de Pooling avec un filtre de taille 2× 2.


2.1.6 Resnets

Convolution

Convolution

Convolution

Convolution

Block 1

Block 2

(a) Bloc convolutif

Convolution

Convolution

Resblock 1

+ Entrée

Convolution

Convolution

Resblock 2

+ Entrée

(b) Bloc convolutif rédisuel

Figure 2.6 Comparaison entre blocs convolutifs

Les réseaux convolutifs de type VGG[91] souffrent du problème de disparition du gradientdécrit à la Section 2.1.3. Plus on ajoute de couches convolutives, plus le gradient diminueet donc la capacité de convergence du réseau. He et al. [37] ont introduit le concept debloc résiduel avec le réseau ResNet afin de pallier à ce problème. Pour ce faire, la sortie dubloc convolutif est additionnée avec l’entrée tel qu’illustré à la Figure 2.6b. Cette additionest appelée Skip Connection. La sortie d’un bloc résiduel est définie par :

H(x) = x+ F (x) (2.5)

Où x est l’entrée et F (x) est une série d’opérations de convolution.

Le gradient de la fonction de coût (L) par rapport à l’entrée peut être écrit de la façonsuivante :

∂L

∂x=

∂L

∂H

∂H

∂x

=∂L

∂H

(1 +

∂F

∂x

)

=∂L

∂H+

∂L

∂H

∂F

∂x

(2.6)

2.2. VISUAL QUESTION ANSWERING ET FEATURE-WISE LINEARMODULATION (FILM) 9

Le gradient de l’entrée par rapport à elle même est égale à 1. En additionnant l’entrée à lasortie du bloc résiduel, le gradient peut se propager plus facilement limitant ainsi l’impactdu problème de disparition du gradient. En utilisant cette technique, il est possible dedévelopper des réseaux convolutifs beaucoup plus profonds.

2.2 Visual Question answering et Feature-wise Linear

Modulation (FiLM)

Le VQA [7] est un type de problème pour lequel un agent intelligent doit répondre à desquestions liées au contenu d’une image qui a été choisie au préalable. Ces questions sonthabituellement formulées sous forme de mots. Chaque scène visuelle peut être composéed’une seule ou plusieurs images (vidéo). Perez et al. [75, 76] ont proposé une approche àbase de réseaux de neurones afin de résoudre ce problème. La solution peut être séparéeen 2 modules principaux : un module qui analyse la question textuelle et un module quianalyse l’image.

Le module question textuelle, désigné par les boites GRU sur la figure 2.7, prend en entréeune séquence de mots (certaines questions font plus de 40 mots). Puisque l’ordre de laséquence contient des informations importantes quant au sens de la question [34], le moduleest composé d’un réseau de neurones récurrent. Il s’agit d’un réseau récurrent de typeGRU [23] formé de 4096 unités cachées. La dernière unité cachée produit un nombre N decouples de 2 paramètres (γ,β) paramétrant les N couches FiLM utilisées dans le moduleimage. Ces paramètres permettent de “focaliser l’attention” du module d’analyse d’imageen modulant les cartes d’activations du réseau.

Le module image, désigné par les boites CNN et ResBlock sur la figure 2.7, est composé d’unréseau convolutif qui analyse la scène visuelle. Plutôt que de faire l’apprentissage à partirde valeurs de poids aléatoires, Perez et al. ont choisi d’utiliser les premières couches d’unréseau classificateur ResNet-101 [37] pré-entraîné sur la base de données ImageNet [47].Ceci permet d’accélérer grandement la vitesse de convergence de l’apprentissage du réseauconvolutif puisque les couches se chargeant de l’apprentissage des structures primaires [58](les premières couches du réseau) n’ont pas besoin d’être apprises. Cet apprentissage prendun temps considérable car les variations appliquées aux poids des couches deviennent deplus en plus petites plus on descend dans l’architecture du réseau (la sortie du réseau estici considérée comme étant la couche la plus haute alors que la première couche est lacouche la plus basse) à cause du problème de dégénérescence du gradient [11]. Perez et al.ont utilisé la couche conv4_6 du réseau Resnet ainsi que les couches inférieures commepremière étape de traitement de l’image. La sortie de la couche conv4_6 passe ensuite


Figure 2.7 Schéma du système VQA FiLM inspirée de [76]. À gauche, le modulequestion composé d’un réseau récurrent de type GRU. Ce dernier prend enentrée une question sous forme d’une séquence de mots et calcule les valeursβ et γ qui seront utilisées par le module image présenté au centre du schéma.Celui-ci prend en entrée l’image représentant la scène d’intérêt. Cette imageest traitée par un premier réseau convolutif pré entrainé (CNN) puis par unesérie de blocs résiduels. L’architecture d’un bloc résiduel est détaillée à droitedu schéma. On y retrouve une série de couches convolutives (Conv), une couchede normalisation par lot (batch normalisation-BN ) ainsi qu’une couche FiLM.Cette couche prend en paramètre β et γ calculés par le module question afind’appliquer une transformation linéaire sur l’activation de la couche précédente.Cette transformation permet de focaliser l’attention du réseau convolutif surdifférentes caractéristiques de l’image en fonction de la question posée. La coucheFiLM est suivie par la fonction d’activation ReLU définie à la section 2.1.3. Lasortie du dernier bloc résiduel est finalement envoyée dans un classificateur quichoisira la réponse la plus probable à la question en fonction de la liste deréponses possibles.

2.2. VISUAL QUESTION ANSWERING ET FEATURE-WISE LINEARMODULATION (FILM) 11

Couche DimensionImage d’entrée 3 x 224 x 224ResNet-101 jusqu’à conv4_6 1024 x 14 x 14Conv(3 x 3, 1024 → 128) 128 x 14 x 14ReLU 128 x 14 x 14Conv(3 x 3, 128 → 128) 128 x 14 x 14ReLU 128 x 14 x 14

Tableau 2.1 Architecture de la première étape de traitement du module imagedu système FiLM [49, 76]. La colonne dimension A x B x C exprime la dimensionde la sortie de chaque couche. A est le nombre de cartes d’activation de dimensionB par C. ResNet-101 fait référence au réseau ResNet [37] pré-entraîné sur la basede données ImageNet [47]. À la colonne Couche, Conv(E x F, G → H) représenteune couche convolutive composée de H filtres de dimension E par F. La relationG → H montre le nombre de cartes de caractéristique en entrée G versus lenombre en sortie H. ReLU représente la fonction d’activation “Rectified LinearUnit” [108].

par 2 couches convolutives avec fonction d’activation ReLU [108]. Le tableau 2.1 donneles détails de ces couches.

Un nombre N de blocs résiduels est placé après cette première étape de traitement. Lenombre de blocs résiduel est un choix d’implémentation. Perez et al. ont choisi d’utiliserun N de 4. Ce choix a été déterminé de façon empirique. Chaque bloc résiduel est définitel que présenté dans le tableau 2.2.

La principale contribution du travail de Perez et al. [76] se trouve au niveau de la coucheFiLM. Cette couche effectue une transformation linéaire de la sortie de la couche précédenteen fonction de 2 paramètres qui sont appris par le module question. L’équation 2.7 démontrela transformation appliquée :

FiLM(Fj,c|γj,c, βj,c) = γj,cFj,c + βj,c (2.7)

Dans cette équation, F représente la carte d’activation (sortie) de la couche précédente.Les index i et c correspondent respectivement à l’index de l’entrée et l’index de la carted’activation. Les paramètres γ et β sont appris lors de la phase d’entrainement du sys-tème. Dépendamment de la valeur de ceux-ci, la carte d’activation F sera modulée d’unefaçon différente. Par exemple, la modulation appliquée peut effectuer une négation, uneamplification ou encore le seuillage de certaines parties des cartes d’activation. Ces modu-lations permettent de focaliser l’attention du réseau convolutif vers des parties spécifiquesde l’image en fonction de la question présentée en entrée. Les auteurs ont effectué une


Indice Couche Dimension(1) Sortie du module précédent 128 x 14 x 14(2) Conv(1 x 1, 128 → 128) 128 x 14 x 14(3) ReLU 128 x 14 x 14(4) Conv(3 x 3, 128 → 128) 128 x 14 x 14(5) Batch Normalisation 128 x 14 x 14(6) Couche FiLM 128 x 14 x 14(7) ReLU 128 x 14 x 14(8) Ajout des résidus (3) et (7) 128 x 14 x 14

Tableau 2.2 Architecture de chaque bloc résiduel [49, 76]. La colonne dimensionA x B x C exprime la dimension de la sortie de chaque couche. A est le nombrede cartes d’activation de dimension B par C. La couche Batch Normalisation estdétaillé dans [44]. La couche FiLM est définie par l’équation 3.1. Conv(E x F,G → H) représente une couche convolutive composée de H filtres de dimensionE par F. La relation G → H montre le nombre de cartes de caractéristique enentrée G versus le nombre en sortie H. ReLU représente la fonction d’activation“Rectified Linear Unit” [108].

Type de question Example de questionExiste (Exist) Is there a white cube ?Moins de (Less than) Is there less red cylinder than big sphere ?Plus que (Greater than) Is there more sphere than cube ?Compte (Count) How many cubes are metallic ?Requête du matériel (Query material) What is the material of the red cylinder in the right corner ?Requête de la taille (Query size) What is the size of the yellow sphere ?Requête de la couleur (Query color) What is the color of the brown thing that is left of the big sphere ?Requête de la forme (Query shape) What is the shape of the green metallic thing ?Équivalence de couleur (Equal color) Are the big cube and the small sphere of the same color ?Équivalence de forme (Equal shape) Are the red metallic thing and the brown thing of the same shape ?Équivalence de taille (Equal size) Are the brown cube and the yellow sphere of the same size ?Équivalence de matériel (Equal material) Are the sphere and the red cube of the same material ?Équivalence de quantité (Equal integer) Are there an equal number of large things and metal spheres ?

Tableau 2.3 Type de questions dans base de donnée CLEVR [48]

analyse des paramètres γ et β après entraînement pour chaque bloc résiduel. Ceux-ci ontobservé des regroupements en lien avec les différents types de questions dans les blocsrésiduels de haut niveau via une projection t-SNE [76]. Les auteurs soutiennent que cesregroupements témoignent d’un certain "raisonnement" basé sur les questions.

La sortie du dernier bloc résiduel passe ensuite à travers un classificateur qui permettrad’attribuer une probabilité à chaque réponse possible. La réponse prédite correspond alorsà la “classe” ayant la plus grande probabilité. L’ensemble des réponses possibles est définipar les exemples de réponse présentés lors de la phase d’entrainement. Le réseau ne seradonc pas en mesure de produire des réponses qu’il n’a pas vues à l’apprentissage. Le

2.3. CARTES DE COORDONNÉES 13

(x,y) Conv

Figure 2.8 Exemple de tâche de régression de coordonnée. Inspirée de [63].

classificateur est composé d’une couche convolutive 1 x 1 et de 2 couches perceptron (512et 1024 neurones).

Afin de réaliser l’entrainement du système, les auteurs ont utilisé la base de donnéesCLEVR [49]. Celle-ci est composée de 70 000 images d’entrainement, 15 000 images devalidation et 15 000 images de test. Ces images sont des scènes contenant des formesgéométriques 3D de différentes couleurs, différentes tailles, placées à différents endroitset avec différents paramètres d’occlusion. Chacune de ces scènes est générée de façonartificielle. Pour chaque scène, une panoplie d’ensembles question-réponse est générée. Onen compte 699 989 pour l’entrainement, 149 991 pour la validation et 14 988 pour les tests.Les différentes catégories de question sont celles présentées dans le Tableau 2.3. En plus dutriplet question-image-réponse, la base de données offre aussi une notion de contexte quantau type de question. Cette information à priori n’est pas utilisée pour faire l’apprentissagedans ce cas-ci. Ce système atteint une précision moyenne de 97.7% sur l’ensemble de testde CLEVR en faisant ainsi la meilleure technique dans l’état de l’art pour cette base dedonnées à l’époque de la publication.

2.3 Cartes de coordonnées

Liu et al. [63] ont développé une tâche où un réseau de neurones doit déterminer la positiond’un carré blanc dans une image avec un fond noir à partir de ses coordonnées cartésiennes.Un exemple est illustré à la Figure 2.8. L’opération inverse, c’est à dire déterminer lescoordonnées cartésiennes d’un carré à partir d’une image, est aussi exploré. Cette tâche(et son opposée) semble très simple mais Liu et al. ont observé que les réseaux de neuronesconvolutifs ont de la difficulté à effectuer cette transformation de l’espace cartésien versl’espace des pixels. Selon eux, cette difficulté s’explique par le fait que chaque filtre d’unecouche convolutive est déplacé sur l’image d’entrée. Ce déplacement permet au filtre des’activer lorsqu’il détecte un objet ou une caractéristique quelle que soit sa position dansl’image (invariance spatiale). Il s’agit là d’une des forces des réseaux convolutifs, surtoutdans le cadre d’une tâche de classification d’image où l’objectif est d’identifier une imageà partir de ses composantes. Cette propriété est par contre problématique pour une tâche


Convolution

Cartes activationsCarte Coordonnées axe X

-1

-0.9

-0.8

0.8

0.9

-1

-0.9

-0.8

0.8

0.9

1

-1

-0.9

-0.8

0.8

0.9

1

-1

-0.9

-0.8

0.8

0.9

1

-1

-0.9

-0.8

0.8

0.9

1

-1

-0.9

-0.8

0.8

0.9

1

-1

-0.9

-0.8

0.8

0.9

1

-1

-0.9

-0.8

0.8

0.9

1

-1 -0.9 -0.8 0.8 0.9 1

-1 -0.9 -0.8 0.8 0.9 1

-1 -0.9 -0.8 0.8 0.9 1

-1 -0.9 -0.8 0.8 0.9 1

-1 -0.9 -0.8 0.8 0.9 1

-1 -0.9 -0.8 0.8 0.9 1

-1 -0.9 -0.8 0.8 0.9 1

-1 -0.9 -0.8 0.8 0.9 11

Carte Coordonnées axe Y

Figure 2.9 Cartes de coordonnées

où la position dans l’espace est importante tel que le traçage de cadre autour d’objets dansune scène visuelle ou encore la régression de coordonnée cartésiennes.

Liu et al. [63] proposent le concept de carte de coordonnées afin de permettre a une coucheconvolutive de conserver des informations sur la position d’un objet. L’objectif est d’offrirune série de points de référence au réseau afin de faciliter son orientation spatiale. Pour cefaire, 2 matrices composées de valeurs fixes sont concaténées avec les cartes d’activations àl’entrée d’une couche convolutive tel qu’illustré à la Figure 2.9. Les dimensions des matricessont les mêmes que celles des cartes d’activations avec lesquelles elles sont concaténées.La première matrice sert de référence pour l’axe des ordonnées (Y). Toutes les colonnesde cette matrice sont identiques. La valeur à la première ligne est -1 et augmente defaçon linéaire jusqu’à la dernière ligne qui comporte la valeur 1. Ceci donne un point derepère quant au début et à la fin de l’axe des ordonnées. Cette référence est constantepour chaque couche convolutive quelle que soit la dimension des cartes d’activations. Ladeuxième matrice sert de référence pour l’axe abscisse (X) et est similaire à la première.Les axes sont inversés : les valeurs des colonnes varient linéairement de -1 à 1 et les lignessont identiques.

Un réseau convolutif faisant usage de cartes de coordonnées atteint une précision de 100%sur la tâche illustrée à la Figure 2.8 alors que le meilleur réseau sans carte de coordon-nées évalué dans [63] obtient 86%. Liu et al. ont aussi testé les cartes de coordonnées surdes tâches plus complexes et ont observé une augmentation considérable de la précision

2.4. RÉSEAUX CONVOLUTIFS APPLIQUÉS À L’AUDIO 15

lorsqu’utilisé dans le cadre d’une tâche de traçage de cadre autour d’objet visuel. La per-formance reste sensiblement la même lorsqu’appliqué à la tâche de classification ImageNet[47]. Ceci démontre qu’il est possible pour le réseau d’apprendre à ignorer les informationsde coordonnées et ainsi maintenir la propriété d’invariance spatiale des réseaux convolutifslorsque nécessaire.

2.4 Réseaux convolutifs appliqués à l’audio

2.4.1 Filtres convolutifs

Tel que mentionné à la section 2.1.4, les réseaux convolutifs sont réputés être capablesd’apprendre une certaine hiérarchie liée à l’organisation de la scène représentée dans uneimage. Afin d’utiliser ce type de réseau avec des trames sonores, plusieurs travaux [21,26, 72, 73, 79] utilisent des représentations temps-fréquence des dites trames. Ce type dereprésentation organise les fréquences sur l’axe des ordonnées et le temps sur l’axe desabscisses. Dans une image classique, les 2 axes représentent une position spatiale. Il s’agitlà d’une différence majeure qui a un grand impact sur la façon d’interpréter ces 2 typesd’images. Lorsqu’on effectue une opération de convolution sur une image classique, ondéplace habituellement un filtre selon les 2 axes de façon similaire. Ce déplacement n’apas la même signification dans le cas d’une représentation temps-fréquence que dans uneimage classique. La plupart du temps, lorsqu’on travaille avec des images classiques, lesfiltres sont de forme carrée. Pons et al. [78, 80] font l’analyse de différentes formes de filtres.Ils expliquent qu’un filtre de forme m× 1 (Fig. 2.10a) est un filtre permettant de capturerles caractéristiques fréquentielles du signal. Un filtre de forme 1 × n (Fig. 2.10b), quantà lui, capture les caractéristiques temporelles du signal. Pons et al. [80] expérimententavec plusieurs longueurs de filtre temporel. Un filtre court (moins de 950 ms) permet demodéliser les onsets alors qu’un filtre long (plus de 1500 ms) capture le rythme du signal.Un filtre de forme carrée/rectangulaire m × n (Fig. 2.10c) permet donc de modéliser àla fois des composantes fréquentielles et temporelles. Cette configuration est d’ailleurs laplus utilisée dans la littérature.

2.4.2 Représentation spectro-temporelle

Il existe plusieurs types de représentation temps-fréquence. Une des plus populaires dansla littérature est le spectrogramme [105] (Illustré dans la Figure 2.11). Celui-ci est obtenuen faisant une transformée de Fourier à temps court (STFT) [27] du signal. Rui et al. [16]proposent d’effectuer une décomposition en composantes principales (PCA [3]) sur unspectrogramme. Il existe d’autres représentations temps-fréquence telles que le Constant-Q Transform (CQT) [90], le Perceptual Linear Predictive (PLP) [39], le Mel frequency


M

N

n=1

m

(a) Filtre fréquenciel

M

N

nm=1

(b) Filtre temporel

M

N

nm

(c) Filtre fréquentiel et tem-porel

Figure 2.10 Différentes formes de filtres convolutifs. Inspirée de [78].

Cepstrum coefficients (MFCC) [64, 68] et le spikegramme [28]. Chaque représentationpermet de faire ressortir des caractéristiques du son différentes. Par exemple, la représen-tation MFCC et la CQT utilisent une échelle fréquentielle non linéaire similaire à cellede la perception humaine [64]. Cette échelle permet de mieux distinguer les composantesvocales d’un signal et offre une meilleure résolution en basse fréquence que la STFT. Celapermet d’identifier plus facilement la fréquence fondamentale. Ces représentations ont étéet sont encore beaucoup utilisées dans le domaine de la reconnaissance vocale. L’utilisationd’une représentation plutôt qu’une autre dépend de la nature des sons avec lesquels ontravaille.

2.4.3 Réseaux audio

Les réseaux convolutifs ont été utilisés sur des signaux sonores en transformant d’abordle signal en représentation temps-fréquence pour effectuer plusieurs types de tâches. Pourn’en citer que quelques unes, ceux-ci ont été utilisés pour faire la classification de son [72,78, 79, 93], la reconnaissance vocale [1, 26, 59], l’annotation de musique [21] et le dé-bruitage [73]. Comme dans tout système à base de réseau de neurones, il n’existe pas deconfiguration optimale pour tous les types de tâches. Pons et al. [79] proposent d’utiliserune architecture peu profonde avec seulement une couche convolutive et une couche demax-pooling [89] pour faire la modélisation de caractéristiques temporelles à court-termede signaux musicaux. Ils proposent aussi d’ajouter un réseau récurrent à la sortie du ré-seau convolutif. Puisque le réseau convolutif modélise les caractéristiques à court-termedes signaux, l’ajout d’un réseau récurrent pourrait permettre la modélisation de caracté-ristiques plus long-terme. Cette piste de solution sera explorée dans un travail futur desauteurs. Park et al. [73] proposent quant à eux d’utiliser 3 couches convolutives avec lemême nombre de couches max-pooling pour faire la classification de segments de bruitdans un signal. Tel que mentionné précédemment, la forme du filtre ainsi que la direc-tion de son mouvement permettent de capturer différents aspects du son. Choi et al. [21]

2.4. RÉSEAUX CONVOLUTIFS APPLIQUÉS À L’AUDIO 17

Fréquence

Temps .Figure 2.11 Exemple de spectrogramme.

utilisent des filtres qui se déplacent selon l’axe X (temps) dans le but d’apprendre la dis-tribution temporelle des différentes bandes de fréquences d’un signal musical pour en fairel’annotation. Pons et al. [80] mentionnent que l’utilisation de différents filtres pour chaquecouche convolutive permet d’améliorer les performances du système. Le travail de Sprengelet al. [93] quant à lui n’a réussi à obtenir des résultats qu’avec des filtres carrés. La formedes filtres dépend, encore une fois, du type de signaux analysés.

Finalement, il existe d’autres approches permettant d’intégrer des représentations sonoresdans un réseau de neurones. Par exemple, la combinaison d’un réseau à réservoir avec unereprésentation par cochléogramme pour faire la reconnaissance de séquences temporellesacoustiques [9, 13] et l’utilisation d’une représentation évènementielle pour faire la détec-tion et la localisation d’évènement acoustique [28]. Ces dernières approches ont l’avantagede ne pas avoir besoin de travailler avec des trames fixes (contrairement aux approchesconventionnelles) placées sur le signal et sont développées au sein du groupe de rechercheNECOTIS[85].


CHAPITRE 3

BASE DE DONNÉES ET RÉSULTATS PRÉ-LIMINAIRES

Auteurs et affiliation :Jerome Abdelnour : étudiant à la maitrise, Université de Sherbrooke,Faculté de génie, Département de génie électrique et de génie informatique.

Giampiero Salvi : professeur, Université Norvégienne de sciences ettechnologies (NTNU), Département de systèmes électroniques.

Jean Rouat : professeur, Université de Sherbrooke, Faculté de génie,Département de génie électrique et de génie informatique.

Date de l’acceptation : 10 novembre 2018

État de l’acceptation : version finale publiée

Revue : NeurIPS 2018 - Visually Grounded Interaction and Language workshop

Référence : [J. Abdelnour, J. Rouat and G. Salvi, “CLEAR : A Dataset for Com-positional Language and Elementary Acoustic Reasoning,” in Workshop on VisuallyGrounded Interaction and Language in Advances in Neural Information ProcessingSystems 31, 2018, https ://arxiv.org/abs/1811.10561]

Titre français : CLEAR : Une base de données pour raisonnement acoustique.

Code : https://github.com/IGLU-CHISTERA/CLEAR-dataset-generation

Contribution au document :Cet article contribue au présent mémoire en introduisant la tâche Acoustic QuestionAnswering ainsi qu’une première version de la base de données CLEAR. La basede données est la première étape de ce travail puisque sans elle, aucun réseau deneurones ne peut être entraîné. L’article décrit les diverses caractéristiques de labase de données ainsi que les techniques utilisées pour la générer. L’article introduitaussi des résultats préliminaires avec le système FiLM.

Résumé français :La tâche d’Acoustic Question Answering est introduite pour la première fois dans cetarticle afin de promouvoir la recherche en raisonnement acoustique. Dans cette tâche,un agent intelligent apprend à répondre à une question sur le contenu d’une scène

19

20 CHAPITRE 3. BASE DE DONNÉES ET RÉSULTATS PRÉLIMINAIRES

auditive. Nous proposons un paradigme de génération de données pour la tâche AQAinspiré par CLEVR [48]. Les scènes acoustiques sont de durée fixe et sont généréesen assemblant 10 sons provenant d’une banque de son élémentaire. Les questions etréponses pour chaque scène acoustique sont générées via des programmes fonctionnelsdéfinis manuellement. Le code pour générer la base de données est rendu publiquepour la communauté sur GitHub. Une analyse préliminaire de la performance dumodèle FiLM, initialement développé pour la tâche VQA, est rapporté pour cettetâche acoustique.

Modifications apportées à l’article :La mise en page a été ajusté afin d’uniformiser la mise en page de ce document. Labibliographie de l’article à été intégrée dans la bibliographie à la fin de ce mémoire.

3.1. CLEAR : A DATASET FOR COMPOSITIONAL LANGUAGE ANDELEMENTARY ACOUSTIC REASONING 21

3.1 CLEAR : A Dataset for Compositional Language

and Elementary Acoustic Reasoning

3.1.1 Abstract

We introduce the task of acoustic question answering (AQA) in the area of acoustic rea-soning. In this task an agent learns to answer questions on the basis of acoustic context.In order to promote research in this area, we propose a data generation paradigm adaptedfrom CLEVR [48]. We generate acoustic scenes by leveraging a bank of elementary sounds.We also provide a number of functional programs that can be used to compose questionsand answers that exploit the relationships between the attributes of the elementary soundsin each scene. We provide AQA datasets of various sizes as well as the data generationcode. As a preliminary experiment to validate our data, we report the accuracy of currentstate of the art visual question answering models when they are applied to the AQA taskwithout modifications. Although there is a plethora of question answering tasks based ontext, image or video data, to our knowledge, we are the first to propose answering ques-tions directly on audio streams. We hope this contribution will facilitate the developmentof research in the area.

3.2 Introduction and Related Work

Question answering (QA) problems have attracted increasing interest in the machine lear-ning and artificial intelligence communities. These tasks usually involve interpreting andanswering text based questions in the view of some contextual information, often expressedin a different modality. Text-based QA, use text corpora as context ([40, 45, 81, 92, 101,102]) ; in visual question answering (VQA), instead, the questions are related to a scenedepicted in still images (e.g. [5, 7, 30, 31, 45, 48, 81, 111, 114]. Finally, video questionanswering attempts to use both the visual and acoustic information in video material ascontext (e.g. [17, 22, 51, 97, 104, 106]). In the last case, however, the acoustic informationis usually expressed in text form, either with manual transcriptions (e.g. subtitles) or byautomatic speech recognition, and is limited to linguistic information [112].

The task presented in this paper differs from the above by answering questions directlyon audio streams. We argue that the audio modality contains important information thathas not been exploited in the question answering domain. This information may allow QAsystems to answer relevant questions more accurately, or even to answer questions thatare not approachable from the visual domain alone. Examples of potential applications arethe detection of anomalies in machinery where the moving parts are hidden, the detectionof threatening or hazardous events, industrial and social robotics.


Current question answering methods require large amounts of annotated data. In thevisual domain, several strategies have been proposed to make this kind of data availableto the community [7, 30, 48, 114]. Agrawal et al. [5] noted that the way the questions arecreated has a huge impact on what information a neural network uses to answer them(this is a well known problem that can arise with all neural network based systems). Thismotivated research [31, 48, 111] on how to reduce the bias in VQA datasets. The complexityaround gathering good labeled data forced some authors [31, 111] to constrain their workto yes/no questions. Johnson et al. [48] made their way around this constraint by usingsynthetic data. To generate the questions, they first generate a semantic representationthat describes the reasoning steps needed in order to answer the question. This gives themfull control over the labelling process and a better understanding of the semantic meaningof the questions. They leverage this ability to reduce the bias in the synthesized data.For example, they ensure that none of the generated questions contains hints about theanswer.

Inspired by the work on CLEVR [48], we propose an acoustical question answering (AQA)task by defining a synthetic dataset that comprises audio scenes composed by sequencesof elementary sounds and questions relating properties of the sounds in each scene. Weprovide the adapted software for AQA data generation as well as a version of the datasetbased on musical instrument sounds. We also report preliminary experiments using theFiLM architecture derived from the VQA domain.

3.3 Dataset

This section presents the dataset and the generation process 1. In this first version (version1.0) we created multiple instances of the dataset with 1000, 10000 and 50000 acousticscenes for which we generated 20 to 40 questions and answers per scene. In total, wegenerated six instances of the dataset. To represent questions, we use the same semanticrepresentation through functional programs that is proposed in [48, 49].

3.3.1 Scenes and Elementary Sounds

An acoustic scene is composed by a sequence of elementary sounds, that we will calljust sounds in the following. The sounds are real recordings of musical notes from theGood-Sounds database [10]. We use five families of musical instruments : cello, clarinet,flute, trumpet and violin. Each recording of an instrument has a different musical note(pitch) on the MIDI scale. The data generation process, however, is independent of thespecific sounds, so that future versions of the data may include speech, animal vocalizations

1. Available at https://github.com/IGLU-CHISTERA/CLEAR-dataset-generation

3.3. DATASET 23

Question type Example Possible Answers #Yes/No Is there an equal number of loud cello sounds and quiet cla-

rinet sounds ?yes, no 2

Note What is the note played by the flute that is after the loudbright D note ?

A, A#, B, C, C#, D, D#, E, F,F#, G, G#

12

Instrument What instrument plays a dark quiet sound in the end of thescene ?

cello, clarinet, flute, trumpet, vio-lin

5

Brightness What is the brightness of the first clarinet sound ? bright, dark 2Loudness What is the loudness of the violin playing after the third trum-

pet ?quiet, loud 2

Counting How many other sounds have the same brightness as the thirdviolin ?

0–10 11

Absolute Pos. What is the position of the A# note playing after the bright Bnote ?

}first–tenth 10

Relative Pos. Among the trumpet sounds which one is a F ?Global Pos. In what part of the scene is the clarinet playing a G note that

is before the third violin sound ?beginning, middle, end (of thescene)

3

Total 47

Tableau 3.1 Types of questions with examples and possible answers. The va-riable parts of each question is emphasized in bold italics. In the dataset manyvariants of questions are included for each question type, depending on the kindof relations the question implies. The number of possible answers is also reportedin the last column. Each possible answer is modelled by one output node in theneural network. Note that for absolute and relative positions, the same nodesare used with different meanings : in the first case we enumerate all sounds, inthe second case, only the sounds played by a specific instrument.

and environmental sounds. Each sound is described by an n-tuple [Instrument family,

Brightness, Loudness, Musical note, Absolute Position, Relative Position,

Global Position, Duration] (see Tableau 3.1 for a summary of attributes and values).Where Brightness can be either bright or dark ; Loudness can be quiet or loud ;Musical note can take any of the 12 values on the fourth octave of the Western chromaticscale 2. The Absolute Position gives the position of the sound within the acoustic scene(between first and tenth), the Relative Position gives the position of a sound relativelyto the other sounds that are in the same category (e.g. “the third cello sound”). GlobalPosition refers to the approximate position of the sound within the scene and can beeither beginning, middle or end.

We start by generating a clean acoustic scene as following : first the encoding of the originalsounds (sampled at 48KHz) is converted from 24 to 16 bits. Then silence is detected andremoved when the energy, computed as 10 log10

∑i x

2i over windows of 100 msec, falls

below -50 dB, where xi are the sound samples normalized between ±1. Then we measurethe perceptual loudness of the sounds in dB LUFS using the method described in theITU-R BS.1770-4 international normalization standard [98] and implemented in [95]. Weattenuate sounds that are in an intermediate range of -24 dB LUFS and -30.5 dB LUFSby -10 dB, to increase the separation between loud and quiet sounds. We obtain a bank of

2. For this first version of CLEAR the cello only includes 8 notes : C, C#, D, D#, E, F, F#, G.


Figure 3.1 Example of an acoustic scene. We show the spectrogram, the wa-veform and the annotation of the instrument for each elementary sounds. Apossible question on this scene could be "What is the position of the flutethat plays after the second clarinet ?", and the corresponding answer wouldbe "Fifth". Note that the agent must answer based on the spectrogram (orwaveform) alone.

56 elementary sounds. Each clean acoustic scene is generated by concatenating 10 soundschosen randomly from this bank.

Once a clean acoustic scene has been created it is post-processed to generate a more dif-ficult and realistic scene. A white uncorrelated uniform noise is first added to the scene.The amplitude range of the noise is first set to the maximum values allowed by the enco-ding. Then the amplitude is attenuated by a factor f randomly sampled from a uniformdistribution between -80 dB and -90 dB (20 log10 f). The noise is then added to the scene.Although the noise is weak and almost imperceptible to the human ear, it guaranties thatthere is no pure silence between each elementary sounds. The scene obtained this wayis finally filtered to simulate room reverberation using SoX 3. For each scene, a differentroom reverberation time is chosen from a uniform distribution between [50ms, 400ms].

3.3.2 Questions

Questions are structured in a logical tree introduced in CLEVR [48] as a functional pro-gram. A functional program, defines the reasoning steps required to answer a questiongiven a scene definition. We adapted the original work of Johnson et al. [48] to our acous-tical context by updating the function catalog and the relationships between the objectsof the scene. For example we added the before and after temporal relationships.

In natural language, there is more than one way to ask a question that has the samemeaning. For example, the question “Is the cello as loud as the flute ?” is equivalent to

3. http ://sox.sourceforge.net/sox.html

3.4. PRELIMINARY EXPERIMENTS 25

“Does the cello play as loud as the flute ?”. Both of these questions correspond to thesame functional program even though their text representation is different. Therefore thestructures we use include, for each question, a functional representation, and possiblymany text representations used to maximize language diversity and minimize the bias inthe questions. We have defined 942 such structures.

A template can be instantiated using a large number of combinations of elements. Notall of them generate valid questions. For example "Is the flute louder than the flute ?"is invalid because it does not provide enough information to compare the correct soundsregardless of the structure of the scene. Similarly, the question “What is the position of theviolin playing after the trumpet ?” would be ill-posed if there are several violins playingafter the trumpet. The same question would be considered degenerate if there is only oneviolin sound in the scene, because it could be answered without taking into account therelation “after the trumpet”. A validation process [48] is responsible for rejecting bothill-posed and degenerate questions during the generation phase.

Thanks to the functional representation we can use the reasoning steps of the questions toanalyze the results. This would be difficult if we were only using the text representationwithout human annotations. If we consider the kind of answer, questions can be organizedinto 9 families as illustrated in Tableau 3.1. For example, the question “What is the thirdinstrument playing ?” would translate to the “Query Instrument” family as its function isto retrieve the instrument’s name.

On the other hand we could classify the questions based on the relationships they requiredto be answered. For example, "What is the instrument after the trumpet sound that isplaying the C note ?" is still a “query_instrument” question, but compared to the previousexample, requires more complex reasoning. The appendix reports and analyzes statisticsand properties of the database.

3.4 Preliminary Experiments

To evaluate our dataset, we performed preliminary experiments with a FiLM network [76].It is a good candidate as it has been shown to work well on the CLEVR VQA task [48]that shares the same structure of questions as our CLEAR dataset. To represent acousticscenes in a format compatible with FiLM, we computed spectrograms (log amplitude of thespectrum at regular intervals in time) and treated them as images. Each scene correspondsto a fixed resolution image because we have designed the dataset to include acoustic scenesof the same length in time. The best results were obtained with a training on 35 000 scenesand 1 400 000 questions/answers. It yields a 89.97% accuracy on the test set that comprises


7 500 scenes and 300 000 questions. For the same test set a classifier choosing always themajority class would obtain as little as 7.6% accuracy.

3.5 Conclusion

We introduce the new task of acoustic question answering (AQA) as a means to stimulateAI and reasoning research on acoustic scenes. We also propose a paradigm for data gene-ration that is an extension of the CLEVR paradigm : The acoustic scenes are generated bycombining a number of elementary sounds, and the corresponding questions and answersare generated based on the properties of those sounds and their mutual relationships. Wegenerated a preliminary dataset comprising 50k acoustic scenes composed of 10 musicalinstrument sounds, and 2M corresponding questions and answers. We also tested the FiLMmodel on the preliminary dataset obtaining at best 89.97% accuracy predicting the rightanswer from the question and the scene. Although these preliminary results are very en-couraging, we consider this as a first step in creating datasets that will promote researchin acoustic reasoning. The following is a list of limitations that we intend to address infuture versions of the dataset.

3.5.1 Limitations and Future Directions

In order to be able to use models that were designed for VQA, we created acoustic scenesthat have the same length in time. This allows us to represent the scenes as images(spectrograms) of fixed resolution. In order to promote models that can handle soundsmore naturally, we should release this assumption and create scenes of variable lenghts.Another simplifying assumption (somewhat related to the first) is that every scene includesan equal number of elementary sounds. This assumption should also be released in futureversions of the dataset. In the current implementation, consecutive sounds follow eachother without overlap. In order to implement something similar to occlusions in visualdomain, we should let the sounds overlap. The number of instruments is limited to fiveand all produce sustained notes, although with different sound sources (bow, for cello andviolin, reed vibration for the clarinet, fipple for the flute and lips for the trumpet). Weshould increase the number of instruments and consider percussive and decaying sounds asin drums and piano, or guitar. We also intend to consider other types of sounds (ambientand speech, for example) to increase the generality of the data. Finally, the complexityof the task can always be increased by adding more attributes to the elementary sounds,adding complexity to the questions, or introducing different levels of noise and distortionsin the acoustic data.

3.6. ACKNOWLEDGEMENTS 27

3.6 Acknowledgements

We would like to acknowledge the NVIDIA Corporation for donating a number of GPUs,the Google Cloud Platform research credits program for computational resources. Part ofthis research was financed by the CHIST-ERA IGLU project, the CRSNG and Michael-Smith scholarships, and by the University of Sherbrooke.


3.7 APPENDIX : Statistics on the Data Set

This appendix reports some statistics on the properties of the data set. We have consideredthe data set comprising 50k scenes and 2M questions and answers to produce the analysis.Figure 3.2 reports the distribution of the correct answer to each of the 2M questions.Figure 3.3 and 3.4 reports the distribution of question types and available template typesrespectively. The fact that those two distributions are very similar means that the availabletemplates are sampled uniformly when generating the questions.

Finally, Figure 3.5 shows the distribution of sound attributes in the scenes. It can be seenthat most attributes are nearly evenly distributed. In the case of brightness, calculatedin terms of spectral centroids, sounds were divided into clearly bright, clearly dark andambiguous cases (referred to by "None" in the figure). We only instantiated questionsabout the brightness on the clearly separable cases.

3.7. APPENDIX : STATISTICS ON THE DATA SET 29

Bright

Dark

0 1 2 3 4 5 6 7 8 9 Cello

Clarinet

Flute

Trumpet

Violin

LoudQ

uietN

oYesA A

#B C C

#D D

#E F F

#G G

#E

ighthF

ifthF

irstF

ourthN

inthS

econdS

eventhS

ixthTenthT

hirdB

eginning Of T

he Scene

End O

f The S

ceneM

iddle Of T

he Scene

0

0.02

0.04

0.06

0.08

Bright

Dark

0 1 2 3 4 5 6 7 8 9 Cello

Clarinet

Flute

Trumpet

Violin

LoudQ

uietN

oYesA A

#B C C

#D D

#E F F

#G G

#E

ighthF

ifthF

irstF

ourthN

inthS

econdS

eventhS

ixthTenthT

hirdB

eginning Of T

he Scene

End O

f The S

ceneM

iddle Of T

he Scene

0

0.02

0.04

0.06

0.08

Bright

Dark

0 1 2 3 4 5 6 7 8 9 Cello

Clarinet

Flute

Trumpet

Violin

LoudQ

uietN

oYesA A

#B C C

#D D

#E F F

#G G

#E

ighthF

ifthF

irstF

ourthN

inthS

econdS

eventhS

ixthTenthT

hirdB

eginning Of T

he Scene

End O

f The S

ceneM

iddle Of T

he Scene

0

0.02

0.04

0.06

0.08

BrightnessBrightness CountCount InstrumentInstrument LoudnessLoudness Yes/NoYes/NoYY

Musical NoteMusical Note PositionPosition Position GlobalPosition Global

Fre

quen

cyF

requ

ency

Fre

quen

cy

Training

Validation

Test

Figure 3.2 Distribution of answers in the dataset by set type. The color re-present the answer category.


Com

pare Integer

Exist

Query B

rightness

Query Instrum

ent

Query Loudness

Query M

usical Note

Query P

osition Absolute

Query P

osition Global

Query P

osition Instrumen

Count

0

0.05

0.1

0.15

0.2

TrainingTraining ValidationValidationVV TestTestTT

Fre

quen

cy

Figure 3.3 Distribution of question types. The color represent the set type.

3.7. APPENDIX : STATISTICS ON THE DATA SET 31

Com

pare Integer

Exist

Query B

rightness

Query Instrum

ent

Query Loudness

Query M

usical Note

Query P

osition Absolute

Query P

osition Global

Query P

osition Instrumen

Count

0

0.05

0.1

0.15

0.2

Template Type Distribution

Fre

quen

cy

Figure 3.4 Distribution of template types. The same templates are used togenerate the questions and answers for the training, validation and test set.


Cello

Clarinet

Flute

Trumpet

Violin

0

0.05

0.1

0.15

0.2

Bright

Dark

None

0

0.1

0.2

0.3

0.4

Loud

Quiet

0

0.2

0.4

A A#

B C C#

D D#

E F F#

G G#

0

0.02

0.04

0.06

0.08

TrainingTraining ValidationValidationVV TestTestTT

Fre

quen

cy

Fre

quen

cy

Fre

quen

cy

Fre

quen

cy

Instrument Distribution Brightness Distribution

Loudness Distribution Note Distribution

Figure 3.5 Distribution of sound attributes in the scenes. The color representthe set type. Sounds with a "None" brightness have an ambiguous brightnesswhich couldn’t be classified as ’Bright’ or ’Dark’.

CHAPITRE 4

RÉSEAU DE NEURONES ACOUSTIQUE

Auteurs et affiliation :Jerome Abdelnour : étudiant à la maitrise, Université de Sherbrooke,Faculté de génie, Département de génie électrique et de génie informatique.

Giampiero Salvi : professeur, Université Norvégienne de sciences ettechnologies (NTNU), Département de systèmes électroniques.

Jean Rouat : professeur, Université de Sherbrooke, Faculté de génie,Département de génie électrique et de génie informatique.

Date de Soumission : 28 avril 2021

État de l’acceptation : En attente

Code de soumission : TPAMI-2021-04-0616

Revue : IEEE Transactions on Pattern Analysis and Machine Intelligence

Titre français :NAAQA : une architecture neuronale pour la tâche Question/Réponse sur scènesauditives.

Contribution au document :Cet article contribue au présent mémoire en introduisant une nouvelle version de labase de données CLEAR, plus difficile que la précédente. L’article offre une analysedétaillée de la performance du système FiLM sur la tâche d’Acoustic Question Ans-wering. L’article fait aussi l’analyse détaillée de l’efficacité de l’utilisation de cartesde coordonnées (CoordConv) lorsqu’appliquées dans un contexte acoustique. Finale-ment, l’article introduit un nouveau modèle (NAAQA) à base de réseau de neuronesadapté au contexte acoustique de la tâche ainsi qu’une analyse des résultats.

Résumé français :Nous introduisons une nouvelle version de la base de données CLEAR composée descènes à durée variable et constituée d’une plus grande variété de sons élémentairesque la version précédente. Ces caractéristiques rendent la tâche définie par la basede données CLEAR beaucoup plus difficile. Nous effectuons l’analyse approfondied’un réseau de neurones initialement développé dans un contexte visuel. Nous in-troduisons un nouveau réseau de neurones (NAAQA) faisant usage de convolution

33

34 CHAPITRE 4. RÉSEAU DE NEURONES ACOUSTIQUE

1D temporelles et fréquentielles pour traiter les représentations spectro-temporelles2D des scènes acoustiques. NAAQA atteint une précision de 91.6% sur la base dedonnées CLEAR en étant environ 7 fois moins complexe que le réseau FiLM étudiéprécédemment. Nous faisons aussi l’analyse de l’efficacité des cartes de coordonnéesdans ce contexte acoustique. On y démontre que les cartes de coordonnée tempo-relles améliorent la capacité de localisation temporelle du réseau ce qui entraîne uneaugmentation de la précision par 17 points de pourcentage.

4.1. NAAQA : A NEURAL ARCHITECTURE FOR ACOUSTICQUESTION ANSWERING 35

4.1 NAAQA : A Neural Architecture for Acoustic

Question Answering

4.2 Abstract

The goal of the Acoustic Question Answering (AQA) task is to answer a free-form textquestion about the content of an acoustic scene. It was inspired by the Visual QuestionAnswering (VQA) task. In this paper, based on the previously introduced CLEAR dataset,we propose a new benchmark for AQA, that emphasizes the specific challenges of acousticinputs, e.g. variable duration scenes. We also introduce NAAQA, a neural architecturethat leverages specific properties of acoustic inputs. The usage of time and frequency 1Dconvolutions to process 2D spectro-temporal representations of acoustic content showspromising results and enables reductions in model complexity. NAAQA achieves 91.6% ofaccuracy on the AQA task with ∼7 times fewer parameters than the previously exploredVQA model. We provide a detailed analysis of the results for the different question types.The effectiveness of coordinate maps in this acoustic context was also studied and weshow that time coordinate maps augment temporal localization capabilities which enhanceperformance of the network by ∼17 percentage points.

4.3 Introduction

Question answering (QA) tasks are examples of constrained and limited scenarios forresearch in reasoning. The agent’s task in QA is to answer questions based on context.Text-based QA uses text corpora as context [40, 45, 81, 92, 101, 102]. In visual questionanswering (VQA) the questions are related to a scene depicted in still images [5, 7, 30,31, 48, 111, 114]. Finally, video question answering attempts to use both the visual andacoustic information in video material as context [17, 22, 51, 97, 104, 106]. However, theuse of the acoustic channel is usually limited to linguistic information that is expressedin text form, either with manual transcriptions (e.g. subtitles) or by automatic speechrecognition [112].

In most studies, reasoning is supported by spatial and symbolic representations in thevisual domain [20, 69]. However, reasoning and logic relationships can also be studied viarepresentations of sounds [18]. Including the auditory modality in studies on reasoning isof particular interest for research in artificial intelligence, but also has implications in realworld applications [19]. In [77], audio was used in combination with video and depth infor-mation to recognize human activities. It was shown that sound can be more discriminativethan the corresponding visual cues. As an example, imagine using an espresso machine.


Besides possibly a display, all information about the different phases of producing coffee,from grinding the beans, to pressing the powder into the holder and brewing the coffeewith high pressure hot water are conveyed by the sounds. Detection of abnormalities inmachinery where the moving parts are hidden, or the detection of threatening or hazar-dous events are other examples of the importance of the audio information for cognitivesystems.

The audio modality provides important information that has not yet been exploited inQA reasoning. Audio allows QA systems to answer relevant questions more accurately,or even to answer questions that are not approachable from the visual domain alone.In [2] we introduced, to our knowledge, the first task in acoustic question answering. Theagent’s goal was to answer questions related to acoustic scenes composed by a sequenceof elementary sounds. The questions foster reasoning on the properties of the elementarysounds and their relative and absolute position in the scene. As a benchmark for this task,we created the CLEAR dataset.

The modeling in [2] was mutated from a similar task in VQA. However, the structure ofacoustic data is fundamentally different from that of visual data. This is illustrated in [110]where two standard data sets in computer vision (MNIST) and speech technology (GoogleSpeech Commands) are compared via T-SNE [65].

In this paper we address the specific nature of acoustic observations in the context ofAcoustic Question Answering (AQA) by introducing a number of extensions and improve-ments both to the data and to the methods to solve the AQA task. The main contributionsare summarized as follows : we propose

– a neural architecture that leverages specific properties of acoustic inputs,– a more challenging version of the CLEAR dataset which comprises scenes of variable

duration (compared to fixed duration in [2]),– an analysis of the relevance of coordinate maps in an acoustic context, and– a detailed comparison of the performance of the Visual FiLM [76] network initially

designed for the VQA task with the substantially smaller proposed NAAQA model.The rest of the paper is organized as follows : Section 4.4 reports on recent related work,Section 4.5 describes our dataset, Section 4.6 presents the QA models we have tested,Section 4.7 gives details on the experimental settings, Section 4.8 discusses the resultsand, finally, Section 4.9 concludes the paper.

4.3. INTRODUCTION 37

Figure

4.1O

verviewof

the

CLEA

Rdataset

generation

process.

Highlighted

inred

:10

randomly

sampled

soundsfrom

theelem

entarysounds

bank,areassem

bledto

createan

acousticscene.T

heattributes

ofeachelem

entarysound

aredepicted

inblue.

The

questiontem

plate(orange)

andthe

elementary

soundsattributes

arecom

binedto

instantiatea

question.T

heansw

eris

generatedby

applyingeach

stepsof

thequestion

functionalprogram

(purple)on

theacoustic

scenedefinition

(blue).T

heim

pactof

thereverberations

canbe

seenin

thechanges

ofthe

signalsenvelops.


4.4 Related Work

4.4.1 Text-Based Question Answering

The question answering task was introduced as part of the Text Retrieval Conference[101]. In text-based question answering, both the questions and the context are expressedin text form. Answering these questions can often be approached as a pattern matchingproblem in the sense that the information can be retrieved almost verbatim in the text(e.g. [40, 45, 81, 92]).

4.4.2 Visual Question Answering (VQA)

Visual Question Answering aims to answer questions based on a visual scene. Several VQAdatasets are available to the scientific community [7, 30, 32, 33, 43, 48, 67, 103, 109, 114].However, designing an unbiased dataset is non-trivial. Agrawal et al. [5] observed thatthe type of questions has a strong impact on the results of neural network based systemswhich motivated research to reduce the bias in VQA datasets [6, 24, 31, 48, 66, 111].Gathering good labeled data is also non-trivial which induced Zhang et al. and Geman etal. [31, 111] to constrain their work to yes/no questions. To alleviate this problem, Johnsonet al. [48] proposed the use of synthetic data for both questions and visual scenes. Theresulting CLEVR dataset has been extensively used to evaluate neural networks for VQAapplications [8, 42, 41, 76, 100, 107] which helped foster research on VQA. To create visualscenes, the authors automated a 3D modelling software. This allows for an unlimited supplyof labeled data eliminating the time and effort needed for manual annotations. For thequestions, they first manually designed semantic representations for each type of question.These representations describe the reasoning steps needed to answer a question (i.e. "findall cubes | that are red | and metallic"). The semantic representations are then instantiatedbased on the visual scene composition thus creating a question and an answer for a givenscene. This gives complete control over the labelling process.

4.4.3 Acoustic Question Answering (AQA)

To the best of our knowledge, we were the first to introduce the AQA task in [2]. Fayeket al. [29] also presented reflections on acoustic reasoning via question answering. Theyproposed a network comprising 2 sets of FiLM layers [76] per Resblock to modulate theactivation maps based on both textual and audio features contexts. This second set ofFiLM layers makes this model much more complex in terms of number of parametersthan the initial network proposed by Perez et al. [76]. As one of our goal is to solve theCLEAR task with a minimal number of parameters, this model is out of scope for thisstudy.

4.4. RELATED WORK 39

Using synthetic data in the design of a dataset has substantial advantages. Data can beautomatically annotated which save time and complexity. The number of training examplesthat can be generated is only limited by the available computational resources. Controllingthe generation process gives a complete understanding of the properties and relations ofthe objects in a scene. This understanding can be leveraged to reduce bias in the datasetand to generate complex questions and their corresponding answers. The CLEAR datasetis generated using semi-synthetic data with a methodology similar to [48]

4.4.4 Convolutional neural network on Audio

Convolutional neural networks (CNN) have dominated the visual domain in recent years.More recently, they have also been applied to a number of problems in the acoustic domainssuch as acoustic scene classification [4, 12, 36, 60], music genre classification [60, 70],instrument classification [80, 113], sound event classification and localization [14] andspeech recognition [60]. Some authors [36, 70, 80, 113] use intermediate representationssuch as STFT [71], MFCC [64] or CQT [15] spectrograms while others [4, 60] work directlywith the raw audio signal.

Square convolutional and pooling kernels are often used to solve visual tasks such asVQA, visual scene classification and object recognition [37, 54, 91, 96]. Some authors[12, 38, 55] have successfully used visually motivated CNN with square filters to solve audiorelated tasks. Time-frequency representations of audio signals are however structured verydifferently than visual representations. Pons et al. [78] explores the performance of differentstructures of convolutive kernels when working with audio signals. They propose the useof 1D convolution kernels to capture time-specific or frequency-specific features. Theydemonstrate that similar accuracy can be reached using a combination of 1D convolutionsinstead of 2D convolutions by combining 1D time and 1D frequency filters while usingmuch fewer parameters. They also explore rectangular kernels which capture both timeand frequency features at different scales.

In this work, a more challenging version of the CLEAR dataset which comprises moreelementary sounds and scenes of variable duration (compared to fixed duration in [2])is introduced. The performance of a network initially designed for the VQA task (VisualFiLM) [76] is evaluated on the AQA task, and modification to the architecture are proposedto leverage specific properties of acoustic inputs. The effectiveness of using coordinate mapsin an acoustic context is also analyzed. Finally, the architecture is optimized to reducecomplexity in terms of number of parameters.


Question type Example Possible Answers #Note What is the note played by the flute that is after the loud bright

D note ?A, A#, B, C, C#, D, D#, E, F, F#,G, G#

12

Instrument What instrument plays a dark quiet sound in the end of the scene ? bass, cello, clarinet, flute, trumpet,violin

5

Brightness What is the brightness of the first clarinet sound ? bright, dark 2Loudness What is the loudness of the violin playing after the third trum-

pet ?quiet, loud 2

Absolute Position What is the position of the A# note playing before the bright Bnote ?

⎫⎬⎭first, second ... fifteenth 15

Relative Position Among the trumpet sounds which one is a F ?Global Position In what part of the scene is the clarinet playing a loud G note ? beginning, middle, end (of the scene) 3Counting How many other sounds have the same brightness as the third vio-

lin ?

⎫⎪⎪⎬⎪⎪⎭

0, 1 ... 15 16Counting Instruments How many different instruments are playing before the secondtrumpet ?

Exist Is there a bass playing a bright C# note ?⎫⎬⎭yes, no 2Counting comparison Is there an equal number of loud cello sounds and quiet clarinet

sounds ?Total 57

Tableau 4.1 Types of questions with examples and possible answers.The variable parts of each question is emphasized in bold italics. The numberof possible answer per question type is reported in the last column. Certainquestions have the same possible answers, the meaning of which depends on thetype of question.

4.5 Variable scene duration dataset

In this section, we briefly review the principal aspects of the dataset and present anupdated version for a task that emphasizes the challenges with AQA as opposed to VQA.A graphical overview of the CLEAR dataset generation process is depicted in Figure 4.1.Each record in the dataset is a unique combination of a scene, a question and an answer.

To build acoustic scenes, we prepared a bank of elementary sounds composed of realmusical instrument recordings extracted from the Good-Sounds [83] dataset. The bankcomprises 135 unique recordings (compared to 56 in previous work [2]) sampled at 48 kHzincluding 6 different instruments (bass, cello, clarinet, flute, trumpet and violin), 12 notes(chromatic scale) and 3 octaves. The acoustic scenes are built by concatenating between 5to 15 randomly chosen sounds from the elementary sound bank into a sequence (Previouswork [2] comprised fixed duration scenes). Silence segments of random duration are addedin-between elementary sounds. The acoustic scenes are then corrupted by filtering thescenes to simulate room reverberation and by adding a white uncorrelated uniform noise.Both additive noise and reverberation vary from scenes to scenes which reduces overfittingand provides a more natural acoustic scene.

Each elementary sound in a scene is characterized by an n-tuple : [Instrument, Brightness,Loudness, Musical Note, Duration, Absolute position in scene, Relative position in scene,Global position]. The Brightness property is computed by using the timbralmodels [74]library. A threshold is used to define the label of the sound (Dark or Bright). The Loudness

4.6. METHOD 41

labels are assigned based on the perceptual loudness as defined by the ITU-R BS.1770-4international normalization standard [98]. Again, a threshold is used to determine if thesound is Quiet or Loud. All attribute values are listed in Tableau 4.1 as possible answersto the questions explained below.

For each scene, a number of questions is generated using CLEVR-like [48] templates. Atemplate defines the reasoning steps required to answer a question based on the compo-sition of the scene (i.e. "find all instances of violin | that plays before trumpet | that isthe loudest"). 942 templates where designed for this AQA task. Not all template instan-tiation results in a valid question. The generated questions are filtered to remove ill posedquestions similarly to [48]. Tableau 4.1 shows examples of questions with their answers.

The a priori probability of answering correctly with no information about the question orthe scene, and assuming a uniform distribution of classes, is 1

57= 1.75%. These probabili-

ties are lower, on average, if we introduce information about the question. For example, ifwe know that the question is of the type Exist or Counting comparison, there are only twopossible answers (yes or no) and the probability of answering correctly by chance is 0.5.The majority class accuracy (always answering the most common answer) is 7.5%. Somestatistics on the CLEAR dataset are presented in Tableau 4.2.

The generation process was built with extensibility in mind. Different versions of thedataset with fewer or more objects per scene can be generated by using different parametersfor the generation script. It is also possible to modify the elementary sound bank togenerate datasets for AQA in other domains, speech or environmental sounds, for example.The code for generating the dataset is available on GitHub 1.

4.6 Method

Our goal is to devise a system that takes as inputs a variable length text string represen-ting a question and a variable duration audio signal representing an acoustic scene. Thesystem’s task is to learn to interpret the question in the context of the acoustic scene andto generate an answer. We first describe the Visual FiLM architecture [76] that is usedas a baseline, then the NAAQA model and finally how coordinate maps [63] can be ap-plied in an acoustic context. Both the NAAQA and Visual FiLM share an overall commonarchitecture which is depicted in Figure 4.2.

1. https ://github.com/IGLU-CHISTERA/CLEAR-dataset-generation


Dataset global statistics

Nb of questions 200 000Nb of scenes 50 000Nb of unique vocabulary words 91

Dataset detailed statisticsMean Min Max

Nb of sounds per scene 10 5 15Elementary sound duration 0.85s 0.69s 1.11sScene duration 10.69s 4.46s 17.82sNb of words per question 17 6 28Nb of unique words per question 12 5 19

Tableau 4.2 Dataset statistics

4.6.1 Baseline model : Visual FiLM

The FiLM architecture [76] is inspired by Conditional Batch Normalization architec-tures [44]. It achieved state of the art results on the CLEVR [48] VQA task at the timeof publication. We refer to this neural network as Visual FiLM or baseline model in sub-sequent sections. The network takes a visual scene and a text-based question as inputsand predicts an answer to the question for the given scene. The question goes throughthe text-processing module which uses G unidirectional gated recurrent units (GRUs) toextract context from the text input (yellow area in Figure 4.2). The visual scene is pro-cessed by the convolutional module (blue area in the figure). The first step of this moduleis feature extraction (orange box), performed by a Resnet101 model [37] pre-trained onImageNet [88]. The extracted features are processed by a convolutional layer with batchnormalization [44] and ReLU [46] activation followed by J Resblocks illustrated in detailsin the red area in the figure. Unless otherwise specified, batch normalization and ReLUactivation functions are applied to all convolutional layers. Each Resblock j comprisesconvolutional layers with M filters that are linearly modulated by FiLM layers throughthe two M × 1 vectors βj (additive) and γj (multiplicative). This modulation emphasizesthe most important feature maps and inhibits the irrelevant maps given the context ofthe question. βj and γj are learned via fully connected layers using the text embeddingsextracted by the text processing module as inputs (purple area in the figure). The affinetransformation in the batch normalization before the FiLM layer is deactivated. The FiLMlayer applies its own affine transformation using the learned βj and γj to modulate fea-tures. Several Resblocks can be stacked to increase the depth of the model, as illustratedin Figure 4.2. Finally, the classifier module is composed of a 1× 1 convolutional layer [62]with C filters followed by max pooling and a fully connected layer with H hidden units

4.6. METHOD 43

and an output size O equal to the number of possible answers (Gray in Figure 4.2). Asoftmax layer predicts the probabilities of the answers. In order to use the Visual FiLMas a baseline for our experiments, we extract a 2D spectro-temporal representation of theacoustic scenes as depicted at the bottom of Figure 4.2. This is fed to the model as if itwas an image which is the simplest way to adapt the unmodified visual architecture toacoustic data.

4.6.2 NAAQA : Adapting feature extractors to acoustic inputs

As in Visual FiLM, the first step in the NAAQA model is feature extraction (orange boxin Figure 4.2). Because acoustic signals present unique properties, two feature extractionmodules that are specifically tailored to modeling sounds are introduced :

– the Parallel feature extractor (Figure 4.3a) processes time and frequency featuresindependently in parallel pipelines.

– the Interleaved feature extractor (Figure 4.3b) captures time and frequency featuresin a single convolutional pipeline.

In both cases, the feature extractor is trained end-to-end with the rest of the network anduses a combination of 1D convolutional filters to process a 2D spectro-temporal represen-tation. The 1D filters process the time and frequency axis independently as opposed tothe 2D filters typically used in image processing.

The Parallel feature extractor (Figure 4.3a) captures time and frequency features in paral-lel pipelines. The frequency pipeline (green in the figure) comprises a serie of K frequencyblocks. Each block is composed of a 1D convolution with NK 3 × 1 kernels and 2 × 1

strides followed by a 1 × 2 maxpooling. With a stride larger than 1 × 1, the convolutionoperation downsamples the frequency axis and the pooling operation downsamples thetime axis. This downsampling strategy allows features in both parallel pipelines to be ofthe same dimensions. The time pipeline (yellow in the figure) is the same as the frequencypipeline except that the convolutional kernel operates along the time dimension and thepooling along the frequency dimension. The convolution kernel is 1 × 3 and the poolingkernel is 2× 1. The activation maps of both pipelines are concatenated channel-wise anda representation combining both the time and frequency features is created using a 1× 1

convolution [62] with P filters and a stride of one. The feature maps dimensionality iseither compressed or expanded depending on the number of filters P in the 1 × 1 convo-lution. The 1 × 1 convolution can also be removed thus leaving it up to the next 3 × 3

convolution to fuse the time and frequency features.

The Interleaved feature extractor (Figure 4.3b) processes the input spectrogram in a singlepipeline composed of a serie of K interleaved blocks (purple in the figure). Each block


+

FeatureExtractor

FiLMedResblock 1

FiLMedResblock 2

FiLMed Resblock J

LinearCxH

LinearHxO

Global Max Pooling

G GRU unitsβ1 γ1

β2 γ2

βJ γJ

Embedding

What instrument plays a B note before the trumpetthat plays a loud F# note ?

LinearGx2M

Hidden states

LinearGx2M

LinearGx2M

Cello

ReLU

M Conv 3x3

BatchNormNo affine

FiLM

ReLU

βj γj

LegendText-Processing ModuleConvolutional ModuleFusion Mechanism (FiLM)Classifier ModuleDetails of FiLMed ResblockFeature ExtractorConcat coordinate maps

M Conv 3x3M Conv 3x3M Conv 3x3BN & ReLUM Conv 3x3BN & ReLU

C Conv 1x1BN & ReLUC Conv 1x1BN & ReLU

Figure 4.2 Common Architecture. Two inputs : a spectro-temporal repre-sentation of an acoustic scene and a textual question. The spectro-temporalrepresentation goes through a feature extractor (Parallel feature extractor de-tailed in Section 4.6.2 for NAAQA and Resnet101 pretrained on ImageNet forVisual FiLM) and then a serie of J Resblocks that are linearly modulated byβj and γj (learned from the question input) via FiLM layers. Coordinate mapsare inserted before convolution operation with a pink border. The output is aprobability distribution of the possible answers.

4.6. METHOD 45

Input Spectrogram

FrequencyBlock 1

TimeBlock 1

FrequencyBlock 2

TimeBlock 2

FrequencyBlock K

TimeBlock K

ConcatChannel-Wise

P Conv 1x1Stride 1x1

MaxPooling1x2

Frequency Block

NK Conv 3x1Stride : 2x1

MaxPooling2x1

Time Block


(a) Parallel feature extraction. The inputspectrogram is processed by 2 parallel pipe-lines. The first pipeline (in green) captures fre-quency features using a serie of K 1D convolu-tions with Nk filters and a stride of 2×1. Sincethe stride is larger than 1 × 1, each convolu-tion downsample the frequency axis. The 1×2maxpooling then downsamples the time axis.The second pipeline (in yellow) captures timefeatures using the same structure with trans-posed filter size. Features from both pipelinesare concatenated and fused using a 1×1 convo-lution with P filters to create.

InputSpectrogram

InterleavedBlock 1

InterleavedBlock 2

InterleavedBlock K

P Conv 1x1Stride 1x1

Interleaved Block



(b) Interleaved feature extraction. 1Dtime (in yellow) and frequency (in green)convolutions are applied alternately on the in-put spectrogram building a time-frequencyrepresentation after each block. The order ofthe convolution in each block can be reversed.The extractor is composed of K blocks whereeach convolution has Nk filters followed by a1× 1 convolution with P filters.

Figure 4.3 Acoustic feature extraction


comprises a 1 × 3 time convolution with NK filters and stride 1 × 2 followed by a 3 × 1

frequency convolution with NK filters and stride 2× 1. A 1× 1 convolution with P filtersprocesses the output of the last block to either compress or expand its dimensionality. Asan alternative configuration, the order of the convolution operation in each block can bereversed so that it first operate along the frequency axis and then the time axis. Comparedto the Parallel feature extractor, learned time-frequency representations are created aftereach block instead of only at the end of the pipeline.

For both extractors, the convolutions in the first block comprise N1 convolutional filtersand the number of filters is doubled after each block. More blocks K results in more downsampling of the feature maps shape which brings down the computational cost of themodel.

4.6.3 Coordinate maps with acoustic inputs

The baseline model makes use of the CoordConv [63] technique when tackling the VisualQuestion Answering task (pink border boxes in Figure 4.2). This technique provides areference coordinate system that is consistent across all input features. It comprises 2 fea-ture maps filled with (constant, untrained) coordinate information that are concatenatedchannel-wise to the input features of a convolutional layer. The first concatenated map isa 2D matrix where all values for a given column are the same. The values start from -1 onthe first row and increase linearly up to 1 on to last row. The second map is equivalent tothe first except that the linearly increasing axis is the column and the fixed axis is the row.These coordinate maps help the convolutional layer learn localization features, at least inthe visual domain. Coordinate maps can be inserted before any convolution operation atany depth of the neural network.

In the visual domain both axis of an image encode spatial information. With spectro-temporal representations, the Y axis corresponds to frequency and the X axis correspondsto time. In this context, we call the first map frequency coordinate map and the second maptime coordinate map. Koutini et al. [53] proposed Frequency-Aware convolutions which isequivalent to concatenating only a frequency coordinate map to the input of a convolutionlayer. In this work, the performance of both Frequency and Time coordinate maps areinvestigated. All spectro-temporal representations in the CLEAR dataset have the samerange for the frequency axis but the range for the time axis varies depending on the dura-tion of the acoustic scenes. Also, there are questions directly related to time localizationwhereas frequency localization is more related to pattern recognition. We hypothesize thattime coordinate maps might have a larger impact on performance because they provide arelative time scale that the model can use to enhance its temporal localization capabilities.

4.7. EXPERIMENTS & RESULTS 47

4.7 Experiments & Results

We first describe the preprocessing of the acoustic signals and the experimental conditions.We then study the impact of different feature extractors on the CLEAR task. We examinehow the insertion of coordinate maps at different depths in the architecture affects perfor-mance. Finally, we do an ablation study to minimize the complexity of NAAQA and westudy the impact of the dataset size on the training process.

4.7.1 Acoustic Processing

The raw acoustic signal sampled at 48 kHz is transformed to create a 2D time-frequencyrepresentation where the frequency axis is log-scaled. After experimenting with differentwindow sizes and time shifts, we found that spectrograms using a Hanning window of 512samples (∼106 ms) and a time shift of 2048 samples (∼427 ms) gave the best results. Usinga time shift larger than the window create a spectro-temporal representation for which thetime axis has been downsampled. This provides a cost-effective representation, given thatthe elementary sounds used to create the scenes are slowly varying in time. We refer tothis downsampled spectrogram as spectrogram in the remaining parts of the paper. A 512points FFT is used yielding 256 frequency bins per window. N spectra are concatenated,where N is equal to the number of windows. Since the duration (and therefore N) isnot constant, we pad the spectrogram time axis with zeros so that all spectrograms havethe same dimensions. The power spectrum is normalized using the mean and standarddeviation of the training data. This normalization scheme brings the mean close to 0 andthe standard deviation close to 1 which has been shown to speed up convergence [57].

4.7.2 Experimental conditions

The networks presented in subsequent sections are trained on the CLEAR dataset whichcomprises 50 000 scenes and 4 questions per scene for a total of 200 000 records fromwhich 140 000 (70%) are used for training, 30 000 (15%) for validation and 30 000 (15%)for test. All models were trained for a maximum of 40 epochs or until the validation lossstops decreasing for 6 consecutive epochs. The Adam optimizer [52] with a learning rate of3× 10−4 was used to optimize the cross entropy loss. L2 regularisation with a weight decayof 5× 10−6, dropout [94] with a drop probability of 0.25 and batch normalization [44] areused to regularize the learning process. A batch size of 128 is used for all experiments.The results are reported in terms of accuracy, that is the percentage of correct answerson the total. Since initialization of deep architectures has a profound impact on trainingconvergence, we developed a python library torch-reproducible-block 2 to control themodel initial conditions and design reproducible experiments. To ensure the robustness of

2. https ://github.com/NECOTIS/torch-reproducible-block


Test accuracy by Extractor Typequestion type Interleaved Time (Fig 4.3b) Parallel (Fig 4.3a) Resnet101 (Baseline) Interleaved Freq (Fig 4.3b)

Overall 90.9 ±0.21 89.8 ±0.30 77.9 ±0.14 59.6 ±12.47Instrument 96.8 ±0.19 95.6 ±0.59 85.0 ±0.56 48.0 ±19.78

Note 93.8 ±0.47 92.3 ±1.00 67.7 ±1.54 34.2 ±20.30Brightness 98.6 ±0.17 97.9 ±0.05 93.8 ±0.20 82.0 ±6.59Loudness 98.2 ±0.21 97.7 ±0.24 93.6 ±0.46 80.3 ±8.13

Exist 88.6 ±0.24 87.3 ±0.28 81.6 ±0.46 70.7 ±6.16Absolute Position 96.4 ±0.53 95.3 ±0.06 74.4 ±1.83 51.9 ±20.78Global Position 96.9 ±0.56 96.5 ±0.12 83.8 ±3.76 73.9 ±11.36

Relative Position 70.5 ±1.40 59.4 ±2.79 56.4 ±1.70 43.9 ±4.99Count 64.6 ±0.80 62.9 ±1.01 51.2 ±0.66 40.1 ±4.50

Count Comparison 64.0 ±0.39 63.0 ±0.62 64.1 ±0.64 55.6 ±1.53Count Instruments 41.4 ±1.89 38.1 ±3.17 37.9 ±0.62 37.9 ±2.83

Tableau 4.3 Impact of different feature extractors. The initial configura-tion is used for the rest of the network (defined in Section 4.7.3). Test accuracyis reported by question type (Examples for each question type are available inTableau 4.1).

the results, each model is trained 5 times with 5 different random seeds 3. To reduce theeffect of getting stuck in local minima, the accuracies presented in subsequent sections arethe average of the 3 best runs.

4.7.3 Initial configuration

The initial configuration for the proposed model comprises G = 4096 GRU units, J = 4

Resblocks with M = 128 filters each and a classifier composed of a 1× 1 convolution withC = 512 filters and H = 1024 hidden units in the fully connected layer. This configurationincludes both time and frequency coordinate maps in each location highlighted in pink inFigure 4.2. The above hyper-parameters are also used for the baseline model described inSection 4.6.1.

4.7.4 Feature Extraction

We investigated different feature extraction strategies with the initial configuration. Re-sults are presented in Tableau 4.3.

The baseline uses a Resnet101 module [37] pre-trained on Imagenet [88] as feature extrac-tor. This extractor is expecting a 3 channels input but the spectro-temporal representationof the acoustic scene comprise only 1 channel. To work around this constraint, the ampli-tude of the frequency bins of the spectrogram are transformed into RGB values using theviridis matplotlib colormap 4 thus creating a 3 channels input Spectrograms have a verydifferent structure than visual scenes and, as expected, the pre-learnt knowledge gatheredin a visual context does not transfer directly to the acoustic context. This configuration

3. https ://github.com/NECOTIS/NAAQA4. https ://matplotlib.org/examples/color/colormaps_reference.html


does however achieve a surprisingly good overall accuracy of 77.9%. It performs quite wellwith questions related to brightness and loudness with 93.8% and 93.6% respectively. Onepossible explanation is that both brightness and loudness are directly related to the distri-bution of colors along the frequency axis in the spectrograms. These features are efficientlyencoded by the visual representations learned by the Resnet101 model. The baseline doesnot perform as well with questions related to absolute position which is somewhat sur-prising because these questions correspond to a similar pattern matching problem on thetime axis instead of the frequency axis. It does however perform better with global positionquestions. This is logical because this type of question refer to an approximate position(beginning, middle, end) which constrain the number of possible answers to 3 instead of15. The baseline model has difficulties with questions related to notes (67.7%). This mightbe explained by the fact that a note can be identified by its fundamental frequency andharmonics which can be far apart on the frequency axis. Visual models are trained to re-cognize localized features and therefore struggle to recognize notes. This is a feature thatis typical of acoustic signals.

The Parallel feature extractor (Figure 4.3a) which processes 2D spectrogram (256×N×1)using independent time and frequency pipelines reaches an overall accuracy of 89.8%. Itperforms well on all question types except relative position and count. These results showthat building complex time and frequency feature separately and fusing them at a laterstage is a good strategy to learn acoustic features for this task.

The Interleaved feature extractor (Figure 4.3b) processes 2D spectrogram with a serie of1D convolutions. The order of the 1D convolution in each block has a significant impact onperformance. When the first 1D convolution in each block is done along the frequency axis(Interleaved Freq.), the network reaches an overall accuracy of 59.6%. It performs especiallypoorly with questions related to notes (34.2%) and instruments (48.0%). The performanceon position questions is also the lowest among all the extractors. A possible explanationrelate to the nature of the sounds in the CLEAR dataset. Since these sounds are mostlysustained musical notes, their temporal evolution is rather stable. Therefore, the timedimension does not contain much information related to the identification of individualsounds. The time axis however contains important information relative to the scene asa whole which is exploited by higher level layers (Resblocks) to model the connectionsbetween different sounds. Because its stride is greater than 1x1, each 1D convolutiondownsamples the axis on which it is applied. When the first convolution is a frequencyconvolution, the frequency axis of the resulting features is downsampled which reduces theuseful information that can be captured by the time convolution that follows.


Coordinate maps Accuracy (%)Extractor 1st Conv Resblocks Classifier Train val Test

–* Both* Both* Both* 94.1 ±0.62 91.2 ±0.33 90.9 ±0.21– Both – – 93.3 ±0.18 91.0 ±0.22 90.9 ±0.18– – Time – 93.7 ±0.49 90.5 ±0.70 90.5 ±0.54– Time – – 93.6 ±0.81 90.7 ±0.92 90.4 ±0.74– – Both – 93.5 ±0.59 90.2 ±0.77 90.0 ±0.85

Time – – – 93.6 ±1.08 90.0 ±0.95 89.8 ±1.12Freq – – – 84.6 ±0.96 76.1 ±3.97 75.7 ±3.74

– – – Both 84.5 ±5.66 76.2 ±3.56 75.2 ±2.29– – – Freq 84.2 ±2.42 74.7 ±1.33 73.9 ±1.54– – – – 81.2 ±4.13 73.4 ±1.14 73.5 ±1.07– Freq – – 80.5 ±4.82 73.1 ±5.39 70.5 ±5.65– – Freq – 82.2 ±2.09 72.2 ±1.86 69.7 ±2.27– – – Time 70.9 ±17.56 63.7 ±13.95 61.8 ±12.63

Both – – – 46.5 ±0.37 45.8 ±0.09 45.7 ±0.10Tableau 4.4 Impact of the placement of Time and Frequency coordi-nate maps. All possible positions are illustrated by the pink border boxes inFigure 4.2. The value Both indicate that both Time and Frequency coordinatemaps were inserted at the given position. The interleaved feature extractor isused with hyper-parameters from the initial configuration (defined in section4.7.2). The row marked with * corresponds to the placement in the initial confi-guration.

When the order of the convolution is reversed (Interleaved Time), useful information isbetter captured and the network reach the best overall accuracy with 90.9%. The most no-table improvement is observed with questions related to relative position with an increaseof 11.1 percentage points compared to the Parallel extractor.

All configurations evaluated in this section struggle with the concept of counting especiallywhen asked to count the number of different instruments playing in a specific part of thescene. They also have difficulties with questions related to the relative position of theinstruments. We further discuss these difficulties in Section 4.8.1. The Interleaved featureextractor is the one that performs the best and constitute the basis of NAAQA in allsubsequent experiments.

4.7.5 Coordinate Maps

We analyzed the impact of the placement of Time and Frequency coordinate maps atdifferent depths in the network. The pink border boxes in Figure 4.2 show all the locationswhere coordinate maps can be inserted prior to convolution operations. In addition tothese locations, the insertion of coordinate maps through the Interleaved feature extractor


Test accuracy by Coordinate map typequestion type Time & Freq Time only None Freq only

Overall 90.9 ±0.18 90.4 ±0.74 73.5 ±1.07 70.5 ±5.65Instrument 97.0 ±0.26 96.9 ±0.42 90.9 ±0.50 88.2 ±3.36

Note 94.0 ±0.14 93.4 ±0.82 84.3 ±0.73 80.1 ±5.40Brightness 98.5 ±0.31 98.2 ±0.12 95.8 ±0.32 94.9 ±1.29Loudness 97.9 ±0.15 98.1 ±0.49 95.7 ±0.54 93.7 ±1.69

Exist 88.8 ±0.32 88.6 ±0.60 84.0 ±1.06 81.0 ±3.24Absolute Position 96.3 ±0.29 94.9 ±1.00 48.1 ±2.40 43.6 ±11.25Global Position 97.5 ±0.06 96.7 ±0.90 78.3 ±3.11 74.4 ±8.31

Relative Position 67.1 ±5.47 69.2 ±4.84 48.9 ±4.23 49.9 ±5.85Count 65.6 ±1.14 64.2 ±1.38 51.7 ±1.65 50.2 ±6.52

Count Comparison 62.7 ±1.09 63.6 ±1.02 61.3 ±0.63 60.4 ±2.40Count Instruments 38.9 ±2.50 39.7 ±1.89 29.6 ±1.07 31.1 ±0.62

Tableau 4.5 Comparison of Time and Frequency coordinate maps. Co-ordinate maps are inserted in the Resblocks. Test accuracy is reported by ques-tion type (Examples for each question type are available in Tableau 4.1).

was also explored. All possible locations were evaluated via grid-search. For each location,we inserted either a Time coordinate map, a Frequency coordinate map or both. Resultsare detailed in Tableau 4.4. Coordinate maps have the biggest impact on performancewhen inserted in the first convolution after the feature extractor and in the Resblocks.This could be because the fusion of the textual and acoustic features, and therefore mostof the reasoning, is performed in the Resblocks. The network might be using the additionallocalization information to inform the modulation of the convolutional feature maps basedon the context of the question.

The impact of the kind of coordinate maps on each question types is further analyzedin Tableau 4.5. To facilitate comparison, all coordinate maps are inserted at the samelocation (in the first convolution after the feature extractor). When the network doesnot include any coordinate maps it reaches only 73.5%. Inserting a Frequency coordinatemap does not increase performance. As hypothesized, Time coordinate maps significantlyincrease the overall accuracy with 90.4%. The accuracy increases for all question typesbut is most notable with questions related to Absolute position where the accuracy goesfrom 48.1% to 94.9%. This is a similar result to [63] where coordinate maps were provedto help convolutional networks locate objects in visual scenes. In the context of a spectro-temporal representation, this ability translates to enhanced temporal localization whichnaturally helps to answer questions related to position. Questions related to Notes also


receive a significant boost from Time coordinate maps going from 84.3% to 93.4%. Thefact that Frequency coordinate maps have a negligible impact on questions related toNotes is rather surprising since notes are mostly defined by the frequency content at agiven point in time. Overall accuracy is slightly increased when combining both Time andFrequency coordinate maps with 90.9%.

4.7.6 Complexity reduction

One aim of this study is to determine what is the smallest configuration (in terms ofnumber of parameters) that can achieve reasonable performance on this acoustic task.Most parameters in NAAQA are used for the feature modulation in the FiLM layers. Thenumber of feature modulation parameters is proportional to the number of GRU text-processing units G. The number of Resblock J dictates the number of FiLM layers in themodel and therefore the number of modulation coefficients to compute. The number ofconvolutional filters M in each block also impacts the modulation calculation. The classifiermodule, which comprises a 1 × 1 convolution with C filters and a fully connected layerwith H hidden units, accounts for almost 10% of the model parameters. Detailed resultsof the optimization process and values for all explored hyper-parameters (G, J,M,C,H)are reported in appendix 4.10.

The NAAQA’s original configuration described in Section 4.7.2 comprises 5.61M parame-ters and achieves 90.9%. The most notable complexity reduction comes from the reductionof the number of GRU units G. With G = 512, the number of parameters is reduced to1.94M and the accuracy increases to 91.1% showing that similar performance can be achie-ved with almost 3 times fewer parameters. We found that a model comprising ∼19 timesfewer parameters than the original configuration (0.30M parameters) can achieve betterresults than the baseline with 84.7% vs 77.9%. The final NAAQA configuration comprises∼7 times fewer parameters (0.81M) than the original configuration and achieves 91.5%. Itis composed of an Interleaved feature extractor with K = 3 blocks and P = 32, G = 512

GRU units, J = 4 Resblocks with M = 64 filters, a classifier module with C = 256 andH = 1024.

4.7.7 Impact of dataset composition

We investigated the impact of the dataset composition and, specifically, the importanceof the number of questions per scene. In order to eliminate the effect of overall size ofthe data, multiple versions of the dataset were generated by varying the number of sceneswhile keeping the total number of records fixed. We experiment with 10k, 20k, 50k, 100k,200k and 400k scenes (S). The number of questions per scene for each configuration is


Figure 4.4 Test accuracy for NAAQA trained on different datasetsizes. The total number of records is either 100k, 200k or 400k. The x-axisis the number of scenes and the number of questions per scene used to build thedataset. The y-axis is the accuracy on the test set. The horizontal line is themean accuracy. The dataset configuration highlighted in green is the configura-tion used for all previous experiments.

defined by DS where D is the total number of records. We also investigate the impact of

the dataset total size by repeating this experiment with D = 100k, 200k, 400k records.

The network is trained on each of these sets using 70% of the data. To facilitate comparison,the model is evaluated on the same test set as in previous experiments. Results are inFigure 4.4.

The total number of training records is the most determining factor for the test accuracy.All experiments trained on 400k records perform better than the ones on 200k records andthe same for the one trained on 200k records versus 100k records. This is unsurprising sincemore training examples often lead to better performance. We also expected that havingmore questions for each scene in the training set would increase the network performance.The results however show that the number of questions per scene did not substantiallyaffect accuracy, especially with larger sets.


Figure 4.5 Test accuracy of NAAQA final configuration by questiontype and the number of relation in the question. The presence of beforeor after in a question constitute a temporal relation. The accuracy is N/A forrelative position and count compare since these types of question do no includerelations. The hyper-parameters used are described at the end of Section 4.7.6

4.8 Discussion

4.8.1 NAAQA detailed analysis

NAAQA performs well on this AQA task with 91.6% accuracy. It does however strugglewith certain types of question as shown in Figure 4.5. When asked to count the numberof sounds with specific attributes, NAAQA reaches only 63.3% accuracy. It attains similaraccuracy when asked to compare the number of instances of acoustic objects (more, feweror equal number) with specific attributes (65.2%). This shows that the network learnedto compare quantity but cannot achieves better accuracy because of its limited ability tocount. The network can successfully recognize individual instruments in the scene (96.1%)but cannot count the number of different instruments playing in a given part of the scene(39.1%). This type of question is rather complex to answer. Consider for example the

4.8. DISCUSSION 55

question : "How many different instruments are playing after the third cello playing aC# note ?". The model has to first find the cello playing the C# note, then identify allacoustic objects that are playing after this sound, determine which instruments are of thesame family and finally count the number of different family. This requires the network tofocus on many different acoustic objects. In comparison, the network only needs to focuson 1 objects when answering a question like "What is the note played by the dark loudviolin ?". The same applies for instrument, brightness, loudness and exist questions. Thedifficulty of focusing on multiple acoustic object in the scene might be an explanation asto why NAAQA struggles with counting. It could also explain why the model struggleswith questions related to the relative position of an instrument (66.67%). To answer thequestion "Among the flute sounds, which one plays an F note ?", the model must find allthe flutes playing in the scene, determine which one play an F note, count the number offlute playing before and translate the count to a position 5. This also requires the networkto focus on multiple acoustic objects.

Certain questions include temporal relations between sounds (before and after) as exem-plified in Tableau 4.1. The number of relations in a question correlates with the number ofsounds that need to be focused on to answer a question. Figure 4.5 shows the accuracy ba-sed on the number of relations for each question types. Questions that require the networkto focus on a single acoustic object (brightness, loudness, instrument, note, global positionand absolute position) benefit from the presence of a relation in the question. This mightbe explained by the fact that the question contains more information about the scenewhich helps to focus on the right acoustic object. However, the presence of relation inquestions that already require the network to focus on multiple objects (exist, count andcount comparison) is detrimental. This again supports the idea that having to focus ontoo many objects in the scene hinders the network’s performance.

4.8.2 Importance of text versus audio modality

In Section 4.7.7, we observed that a higher number of questions per scene during trainingdid not improve the test accuracy. This contradicted our hypothesis that increasing thenumber of questions per scene would increase the performance of the model. It also suggeststhat the network uses both the scene and question to solve the task without favoring oneor the other. To validate the fact that NAAQA makes good use of both input modalities(spectrogram and text), the network was also trained using only one modality. To doso, either all values of the input spectrograms were set to 1 or all questions were set tothe unknown token during training. No more than 48% of accuracy on the test set was

5. This is only one possible strategy to answer the question. There may be other ways.


achieved for either of these scenarios which indicates that the network uses informationfrom both modalities to achieve high performance and that the dataset cannot be solvedusing only one modality with this architecture. This is however much higher than theprobability of answering the correct answer by chance ( 1

57= 1.75%), or by majority vote

(7.5%) introduced in Section 4.5. Although we have carefully designed the data set in orderto avoid bias, some form of bias is instrinsic in QA problems. For example, an acousticsequence containing only n elementary sounds automatically disables all counting answersabove n regardless which question is asked. Similarly, asking a yes/no question limits thenumber of possible answers to two, regardless of the acoustic scene. The denominator usedin the chance level (57) will, therefore be lower on average. It remains to investigate if ourresults are consistent with this argument, or if they suggest a residual bias in the dataset.We leave this investigation for future work.

4.8.3 Variability of the input size

Batch normalization has been proven to help regularize the training of neural networks [44].However, using batch normalization requires all inputs in a batch to have the same di-mensions in order to compute the element-wise means and standard deviations. Acousticsignals in the CLEAR dataset have variable duration. To work around this constraint, wepadded the spectrograms with zeros so that they all have the same dimensions. Anothersolution is to resize all spectrograms to a fixed size using bilinear interpolation as it iscommonly done in the visual domain. From a purely acoustic point of view, we hypothe-sized that padding spectrograms would preserve the relative time axis between all scenesand would yield better results than resizing. We revisited this hypothesis by training ourbest architecture on resized spectrograms. Surprisingly, the network performed better byabout 1 percentage points when trained on the resized spectrograms. We do not have adefinitive answer as to why we observe this behaviour. It might be due to the fact thatpadded spectrograms contain less information since a portion of them is filled with zeros.The network could also be using the different time resolutions in the resized spectrogramsas a cue to cheat the reasoning process. However, this could also happen when the scenesare padded since the padded section is unique to each scene. We analysed the networkactivations and did not find that the padded section was overly activated. We leave it upto future work to further analyze this.

4.9 Conclusions

In this study, we propose a version of the CLEAR dataset that emphasizes the specificchallenges in acoustic question answering by introducing scenes of variable duration anda greater variety of elementary sounds. We showed that Visual FiLM [76], which is a

4.9. CONCLUSIONS 57

neural network designed for visual question answering, achieves relatively good accuracyon this acoustic task although it struggles with certain types of questions. We propose theNAAQA architecture where the feature extraction is optimized to the acoustic task. ThisInterleaved feature extractor makes use of 1D convolutions to independently process eachaxis of 2D time-frequency representations. The initial NAAQA architecture reaches 90.9%of accuracy on the CLEAR task with 5.61M parameters. The final NAAQA configurationreaches 91.6% of accuracy with ∼8 times fewer parameters (0.67M). The effectiveness ofcoordinate maps in this acoustic context was also studied. We showed that time coordinatemaps enhance the network temporal localization capabilities which translates in greaterperformance on this task. We hope that these results and the software we provide willstimulate research in acoustic reasoning.


4.10 APPENDIX : Network Optimization

Gru (G) Overall Accuracy (%) Nb ofUnits Train Val Test Parameters512 94.2 ±0.23 91.5 ±0.54 91.4 ±0.39 ∼1.94 M

1,024 93.5 ±0.93 91.3 ±0.53 91.2 ±0.46 ∼2.46 M256 92.8 ±0.70 91.3 ±0.49 91.1 ±0.07 ∼1.68 M

4,096* 94.1 ±0.62 91.2 ±0.33 90.9 ±0.21 ∼5.61 M2,048 92.7 ±1.03 87.4 ±5.29 86.5 ±5.69 ∼3.51 M

Tableau 4.6 Impact of the number of GRU text-processing units G.The interleaved extractor is used with hyper-parameters from the initial configu-ration described in Section 4.7.3 except for G. The row marked with * correspondto the initial value for G. The number of GRU unit have a strong impact onthe number of parameters. Setting G = 512 actually increase accuracy whilereducing the number of parameters by a factor of ∼2.8.

Nb (C) Hidden Overall Accuracy (%) Nb ofFilter Layer (H) Train Val Test Parameters256 1024 93.8 ±0.49 91.4 ±0.36 91.6 ±0.08 ∼1.64 M512* 1024* 94.2 ±0.23 91.5 ±0.54 91.4 ±0.39 ∼1.94 M128 1024 94.3 ±0.48 91.3 ±0.45 91.2 ±0.52 ∼1.50 M128 256 93.2 ±0.61 91.0 ±0.25 91.1 ±0.55 ∼1.35 M256 256 93.1 ±0.56 91.0 ±0.66 91.0 ±0.60 ∼1.40 M128 512 92.9 ±0.41 90.9 ±0.34 90.8 ±0.29 ∼1.40 M256 512 93.4 ±0.67 90.9 ±0.55 90.7 ±0.28 ∼1.48 M512 512 77.4 ±21.55 76.2 ±21.30 76.1 ±21.25 ∼1.65 M

Tableau 4.7 Impact of the number of filters C and hidden units H inthe classifier. The interleaved extractor is used with hyper-parameters fromthe initial configuration described in Section 4.7.3 except for C, H and G = 512.The row marked with * correspond to the initial value for C and H. Classifierparameters have a small impact on both accuracy and the number of parameters.Setting C = 256 gives similar accuracy and reduce the number of parametersby 0.3M compared to the initial parameters. The configuration C = 512 andM = 512 get stuck in local minimums often which explain the large STD.

4.10. APPENDIX : NETWORK OPTIMIZATION 59

Nb (J) Nb Filter (M) Overall Accuracy (%) Nb ofResblock in Resblock Train val Test Parameters

4 128 94.0 ±0.54 91.3 ±0.40 91.7 ±0.07 ∼1.64 M4 64 93.7 ±0.04 91.0 ±0.26 91.1 ±0.04 ∼0.83 M3 128 93.5 ±0.35 90.6 ±0.81 90.6 ±0.68 ∼1.35 M3 64 92.8 ±0.90 90.5 ±0.61 90.3 ±0.61 ∼0.73 M2 128 93.4 ±0.15 90.7 ±0.11 90.2 ±0.14 ∼1.05 M2 64 92.0 ±0.39 89.6 ±0.25 89.6 ±0.14 ∼0.62 M3 32 91.6 ±0.49 88.8 ±0.25 88.6 ±0.24 ∼0.51 M4 32 91.8 ±0.58 89.3 ±1.05 88.6 ±1.36 ∼0.55 M2 32 90.2 ±1.05 86.5 ±0.77 86.7 ±0.65 ∼0.47 M1 128 92.0 ±0.22 87.1 ±0.35 86.3 ±0.13 ∼0.76 M1 64 91.1 ±1.55 86.0 ±1.13 86.0 ±1.31 ∼0.51 M1 32 87.5 ±0.25 81.1 ±0.46 81.4 ±0.26 ∼0.42 M

Tableau 4.8 Impact of reducing the number of Resblock J and thenumber of filters M in each Resblocks. The interleaved extractor is usedwith hyper-parameters from the initial configuration described in Section 4.7.3except for J , M , C = 256, H = 1024 and G = 512. With J = 4 and M = 64,the network reaches great accuracy with only 0.67M parameters (vs 5.61M forthe initial configuration). The accuracy drop when J < 3 and M = 32. Theconfiguration J = 3 and M = 128 get stuck in local minimums often whichexplain the large STD.


Nb (K) Projection (P) Accuracy (% ± 0.60) Nb ofBlock Size Train val Test Parameters

3 32 93.7 ±0.34 91.5 ±0.03 91.5 ±0.15 ∼0.81 M3 64 93.7 ±0.04 91.0 ±0.26 91.1 ±0.04 ∼0.83 M3 128 93.1 ±0.24 91.0 ±0.66 91.1 ±0.36 ∼0.87 M3 – 93.5 ±0.30 91.1 ±0.24 90.8 ±0.29 ∼0.81 M2 – 92.9 ±0.39 89.9 ±0.13 90.0 ±0.13 ∼0.80 M2 32 92.9 ±0.72 89.7 ±0.62 89.9 ±0.55 ∼0.81 M2 128 91.9 ±1.80 89.5 ±0.75 89.5 ±1.03 ∼0.87 M2 64 91.9 ±1.14 89.4 ±0.57 89.4 ±0.58 ∼0.83 M4 32 93.4 ±0.47 86.9 ±2.43 83.1 ±0.07 ∼0.83 M4 64 89.0 ±0.00 81.6 ±0.00 80.8 ±0.00 ∼0.86 M4 – 91.3 ±2.00 83.5 ±6.44 79.9 ±4.09 ∼0.85 M4 128 88.2 ±2.37 77.3 ±2.70 74.8 ±2.62 ∼0.90 M

Tableau 4.9 Impact of the number of blocks K in the parallel featureextractor and its projection size P . The interleaved extractor is used withthe hyper-parameters J = 4, M = 64, C = 256, H = 1024, G = 512. Moreblocks in the extractor results in more downsampling of the feature maps. Toomuch downsampling (K = 4) negatively affects accuracy.

CHAPITRE 5

CONCLUSION

5.1 Sommaire

Ce travail de recherche introduit une nouvelle tâche ayant pour but d’évaluer les capaci-tés de raisonnement de réseaux de neurones dans un contexte acoustique. Dans la tâched’Acoustic Question answering (AQA), un agent intelligent doit réponde à une questionsur le contenu d’une scène auditive. Cette tâche est analogue au Visual Question Answe-ring (VQA) où un agent doit répondre à une question sur le contenu d’une scène visuelle.Puisque ces deux tâches sont similaires, nous nous interrogeons sur la performance desystèmes développés pour la tâche de VQA lorsqu’appliqué à la tâche d’AQA.

Cette interrogation se traduit par la question suivante :

Est-ce qu’une architecture à base de réseau de neurones développée dans lebut de répondre à des questions sur des scènes visuelles peut être utilisée pourrépondre à des question sur des scènes auditives ?

Pour répondre à cette question, nous avons d’abord développé une base de données com-portant un certain nombre de scènes auditives ainsi que des questions et réponses pourchacune de ces scènes. Une première version de cette base de données CLEAR a été intro-duite dans un article de conférence [2]. Dans cette version, toutes les scènes ont la mêmedurée. Cette base de données a introduit à la communauté scientifique la tâche d’AcousticQuestion Answering (AQA). Le code pour la génération de la base de données fut aussipublié sur GitHub 1. Une première expérimentation avec le réseau FiLM [76] a été réaliséedans le cadre de cet article. À l’origine, FiLM a été développé pour s’attaquer à la tâche deVisual Question Answering (VQA). Les résultats préliminaires de cette expérimentationmontrent le potentiel prometteur d’utiliser ce type d’architecture pour la tâche d’AQA.

Afin de rendre la tâche plus réaliste, nous avons développé une seconde version de la base dedonnées CLEAR. Celle-ci comporte des scènes à durée variable et une plus grande variétéde son élémentaire. Nous avons soumis un article au journal IEEE Pattern Analysis andMachine Intelligence introduisant cette nouvelle version de la base de données. Nous y fai-sons une analyse approfondie de la performance du réseau FiLM sur cette nouvelle version

1. https ://github.com/IGLU-CHISTERA/CLEAR-dataset-generation

61

62 CHAPITRE 5. CONCLUSION

de la base de données. Ce dernier atteint une précision de 78% ce qui est impressionnantcompte tenu du fait qu’une partie du réseau est pré-entrainé sur des données visuelles. Leréseau est toutefois très complexe avec plus de 6 millions de paramètres. Nous avons aussiintroduit dans cet article un nouveau réseaux de neurones basé sur l’architecture FiLM[76]. NAAQA fait usage de convolution 1D afin d’analyser les représentations spectro tem-porelles 2D de scènes auditives. Ce modèle atteint une précision de 91% avec environ 8 foismoins de paramètres. Suite à ces expérimentations, il nous est possible d’affirmer qu’unearchitecture à base de réseau de neurones développée dans le but de répondre à des ques-tions sur des scènes visuelles peut être utilisée avec succès pour répondre à des questionssur des scènes auditives. Finalement, nous avons effectué une analyse de l’efficacité descartes de coordonnées dans le cadre de la tâche d’AQA. Nous avons observé que les cartesde coordonnées temporelles ont un grand impact sur la performance de notre réseau. Ellespermettent au réseau de mieux comprendre la dimension temporelle des spectrogrammes.

5.2 Retour sur les contributions

La première contribution de ce travail de recherche est la base de données CLEAR. Cettebase de données constitue le fondement de la tâche d’Acoustic Question Answering. Laversion officielle de la base de données comporte de 50 000 scènes acoustiques et de 4questions/réponses par scène pour un total de 200 000 exemples. Le code pour la générationde la base de données a été développé de façon à être facilement extensible. Il est donc trèssimple de générer un nombre différent de scènes et de questions. Il est aussi facile de changerles sons élémentaires à partir desquels les scènes sont générées. En changeant les sonsélémentaires, il est possible de générer une base de données complètement différente. Lecode de génération est disponible sur GitHub 2 et la base de données sur IEEE Dataport 3.

La deuxième contribution de ce travail de recherche est l’analyse de la performance dusystème visuel FiLM sur la tâche d’Acoustic Question Answering. Nous avons effectuéune analyse détaillée de la performance de ce modèle dans le Chapitre 4. Nous y avonsdémontré que, bien que cette architecture fut développée pour une tâche visuelle, elleatteint de bonne performance sur cette tâche acoustique.

La troisième contribution de ce travail est un réseau de neurones développé spécifiquementpour la tâche d’Acoustic Question Answering. Ce réseau fut introduit à la communautéscientifique dans l’article présenté au Chapitre 4. Ce réseau est basé sur l’architecture duréseau visuel FiLM et atteint un précision de 91% sur la tâche définit par CLEAR. Ce

2. https://github.com/IGLU-CHISTERA/CLEAR-dataset-generation

3. https://ieee-dataport.org/open-access/clear-dataset-compositional-language-and-elementary-acoustic-reasoning

5.3. TRAVAUX FUTURS 63

réseau est environ 8 fois moins complexe en terme de nombre de paramètre que le réseauFiLM.

La dernière contribution de ce travail est une analyse de la pertinence de l’utilisation decartes de coordonnées dans un contexte acoustique. Dans l’article présenté au chapitre4, nous avons effectué une analyse du placement optimal des cartes de coordonnées dansl’architecture NAAQA. Nous avons aussi comparé l’efficacité des cartes de coordonnéestemporelles versus les cartes de coordonnées fréquentielles. Suite à ces expérimentations,il nous est possible d’affirmer que les cartes de coordonnées temporelles ont un impactimportant sur la performance de notre réseau pour la tâche d’AQA. Elles augmentent lacapacité de localisation temporelle du réseau et, par le fait même, sa performance en cequi à trait aux questions lié a des caractéristiques temporelles.

5.3 Travaux futurs

Les scènes auditives de la base de données CLEAR sont composées de sons instrumentaux.Il serait intéressant d’ajouter d’autre types de sons dans les scènes tels que des extraits deparoles ou des sons environnementaux. Ces types de sons sont plus riches en informationet donc plus difficiles à analyser. Ils sont aussi plus proches de ce qu’un agent intelligentpourrait observer dans la nature. Est-ce que le réseau NAAQA performerait toujours aussibien avec ces nouveaux types de son ?

L’ajout de nouveaux types de son permettrait aussi l’ajout de nouveaux types de questions.Par exemple, en ajoutant des extraits de paroles dans les scènes il serait alors possible deposer des questions telles que :

– Est-ce qu’il y a des femmes qui s’expriment dans la scène ?– Combien de personnes différentes peut-on entendre dans la scène ?– Quel est le nom de la personne décrivant X ?– Etc.

Encore une fois, la même question se pose quant à la performance de l’architecture proposéedans NAAQA avec ces nouveaux types de questions. Est-ce que le réseau serait capabled’effectuer la reconnaissance vocale pour répondre aux questions ?

Il serait aussi intéressant de quantifier le biais dans les questions de la base de donnéesCLEAR. Nous avons effectué une analyse rapide en entraînant NAAQA avec les questionsseulement (sans les scènes acoustiques). Nous avons observé une précision d’environ 45%lorsque le réseau est entraîné sans considérer les scènes acoustiques ce qui semble démontrerque le réseau a besoin des deux modalités afin de bien performer sur la tâche définit parCLEAR. NAAQA as toutefois été construit pour analyser les deux modalités en même

64 CHAPITRE 5. CONCLUSION

temps. Il serait intéressant de refaire la même expérimentation mais cette fois avec unréseau ayant été conçu pour répondre à des questions textuelles seulement.

LISTE DES RÉFÉRENCES

[1] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn,and Dong Yu. Convolutional Neural Networks for Speech Recognition. IEEE Tran-sactions on Audio, Speech, and Language Processing, pages 1533–1545, 2014.

[2] Jérôme Abdelnour, Giampiero Salvi, and Jean Rouat. CLEAR : A Da-taset for Compositional Language and Elementary Acoustic Reasoning. InWorkshop on Visually Grounded Interaction and Language (VIGIL) in Ad-vances in Neural Information Processing Systems (NeurIPS), 2018. Available athttps ://arxiv.org/abs/1811.10561.

[3] Hervé Abdi and Lynne J. Williams. Principal component analysis. Wiley Interdis-ciplinary Reviews : Computational Statistics, pages 433–459, 2010.

[4] Sajjad Abdoli, Patrick Cardinal, and Alessandro Lameiras Koerich. End-to-endenvironmental sound classification using a 1D convolutional neural network. ExpertSystems with Applications, pages 252–263, 2019.

[5] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the Behavior ofVisual Question Answering Models. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP), pages 1955–1960, 2016.

[6] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’tJust Assume ; Look and Answer : Overcoming Priors for Visual Question Answering.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 4971–4980, 2018.

[7] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,C Lawrence Zitnick, and Devi Parikh. VQA : Visual Question Answering. In Pro-ceedings of the IEEE International Conference on Computer Vision (ICCV), pages2425–2433, 2015.

[8] Drew Arad Hudson and Christopher D. Manning. Compositional Attention Net-works for Machine Reasoning. In Proceedings of the International Conference onLearning Representations (ICLR), 2018.

[9] Ismael Balafrej and Jean Rouat. P-CRITICAL : A Reservoir Autoregulation Plas-ticity Rule for Neuromorphic Hardware. arXiv, abs/2009.05593, 2020.

[10] Giuseppe Bandiera, Oriol Romani Picas, Hiroshi Tokuda, Wataru Hariya, Koji Oi-shi, and Xavier Serra. Good-sounds.org : a framework to explore goodness in instru-mental sounds. In Proceedings of the International Society for Music InformationRetrieval conference (ISMIR), pages 414–419, 2016.

[11] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependen-cies with gradient descent is difficult. IEEE Transactions on Neural Networks, pages157–166, 1994.

[12] Venkatesh Boddapati, Andrej Petef, Jim Rasmusson, and Lars Lundberg. Classi-fying environmental sounds using image recognition networks. In Procedia ComputerScience, pages 2048–2056, 2017.

65

66 LISTE DES RÉFÉRENCES

[13] Simon Brodeur and Jean Rouat. Regulation toward Self-organized Criticality in aRecurrent Spiking Neural Reservoir. In Proceedings of the International Conferenceon Artificial Neural Networks (ICANN), pages 547–554, 2012.

[14] Mathilde Brousmiche, Jean Rouat, and Stephane Dupont. SECL-UMons Databasefor Sound Event Classification and Localization. In Proceedings of the IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages756–760, 2020.

[15] Judith Brown and Miller Puckette. An efficient algorithm for the calculation of aconstant Q transform. Journal of the Acoustical Society of America, page 2698,1992.

[16] Rui Cai, Lie Lu, Hong-jiang Zhang, and Lian-hong Cai. Improve audio representa-tion by using feature structure patterns. In Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages 345–348,2004.

[17] Jinwei Cao, Jose Antonio Robles Flores, Dmitri Roussinov, and Jay F Nunamaker.Automated question answering from lecture videos : NLP vs. pattern matching.In Proceedings of the Annual Hawaii International Conference on System Sciences(HICSS), page 43b, 2005.

[18] Marc Champagne. Sound reasoning (literally) : Prospects and Challenges of currentacoustic logics. Logica Universalis, pages 331–343, 2015.

[19] Marc Champagne. Teaching Argument Diagrams to a Student Who Is Blind. InProceedings of the International Conference on Theory and Application of Diagrams,pages 783–786, 2018.

[20] Shi-Kuo Chang and Erland Jungert. Symbolic Projection for Image InformationRetrieval and Spatial Reasoning. Signal processing and its applications. Elsevier,1996.

[21] Keunwoo Choi, George Fazekas, and Mark Sandler. Automatic Tagging using DeepConvolutional Neural Networks. In Proceedings of the International Society for Mu-sic Information Retrieval conference (ISMIR), pages 805–811, 2016.

[22] Tat-Seng Chua. Question answering on large news video archive. In Image and Si-gnal Processing and Analysis, 2003. ISPA 2003. Proceedings of the 3rd InternationalSymposium on, pages 289–294, 2003.

[23] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. EmpiricalEvaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv,abs/1412.3555, 2014.

[24] Anubrata Das, Samreen Anjum, and Danna Gurari. Dataset bias : A case study forvisual question answering. Proceedings of the Association for Information Scienceand Technology, pages 58–67, 2019.

[25] Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle,and Aaron C. Courville. Guesswhat ? ! Visual object discovery through multi-modaldialogue. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 4466–4475, 2017.


[26] Li Deng, Ossama Abdel Hamid, and Dong Yu. A deep convolutional neural networkusing heterogeneous pooling for trading acoustic invariance with phonetic confusion.In Proceedings of the IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 6669–6673, 2013.

[27] Lütfiye Durak and Orhan Arikan. Short-time fourier transform : two fundamentalproperties and an optimal implementation. IEEE Transactions on Signal Processing,pages 1231–1242, 2003.

[28] Yousof Erfani, Ramin Pichevar, and Jean Rouat. Audio Watermarking Using Spike-gram and a Two-Dictionary Approach. IEEE Transactions on Information Forensicsand Security, pages 840–852, 2017.

[29] Haytham M. Fayek and Justin Johnson. Temporal Reasoning via Audio QuestionAnswering. IEEE Transactions on Audio Speech and Language Processing, pages2283–2294, 2020.

[30] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Areyou talking to a machine ? dataset and methods for multilingual image question. InProceedings of Neural Information Processing Systems (NeurIPS), pages 2296–2304,2015.

[31] Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. Visual turingtest for computer vision systems. Proceedings of the National Academy of Sciences,pages 3618–3623, 2015.

[32] Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Die-ter Fox, and Ali Farhadi. IQA : Visual Question Answering in Interactive Environ-ments. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 4089–4098, 2018.

[33] Yash Goyal, Tejas Khot, Douglas Summers Stay, Dhruv Batra, and Devi Parikh.Making the V in VQA Matter : Elevating the Role of Image Understanding in VisualQuestion Answering. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 6325–6334, 2017.

[34] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognitionwith deep recurrent neural networks. In Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages 6645–6649,2013.

[35] Jun Han and Claudio Moraga. The influence of the sigmoid function parameterson the speed of backpropagation learning. In International Workshop on ArtificialNeural Networks, pages 195–201, 1995.

[36] Yoonchang Han and Kyogu Lee. Acoustic scene classification using convolutio-nal neural network and multiple-width frequency-delta data augmentation. arXiv,abs/1607.02383, 2016.

[37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learningfor Image Recognition. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 770–778, 2016.


[38] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, ArenJansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, BryanSeybold, Malcom Slaney, Ron J. Weiss, and Kevin Wilson. CNN architectures forlarge-scale audio classification. In Proceedings of the IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.

[39] Florian Hönig, Georg Stemmer, Christian Hacker, and Fabio Brugnara. Revisingperceptual linear prediction (PLP). In Proceedings of the European Conference onSpeech Communication and Technology (INTERSPEECH), pages 2997–3000, 2005.

[40] Eduard H Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk, and Chin-Yew Lin.Question Answering in Webclopedia. In Proceedings of the Text REtrieval Confe-rence (TREC), pages 53–56, 2000.

[41] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. Explainable Neu-ral Computation via Stack Neural Module Networks. In Proceedings of EuropeanConference on Computer Vision (ECCV), pages 55–71, 2018.

[42] Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned Graph Networks for Relational Reasoning. In Proceedings of the IEEEInternational Conference on Computer Vision (ICCV), pages 10293–10302, 2019.

[43] Drew A. Hudson and Christopher D. Manning. GQA : A New Dataset for Real-world Visual Reasoning and Compositional Question Answering. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages6700–6709, 2019.

[44] Sergey Ioffe and Christian Szegedy. Batch Normalization : Accelerating Deep Net-work Training by Reducing Internal Covariate Shift. In Proceedings of InternationalConference on Machine Learning (ICML), pages 448–456, 2015.

[45] Mohit Iyyer, Jordan Boyd Graber, Leonardo Claudino, Richard Socher, and HalDaumé III. A neural network for factoid question answering over paragraphs. InProceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP), pages 633–644, 2014.

[46] Kevin Jarrett, Koray Kavukcuoglu, MarcAurelio Ranzato, and Yann LeCun. What isthe best multi-stage architecture for object recognition ? In Proceedings of the IEEEInternational Conference on Computer Vision (ICCV), pages 2146–2153, 2009.

[47] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : Alarge-scale hierarchical image database. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.

[48] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei Fei, C Law-rence Zitnick, and Ross Girshick. CLEVR : A Diagnostic Dataset for CompositionalLanguage and Elementary Visual Reasoning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 1988–1997, 2017.

[49] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li FeiFei, C Lawrence Zitnick, and Ross Girshick. Inferring and Executing Programs forVisual Reasoning. In Proceedings of the IEEE International Conference on ComputerVision (ICCV), pages 3008–3017, 2017.


[50] Bekir Karlik and A Vehbi Olgac. Performance analysis of various activation func-tions in generalized mlp architectures of neural networks. International Journal ofArtificial Intelligence and Expert Systems, pages 111–122, 2011.

[51] Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. DeepStory :Video story QA by deep embedded memory networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI), pages 2016–2022, 2017.

[52] Diederik P. Kingma and Jimmy Ba. Adam : A Method for Stochastic Optimiza-tion. In Proceedings of the International Conference on Learning Representations,(ICLR), 2015.

[53] Khaled Koutini, Hamid Eghbal zadeh, and Gerhard Widmer. Receptive-Field-Regularized CNN Variants for Acoustic Scene Classification. Proceedings ofthe Detection and Classification of Acoustic Scenes and Events 2019 Workshop(DCASE2019), 2019.

[54] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification withdeep convolutional neural networks. In Proceedings of Neural Information ProcessingSystems (NeurIPS), pages 1097–1105, 2012.

[55] Anurag Kumar and Bhiksha Raj. Deep CNN Framework for Audio Event Recogni-tion using Weakly Labeled Web Data. arXiv, abs/1707.02530, 2017.

[56] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-basedlearning applied to document recognition. Proceedings of the IEEE, pages 2278–2324,1998.

[57] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficientbackprop. In Neural networks : Tricks of the trade, pages 9–48. Springer, 2012.

[58] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. ConvolutionalDeep Belief Networks for Scalable Unsupervised Learning of Hierarchical Represen-tations. In Proceedings of International Conference on Machine Learning (ICML),pages 609–616, 2009.

[59] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. Unsupervised fea-ture learning for audio classification using convolutional deep belief networks. InProceedings of Neural Information Processing Systems (NeurIPS), pages 1096–1104,2009.

[60] Jongpil Lee, Taejun Kim, Jiyoung Park, and Juhan Nam. Raw Waveform-basedAudio Classification Using Sample-level CNN Architectures. arXiv, abs/1712.00866,2017.

[61] Haoning Lin, Zhenwei Shi, and Zhengxia Zou. Maritime Semantic Labeling of OpticalRemote Sensing Images with Multi-scale Fully Convolutional Network. RemoteSensing, page 480, 2017.

[62] Min Lin, Qiang Chen, and Shuicheng Yan. Network In Networks. In Proceedings ofthe International Conference on Learning Representations (ICLR), 2014.

[63] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, AlexSergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks


and the coordconv solution. In Proceedings of Neural Information Processing Systems(NeurIPS), pages 9605–9616, 2018.

[64] Beth Logan. Mel Frequency Cepstral Coefficients for Music Modeling. In Proceedingsof the International Society for Music Information Retrieval conference (ISMIR),2000.

[65] Laurens Van Der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Jour-nal of Machine Learning Research, pages 2579–2605, 2008.

[66] Varun Manjunatha, Nirat Saini, and Larry S. Davis. Explicit Bias Discovery inVisual Question Answering Models. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 9562–9571, 2019.

[67] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA : A Visual Question Answering Benchmark Requiring External Knowledge. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2019.

[68] Jorge Martinez, Hector Perez, Enrique Escamilla, and Masahisa Mabo Suzuki. Spea-ker recognition using Mel frequency Cepstral Coefficients (MFCC) and Vector quan-tization (VQ) techniques. In Proceedings of the International Conference on Elec-trical Communications and Computers (CONIELECOMP), pages 248–251, 2012.

[69] Amirouche Moktefi and Sun-Joo Shin. Visual Reasoning with Diagrams. Studies inUniversal Logic. Springer, 2013.

[70] Juhan Nam, Keunwoo Choi, Jongpil Lee, Szu-Yu Chou, and Yi-Hsuan Yang. DeepLearning for Audio-based Music Classification and Tagging : Teaching Computersto Distinguish Rock from Bach. IEEE Signal Processing Magazine, pages 41–51,2019.

[71] S. Hamid Nawab and Thomas F. Quatieri. Short-Time Fourier Transform, page289–337. Prentice-Hall, Inc., 1987.

[72] Sergio Oramas, Oriol Nieto, Francesco Barbieri, and Xavier Serra. Multi-LabelMusic Genre Classification from Audio, Text, and Images Using Deep Features. InProceedings of the International Society for Music Information Retrieval conference(ISMIR), pages 23–30, 2017.

[73] Taejin Park and Taejin Lee. Music-noise segmentation in spectrotemporal domainusing convolutional neural networks. In Proceedings of the International Society forMusic Information Retrieval conference (ISMIR), 2015.

[74] Andy Pearce, Tim Brookes, and Russell Mason. Timbral Models, AudioCommonsproject, Deliverable D5.7. 2018.

[75] Ethan Perez, Harm de Vries, Florian Strub, Vincent Dumoulin, and Aaron Courville.Learning Visual Reasoning Without Strong Priors. In Workshop on Speech andLanguage Processing at International Conference on Machine Learning (ICML),2017.

[76] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville.FiLM : Visual Reasoning with a General Conditioning Layer. In Proceedings of theAAAI Conference on Artificial Intelligence, pages 3942–3951, 2018.


[77] Alessandro Pieropan, Giampiero Salvi, Karl Pauwels, and Hedvig Kjellström. Audio-Visual Classification and Detection of Human Manipulation Actions. In Procee-dings of the IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 3045–3052, 2014.

[78] Jordi Pons, Thomas Lidy, and Xavier Serra. Experimenting with musically motivatedconvolutional neural networks. In Proceedings of the International Workshop onContent-Based Multimedia Indexing (CBMI), pages 1–6, 2016.

[79] Jordi Pons and Xavier Serra. Designing efficient architectures for modeling tempo-ral features with convolutional neural networks. In Proceedings of the IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages2472–2476, 2017.

[80] Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, and Xavier Serra. TimbreAnalysis of Music Audio Signals with Convolutional Neural Networks. In Proceedingsof the European Signal Processing Conference (EUSIPCO), pages 2744–2748, 2017.

[81] Deepak Ravichandran and Eduard Hovy. Learning surface text patterns for a ques-tion answering system. In Proceedings of the 40th annual meeting on association forcomputational linguistics, pages 41–47, 2002.

[82] Omar Hernández Rodríguez and Jorge M Lopez Fernandez. A semiotic reflectionon the didactics of the chain rule. The Montana Mathematics Enthusiast, pages321–332, 2010.

[83] Oriol Romani Picas, Hector Parra Rodriguez, Dara Dabiri, Hiroshi Tokuda, WataruHariya, Koji Oishi, and Xavier Serra. A real-time system for measuring soundgoodness in instrumental sounds. In Proceedings of the AES Convention, 2015.

[84] Frank Rosenblatt. The perceptron : a probabilistic model for information storageand organization in the brain. Psychological review, page 386, 1958.

[85] Jean Rouat. Groupe de recherche NECOTIS, 2005. https://www.gel.usherbrooke.ca/necotis/, visité le 15 mars 2021.

[86] Jean Rouat. Interactive Grounded Language Understanding (IGLU), 2016. https://iglu-chistera.github.io/, visité le 20 décembre 2020.

[87] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning represen-tations by back-propagating errors. Nature, pages 533–536, 1986.

[88] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.ImageNet large scale visual recognition challenge. International Journal of ComputerVision, pages 211–252, 2015.

[89] Dominik Scherer, Andreas Müller, and Sven Behnke. Evaluation of pooling ope-rations in convolutional architectures for object recognition. In Proceedings of theInternational Conference on Artificial Neural Networks (ICANN), pages 92–101,2010.

[90] Christian Schörkhuber and Anssi Klapuri. Constant-Q transform toolbox for Musicprocessing. In Proceedings of the Sound and Music Computing Conference, 2010.


[91] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks forLarge-scale Image Recognition. In Proceedings of the International Conference onLearning Representations (ICLR), 2015.

[92] Martin M Soubbotin and Sergei M Soubbotin. Patterns of Potential Answer Expres-sions as Clues to the Right Answers. In Proceedings of the Text REtrieval Conference(TREC), volume 500-250, 2001.

[93] Elias Sprengel, Martin Jaggi, Yannic Kilcher, and Thomas Hofmann. Audio basedbird species identification using deep learning techniques. In Proceedings of Life-CLEF, pages 547–559, 2016.

[94] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Sa-lakhutdinov. Dropout : A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, pages 1929–1958, 2014.

[95] Christian J. Steinmetz and Joshua D. Reiss. PyLoudNorm : A simple yet flexibleLoudness meter in Python. In Proceedings of the 150th AES Convention, 2021.

[96] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. GoingDeeper with Convolutions. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 1–9, 2015.

[97] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, RaquelUrtasun, and Sanja Fidler. Movieqa : Understanding stories in movies throughquestion-answering. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 4631–4640, 2016.

[98] International Telecommunication Union. Algorithms to measure audio programmeloudness and true-peak audio level (ITU-R BS.1770-4). Technical report, ElectronicPublication, 2015.

[99] Union Européenne. CHIST-ERA, 2012. http://www.chistera.eu/, visité le 20décembre 2020.

[100] Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra,and Devi Parikh. Probabilistic Neural Symbolic Models for Interpretable VisualQuestion Answering. In Proceedings of the 36th International Conference on MachineLearning, pages 6428–6437, 2019.

[101] Ellen M Voorhees. The TREC-8 Question Answering Track Report. In Proceedingsof the Text REtrieval Conference (TREC), pages 77–82, 1999.

[102] Ellen M Voorhees and Dawn M Tice. Building a question answering test collection.In Proceedings of the International Conference on Research and Development inInformation Retrieval (SIGIR), pages 200–207, 2000.

[103] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel.FVQA : Fact-Based Visual Question Answering. IEEE Transactions on PatternAnalysis and Machine Intelligence, pages 2413–2427, 2018.

[104] Yu-Chieh Wu and Jie-Chi Yang. A robust passage retrieval algorithm for video ques-tion answering. IEEE Transactions on Circuits and Systems for Video Technology,pages 1411–1421, 2008.


[105] Lonce Wyse. Audio Spectrogram Representations for Processing with Convolutio-nal Neural Networks. In Proceedings of the First International Workshop on DeepLearning and Music joint with IJCNN, pages 37–41, 2017.

[106] Hui Yang, Lekha Chaisorn, Yunlong Zhao, Shi-Yong Neo, and Tat-Seng Chua. Vi-deoQA : Question Answering on news video. In Proceedings of the eleventh ACMinternational conference on Multimedia, pages 632–641, 2003.

[107] Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and JoshuaB. Tenenbaum. Neural-symbolic VQA : Disentangling Reasoning from Vision andLanguage Understanding. In Proceedings of Neural Information Processing Systems(NeurIPS), pages 1039–1050, 2018.

[108] Matthew D Zeiler, M Ranzato, Rajat Monga, Min Mao, Kun Yang, Quoc Viet Le,Patrick Nguyen, Alan Senior, Vincent Vanhoucke, Jeffrey Dean, et al. On rectifiedlinear units for speech processing. In Proceedings of the IEEE International Confe-rence on Acoustics, Speech and Signal Processing (ICASSP), pages 3517–3521, 2013.

[109] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From Recognition toCognition : Visual Commonsense Reasoning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2019.

[110] Cheng Zhang, Cengiz Öztireli, Stephan Mandt, and Giampiero Salvi. Active Mini-batch Sampling using Repulsive Point Processes. In Proceedings of the AAAI Confe-rence on Artificial Intelligence, pages 5741–5748, 2019.

[111] Peng Zhang, Yash Goyal, Douglas Summers Stay, Dhruv Batra, and Devi Parikh.Yin and yang : Balancing and answering binary visual questions. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages5014–5022, 2016.

[112] Ted Zhang, Dengxin Dai, Tinne Tuytelaars, Marie-Francine Moens, and Luc VanGool. Speech-based Visual Question Answering. arXiv, abs/1705.00464, 2017.

[113] Weiping Zheng, Zhenyao Mo, Xiaotao Xing, and Gansen Zhao. CNNs-based AcousticScene Classification using Multi-spectrogram Fusion and Label Expansions. arXiv,abs/1809.01543, 2018.

[114] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei Fei. Visual7W : GroundedQuestion Answering in Images. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 4995–5004, 2016.


Documents

Système neuronal pour réponses à des questions de