40
Title: Automatic Voice Emotion Recognition of Child-Parent Conversations in Natural Settings Author: *Effie Lai-Chong Law School of Informatics, University of Leicester, Leicester, U.K. email: [email protected] Samaneh Soleimani School of Informatics, University of Leicester, Leicester, U.K. email: [email protected] Dawn Watkins School of Law, University of Leicester, Leicester, U.K. email: [email protected] Joanna Barwick School of Law, University of Leicester, U.K. email: [email protected] Abstract: While voice communication of emotion has been researched for decades, the accuracy of automatic voice emotion recognition (AVER) is yet to improve. In particular, the intergenerational communication has been under-researched, as indicated by the lack of an emotion corpus on child-parent conversations. In this paper we reported our work of applying Support-Vector Machines (SVMs), established machine learning models, to analyze audio recordings of 20 child-parent dyads, who discussed certain everyday life scenarios presented through a tablet-based video game. Among many issues facing the emerging work of AVER, we explored two critical ones: the methodological issue of optimizing its performance against computational costs, and the conceptual issue on the state of emotionally neutral. We used the minimalistic/extended acoustic feature set extracted with OpenSMILE and a small/large set of annotated utterances for building models, and analyzed the prevalence of the class neutral. Results indicated that the bigger the combined sets, the better the training outcomes. Regardless, the classification models yielded modest average recall when applied to the child- parent data, indicating their low generalizability. Implications for improving AVER and its potential uses are drawn. Keywords:

INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Title: Automatic Voice Emotion Recognition of Child-Parent Conversations in Natural Settings

Author:*Effie Lai-Chong LawSchool of Informatics, University of Leicester, Leicester, U.K.email: [email protected]

Samaneh SoleimaniSchool of Informatics, University of Leicester, Leicester, U.K.email: [email protected]

Dawn WatkinsSchool of Law, University of Leicester, Leicester, U.K.email: [email protected]

Joanna BarwickSchool of Law, University of Leicester, U.K.email: [email protected]

Abstract:While voice communication of emotion has been researched for decades, the accuracy of automatic voice emotion recognition (AVER) is yet to improve. In particular, the intergenerational communication has been under-researched, as indicated by the lack of an emotion corpus on child-parent conversations. In this paper we reported our work of applying Support-Vector Machines (SVMs), established machine learning models, to analyze audio recordings of 20 child-parent dyads, who discussed certain everyday life scenarios presented through a tablet-based video game. Among many issues facing the emerging work of AVER, we explored two critical ones: the methodological issue of optimizing its performance against computational costs, and the conceptual issue on the state of emotionally neutral. We used the minimalistic/extended acoustic feature set extracted with OpenSMILE and a small/large set of annotated utterances for building models, and analyzed the prevalence of the class neutral. Results indicated that the bigger the combined sets, the better the training outcomes. Regardless, the classification models yielded modest average recall when applied to the child-parent data, indicating their low generalizability. Implications for improving AVER and its potential uses are drawn.

Keywords: Vocal emotion; Child-parent conversation; Recognition accuracy; Emotion corpora; Emotion neutrality; IEMOCAP

Number of words:The main body (without references): 9147The entire document (with references): 12474

Page 2: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Automatic Voice Emotion Recognition of Child-Parent Conversations in Natural Settings

Abstract. While voice communication of emotion has been researched for decades, the accuracy of automatic voice emotion recognition (AVER) is yet to improve. In particular, the intergenerational communication has been under-researched, as indicated by the lack of an emotion corpus on child-parent conversations. In this paper, we presented our work of applying Support-Vector Machines (SVMs), established machine learning models, to analyze 20 pairs of child-parent dialogues on everyday life scenarios. Among many issues facing the emerging work of AVER, we explored two critical ones: the methodological issue of optimizing its performance against computational costs, and the conceptual issue on the state of emotionally neutral. We used the minimalistic/extended acoustic feature set extracted with OpenSMILE and a small/large set of annotated utterances for building models, and analyzed the prevalence of the class neutral. Results indicated that the bigger the combined sets, the better the training outcomes. Regardless, the classification models yielded modest average recall when applied to the child-parent data, indicating their low generalizability. Implications for improving AVER and its potential uses are drawn.

Keywords: Vocal emotion; Child-parent conversation; Recognition accuracy; Emotion corpora; Emotion neutrality; IEMOCAP

1 INTRODUCTIONWhat emotions are captured in the following excerpt of a child-parent (C-P) audiotaped conversation on a sensitive question: Are any of these people - parents, adult relatives, children, police, neighbours, and teachers - allowed to hit children?

P1: No absolutely not.C1: Um. I know parents aren't. Children

aren't. Teachers aren't. Adult relatives aren't. Neighbours aren't. Not sure about police.

P2: No. The police are not allowed to.C2: Aren't they?P3: No.C3: So no one is?

Three approaches have been used to analyze the valence (positive, negative and neutral) of individual lines or segments: First, human coders having access to both the voice recording and text transcript; second, sentiment text-based analysis with the software Stanford CoreNLP [90]; third, automatic voice analysis with an emotion corpus and machine learning techniques [49]. Inconsistency is observed in the results of the first two segments P1 and C1. For human coders, P1 is negative (“cold anger” [43] or subdued irritation) and C1 is positive with a sense of pride. For text analysis, both P1 and C1 are neutral leaning towards negative. For voice analysis, both P1 and C1 are positive. Results of the other lines are a mix bag of (in)consistency among the three approaches. These examples of human- versus machine-based approaches seem to attest that automatic voice emotion recognition (AVER) is yet to mature in terms of accuracy, despite about six decades of systematic research on voice communication of emotion in different fields such as psychology, linguistics, engineering, and computer science (see reviews in [21] and [43]).

As vocal expression is an age-old evolutionary mechanism for interpersonal communications [18], average humans possess the ability to detect cues in speech utterances to infer emotional states of speakers [43]. AVER is a burgeoning interdisciplinary research area aiming to mimic this innate human ability [77] with the use of computational intelligence. It addresses the acoustic but not the semantic channel of speech [91]. However, this unimodal approach has been augmented to be multimodal (e.g., text, video) with some increase in accuracy, albeit still low (e.g., see review in [85]). Nonetheless, we argue for the importance to first focus on optimizing the performance of AVER as voice is the most common carrier of emotions [1].

Page 3: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Indeed, the role of AVER in human-computer interaction (HCI) has become increasingly salient, thanks to the soaring interest in voice user interfaces [80]. For instance, the ability of a robot to recognize the emotion expressed in a human user’s utterance and respond appropriately can help it to gain trust from the user [61]. Call centre operators can regulate their voices when they are made aware of the emotional state of a caller through an AVER device.

Furthermore, we argue that AVER can facilitate the analysis of the above conversation between the ten-year-old child (C) and his parent (P), which took place at their own home when they were playing a video game of which a non-player character presented them the contentious question. We posit that by mapping out the emotional trajectory of a dyadic conversation, specific emotion markers can help researchers identify points of interest efficiently rather than sifting through hours of audio recordings. For example, if results of automatic emotion analysis indicate that at certain moments both child and parent express “hot anger” (i.e. negative valence with high arousal [21]), then the corresponding lines of the conversation should be thoroughly inspected. Nevertheless, to qualify AVER as a useful research approach, it is imperative to demonstrate that it works by attaining an acceptable level of recognition accuracy. But how high the accuracy is deemed acceptable remains debatable [68].

For the vocal emotion analysis, we adopted the corpus-based approaches leveraging the established machine learning models – support vector machines (SVMs) – with the emotion corpus IEMOCAP [8] (Section 4.2.1). Our work involved automatically extracting the most emotion-relevant acoustic features using openSMILE [17] and then training SVMs based on the extracted features set (details see Section 4.2). Specifically, we applied SVMs to a new challenge - emotion analysis of child-parent conversation – an area that has hitherto been under-researched and increasingly attracted attention and effort [66]. Nevertheless, there is still a lack of a child-parent emotion corpus, be it with acted or spontaneous utterances. The dataset we captured in our empirical study can serve as a precursor for such a wanting corpus.

We employed the INTERSPEECH 2009 Emotion Challenge Set with the original set of 384 features (e.g. frequency-related, energy-related, spectral), which was extended to a much larger set of 6552 features [52]. In fact, the number of such features can range from 88 (i.e. a minimalistic set [15]) through 1582 [88] to 6373 [53]. A bigger feature set typically results in a better accuracy than its smaller counterpart. The larger the set, the higher the probability that relevant features are included. On the other hand, the probability of having redundant or irrelevant features is also higher, undermining the classification performance and increasing computational cost [97] as well as the risk of model overfitting [15]. We addressed empirically the question of how to balance these competing factors for optimizing AVER efficiency and accuracy.

Another intriguing issue in the process of emotion analysis is the implication of neutral expressions. While many of existing speech databases (36 out of 57 reviewed in [92]) include neutral as an emotional state, disputes remain on how to define a neutral vocal expression. As each of these corpora uses a specific model with a specific number and set of emotion labels, to compare their performances a typical approach is to reduce their representations to binary formats. As described in [15], the neutral state in the corpora EMO-DB [7] and GeSiE [46] is mapped to positive valence and low arousal. Level of Interest 2 (loi2) [16], which is regarded as neutral, is mapped to positive valence but high arousal [15]. Given the lack of a clear justification for such a mapping and the inconsistency, a neutral state appears to be a buffer token that can arbitrarily be assigned to other categories. Furthermore, results of the others’ as well as our own work (Section 5) on manual and automatic analysis of voice emotions indicated a high percentage of the neutral category. This observation implies conceptual and methodological problems for defining and classifying the neutral state, raising the questions like: What is meant by emotionally neutral? Is neutral a label for something that neither human nor machine can decipher or is it a matter of training?

Overall, the two main research questions (RQ) of our work are: RQ1: What is the accuracy of automatic voice emotion recognition of child-parent conversations in natural settings? RQ2: How specific methodological (i.e. the size of feature set and dataset) and conceptual (i.e. the definition of neutral emotion) factors determine the accuracy?

Page 4: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

To answer these questions, we analyzed, automatically as well as manually (i.e. benchmarking), the emotions conveyed in the conversations of child-parent dyads when they were discussing issues pertaining to everyday life scenarios presented in a video game. The contributions we aimed to achieve through this research work are:

Scope of AVER applicability: Providing empirical evidence for the applicability of AVER for child-parent conversation analysis, which still relies on laborious manual approaches, and drawing practical implications for utilizing AVER to enhance intergenerational communication;

Computational costs of AVER: Inferring the methodological implication for addressing the tradeoff between accuracy and the infrastructure and time required;

Neutral as emotion state: Identifying and gaining insights into the issue on the ambiguity of the neutral state from the conceptual and methodological perspective;

Emotion corpus gap: Offering a precursor for a child-parent emotion corpus to bridge the gap in the existing databases.

The rest of the paper is structured as follows: Section 2 presents the related work on five aspects. Section 3 describes the videogame and procedure for the empirical study in which child-parent dyads were involved. Section 4 elaborates the AVER process and Section 5 reports the results. Section 6 is Discussion where we revisit the above four planned contributions and limitations, followed by a conclusion in Section 7.

2 RELATED WORKIn this section, we present five topics that are relevant to our research work, including emotional experiences, intergenerational communication, voice data transcription and segmentation, voice emotion analysis, and neutral emotion.

2.1 Emotional Experiences

Human emotions as a research topic has been in and out of focus in the field of psychology (see review in [41]). Arguably, the recent flux of interest in emotions in the wider research community has been stimulated by the shift of emphasis onto emotional experience in relation to the use of technology, as characterized by the work in User Experience (UX) (e.g.[23, 34, 36]).

Nonetheless, each of the two closely related concepts – emotion and experience – is hard to define on its own as well as when combined together. In an attempt to review existing definitions of emotion in the 1980s, Kleinginna and Kleinginna [29] identified at least 92 instances. More recent efforts have resulted in component-based definitions. Among them, the oft-cited one is Schrerer’s [44]: “an episode of interrelated, synchronized changes in the states of all or most of five organismic subsystems in response to the evaluation of an external or internal stimulus event as relevant to major concerns of the organism” (p.697). The five subsystems are cognitive, neuropsychological, executive, expressive, and experiential [44]. Aligning with the appraisal theory of emotion [45], experience arises from evaluating one's needs and goals in relation to contextual factors, including social others, and comprises feelings, thoughts and actions [23]. Experience can also be seen as a continuous self-talk [20] with pain and pleasure of different intensities evolving over time [23]. Overall, the relations between cognition and emotion are highly complex [41], but clearly their developments are influenced by emotional experiences gained in social interactions [20].

Indeed, there exist a number of studies on understanding how parents shaped the cognitive and emotional development of very young children, mostly infants and pre-schoolers, through verbal behaviours (e.g., [19, 30, 32, 35, 42, 57]). However, the studies relied on conventional manual methods of data collection and analysis. Such methods can be prohibitively time-consuming and very costly when applied to a vast body of conversational data. The prolonged time gap between collecting data and yielding results may cause researchers to miss the opportunity to maximize the potential impact of the work. This hurdle can be alleviated through automating the analysis process enabled by leveraging the progressively more sophisticated signal processing and machine learning models (a

Page 5: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

recent review on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses) are relevant to automatic emotion analysis. Each data type involves a nontrivial body of related work. Hence, we focus on voice only in the ensuing discussion, given our main research question.

2.2 Intergenerational Communication

To the best of our knowledge, very few studies have been conducted in applying AVER to examine intergenerational communications in natural settings. One related study was done by Stolar and her colleagues [60], but they manually coded emotions elicited in a parent-adolescent conversation in a quiet laboratory room where the dyad was asked to discuss three pre-selected topics for 20 minutes each. Then they applied probabilistic models to analyze how the dyad influenced each other’s emotions. Their controlled lab-based (cf. our natural) setting had a higher risk of confounding due to emotion regulation (i.e. suppressing certain emotional expressions when being aware of under observation) [63].

As compared with the number of studies on vocal communication of emotion for children with special needs (e.g. autism) ([26, 28, 53]) for which research efforts are clearly well-justified and much needed, research addressing emotions evoked in ordinary pre-teens and their parents when discussing issues pertinent to their everyday lives is scanty but likewise worthy. We regard this line of inquiry as significant, given the role of parents in shaping their children’s development, as ascertained by decades of educational and psychological research (e.g. [5, 6, 10, 12, 31]). Nonetheless, a caveat is that this paper does not aim to validate any child development models. While it is worthwhile to analyse ideas articulated in a child-parent dyad to see how they are interrelated, this is beyond the scope of this paper as it entails a very comprehensive analysis of the domain-specific content (i.e. legal competence for this study); a separate paper is needed to do full justice to this particular inquiry. In fact, as mentioned above, AVER focuses on the acoustic but not on the semantic channel of speech [91].

2.3 Voice Data Transcription and Segmentation

Like most initiatives for automation, a major rationale for voice emotion analysis is to enhance the efficiency of the process, relieving humans from tedium and expediting the harvest of findings. Based on the earlier survey [43] and more recent studies (e.g., [46]), the average recognition accuracy of untrained adult coders in decoding emotions in vocal expressions was about 60% (range: 56%-66%). There were notable differences across emotions: ‘anger’, the most recognizable, had the highest accuracy, and ‘disgust’, the most confusing, had the lowest, even less than what would be expected by chance. Clearly, different research protocols and tools contribute to differences in the recognition accuracy of humans and of machines [15].

Full automation of voice emotion analysis is not yet in place; raw data require tedious pre-processing before they can be analysed with dedicated software tools. Depending on the research goal, knowing what emotions are elicited in dialogues may exempt transcription, but segmentation of audio files is still required. To understand why certain emotions are elicited, both transcription and segmentation are needed. These data preparation processes can be slow and expensive. For transcription, while various attempts to improve the performance of the speech-to-text-recognition (STR) technology have been undertaken since late 1990s (reviews by [56] and by [25]), some common STR tools achieved less than 40% [35] and other needed intensive training to reach an acceptable rate of more than 80% for supporting learning [56]. Overall, the low reliability and accuracy of the STR technology remain a concern. As the quality of raw data is critical for subsequent analysis processes, manual transcription and segmentation with high accuracy and reliability should be applied. The literature of voice emotion analysis (e.g. [15, 46]) shows that segmentation has largely been a manual process, though the unit of analysis may vary (e.g. fixed duration, turn-taking, meaning). Especially tricky for automatic segmentation are the length variation (or even lack) of natural pauses to be used as segment delimiters and simultaneous talking of speakers [49]. These issues are relevant to our data (Section 4.2).

Page 6: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

2.4 Voice Emotion Analysis

2.4.1 Vocal cues for emotionResearch on emotional cues manifested in voices has a history of about 50 years [64]. The main theoretical concept behind vocal emotion analysis is that affective states have a specific effect on the somatic nervous system, which in turn changes the tension of musculature, influencing the voice production [48]. Acoustic features include pitch (level, range and contour of the fundamental frequency F0), intensity (or the energy of voice) and duration [48]. AVER entails extracting these and other acoustic features (or parameters) to understand the pattern of different emotional states [49]. Vocal cues as indicators of emotions have been researched for different applications, for instance, voice emotion recognition in negotiation systems to visualise feedback [38], tutoring applications to improve communication [39], entertainment and game programs [27], emotionally-intelligent car systems [51], emotion tracking applications integrated in virtual agents [62], and vocal chat media [11]. However, sets of acoustic features used for analysis varied with studies, making it very difficult to perform a meta-analysis on recognition accuracy [15].

2.4.2 Emotion corporaIn last decade, there has been a proliferation of corpora of vocal emotional expressions such as EMO-DB [7], IEMOCAP [8], GEMEP [2], TUM-AVIC [50], to name just a few. Different research groups, driven by specific research questions, have constructed corpora with specific methodological choices [13], including: who (i.e., trained actors vs laypeople; number of speakers), what (i.e., a set of selected emotions; verbal content), how (i.e., scripted or free format speech), where (i.e., lab or field), and which (i.e., training algorithms; number of human coders) (see [2, 8] for examples of emotion corpus construction; [15] for review). No standard or reference set of vocal emotional expressions is available [15, 76], resulting in variations and inconsistencies. A salient example is the in(ex)clusion of the neutral state as a permissible emotional category. Of particular concern to our work is the lack of an emotion corpus on child-parent dialogues. FAU AIBO [59] is one of a few corpora with voices of children, who, however, converse with a robot dog rather than with a human partner. While some proprietary vocal corpora may include children voices, they are inaccessible to the research community [15].

2.4.3 SVMs vs. Deep Learning SVMs are one of the state-of-the-art machine learning approaches, and have broadly been used in the earlier (e.g. [86, 98]) as well as more recent work on voice emotion analysis (e.g., [15, 96]). SVMs are a form of supervised learning, relying on labelled data and selected sets of features to train classifiers. Deep Learning [70], an emerging machine learning approach, can learn automatically what features needed for classification. Nonetheless, Deep Learning may not be better than SVMs in terms of classification performance, especially for a small dataset. Deep Learning needs the infrastructure (e.g., supercomputers) to train large datasets in reasonable time. In a recent comprehensive review [85] it is stated that: “Though deep learning based methods are quite popular in text and visual affect recognition … the hand-crafted feature computation methods, e.g., OpenSMILE, are still very popular and widely used in the audio affect classification research.” (p.118).

2.5 Neutral Emotion

Voices can be classified as neutral when they are perceived as devoid of emotions. Whether neutral is an emotional state is debatable. The ambiguity is reflected in incoherent approaches in emotion research. It is common that ‘neutral’ is used as an extra category to accommodate expressions whose content has no emotional value (e.g. random numbers) or whose emotion may be so subdued that a human listener or a machine cannot decipher [50].

In [37], neutral vocalization is operationalized as words “are spoken in a monotone voice, with minimal inflection and in a matter-of-fact fashion” (p.3225). In [50], the label ‘neutral’ is chosen when a participant is involved in a conversation but it cannot be judged whether she is interested in the topic. In acknowledging the complexity and difficulty in defining neutral speech, some researchers argue that as neutral expressions are frequently interpreted as a low-intensity, weak emotion without any recognizable cues [76, 82, 83], they should be labelled as such in order to ensure the coding

Page 7: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

consistency across human coders [59]. This assumption, however, is questionable, because human coders may be tempted to classify those low-intensity units as neutral when they are unsure about their emotional values (cf. the “I don’t know” option for a close-ended question [78]).

Furthermore, a so-called ‘neutral baseline’ is often used to contrast with other emotive values; the direction and extent to which they deviate from such a baseline indicates the nature and strength of that emotion. It is akin to the notion of ‘degree of emotive markedness’: “… more or less of x, where x is regarded as an implicitly neutral, unmarked midpoint of an emotive continuum, such as positive/negative… ” ([74], p.353). Such emotive markers are actually physiological measures encoding valence/arousal fluctuations in voice to be perceived and labelled by humans. The threshold of neutrality seems arbitrarily set by a specific research group for a specific project. In other words, a reference framework enabling neutral emotion to be consistently classified is essentially lacking. While some corpora contain a high ratio of neutral instances (FAU AIBO [59]: 71% out of 48401 words are neutral), some (e.g. GEMEP [2]) do not include neutral as a category at all. In our study, we included the neutral class to analyse its implications.

As neutral vocal emotion is defined in relation to the extremes of an emotion spectrum, it is relevant to know how ‘full-blown emotion’ [87] is characterized. Paradoxically, a stereotypical portrayal of full-blown emotion is speechlessness or incoherency in speech [76]. This observation may be attributed to the assumptions that the five subsystems of emotion (Section 2.1) are competing for resources and that at the full-blown end the experiential, neurophysiological and executive subsystems become so dominant as to suppress or interrupt the other two subsystems, the expressive and cognitive. Hence, one can infer that emotions conveyed by naturally fluent speech tend to be weak or mixed, as their expressions are consciously regulated [76].

3 EMPIRICAL STUDY3.1 Video game

The child-parent conversations were collected as a part of an 18-month research project aiming to assess children’s (8-11 years old) awareness of law in their everyday lives with an Android tablet-based game, which was designed based on the findings of the participatory design workshops with children. Four settings with which children are familiar have been identified - School, Shop, Park, and Friend’s home - as micro-worlds of the game. In each of these micro-worlds, several scenarios are presented where children are expected to apply their legal competence to interpret the situations and select an action out of the given options accordingly. The lines shown in the beginning of Introduction (Section 1) are responses to the scenario “physical chastisement” in Park where a woman is portrayed as she is about to hit a child (Fig. 1). A player is asked to indicate if a person is (not) allowed to hit children by dragging her/his head to the ‘Yes’/ ‘No’ test-tube, and in case of undecided, to ‘Don’t know’ test tube, and to explain the decisions. To avoid prolonging the paper, only one scenario per micro-world (Figure 1 - 4) is presented to illustrate the main design concepts of the video game (see the project website1 for details)

*** Insert Figure 1 – 4 here ***

The video game consists of 13 real-life scenarios. Each scenario has 2 to 3 quantitative questions with different response formats, and 11 of the scenarios have a follow-up question “Why do you think that?” posed by a non-player character, an alien, to elicit reasoning underlying a choice. No length limit was imposed on the gameplay, and all responses were recorded in the game.

3.2 Procedure and Participants

The project consisted of two studies – one took place in children’s schools (School Study) and the other in their home (Home Study). The former involved eight schools and 634 children whereas the latter involved 73 (out of 634) children and their parents. The empirical studies were approved by the

1 https://www2.le.ac.uk/departments/law/research/law-in-childrens-lives

Page 8: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Research Ethics committee of the University of Leicester and consent forms were signed by both children and parents.

For School Study, children played the digital game on an individual basis in their classroom under the supervision of their teachers and researchers. Each child was asked to wear a headset to minimize distraction from their classmates and to facilitate their vocal responses to be captured in the game. They were debriefed at the end of the gameplay, gaining factual knowledge related to certain questions (e.g., age of criminal responsibility) and discussing some attitude-based ones (e.g. the appropriateness of a school excursion for boys only). For each school, ten children were randomly selected to take part in Home Study. With a prior agreement to play the game at home, the children would take home a pack containing the tablet game, a digital voice recorder (Olympus WS-852 PC), an instruction sheet and an evaluation form. Participants were instructed to discuss the scenarios and justify the decisions they made. The main goal of Home Study was to find out whether children would change their responses to the questions in the game as a result of discussing them with their parents, but addressing this goal is not the focus of this paper. This paper is primarily based in Home Study of which results are not reported in any of the project’s publications based in School Study.

To respect the privacy of the participants, no demographic data were collected; the data of children’s gender was given by their teachers and their age was estimated from the year group (Year 4, 5 or 6) to which they belonged. We left the families to decide which parent would play the game with the child; we did not aim to study gender as a variable. We randomly picked two to three cases from each of the eight schools to conduct the meticulous emotion analyses, both human- and machine-based, of 20 recordings of parent-child conversations.

4 EMOTION ANALYSIS APPROACHESThe manual transcription and segmentation of audio recordings and human coding were very costly in terms of time and effort. To analyse the emotion expressions in child-parent dialogues, the audio files were transcribed and formatted on a turn-taking basis (cf. the excerpt presented in Introduction). Segmentation was also based on turn-taking. Rarely was there a pause between the consecutive turns. As the child and parent tended to talk simultaneously, automatic segmentation was rendered impractical. On average, the ratio of recording to transcribing is 1:8 and recording to segmenting is about 1:7, performed by research assistants not professionally trained but experienced in these tasks.

4.1 Human Coding

To facilitate the benchmarking of the AVER approach, for each segment, human coders classify the valence (positive, negative, neutral) and arousal (high, low, neutral) as the first level of coding. Reducing codes as such is a common practice in emotion research. For instance, in applying the complex scheme LIFE [24] containing ten emotion codes and numerous others, [60] collapsed the ten into three and added “silence”.

Two human coders were involved: One is a UX researcher focusing on measuring emotions and the other is a linguist researching on language and culture; they have worked in their respective field for more than a decade. Prior to the launch of the coding task, the two coders discussed emotion expressions from the UX and linguistic perspective to set a common ground. Each coder had access to the transcripts and audio files. The following steps were undertaken:

(i) read through the transcript in one go; (ii) code the valence of each segment of the transcript; (iii) listen to the entire audio recording in one go; (iv) code the arousal of each segment of the audio recording while listening to

it; (v) revisit the transcript-based valence and recode, if necessary, while

listening to the audio segment.

Page 9: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Step (i) and (iii) were to contextualize the conversation and to identify the general atmosphere. Step (ii) was to ensure due attention to be paid to the words, which would otherwise be masked by the intensity (arousal) of the voice [43]; this justified the sequence of (iv) and (v). The last step allowed fine-tuning of the coding. The rationale for these rounds of coding was to maximize the recognition accuracy on the part of human listeners, but the process was obviously very laborious. In the process, regular meetings were held between the two coders to negotiate the discrepancies in coding. For the first five cases, the inter-rater reliability was moderate (Cohen’s kappa = 0.53). The main discrepancy lied mainly in the neutral cases. The kappa was then improved to 0.71 after rounds of discussion and recoding.

4.2 Voice Emotion Analysis

The emotional content of the audio recordings was analysed by using supervised machine learning algorithms SVMs that are trained based on the patterns of emotions on the given signals. An English emotion corpus was used to train and develop a classification model, which was subsequently employed to evaluate the emotional values of the child-parent conversations recorded.

4.2.1 Emotional corpusA plethora of emotion corpora has been developed with different characteristics (e.g., speaker attributes, number of utterances, language, modality, setting). As there is a lack of parent-child emotion corpus, we have selected the most relevant one: Interactive Emotional Dyadic Motion Capture (IEMOCAP) [8]. Developed in 2008, IEMOCAP consists of about 12 hours of audio-visual data of dyadic interactions from ten trained adult actors (five male, five female). The actors performed scripted and improvised emotional scenarios designed to evoke specific types of emotions. The recorded data were manually segmented into 10039 utterances. Each utterance is a chunk spoken by one actor. The average length and duration of an utterance are 11.4 words and 4.5 seconds. The utterances were manually coded with two annotation schemes ([8], p.345): “discrete categorical based” with ten states and “continuous attribute based” with three dimensions on a 5-point scale: valence (negative-positive), arousal (calm-excited), and dominance (weak-strong). There have been ongoing debates on strengths and limitations of discrete and dimensional models of emotion (see [41, 44]).

In this paper, we used four emotion categories (happy, angry, sad, neutral) to be consistent with most previous work on IEMOCAP, including its originator [81]. We applied the dimensional model and mapped them to the emotion categories, following a recent scheme [96] (i.e. point [1-2) as Negative/Low; [2-3) as Neutral; [3-5] for Positive/High; Table 1) to facilitate the benchmarking.

Table 1. Distribution of emotion states in binary dimension

Valence Arousal

Positive Negative High Low

Happy Sad, Angry Angry, Happy Sad

One key factor contributing to the computational cost of automatic emotion analysis is the database size (i.e., number of utterances). The larger the database, the more the time and resources required for training the machine learning algorithms. We studied this factor by using subsets of IEMOCAP utterances. Indeed, rarely is the entire emotion corpus used to generate the classification model, as shown in the earlier as well as recent work deploying IEMOCAP (e.g., [73, 93, 96]). To enable comparing our results with those of the earlier studies, we selected the sizes of two subsets that fall within the related ranges. However, as there was an imbalanced distribution of samples among the three classes, we applied SMOTE (Synthetic Minority Over-sampling Technique [75]) to produce random subsets of the corpus, resulting in a more even distribution: a smaller one of about 550 utterances for each class (~1650) and a larger one of about 2200 utterances for each class (~6600); the size for valence is slightly different from that of arousal (Table 2).

Page 10: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Table 1. Number of utterances in the three classes of valence/arousal in the small/ large subsets

Subset Negative/Low Neutral Positive/HighValenceArousal

538 (32%)546 (33%)

546 (33%)533 (33%)

577 (35%)563 (34%)

ValenceArousal

2168 (32%)2209 (34%)

2401 (36%)2111 (33%)

2107 (32%)2103 (33%)

4.2.2 Feature extractionAnother key factor determining the computation cost is the size of the acoustic feature set used to train as well as test classification models and apply them to new unseen data. To extract acoustic features from a database, the open-source openSMILE extraction tool [17] is widely used. Our approach was to use the predefined openSMILE set Emo-Large with 6552 features [84], the largest feature set known to date. For comparison, we used the minimalistic set of 88 acoustic features known as eGeMAPS [15], which was identified to address two issues of large feature sets. First, large brute-force feature sets tend to over-adapt classifiers to the training data in machine learning problems, reducing their generalizability to unseen (test) data [54]. Second, the interpretation of the underlying mechanisms of thousands of features is very difficult, if not impossible [15].

4.2.3 ClassificationUsing the WEKA [22] toolkit, SVMs with linear kernel were trained using the Sequential Minimal Optimization (SMO) algorithm. The SVM classifiers are the state-of-the-art algorithms for emotion recognition. SVMs provide good generalizability properties for unknown data as well as being capable of handling high dimensional feature space. To optimize the recognition results of SVM classifiers, it is required to identify a good value of margin (or the complexity parameter known as C) around the lines between classes. Consequently, we tested a range of values for C in a N-fold (here N=10) stratified cross-validation paradigm. In this paradigm, the development set is randomly split into N folds with the same size, N-1 folds are used for training the classifier and one fold is used for testing. This process is repeated N times and finally the overall performance is then taken as the average of the performances achieved in the N runs. In stratified cross-validation, the generated folds contain roughly the same proportion of the emotional labels.

We computed recall (= TP / (TP+FN) whereby TP is True Positive; FN is False Negative) for each of the three classes for both valence and arousal and averaged them, resulting in the respective unweighted average of recall (UAR). For a balanced number of instance per class as it is in this case (cf. Table 2), UAR is an appropriate metric. Table 3 shows the results of the UARs for the four conditions defined by the size of the acoustic feature set and of utterance dataset for training and testing classifiers.

Table 1. Results of UAR based on the training data over the three classes of valence and arousal, using a small subset and two feature sets (Minimalistic: 88 vs. Extended: 6552)

Small subset (1650)Negative/Low Neutral Positive/High UAR

Minimalistic: 88 Valence 0.51 0.62 0.52 55%Arousal 0.65 0.50 0.59 58%

Extended: 6552 Valence 0.67 0.59 0.60 62%Arousal 0.82 0.64 0.63 70%

Large subset (6550)

Minimalistic: 88 Valence 0.59 0.65 0.32 54%Arousal 0.68 0.56 0.46 57%

Extended: 6552 Valence 0.83 0.74 0.73 77%Arousal 0.77 0.46 0.59 61%

Page 11: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

One observation is that the larger the feature set, the higher the UAR is; this corroborates the finding of the existing work and the issue of over-fitting (e.g. [15, 54]). However, the effect of the size of the utterance set is peculiar. Instead of the expected increase, for the minimalistic feature set the UAR becomes lower, albeit only 1%, when a larger utterance set is used. For the extended feature set, the UAR increases for valence by 15% but decreases by 9% for arousal when a larger utterance set is used.

Overall, the UARs yielded are similar to those of previous work, but the extended-feature-large-utterance-set for valence (i.e. 77%) is higher (Section 6.2). Nonetheless, the increase in accuracy comes with the increase in computational costs in terms of the infrastructure and time used (Table 4; see Section 6.2 for the discussion on this issue).

Table 1. Infrastructure and time used for each of the four feature-utterance-set conditions.

Feature Segment Time Infrastructure88 1661 ~15 mins

Laptop: 3.3 Ghz Intel Core i7, 16GB RAM88 6676 ~15 mins6552 1661 ~ 90 mins6552 6676 ~ 7 hours HPC SPECTRE: 28-core 2.6Ghz Intel Xeon

cores, 264GB RAM

5 RESULTS5.1 Descriptive Statistics

The total duration of the 20 audio recordings was 454.96 minutes (Mean = 22.75, SD=9.85, range = 11.41 – 56.57). They amounted to a total of 4297 segments of which 2174 uttered by the children and 2123 by their parents (Table 5); the segment length varied from one to 207 words.

Overall, averaging over the utterances made by the 20 children and their parents, 30% and 22% were positively and negatively valenced whereas 40% and 17% showed high and low level of arousal, respectively. 47% and 43% of the segments were classified as emotionally and energetically neutral (Table 6).Table 1. The number of segments per dyad

Mean SD RangeChild 108.7 44.9 46 - 205Parent 106.2 51.4 22 - 225

All 214.9 95.5 72 - 430

Table 1. Mean (SD) [Range] of percentages of segments of 20 Child-Parent dyads classified by human coders.

Valence (%) Arousal (%)Positive Negative Neutral High Low Neutral

Child 34 (12)[13-55]

21 (7)[9-33]

45 (16)[19-71]

39(18)[12-78]

17(11)[3-41]

43 (21)[14-85]

Parent 27 (11)[8-52]

24 (14)[5-57]

49 (14)[26-80]

41(13)[18-67]

16(14)[0-54]

43(17)[15-80]

5.2 Accuracy of Voice Emotion Analysis

We computed both the Unweighted/Weighted Average of Recall (UAR/WAR), given the uneven distribution of the segments over the three classes of valence/arousal with higher percentages of Neutral (Table 6).

WAR = (R1*C1 + R2*C2 + R3*C3) / (C1 + C2 + C3) {Equation 1}

Page 12: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

whereby Ri, Ci are recall and number of segment of the respective class

Table 1. Mean (SD) [Range] of automatic voice emotion recognition (AVER) accuracy of Child-Parent dialogues with small utterance set and 6552 features

UAR WARValence Arousal Valence Arousal

Child 0.35 (0.03)[0.31-0.41]

0.34 (0.03)[0.29-0.40]

0.35 (0.11)[0.20-0.57]

0.42 (0.15)[0.14-0.76]

Parent 0.35 (0.04)[0.29-0.42]

0.36 (0.06)[0.29-0.53]

0.3 (0.08)[0.14-0.45]

0.42 (0.13)[0.21-0.67]

Table 1. Mean (SD) [Range] of automatic voice emotion recognition (AVER) accuracy of Child-Parent dialogues with large utterance set and 6552 features

  UAR WAR  Valence Arousal Valence Arousal

Child 0.46 (0.04) 0.37 (0.03) 0.49 (0.09) 0.46 (0.17)[0.35-0.53] [0.3-0.42] [0.41-0.66] [0.22-0.68]

Parent 0.44 (0.05) 0.40 (0.07) 0.39 (0.06) 0.41 (0.14)[0.37-0.52] [0.32-0.55] [0.31-0.55] [0.19-0.63]

As indicated in Table 3, the 6552-feature set performed better than the 88-feature one for both the small and large utterance set. Consequently, we used the respective classification models to analyze the child-parent data. Overall, the UARs and WARs were modest irrespective of the size of the utterance set, although those of the large utterance set (Table 8) were generally higher than the corresponding values of the small set (Table 7). WAR is more relevant given the uneven number of segment per class. The overall performance was modest. The infrastructure used between the two conditions was different: laptop vs. HPC and 90 minutes vs. 7 hours (cf. Table 4). The gain in accuracy seems not well justified by the higher demand for computational costs. Given the unsatisfactory performance of applying the classification model based on the extended-feature-small-utterance-set to the Child-Parent data, we studied the related confusion matrix (Table 9). The 13% accuracy for the class High/Low (code=0) and 6% for Neutral (code =1) were contrastingly low as compared with 85% for Positive/High (code =2). A similar pattern was observed for the extended-feature-large-utterance-set, though to a lesser contrast with the respective values of 33%, 14% and 89%.

Table 1. Confusion matrix for applying the classification model based on the extended-feature-small-set to the C-P data (0:Negative/Low, 1:Neutral, 2: Positive/High)

 Classified

 0 1 2

Actual

0 44 9 331 Valence62 16 347 Arousal

1 164 66 820 Valence208 74 854 Arousal

2 77 17 642 Valence66 27 487 Arousal

Table 9 is better interpreted in juxtaposition with Table 6, which shows that human coders rated 47% of Valence (averaged 45% and 49%) and 43% of Arousal as Neutral. However, the classification

Page 13: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

model misclassified a substantial number of Neutral as Positive Valence and High Arousal (see Section 6.1 for discussing plausible reasons for this misclassification).

5.3 Binary Mapping

Many of the existing emotion corpora have a set of overlapping as well as distinct emotion labels [15]. To compare the accuracy across the corpora, labels are mapped to binary dimensions (cf. Table 1). While this mapping approach is commonly adopted in Affective Computing [15], as we pointed out earlier, there is no clear theoretical or methodological justification underpinning it.

To study the effect of this seemingly arbitrary mapping, we mapped the 3-class labels derived from the extended-feature-small-utterance-set. The accuracies of the binary classification were better (Table 10), but they remained low when the neutral state was classified as Negative/Low (N0) for both Child and Parent. In contrast, the accuracies were notably higher when the neutral state was classified as Positive/High (N1). The differences can be understood in light of the IEMOCAP’s better fit for Positive/High (Table 9). With the conventional mapping, results of binary classification are those shown in the red-lined box, implying that both valence and arousal are sensitive to the type of binary mapping used.

Table 1. Mapping the neutral class to binary classification

Binary mapping Dimension Child ParentN0:Neutral as Negative/Low

Valence 0.42 0.37Arousal 0.45 0.44

N1:Neutral as Positive/High

Valence 0.71 0.67Arousal 0.78 0.76

6 DISCUSSIONIn Introduction (Section 1), we highlighted four contributions that this research work aimed to achieve. This section is structured accordingly. First, we argue for the applicability of AVER for child-parent conversation analysis, identifying reasons for the modest performance of AVER and proposing scenarios for future applications (Section 6.1). Second, we discuss computational costs of AVER by examining the relationship between the number of features and utterances used and the classification accuracies observed in nine related studies, raising the concern about the lack of standards for evaluating the cost-effectiveness of different conditions for AVER (Section 6.2). Third, we discuss the notion of neutral emotion, which can have significant conceptual and methodological implications for future research on AVER and its related applications (Section 6.3). Fourth, we discuss how our emotional corpus on child-parent dialogues should further be developed to enhance its strengths and enable its access to the wider research community (Section 6.4). In addition, we reflect on the limitations of our empirical study and delineate our research agenda for this emergent area (Section 6.5). 6.1 Scope of AVER Applicability

Results show that the AVER performance was modest, especially when applying the SVM classification model based on the extended-feature-small-utterance-set. However, the recalls for the three classes (Table 3) suggest that the model was good. To verify if the poor accuracy was due to the distributions of the training and testing (unseen) data, we visualized the distributions between the two: model-development-dataset and unseen-dataset. The visualization indicates some differences. Basically, we selected the two most important attributes by using the WEKA CorrelationAttributeEval algorithm and plotted the two attributes as the two axes of a graph. As illustrated in Figure 5, the green (light) dots represent data for the model-development part and the grey (dark) ones are the unseen data.

*** Insert Figure 5 here ***

Page 14: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Furthermore, we used Wilcoxon signed-rank tests to see how the two distributions vary in terms of their feature spaces. The feature selection was performed using WEKA correlation feature selection (CFS) algorithm to identify the highly correlated attributes for classification. Results show that the training data has a significantly different feature space than that of the testing data (e.g., using a subset of child-parent dialogue on the most correlated attribute, "energy_mean"; Z=5.9, p<.0001). The observed distribution can be attributed to the fact that our model was not trained on a scenario-specific corpus, given the lack of a relevant child-parent corpus.

There exist studies using SVMs for voice emotion analysis of individual speakers with good results (e.g. [65]). Applying SVMs to child-parent dialogues is rather new [66]. Arguably the emerging Deep Learning (DL) approach may work better for voice/image data [67] (Section 2.4.3). The advantage of DL is that we need not find ourselves what features are good to use as the DL can do it for us. However, with the technique we used through SVMs, we did not look for features either. We already extracted a predefined set of vocal features, which are known to be relevant to emotions.

We foresee to utilize the AVER approach to enhance children’s mental health. An AVER-based application can be used as a parental training tool to enable parents to perceive voice emotion signals accurately during child-parent conversations, thereby providing emotionally-sensitive matched responses to engage children in open and desirable dialogues on a range of topics. The success of such a tool relies heavily on the development of effective machine learning models and on the design of a usable voice-sensor wearable device into which the models are integrated to present instantaneous and meaningful vocal emotions analysis results and recommendations to its users for regulating their emotions and behaviors, when applicable.

6.2 Computational Costs of AVER

To study the impact of varying the size of feature and utterance set on the performance of the classification model, we experimented with the four combinations. Results showed that the accuracy gain (Table 3) was at the expense of the computational resources (Table 4). To calibrate our results, we made rough comparisons with prior work that also used SVMs and IEMOCAP. Nine such studies published in the period of 2011-2017 were identified (Table 11). They used a range of combined feature/segment set: Six used 4 (‘happy’, ‘sad’, ‘angry’, ‘neutral’) or 5 emotion categories (‘excited’ included) whereas two used the dimensional scheme and one applied both [96].

None had the identical setting as ours. The closest one was [72] with a smaller, albeit still high, number of features and a similar number of utterances. In comparing with the extended-feature-large-utterance-set, our UARs of 61% and 77% (Table 3) were higher than the corresponding 52% and 57% of [72]. In [73], the authors shared our goal of comparing the performances of four different feature sets (i.e. [15, 61, 88]) with the same utterance set of 5531. Nonetheless, the relative gain in accuracy (58% vs. 62%) from the small set of 62 features to the large set with 6373 features was limited (i.e., the last row of Table 11).

Table 1. Studies of applying SVM to IEMOCAP (*4 or 5 categories; A; Arousal; V: Valence)

Ref. Feature Segment Emotion* UAR[98] 384 5480 4 51%[86] 85 5531 5 61%[81] 210 1252 5 55%[89] 384 2762 4 57%[93] 1582 10037 A, V 65%, 54%[72] 4368 6829 A, V 52%, 57%[96] 1582 5531 4, A, V 60%, 62%, 62%[79] 192 5530 5 65%

[73] 62, 88, 384, 6373 5531 4 58%, 59%, 61%, 62%

Page 15: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

These nine studies differ from our work in two major aspects: First, they did not apply their SVM classifiers to a new source of unseen validation data like we did with the child-parent dialogues as they split a subset of the IEMOCAP utterances between training and testing. Second, they did not provide information on the infrastructure or time used for building their classification models (cf. Table 4). Hence, based on the UARs alone, the performance of our SVM classifiers derived from the 6552-feature-6550-utterances was reasonably good (Table 3), and the modest generalization (Table 8) can be considered as overfitting. In the realm of machine learning, generalization refers to how well an algorithmic model built from the training data applies to unseen data. Overfitting happens when the details and random fluctuations in the training data are learned by a model as concepts, which, however, do not apply to new data and thus negatively impact the model's ability to generalize [100]. Nonetheless, with limited prior work to benchmark our findings, it is hard to know whether the discrepant accuracy between the same and other source of testing (unseen) data is a common phenomenon.

To demonstrate the effectiveness of the minimalistic 88-feature set (eGeMAPS), [15] conducted experiments with different emotion corpora (IEMOCAP was not included), the best average UARs were 66% and 80% for binary valence and arousal, respectively. However, the overall UAR varied from 34% to 81% when non-binary schemes were used with eGeMAPs for individual corpora with different numbers of utterances. The point is that the lack of standardized frameworks for building and testing classifiers makes it difficult to calibrate as well as understand discrepant performances across studies.

6.3 Neutral Emotion

As explored in Section 2.5, neutral emotion is a disputable concept. As a consensual definition of emotion neutrality is lacking, there is no standardized way to operationalize it. This may explain why a substantial portion (77%) of the neutral utterances was misclassified as Positive/High, as shown in the confusion matrix (Table 9). In contrast, the confusion matrix of IEMOCAP shows that the neutral class had a good recall of 74% [8]. One possible explanation for this contrast could be that our child-parent utterances were more natural than IEMOCAP ones. For the former neutrality was perceived as having low prosodic variability whereas for the latter it was encoded with exaggerated clarity and slower speech rate [95], irrespective whether the actors were in the scripted or improvised scenarios as they were aware of the potential uses of their emotive expressions. There have been ongoing debates on pros and cons of contrived and natural emotive data of an emotion corpus [44, 76, 95].

The problem of recognizing neutral emotion is indeed challenging, especially the neutral state is not always well defined and it is often confused with other emotions [89]. Other researchers may even dismiss the utility or necessity of the notion of emotional neutrality (e.g. [3]). Nonetheless, the implication of having a clear definition of neutrality is significant, especially when applying AVER to address mental health. In the field of psychology in which emotion research is rooted, to the best of our knowledge, there has been limited research on defining or operationalizing neutral emotion, apart from taking a middle value of a scale or using a neither/nor criterion. Clearly, more research along this line of inquiry is called forth.

6.4 Child-Parent Emotion CorpusAn emotion corpus made of naturally occurring child-parent dialogues will likely improve the accuracy when the classifier thus derived is applied to new unseen real-life data, addressing the issue of generalization as observed in our findings reported above.

Developing an emotion corpus on a specific theme can be laborious and costly [13]. First it involves capturing raw data with representative target groups, which, in our case, are young children and their parents. Ideally high performance recording equipment is deployed to maximise the quality of raw data. However, this criterion is challenging to meet when the data are captured in a natural setting (i.e. “in the wild”, [101]) rather than in a controlled lab-based setting. It entails speech enhancement techniques (e.g.,[102] [103]) to clean the data collected in noisy environment such as a typical family home with children and household activities.

Page 16: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

The second challenge of building an emotional corpus is annotating the data with emotional labels, be they dimensional (valence, arousal) or categorical (e.g., sad, happy, angry). As mentioned earlier, the ability of untrained coders to decode emotions in vocal expressions was moderate, with an average accuracy of only about 60% [44]. Hence, multiple coders are usually employed to ensure high reliability of coding. Crowdsourcing can be utilized to amass codes from a number of contributors [104]. This approach, however, entails more ethical considerations, including de-anonymising data and presenting coders with extracted audio/video features rather than raw data for the annotating task. Alternatively, a large cohort of coders should be trained systematically to ensure a high level of inter-rater reliability.

Like all other AI/machine learning research work, ethics is of paramount importance to AVER (e.g., [105]), especially when some AVER-enabled devices such as chatbots for customer services are already in use in real-life contexts. To enhance the recognition accuracy, multimodal rather than unimodal data have increasingly been used in recent years [85]. Voices, facial expressions and physiological as well as behavioural measures (e.g. heart rate, eye movement) are collected and stored [106]. Far-reaching implications of handling massive databases of personal data should not only be discussed in AI/machine learning research community but in society at large.

Overall, with the meticulously labelled data, our child-parent database is a precursor of a full-fledged emotion corpus (Section 4.1). However, more conceptual, technical and ethical work needs to be done before it can be opened up to the wider research community to stimulate further research and practice in a range of application areas such as health and wellbeing.

6.5 Limitations Like many of research studies, our work has limitations caused by personal and technical constraints.

Tradeoff between natural and controlled settings: The dyads were left on their own to choose who, how and when to record the conversation at home. How far the given digital recorder was placed to a conversing dyad might influence the quality of the recording. While the 20 pairs used for analysis were of acceptable quality to the human coders, the variation in quality could have been picked up by the machines.

Choice of feature sets and databases: To build the classification models, we selected the contrasting numbers of acoustic features and used only one database. If resources were allowed, we could have used the medium feature set (cf. [88]) and other databases for cross-corpus AVER (i.e. using one database for training and another database for testing to allow benchmarking of various factors). While this endeavor has been undertaken by some researchers [54], results were not encouraging given the variations in the corpora. Nonetheless, the implication for broadening the scope of spoken content with natural speech [54] is highly relevant to our future work.

In reflecting on the limitations, we ponder how they can be addressed in our future work. One possible means to improve the modest recognition accuracy is to apply other machine learning methods such as Deep Learning, though some latest attempts have not shown any notable improvement (e.g. [94]). Nonetheless, the associated computational costs, which is rarely discussed in the work with SVMs, should be taken into account when selecting an approach. Another issue that needs to be further explored is the neutral state: Is its high prevalence genuine? If yes, is it an evolutionary mechanism to protect humans from being slave to emotions? If no, does it imply that means to train emotion-sensitive ears to identify nuances of expressive communications should be developed? We will tackle these challenges in our future work.

7 CONCLUSIONAVER is burgeoning research area with many potentials (e.g. human-robot-interaction) and challenges (e.g. low-to-moderate accuracy). We explored this specific area with two research questions (RQs), which are listed in Introduction. Specifically, we assessed the accuracy of automatic voice recognition of child-parent conversations in natural settings (RQ1), and studied the methodological and conceptual factors for determining the accuracy (RQ2).The modest accuracy of

Page 17: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

our model could be explained as a manifestation of overfitting, given high performance on the training data but low generalization to unseen data. This observation was also attributable to the lack of a child-parent emotional corpus - an issue that we were also motivated to resolve by building a precursor. Furthermore, we compared the accuracies of four combinations of small and large feature-set and utterance-set. The results enabled us to conclude that the larger the size of feature-set and utterance-set, the higher the accuracy. However, the slight extent of improvement might not justify the additional computational costs in terms of time and infrastructure required. We need a sophisticated approach to estimating cost-effectiveness in order to address such a trade-off. In addition, we were able to demonstrate the idiosyncrasy of the state of neutral emotion, which accounted for a significant percentage of misclassification.

Summing up, we could answer RQ1 with a reasonable level of confidence that AVER was applicable to emotion analysis of child-parent conversations, although there is room for improving the model fitting. We were also able to answer RQ2 with intriguing insights into the under-researched issues about the justifiability of computational cost and fuzziness of emotion neutrality. Above all, having witnessed the ongoing development of the work on AVER, we are optimistic that it will become an effective tool for child-parent conversation emotion analysis in particular and for emotional health and wellbeing in general.

ACKNOWLEDGEMENTSThis work was supported by the Economic and Social Research Council [grant number ES/M000443/1]. We would like to express our gratitude to the children and parents who took part in the study. We would also like to thank Dr. Leandro Minku, University of Birmingham, for his advice with the machine learning techniques, and anonymous reviewers for comments that greatly improved the manuscript.

Page 18: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

REFERENCES[1] Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of

Personality and Social Psychology, 70(3), 614.[2] Bänziger, T., Mortillaro, M., & Scherer, K. R. (2012). Introducing the Geneva Multimodal

expression corpus for experimental research on emotion perception. Emotion, 12(5), 1161.[3] Bänziger, T., Patel, S., & Scherer, K. R. (2014). The role of perceived voice and speech

characteristics in vocal emotion communication. Journal of Nonverbal Behavior, 38(1), 31-52.[4] Batliner, A., Hacker, C., Steidl, S., Nöth, E., D'Arcy, S., Russell, M. J., & Wong, M. (2004).

"You Stupid Tin Box"-children interacting with the AIBO Robot: A cross-linguistic emotional speech corpus. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC) 2004, Lisbon, Portugal.

[5] Belsky, J. (1984). The determinants of parenting: A process model. Child Development, 83-96.[6] Bornstein, M. (2002) (Ed.). Handbook of Parenting Volume 1: Children and Parenting (2nd

Ed.). Erlbaum.[7] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005, September). A

database of German emotional speech. In Proceedings of INTERSPEECH, pp. 1517-1520.[8] Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S. and

Narayanan, S.S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), p.335.

[9] Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15-21.

[10] Collins, W. A., Maccoby, E. E., Steinberg, L., Hetherington, E. M., & Bornstein, M. H. (2000). Contemporary research on parenting: The case for nature and nurture. American Psychologist, 55(2), 218.

[11] Dai, W., Han, D., Dai, Y. and Xu, D., 2015. Emotion recognition and affective computing on vocal social media. Information & Management, 52(7), pp.777-788.

[12] Demo, D. H., & Cox. M. J (2000). Families with young children: A review of research in the 1990s. Journal of Marriage and the Family, 62, 876-895.

[13] Devillers, L., & Martin, J. C.(2011). Emotional corpora: From acquisition to modeling. Emotion-Oriented Systems, 77-105.

[14] Ekkekakis, P. (2013). The measurement of affect, mood, and emotion: A guide for health-behavioral research. Cambridge University Press.

[15] Eyben, F., Scherer, K.R., Schuller, B.W., Sundberg, J., André, E., Busso, C., Devillers, L.Y., Epps, J., Laukka, P., Narayanan, S.S. and Truong, K.P (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), pp.190-202.

[16] Eyben, F., Weninger, F., & Schuller, B. W. (2013, August). Affect recognition in real-life acoustic conditions-a new perspective on feature selection. In Proceedings of INTERSPEECH (pp. 2044-2048).

Page 19: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

[17] Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia (pp. 1459-1462).

[18] Fitch, W. T. (2000). The evolution of speech: a comparative review. Trends in Cognitive Sciences, 4(7), 258-267.

[19] Fivush, R. (2014). Emotional content of parent-child conversations about the past. In Memory and Affect in Development: The Minnesota Symposia on Child Psychology, Vol. 26, pp. 39-78. Psychology Press.

[20] Forlizzi, J., & Battarbee, K. (2004) Understanding experience in interactive systems. In Proceedings of the 5th Conference on Designing Interactive Systems: Processes, Practices, Methods, and Techniques, pp. 261-268.

[21] Goudbeek, M., & Scherer, K. (2010). Beyond arousal: Valence and potency/control cues in the vocal expression of emotion. The Journal of the Acoustical Society of America, 128(3), 1322-1336.

[22] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), pp.10-18.

[23] Hassenzahl. M. (2008). User experience (UX): towards an experiential perspective on product quality. In Proceedings of the 20th ACM Conference on l'Interaction Homme-Machine, pp. 11-15.

[24] Hops, H., Davis, B., & Longoria, N. (1995). Methodological issues in direct observation: Illustrations with the Living in Familial Environments (LIFE) coding system. Journal of Clinical Child Psychology, 24(2), 193-203.

[25] Johnson, M., Lapkin, S., Long, V., Sanchez, P., Suominen, H., Basilakis, J., & Dawson, L. (2014). A systematic review of speech recognition technology in health care. BMC Medical Informatics and Decision Making, 14(1), 94.

[26] Jones, C. R., Pickles, A., Falcaro, M., Marsden, A. J., Happé, F., Scott, S. K., ... & Simonoff, E. (2011). A multimodal approach to emotion recognition ability in autism spectrum disorders. Journal of Child Psychology and Psychiatry, 52(3), 275-285.

[27] Jones, C., & Deeming, A. (2008). Affective human-robotic interaction. In Affect and Emotion in Human-Computer Interaction (pp. 175-185). Springer Berlin Heidelberg.

[28] Kawas, S., Karalis, G., Wen, T., & Ladner, R. E. (2016). Improving Real-Time Captioning Experiences for Deaf and Hard of Hearing Students. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility (pp. 15-23).

[29] Kleinginna, P. R., & Kleinginna, A. M. (1981). A categorized list of emotion definitions, with suggestions for a consensual definition. Motivation and Emotion, 5(4), 345-379.

[30] Kochanska, G., & Kim, S. (2013). Difficult temperament moderates links between maternal responsiveness and children’s compliance and behavior problems in low‐income families. Journal of Child Psychology and Psychiatry, 54(3), 323-332.

[31] Kohlberg, L., & Hersh, R. H. (1977). Moral development: A review of the theory. Theory into Practice, 16(2), 53-59.

[32] Lagattuta, K. H., Elrod, N. M., & Kramer, H. J. (2016). How do thoughts, emotions, and decisions align? A new way to examine theory of mind during middle childhood and beyond. Journal of Experimental Child Psychology, 149, 116-133.

[33] Lasecki, W. S., Miller, C. D., Naim, I., Kushalnagar, R., Sadilek, A., Gildea, D., & Bigham, J. P. (2017). Scribe: Deep Integration of Human and Machine Intelligence to Caption Speech in Real Time. Communications of the ACM, 60(11).

[34] Law, E. L. C., Roto, V., Hassenzahl, M., Vermeeren, A. P., & Kort, J. (2009). Understanding, scoping and defining user experience: a survey approach. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 719-728). ACM.

Page 20: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

[35] Luce, M. R., Callanan, M. A., & Smilovic, S. (2013). Links between parents' epistemological stance and children's evidence talk. Developmental Psychology, 49(3), 454.

[36] McCarthy, J., & Wright, P. (2004). Technology as Experience. MIT Press.[37] Mumme, D. L., Fernald, A., & Herrera, C. (1996). Infants' responses to facial and vocal

emotional signals in a social referencing paradigm. Child Development, 67(6), 3219-3237.[38] Nowak, M., Kim, J., Kim, N.W. and Nass, C. (2012). Social visualization and negotiation:

effects of feedback configuration and status. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (pp. 1081-1090).

[39] Pfister, T. and Robinson, P., 2011. Real-time recognition of affective states from nonverbal features of speech and its application for public speaking skill analysis. IEEE Transactions on Affective Computing, 2(2), pp.66-78.

[40] Prensky, M. (2003). Digital game-based learning. Computers in Entertainment (CIE), 1(1), 21-21.

[41] Robinson, M. D., Watkins, E. R., & Harmon-Jones, E. (Eds.). (2013). Handbook of Cognition and Emotion. Guilford Press.

[42] Sabbagh, M. A., & Callanan, M. A. (1998). Metarepresentation in action: 3-, 4-, and 5-year-olds' developing theories of mind in parent–child conversations. Developmental Psychology, 34(3), 491.

[43] Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1), 227-256.

[44] Scherer, K. R. (2005). What are emotions? And how can they be measured? Social science Information, 44(4), 695-729.

[45] Scherer, K. R., Schorr, A., & Johnstone, T. (Eds.). (2001). Appraisal processes in emotion: Theory, methods, research. Oxford University Press.

[46] Scherer, K. R., Sundberg, J., Tamarit, L., & Salomão, G. L. (2015). Comparing the acoustic expression of emotion in the speaking and the singing voice. Computer Speech & Language, 29(1), 218-235.

[47] Scherer, K.R, Schüller, B., & Elkins, A. (2017). Computational analysis of vocal expression of affect: Trends and challenges. Social Signal Processing (pp. 56-68).

[48] Scherer, K.R. (1986). Vocal affect expression: a review and a model for future research. Psychological Bulletin, 99(2), 143-165.

[49] Schuller, B., Batliner, A., Steidl, S. and Seppi, D (2011). Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9), 1062-1087.

[50] Schuller, B., Müller, R., Eyben, F., Gast, J., Hörnler, B., Wöllmer, M., ... & Konosu, H. (2009). Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image and Vision Computing, 27(12), 1760-1774.

[51] Schuller, B., Rigoll, G., Grimm, M., Kroschel, K., Moosmayr, T. and Ruske, G. (2007). Effects of in-car noise-conditions on the recognition of emotion within speech. Fortschritte der Akustik, 33(1), p.305.

[52] Schuller, B., Steidl, S., Batliner, A. (2009). The INTERSPEECH 2009 emotion challenge. In Proceedings of INTERSPEECH 2009, 312-315.

[53] Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., ... & Salamin, H. (2013). The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, Conflict, Emotion, Autism. In Proceedings of INTERSPEECH 2013, pp. 148-152.

[54] Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A., & Rigoll, G. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2), 119-131.

[55] Serrano-Guerrero, J., Olivas, J. A., Romero, F. P., & Herrera-Viedma, E. (2015). Sentiment analysis: A review and comparative analysis of web services. Information Sciences, 311, 18-38.

Page 21: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

[56] Shadiev, R., Hwang, W. Y., Chen, N. S., & Yueh-Min, H. (2014). Review of speech-to-text recognition technology for enhancing learning. Journal of Educational Technology & Society, 17(4), 65.

[57] Sigel, I. E., McGillicuddy-DeLisi, A. V., & Goodnow, J. J. (Eds.). (2014). Parental Belief Systems: The Psychological Consequences for Children. Psychology Press.

[58] Soleimani, S., & Law, E. L. C. (2017, June). What Can Self-Reports and Acoustic Data Analyses on Emotions Tell Us?. In Proceedings of the 2017 ACM Conference on Designing Interactive Systems (pp. 489-501).

[59] Steidl, S. (2009). Automatic classification of emotion related user states in spontaneous children's speech. Erlangen, Germany: University of Erlangen-Nuremberg.

[60] Stolar, M. N., Lech, M., Sheeber, L. B., Burnett, I. S., & Allen, N. B. (2013). Introducing Emotions to the Modelingof Intra-and Inter-Personal Influences in Parent-Adolescent Conversations. IEEE Transactions on Affective Computing, 4(4), 372-385.

[61] Tahon, M., Delaborde, A., & Devillers, L. (2011). Real-life emotion detection from speech in human-robot interaction: experiments across diverse corpora with child and adult voices. In Proceedings of INTERSPEECH 2011. Access on: https://hal.inria.fr/hal-01404151/

[62] Vogt, T., André, E. and Bee, N., 2008. EmoVoice—A framework for online recognition of emotions from voice. Perception in Multimodal Dialogue Systems, pp.188-199.

[63] Webb, T. L., Miles, E., & Sheeran, P. (2012). Dealing with feeling: a meta-analysis of the effectiveness of strategies derived from the process model of emotion regulation. Psychological Bulletin, 138(4), 775.

[64] Williams, C.E. and Stevens, K.N., 1972. Emotions and speech: Some acoustical correlates. The Journal of the Acoustical Society of America, 52(4B), pp.1238-1250.

[65] Takeuchi, H., Subramaniam, L. V., Nasukawa, T., & Roy, S. (2007). Automatic Identification of Important Segments and Expressions for Mining of Business-Oriented Conversations at Contact Centers. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 458-467). (pp. 458-467).

[66] Schuller, B., Steidl, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., ... & Warlaumont, A. S. (2017). Interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring. In Proceedings of INTERSPEECH 2017 (pp. 3442-3446).

[67] Lim, G. H., Hong, S. W., Lee, I., Suh, I. H., & Beetz, M. (2013). Robot recommender system using affection-based episode ontology for personalization. In Proceedings of IEEE RO-MAN, 2013 (pp. 155-160).

[68] Kay, M., Patel, S. N., & Kientz, J. A. (2015, April). How Good is 85%?: A Survey Tool to Connect Classifier Evaluation to Acceptability of Accuracy. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (pp. 347-356). ACM.

[69] Semwal, N., Kumar, A., & Narayanan, S. (2017, February). Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models. In Proceedings of IEEE International Conference on Identity, Security and Behavior Analysis (ISBA), 2017 (pp. 1-6).

[70] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.

[71] Burton, M., Pavord, E., & Williams, B. (2014). An introduction to child and adolescent mental health. SAGE.

[72] Abdelwahab, M., & Busso, C. (2015, April). Supervised domain adaptation for emotion recognition from speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015 (pp. 5058-5062).

Page 22: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

[73] Aldeneh, Z., & Provost, E. M. (2017, March). Using regional saliency for speech emotion recognition. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017 (pp. 2741-2745).

[74] Caffi, C., & Janney, R. W. (1994). Toward a pragmatics of emotive communication. Journal of Pragmatics, 22(3-4), 325-373.

[75] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.

[76] Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40(1-2), 5-32.

[77] Darwin, C., & Prodger, P. (1998). The expression of the emotions in man and animals. Oxford University Press, USA.

[78] Krosnick, J. A. (2018). Questionnaire design. In The Palgrave Handbook of Survey Research (pp. 439-455). Palgrave Macmillan, Cham.

[79] Kurpukdee, N., Koriyama, T., Kobayashi, T., Kasuriya, S., Wutiwiwatchai, C., & Lamsrichan, P. (2017, December). Speech emotion recognition using convolutional long short-term memory neural network and support vector machines. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017 (pp. 1744-1749). IEEE.

[80] Lewis, J. R. (2016). Practical speech user interface design. CRC Press.[81] Mariooryad, S., & Busso, C. (2013). Exploring cross-modality affective reactions for

audiovisual emotion recognition. IEEE Transactions on Affective Computing, 4(2), 183-196.[82] Paulmann, S., Ott, D. V., & Kotz, S. A. (2011). Emotional speech perception unfolding in time:

The role of the basal ganglia. PLoS One, 6(3), e17694.[83] Pell, M. D. (2002). Evaluation of nonverbal emotion in face and voice: Some preliminary

findings on a new battery of tests. Brain and Cognition, 48(2-3), 499-504.[84] Pfister, T., & Robinson, P. (2010, August). Speech emotion classification and public speaking

skill assessment. In International Workshop on Human Behavior Understanding (pp. 151-162). Springer, Berlin, Heidelberg.

[85] Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98-125.

[86] Rozgic, V., Ananthakrishnan, S., Saleem, S., Kumar, R., & Prasad, R. (2012, December). Ensemble of svm trees for multimodal emotion recognition. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific (pp. 1-4). IEEE.

[87] Scherer, K. R. (1999). Appraisal theory. In Dalgleish, T., Power, M. (Eds.), Handbook of Cognition and Emotion (pp. 637-663).

[88] Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., & Narayanan, S. (2010). The INTERSPEECH 2010 paralinguistic challenge. In Proceedings of INTERSPEECH 2010 (pp. 2794-2797).

[89] Shah, M., Chakrabarti, C., & Spanias, A. (2014, June). A multi-modal approach to emotion recognition using undirected topic models. In Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), 2014 (pp. 754-757).

[90] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1631-1642).

[91] Stolar, M. N., Lech, M., Bolia, R. S., & Skinner, M. (2017, December). Real time speech emotion recognition using RGB image classification and transfer learning. In Proceedings of IEEE 11th International Conference on Signal Processing and Communication Systems (ICSPCS), 2017 (pp. 1-8).

Page 23: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

[92] Swain, M., Routray, A., & Kabisatpathy, P. (2018). Databases, features and classifiers for speech emotion recognition: a review. International Journal of Speech Technology, 21(1), 93-120.

[93] Tian, L., Moore, J. D., & Lai, C. (2015, September). Emotion recognition in spontaneous and acted dialogues. In Proceedings of IEEE International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 698-704).

[94] Tripathi, S., & Beigi, H. (2018). Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning. arXiv preprint arXiv:1804.05788.

[95] Truesdale, D. M., & Pell, M. D. (2018). The sound of passion and indifference. Speech Communication, 99, 124-134.

[96] Xia, R., & Liu, Y. (2017). A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Transactions on Affective Computing, 8(1), 3-14.

[97] Zhang, T., Ding, B., Zhao, X., & Yue, Q. (2018). A Fast Feature Selection Algorithm Based on Swarm Intelligence in Acoustic Defect Detection. IEEE Access, 6, 28848-28858

[98] Lee, C. C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9-10), 1162-1171.

[99] Smith, J., Tsiartas, A., Wagner, V., Shriberg, E., & Bassiou, N. (2018, April). Crowdsourcing Emotional Speech. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5139-5143). IEEE.

[100] Brownlee, J. (2016). Master Machine Learning Algorithms. Ebook.[101] Kossaifi, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt, M., Ringeval, F., ... & Hajiyev, E.

(2019). SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild. arXiv preprint arXiv:1901.02839.

[102] Triantafyllopoulos, A., Keren, G., Wagner, J., Steiner, I., & Schuller, B. (2019). Towards Robust Speech Emotion Recognition using Deep Residual Networks for Speech Enhancement. In Proceedings of INTERSPEECH 2019, 1691-1695.

[103] Lakomkin, E., Zamani, M. A., Weber, C., Magg, S., & Wermter, S. (2018, October). On the Robustness of Speech Emotion Recognition for Human-Robot Interaction with Deep Neural Networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 854-860.

[104] Schuller, B. W. (2018). Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM, 61(5), 90-99.

[105] Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., Mcrorie, M., ... & Amir, N. (2007, September). The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. In Proceedings of International Conference on Affective Computing and Intelligent Interaction (pp. 488-500).

[106] Schuller, B., Ganascia, J. G., & Devillers, L. (2016, May). Multimodal sentiment analysis in the wild: Ethical considerations on data collection, annotation, and exploitation. In Actes du Workshop on Ethics In Corpus Collection, Annotation & Application (ETHI-CA2), LREC, pp.29-34.

Page 24: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Figure 1: The game scenario of physical chastisement in ParkFigure 2: The game scenario of sword fighting in SchoolFigure 3: The game scenario of home alone in Friend’s HomeFigure 4: The game scenario of snail in bottle in ShopFigure 5. Visualization of the distribution of the training and testing data.

Page 25: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Figure 2: The game scenario of sword fighting in School

Question: Two girls were sword-fighting with plastic rulers in their classroom. One of them was injured as a piece of the plastic snapped off into her eye. Who is responsible for this?

Figure 4: The game scenario of snail in bottle in Shop

Question: The child discovers a dead snail in the drink bottle. What can you do about that?

Figure 3: The game scenario of home alone in Friend’s Home

Question: Friend welcomes the child, “Come in! My parents are out – they won’t be back for at least one hour”. At what age can you be left at home on your own?

Figure 1: The game scenario of physical chastisement in Park

Page 26: INTRODUCTION  · Web view2020. 7. 8. · on sentiment analysis tools [9, 55, 58]). Both verbal and nonverbal data (i.e. word, voice, facial expression, and psychophysiological responses)

Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.Figure 5. Visualization of the distribution of the training and testing data.