HUMAN-HUMAN TASK-ORIENTED CONVERSATIONS CORPUS FOR INTERACTION QUALITY MODELING

A. V. Spirina; Спирина А. В.; M. Yu. Sidorov; Сидоров М. Ю.; R. B. Sergienko; Сергиенко Р. Б.; E. S. Semenkin; Семенкин Е. С.; W. Minker; Минкер В.

HUMAN-HUMAN TASK-ORIENTED CONVERSATIONS CORPUS FOR INTERACTION QUALITY MODELING

Autores: Spirina A.V.¹, Sidorov M.Y.², Sergienko R.B.², Semenkin E.S.¹, Minker W.²
Afiliações:
1. Reshetnev Siberian State Aerospace University
2. Ulm University
Edição: Volume 17, Nº 1 (2016)
Páginas: 84-90
Seção: Articles
##submission.datePublished##: 15.03.2016
URL: https://journals.eco-vector.com/2712-8970/article/view/504721
ID: 504721

Citar

Resumo

Speech is the main modality for human communication. It can tell a lot about its owner: their emotions, intelligence, age, psychological portrait and others properties. Such information can be useful in different fields: in call centres for improvement in the quality of service, in designing Spoken Dialogue Systems for better adaptation of a system to users’ behaviour, in the automatization of some processes for analysing people’s psychological state in a situation with a high level of responsibility, for example, in a space programme. One such characteristic is the Interaction Quality. The Interaction Quality is a quality metric, which is used in the field of Spoken Dialogue Systems to evaluate the quality of human-computer interaction. As well as in Spoken Dialogue Systems, the Interaction Quality can be applied for estimating the quality of human-human conversations. As with any investigation in the field of speech analytics, for modelling the Interaction Quality for human-human conversations a specific corpus of task-oriented dialogues is required. Although there is a large number of speech corpora, for some tasks, as, for example, for Interaction Quality modelling, it is still difficult to find appropriate specific corpora. That is why we decided to generate our own corpus based on dialogues between the customers and agents of one company. In this paper we describe the current state of this corpus. It contains 53 dialogues, corresponding to 1165 exchanges. It includes audio features, paralinguistic information and experts’ labels. We plan to extend this corpus both in the feature set and in the observations.

Palavras-chave

interaction quality, human-human conversation, speech analysis, speech corpus

Texto integral

1. Introduction Speech analytics is applied to extract different information from speech data. Human speech can tell a lot about a person: their emotions, intelligence, age, psychological portrait and other properties. On the dialogue level it can determine for example cooperativeness between speakers, the involvement of each speaker in the dialogue and the topic of the discussion. Speech analytics is useful for call centres in such tasks as estimating customer satisfaction and detecting problems in the agent’s work. Moreover, such characteristics as customer satisfaction, emotions, Interaction Quality (IQ) and others are important for designing Spoken Dialogue Systems (SDS) for better adaptation of such systems to user behaviour through the dialogue. Besides, these characteristics can be used for the automatic assessing of relationships between people using speech. It is especially it is important for different space programmes, where the crew members spend a lot of time in a small space inside the space station. The IQ is a quality metric, which is used in the field of SDS to evaluate the quality of human-computer (HC) interaction. The IQ metric was proposed by Schmitt et al. in [1]. This metric can be useful not only for measuring the quality of the interaction between humans and computers, but for human-human (HH) dialogues as well. The model of the IQ for HH task-oriented conversations can help then make SDS more flexible, more human-like and friendlier. As for each investigation in the field of speech analytics, for modelling the IQ for HH conversations a specific corpus of task-oriented dialogues is required. Such a corpus can be developed based on calls from call centers offering support, information or help services. It is difficult to get access to such a database of calls as these calls contain the private information of speakers. This is why we tried to develop the speech corpus based on a call database. This corpus consists of the calls between company workers and customers. In this paper we present a first overview on this corpus. This paper is organized as follows. A brief description of related work (existing HH task-oriented conversation corpora) is presented in Section 2. Section 3 gives information about the developed HH task-oriented conversation corpus for the IQ modelling for HH dialogues. Section 4 introduces a description of manually annotated variables in the corpus and presents a comparison of the rules for annotating the IQ for HC and HH task-oriented spoken dialogues. In section 5 we describe future work for extending this corpus both in terms of variables and observations. Finally we present our conclusion in Section 6. 2. Related work: existing corpora Such organisations as ERLA (European Language Resources Association) [2] and LDC (Linguistic Data Consortium) [3] offer huge corpora databases for different purposes in the field of speech analytics such as: - emotion recognition; - speech recognition; - language identification; - speaker identification; - speaker segmentation; - speaker verification; - topic detection and others. Although there is a huge number of corpora, some researchers are forced to develop specific corpora for their research. DECODA is a call centre human-human spoken conversation corpus consisting of dialogues from the call centre of the Paris public transport authority. It consists of 1514 dialogues, corresponding to about 74 hours of speech [4]. The corpus described in [5; 6] consists of 213 manually transcribed conversations of a help desk call centre in the banking domain. Unfortunately it includes text data without audio files. Another example of a task-oriented corpus is described in [7]. The EDF (French power supply company) CallSurf corpus consists of almost 5800 calls (620 hours) between customers and operators of an EDF Pro call centre. There are many other corpora, which are described in different papers, but some of them are difficult to find or to get access to. Some existing corpora in the field of HH task-oriented conversations are not appropriate for our task of IQ modelling for different reasons and some of them are not accessible. The main reasons, which forced us to design our own corpus are as follows: - such corpora as DECODA and CallSurf are in French, which makes the labelling process difficult without knowing French; - some corpora, such as the corpus described in [5; 6], do not include audio files, which leads to a loss in information enclosed in audio features; - some corpora, unfortunately, are not accessible to the public because the content is private information. 3. Corpus description The current state of the corpus consists of 53 task-oriented dialogues in English between the customers and agents of one company, corresponding to about 87 minutes of signal. The raw audio data was presented in mono audio format. The average duration of a dialogue is 99.051 seconds. The distribution of dialogue duration is presented in fig. 1. First of all it was required to perform speaker diarization. Speaker diarization consists of speaker segmentation and speaker clustering, in other words, it helps to understand who speaks in each speech fragment. We tried to implement such open-source diarization toolkits as LIUM [8] and SHoUT [9]. Unfortunately, the results of diarization were not suitable for us, because we needed diarization without errors. That is why the diarization was performed manually with the help of the Audacity computer software application, a free open source, cross-platform software for recording and editing sounds [10]. Then the audio files were split by FFmpeg, a free software project, which includes libraries for working with multimedia data [11]. Thus 1791 audio-file fragments, which contain the speech of customers or agents or each overlapping, were extracted. In this stage such information as gender, type of speaker (customer, agent) and overlapping speech was extracted manually, although it could be done automatically, but with some error. Fig. 1. Distribution of the dialogues by their duration In the next stage all these fragments were manually joined into exchanges. Each exchange consists of the turns of a customer and an agent. All fragment concatenations can be divided into three groups: sequential, chain and mixed. For example, we can have such a scheme of a dialogue: ACACAC, where A is the agent’s turn and C is the customer’s turn. The sequential type of concatenation will look like this: AC-AC-AC. If we speak about the chain type of concatenation, it will be like this: AC-CA-AC-CA-AC. An example of the mixed type of concatenation can be like this: AC-CA-CA-AC. To concatenate turns into exchanges we applied the mixed type. Thus we retrieved 1165 exchanges. For extracting features for the IQ modelling for HH conversation we used the three different parameter levels described in [12]. This approach consists of the three levels: - exchange level, containing information about the current exchange; - window level, containing information about the last n exchanges; - dialogue level, containing information from the beginning of the dialogue up to the current exchange. The scheme of the three different parameter levels is depicted in fig. 2. On the exchange level there are four blocks of features: - features, describing an exchange on the whole; - features, describing the speech of the agent in this exchange; - features, describing the speech of the customer in this exchange; - features, describing overlapping speech in this exchange. The list of the features on the exchange level is presented in tab. 1. The list of the features on the window level and dialogue level is the same and is presented in tab. 2. The difference remains only in the number of exchanges in the computation. Ashampoo_Snap_2015 Fig. 2. This figure from [12] represents three parameter levels Table 1 Features on the exchange level Feature Description Features, describing an exchange on the whole Agent_speech Is there an agent’s speech in the exchange? Customer_speech Is there a customer’s speech in the exchange? Overlapping Is there overlapping speech in the exchange? {#} overlapping_exchange Number of overlapping speech moments in the exchange Start_time The time elapsed from the beginning of the dialogue before the start of the exchange First_speaker Who starts the exchange? Duration Duration of the exchange First_exchange Is the exchange first in the dialogue? Pause_duration Total duration of the pauses between a customer’s and agent’s speech in the exchange {%} pause_duration The percentage of the total pauses duration in the exchange End tab. 1 Feature Description Type_of_turns_concatenation Pause_before_duration Duration of the pause between the current and previous exchanges Features, describing agent speech/customer speech/overlapping speech Audio features All audio features were extracted by OpenSMILE (Speech & Music Interpretation by Large Space Extraction), an open source features extraction utility for automatic speech, music, paralinguistic recognition research [13]. For extracting features we have applied the openSMILE configuration emo_IS09.conf. It contains 384 features, as a result of applying statistical functionals to 16 low-level descriptor contours Split_duration Duration of the agent’s/customer’s/overlapping speech in the exchange Split_overlapping Is there overlapping speech in the speech of an agent or customer Start_time_split The time elapsed from the beginning of the dialogue before the start of the agent’s/customer’s/overlapping speech {%} duration The percentage of agent’s/customer’s/overlapping speech duration in the exchange Gender Gender of the agent and customer Table 2 Features on window/exchange level Feature Description Total_duration Total duration of exchanges Mean_duration Mean duration of exchange A_duration, C_duration, O_duration Total duration of the agent’s/customer’s/overlapping speech Pauses_duration Total duration of the pauses between customer’s and agent’s speech in the exchanges A_mean_duration, C_mean_duration, O_mean_duration Mean duration of the agent’s/customer’s/overlapping speech Pause_mean_duration Mean duration of the pause between customer’s and agent’s speech in the exchanges A_percent_duration, C_ percent _duration, O_ percent _duration The percentage of the agent’s/customer’s/overlapping speech duration Pauses_percent_duration Percentage of the pause duration between customer’s and agent’s speech #A_start_dialogue, #C_ start_dialogue, #O_ start_dialogue Number of exchanges where the first speech is agent’s/customer’s/overlapping speech #overlapping Number of the fragments with overlapping speech Mean_num_overlapping Mean number of the overlaps Pauses_between_exchanges_duration Total duration of pauses between exchanges The manually annotated variables, such as emotions and the IQ will be discussed in the next section. 4. Annotation (Target variables) The corpus has been manually annotated with a number of target variables, such as emotions and IQ score. Emotions. Three sets of emotion categories were selected from [14]. The first set consists of: angry (1), sad (2), neutral (3) and happy (4). The second set includes such emotions as anxiety (1), anger (2), sadness (3), disgust (4), boredom (5), neutral (6) and happiness (7). The third set contains fear (1), anger (2), sadness (3), disgust (4), neutral (5), surprise (6) and happiness (7). All audio fragments were annotated by one expert rater. The distribution of the emotion labels for each set is presented in fig. 3. For emotion labelling a web form was designed. This form is presented in fig. 4. The web form depicted above has also been used to join split audio fragments into an exchange. For visualizing diarization results we used Flotr2, a library for drawing HTML5 charts and graphs [15]. IQ labels. For IQ score annotation we used adopted rater guidelines from [1]. Due to the fact that for HC interaction IQ labelling always starts with “5” and for HH conversations it can lead to a loss of useful information, we add the scale of IQ changes, which mirror the value of the change between the previous and current exchange. The rater guidelines both for the absolute scale and for the scale of changes is described in tab. 3. All exchanges were annotated with IQ scores by one expert rater. For IQ labelling we designed the web form depicted in fig. 5. Fig. 3. Distributions of the emotion label for three sets of emotions Ashampoo_Snap_2015 Fig. 4. The web form for emotion annotation and manually joining agent/customer turns into an exchange Table 3 Rater guidelines for annotating the IQ in the absolute scale and in the scale of changes № The absolute scale The scale of changes 1 The rater should try to assess the interaction on the whole as objectively as possible, but pay more attention to the customer point of view in the interaction 2 An exchange consists of the agent and the customer turns 3 The IQ score is deﬁned on a 5-point scale with “1=bad”, “2=poor”, “3=fair”, “4=good” and “5=excellent” The IQ score is deﬁned on a 6-point scale with “-2”, “-1”, “0”, “1”, “2” and “abs 1”. The first five points of the scale reflect changes in the IQ from the previous exchange to the current exchange. “abs 1” means “1=bad” in the absolute scale 4 The IQ is to be rated for each exchange in the dialogue. The history of the dialogue should be kept in mind when assigning the score. For example, a dialogue that has proceeded fairly poorly for a long time should require some time to recover 5 A dialogue always starts with an IQ score of “5” A dialogue always starts with an IQ score of “0” 6 In general, the score from one exchange to the following exchange is increased or decreased by one point at the most 7 Exceptions, where the score can be decreased by two points are e. g. hot anger or sudden frustration. The rater’s perception is decisive here End tab. 3 № The absolute scale The scale of changes 8 Also, if the dialogue obviously collapses due to agent or customer behaviour, the score can be set to “1” immediately. An example therefore is a reasonable frustrated sudden hang-up Also, if the dialogue obviously collapses due to agent or customer behaviour, the score can be set to “abs 1” immediately. An example therefore is a reasonable frustrated sudden hang-up 9 Anger does not need to influence the score, but can. The rater should try to figure out whether anger was caused by the dialogue behaviour or not 10 In the case of a customer realizing that he should adapt his dialogue strategy to obtain the desired result or information and succeeded that way, the IQ score can be raised up to two points per turn. In other words, the customer realizes that he caused the poor IQ by himself 11 If a dialogue consists of several independent queries, the quality of each query is to be rated independently. The former dialogue history should not be considered when a new query begins. However, the score provided for the first exchange should be equal to the last label of the previous query 12 If a dialogue proceeds fairly poorly for a long time, the rater should consider increasing the score more slowly if the dialogue starts to recover. Also, in general, he should observe the remaining dialogue more critically 13 If a constantly low-quality dialogue finishes with a reasonable result, the IQ can be increased Ashampoo_Snap_2015 Fig. 5.The web form for annotating the IQ in absolute scale and scale of changes by the expert rater Fig. 6. Distributions of the IQ score Then we converted the scale of changes into the absolute scale. The distributions of IQ score for both scale (absolute scale and absolute scale, converted from the scale of changes) are presented in fig. 6. 5. Future work We plan to extend this corpus both in the feature set and in observations. The feature set will be extended, for example, by adding some audio features, such as schimmer, jitter, formants and others, computed in openSMILE or PRAAT [16], by adding manually annotated features as task completion. Moreover, we plan to apply automatic speech recognition software to our audio files. The use of two scales (absolute scale and scale of changes) has revealed that applying the method of IQ estimation starting with “5” for HH conversations can lead to a possible loss of information in the modelling process. An example of such a situation is depicted in fig. 7. Fig. 7. An example of possible loss of information due to IQ estimations starting with “5” In HH task-oriented conversations, for example, in call centres, there are a number of reasons why the dialogue cannot start with the IQ label “5”: the great waiting time, customer claims against the company with customers’ aggressive behaviour due to which the IQ initially cannot be good. Moreover, in the form presented above the IQ is a very subjective estimation. That is why we plan to change the rater guidelines for the IQ annotation: describe all possible situations in each category of the IQ and transitions from one category to another. It helps to decrease subjectivity in the assessment of the IQ. Also for decreasing subjectivity we plan to increase the number of expert raters. 6. Conclusion Although there are a large number of speech corpora, for some tasks from speech analysis, such as the IQ modelling for HH task-oriented conversations, it is still difficult to find appropriate specific corpora. In this paper we described the current version of the corpus for the IQ modelling for HH task-oriented conversations. In spite of some drawbacks it can be useful for investigations. Extension of this corpus helps to design a more accurate model for the IQ, which in turn can help to improve the quality of service in the call centres, to make SDS friendlier, more flexible and more human-like, and to automatize some processes of analysing the psychological state of people in a situation with a high level of responsibility, for example, in different space programmes.

×

Bibliografia

Schmitt A., Schatz B., Minker W. Modeling and predicting quality in spoken human-computer interaction. Proceedings of the SIGDIAL 2011 Conference. Association for Computational Linguistics, 2011, P. 173-184.
European Language Resources Association. Available at: http://elra.info/Language-Resources-LRs. html (accessed 03.10.2015).
Linguistic Data Consortium. Available at: https:// catalog.ldc.upenn.edu/ (accessed 03.10.2015).
Bechet F., Maza B., Bigouroux N., Bazillon T., El-Bèze M., R. De Mori, Arbillot E. DECODA: a call-center human-human spoken conversation corpus. International Conference on Language Resources and Evaluation (LREC), 2012, P. 1343-1347.
Vincenzo Pallotta, Rodolfo Delmonte, Lammert Vrieling, David Walker. Interaction Mining: the new frontier of Call Center Analytics. CEUR Workshop Proceedings, 2011, P. 1-12.
Rafaelli A., Ziklik L., Doucet L. The Impact of Call Center Employees’ Customer Orientation Behaviors on Service Quality. Journal of Service Research, 2008, Vol. 10, No. 3, P. 239-255.
Lavalley R., Clavel C., Bellot P., El-Bèze M. Combining text categorization and dialog modeling for speaker role identification on call center conversations. INTERSPEECH, 2010, P. 3062-3065.
Meignier S., Merlin T., LIUM SpkDiarization: An Open Source Toolkit For Diarization. Proceedings of CMU SPUD Workshop, 2010.
Shout. Available at: http://shout-toolkit.sourceforge. net/ (accessed 03.10.2015).
Audacity. Available at: http://audacityteam.org/ (accessed 03.10.2015).
FFmpeg. Available at: https://www.ffmpeg.org/ (accessed 03.10.2015).
Schmitt A., Ultes S., Minker W. A Parameterized and Annotated Corpus of the CMU Let’s Go Bus Information System. International Conference on Language Resources and Evaluation (LREC), 2012, P. 3369-3373.
Eyben F., Weninger F., Gross F., Schuller B. Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor. Proceedings of ACM Multimedia (MM), 2013, P. 835-838.
M. Sidorov, A. Schmitt and E. Semenkin. Automated Recognition of Paralinguistic Signals in Spoken Dialogue Systems: Ways of Improvement. Journal of Siberian Federal University, Mathematics and Physics, 2015, Vol. 8, No. 2, P. 208-216.
Flotr2. Available at: http://www.humblesoftware. com/flotr2/ (accessed 03.10.2015).
Praat: doing phonetics by computer. Available at: http://www.fon.hum.uva.nl/praat/ (accessed 03.10.2015).

Arquivos suplementares

Ação

1. JATS XML

Baixar

Nome de usuário
Senha
Lembrar usuário

Esqueceu a senha?	Cadastro

Nome de usuário
Senha
Lembrar usuário

Esqueceu a senha?	Cadastro

HUMAN-HUMAN TASK-ORIENTED CONVERSATIONS CORPUS FOR INTERACTION QUALITY MODELING

Texto integral

Resumo

Palavras-chave

Texto integral

Sobre autores

A. Spirina

M. Sidorov

R. Sergienko

E. Semenkin

W. Minker

Bibliografia

Arquivos suplementares