Keywords

1 Introduction

In the last few years, automatic speech recognition (ASR) technology has made a lot of progress so that it is no longer a bottleneck in the development of spoken dialogue systems. Some industrial services already have a vocal interaction mode but their efficiency and their naturalness are still poor. An important aspect that still needs to be improved is floor management. Time is shared in a rigid way as the user has to wait for the system to finish its sentence before speaking and vice-versa. This is not natural, and as we show in this paper, it is not the most efficient way of interaction. In human communication, the listener understands the speaker on the flow [8]. This ability is manifested through different turn-taking phenomena and some of the most commonly studied in the field of dialogue systems have been implemented in this work.

Systems that replicate this behaviour are called incremental dialogue systems. Some sequential architectures have been published in order to design such systems [1, 3, 7]. In this paper we use a multi-layer architecture [5] where the turn-taking management part is handled by a Scheduler module that is added as an extra-layer to the traditional Dialogue Manager (DM). We show that incremental dialogue processing offers possibilities to make dialogue systems more robust to noise. To demonstrate our work, a user simulator which interacts with a personal agenda assistant has been developed. Dialogue efficiency is measured through dialogue duration and task completion [2, 9]. Moreover, the turn-taking phenomena are evaluated separately in order to see which ones are critical in terms of efficiency improvement.

Section 2 introduces the simulated environment and the chosen task. Section 3 describes the experiment and the results and finally, Sect. 4 concludes and announces future work.

2 Simulated Environment

Three different modules compose the simulated environment: the user simulator, the Scheduler and the service. In order to simulate strategies that do not require incremental processing, the user simulator interacts directly with the service. On the other hand, the Scheduler is inserted as an intermediate module between the two for incremental strategies [5]. It is in charge of floor-taking decisions whereas the service handles more high-level semantic decisions (computing responses based on the user simulator’s utterances).

The service is a personal agenda manager. It can accomplish three types of tasks: adding, modifying or deleting an event from the agenda. Therefore, a task is defined by four slots : the type of action (ADD, MODIFY or DELETE), the title of the event, its date and its time slot. However, overlaps between events are not tolerated. Initially, the agenda contains a few events. The user simulator is given a list of events that should be added during the dialogue. Each event has a priority and a set of alternative dates and time slots that can be used in the case of a conflict. The user simulator tries to make the maximum number of events with the highest priority fit in the agenda.

2.1 User Simulator

A simple algorithm has been implemented so that the user simulator can calculate the next action to make. When trying to add a new event, if the slot is free then it is added, otherwise it checks the alternative slots. If they are all taken, it checks whether it is possible to move a conflicting event. If it is not possible, then the priority of the event to add is compared to the less important priority among the conflicting events. If it is greater, then the conflicting event is deleted and replaced with the new event, otherwise the latter is forgotten.

After the next user act is determined, it is incrementally sent to an ASR output simulator. The increment chosen here is the word, for the sake of simplicity. Each incremental step is called a micro-turn. The ASR output simulator maintains an N-Best (the top recognition hypotheses with their confidence scores) corresponding to the partial utterance made so far by the user. A Word Error Rate (WER) parameter is set to control the noise level.

Real ASR modules are not monotonous: a new chunk of audio signal can modify an important part if not the whole ASR best output hypothesis. This is due to the fact that the language model suddenly recognises a pattern that is more likely to correspond to what the user just said. To simulate this phenomenon, whenever a new concept appears in a hypothesis, the corresponding confidence score is boosted (so it has a chance to be the top hypothesis). As a consequence, the last words of a partial request are more likely to change than the ones that appear earlier in the utterance and no decision should be made based on them. Thus, the Scheduler removes a few last words from the current partial utterance and makes a decision based on the remaining prefix. The latter will be called the last stable partial utterance. The number of words removed will be referred to as the stability margin (SM). In [6], if a partial utterance lasted for more that 0.6 seconds then it has more than 90 % chance of staying unchanged. In this work, we suppose that the speech rate is 200 words per seconds [10]. Hence two words are spoken in 0.6 seconds and we set SM=2.

2.2 Turn-Taking Rules

Five of the most studied turn-taking phenomena in the field of incremental dialogue have been implemented in this work. The first four require the Scheduler to make a decision and one of them depends on the user’s behaviour only. At each micro-turn, the Scheduler has to decide whether it does not speak (WAIT), whether to repeat the last word of the last stable utterance (REPEAT) or whether to retrieve the last response obtained from the service (SPEAK). In the following, we explicit the rules that have been implemented in the Scheduler for each phenomenon (and in the use simulator for the last one).

FAIL_RAW: When speaking, the user has no guarantee that the system understands his message (due to noise or the use of off-domain words). Therefore, in order to prevent the user from speaking too long without being understood, if no key concept has been detected after a long enough sentence, then the Scheduler performs a SPEAK (as no key concept is present in the current partial utterance, the last service’s response is Sorry. I don’t understand.). The threshold depends on the last system act. It is set to 6 for open questions and time slot questions, to 3 for yes/no questions and 4 for dates so that the user can utter some out of domain words and the key concepts have time to stabilise.

INCOHERENCE_INTERP: Even if understood, the user’s utterance can be problematic if it is incoherent with the dialogue context (trying to modify a non existing event in the agenda for example). Therefore, as soon as the service makes an incoherence alert, the Scheduler waits to get SM more words and if the partial request at time \(t-SM\) is a prefix of the one at time t (no changes because of ASR instability) then it performs a SPEAK.

FEEDBACK_RAW: The last word’s confidence score is estimated as the ratio between the last partial request score and the one before last. If this ratio is below a threshold, the last word is repeated (REPEAT action) SM words later if it is still present in the partial utterance.

BARGE_IN_RESP (System): When there is enough information in the partial request for the service to generate a response that improves the dialogue, the system can barge-in before the user ends his utterance. The Scheduler performs a SPEAK SM words later if the partial utterance that generates the response is a prefix of the current one (no changes because of ASR instability).

BARGE_IN_RESP (User): This phenomenon corresponds to a user’s decision. We suppose that the user is familiar with the system to barge-in as soon as it has enough information without letting the system finish its utterance.

3 Experiment

Dialogue efficiency is evaluated given two criteria: the dialogue duration (given a speech rate of 200 words per minute) and the task completion. If it takes two long for the user to accomplish a task (add, modify or delete), then she hangs up. The corresponding time threshold is sampled at each new task from a distribution with a 3 min mean. The first part of the experiment is dedicated to three generic strategies used in traditional dialogue systems: system initiative (SysIni), user initiative (UsrIni) and mixed initiative (MixIni). In this work, they have been instantiated as follows: in SysIni the user is asked for the different chunks of information, one by one (action, description, date and time slot). On the contrary, in UsrIni, all the necessary information must be given in one request. Finally, MixIni behaves like UsrIni and if it fails, it switches to SysIni. In this part, the user simulator interacts directly with the service.

Next, the impact of incrementality is studied. The Scheduler, embedding the rules introduced in Sect. 2.2, is added between the user simulator and the service. In addition, the user simulator is configured to interrupt the system (BARGE_IN_RESP from the user’s side). In SysIni, the utterances are short which is not adapted to incremental processing. Thus, we study incrementality in the case of UsrIni (UsrIni+Incr) and MixIni (MixIni+Incr).

To specify the task achieved during the dialogues, we define a scenario as two lists of events. The first one corresponds to the events that exist in the agenda before the dialogue whereas the second contains the events to add during the dialogue. Our experiment is based on three handcrafted scenarios (leading to dialogues with different levels of complexity because of overlaps between the two lists of events). In order to analyse the effect of noise on these strategies, we vary the WER between 0 and 0.3 with a step of 0.03. For each scenario and each WER, 1000 dialogues have been run.

Fig. 1.
figure 1figure 1

Mean dialogue duration and task completion for generic strategies

Fig. 2.
figure 2figure 2

Mean dialogue duration and task completion for different turn-taking phenomena

Fig. 3.
figure 3figure 3

INCOH_INTERP strategy in a more adapted task.

For low values of WER, SysIni strategy is more tiresome and inefficient compared to UsrIni (Fig. 1). However, for high noise levels, it outperforms it showing that it is more efficient to communicate the information chunk by chunk. MixIni has the advantages of both strategies as it performs like UsrIni for low levels of WER and better than both of them in noisy situations. Incremental behaviour has been introduced to both UsrIni and MixIni (in the case of SysIni, the user’s utterances are supposed to be short, so it is not relevant to add incrementality). Just like mixed initiative, incrementality is also shown to be a solution to making UsrIni more robust to noise, with even better results. Finally, we show that MixIni with incremental behaviour performs best.

In Fig. 2, the performance of each turn-taking phenomenon is represented (with MixIni as a baseline). FEEDBACK_RAW, BAGE_IN_RESP from the user’s side and FAIL_RAW are the phenomena that have an impact on the dialogue efficiency. INCOHERENCE_INTERP is more adapted to tasks where the user’s utterance is more likely to be in conflict with the dialogue context. To illustrate that, in Fig. 3, the scenario has been slightly modified: the user tries to move an event five times before discovering a free slot. Therefore, most of his requests refer to an existing event and if the event title is modified because of ASR noise, an incoherence is detected. INCOHERENCE_INTERP reduces the dialogue duration. The task completion is not reduced significantly but consistently over the different WER levels (starting from WER=0.15). In this task, MixIni already performs very well so the margin for improvement is small.

BAGE_IN_RESP from the system’s side does not make the system more robust to noise as it does not handle errors. It is useful with users who tend to make unnecessarily long utterances. [4] shows that when the users are interrupted they tend to make more concise utterances focusing on the main information.

4 Conclusion and Future Work

A simulated environment has been used to show that incremental dialogue offers new possibilities to make dialogue systems more robust to noise. First, three non-incremental strategies are compared: system initiative, user initiative and a combination of the two, mixed initiative, that is shown to achieve the best performance. We then show that there is still room for improvement using incremental processing. In future work, we plan to use reinforcement learning to show that optimal turn-taking management can be learnt automatically.