Keywords

1 Introduction

With the rapid development of artificial intelligence, the technology of human-computer interaction is becoming more and more mature. The variety of terminal products equipped with conversational agents are more diverse. They have been widely added on smartphones and many intelligent home appliances. With the common use of conversational agents for the average person in everyday life, more research is needed to understand the experiences that users are having [1].

Recent research has explored users’ experiences with conversational agents focusing on system capabilities and usability. They evaluated user satisfaction about specific functions like searching for information [2] and so on. However, for voice interaction, perceptual requirements are more complex than visual interaction and hardware interaction. Therefore, it is particularly important to design feedback strategies for voice interaction. Proper interaction logic, association with the context of dialogue and context perception play important roles in user experience. Previous research on feedback strategies mainly focused on the level of feedback form and feedback time. Researchers paid little attention to the affective experience in the expression of feedback.

Emotions are at the heart of human experience [3]. Understanding users’ affective experience is crucial to designing compelling CAs [1]. In China, many dialogue scripts are generated from product managers only without any help from user experience specialists. This practice can be the factor preventing positive experiences. Therefore, it is very important to explore expression ways of conversational agents from the perspective of affective experience.

In addition, combining the results of previous usability tests and focus groups, we found that when interacting with the agent, the time of interaction and the functions used also affected the users’ expectations of agent expression.

So, we did this research about how to design the expression ways of conversational agents based on affective experience. In this paper, we adopt a particular “Wizard of Oz techniques” to explore the performance of three different expression ways (general way, implicit way, and explicit way) in different time and different functions. Specifically, our contributions are:

  1. 1.

    It is proved that the user’s affective experience should be considered in expression ways’ design.

  2. 2.

    We discover that there exist gender differences in conversational agents under the same female speakers.

  3. 3.

    Experimental methods take account of real-world scenarios. We explore the expression way design with considerations of using time and functions.

We hope these results will help AI designers and will promote further studies about principles for human-AI interaction.

2 Related Work

In this session, we describe the previous work as a background for our study, both in the scope of feedback strategies and in the context of communication theory.

2.1 Feedback Strategies in Voice Interaction

To realize human-computer voice interaction, many scholars have done a lot of research on the feedback strategies of voice interaction. Here we divide these studies into three categories: feedback form research, feedback time research and feedback expression research.

For the study of feedback form, more attention was paid to multi-modal human-computer interaction based on voice. In terms of voice, Nass [4] pointed out that female voices were more acceptable for people to help deal with problems, and women were more kind to help others make decisions; male voices subconsciously gave people a sense of command, authority, and dominance to “solve problems in this way”. This provided theoretical support for the gender setting of conversational agents. Of course, the selection of voice should also be consistent with the behavioral language we have set in advance. Improper dubbing might be worse than no voice because voice also has a certain social meaning [5]. Also, Microsoft Cortana was set to visualize emotions through vision (changes in color and shape) and then inform users [6].

For the study of feedback time, previous research focused on the output time first. Cathy Pearl believed that the system needed to know when the user stopped talking [7]. She describes three kinds of timeouts in the VUI experience (end-of-speech timeout, no speech timeout, and too much speech), and then gave the ranges of each time parameter setting.

There are some studies involving feedback expression. First of all, we can regard personality design as the basic part of feedback expression. When Microsoft designed Cortana, this “breaks the illusion that Cortana is a human, which is just fine by Microsoft, while also making it a more useful tool” [8]. Besides, in a study performed by Clifford Nass and Scott Brave, users performed tasks with a driving simulator, and throughout the tasks, the system voice made comments about their driving performance. Error messages blaming the driver might seem like a small thing, but they could affect the user’s perception of the system and even their performance [4]. In Cathy Pearl’s book, there are many design strategies for feedback expression in various chapters, involving conversational design, confirmations, error handling, disambiguation and many other aspects.

In general, the researches on voice interactive feedback strategy were more concerned with the functional level. From the perspective of user experience, it focused on the usability and usefulness of products. Researchers often ignored the affective relationship between users and products. However, with the continuous improvement of products, users also hope that products can meet their affective experience.

2.2 Interpersonal Communication and Social Style Model

For conversational agents, especially regarded as assistants or friends rather than tools, a good affective experience means that users feel that they are communicating with a “normal person”. As Cohen wrote in a book, “As designers, we don’t get to create the underlying elements of conversation. (e.g., we must follow human conventions [9]”. That is to say, it is also necessary to obey some requirements of interpersonal communication in the process of human-computer interaction.

A previous study showed that there were six prominent motives in interpersonal communication: pleasure, affection, inclusion, escape, relaxation, and control. The motives of pleasure, affection, and relaxation were more closely related to communication satisfaction [10]. To achieve good communication, it is particularly important to understand the characteristics of social styles. We should also pay attention to the behavior tendency and communication mode of people with different styles.

Sheng [11] believed that the social style model could be divided into four styles in the coordinate system of dominant/amiable and explicit/implicit dimensions: facilitating style, promoting style, controlling style and analytic style.

The characteristics that describe dominating behavior are: talkative, seeming confident, easy to make conclusions, direct, challenging, assertive, competitive, making decision quickly.

The characteristics that describe easy-going behavior are: quiet, seeming uncertain, easy to ask questions, not being straightforward, sensitive, people moving, gentle, making decision slowly.

The characteristics that describe explicit tendencies are: interrupting others’ words, appearing extroverted, people-oriented, large range of movements, naturally showing, expressive, fun-loving, and passionate.

The characteristics that describe implicit tendencies are: task-oriented, careful listening, introverted, indifferent, few movements, methodical, unemotional, self-restrained, rigid, serious.

Similarly, DISC personality model was based on the work of psychologist William M. Marston [12]. This model explored people’s priorities and how those priorities influenced behavior [13]. DISC is an acronym for Dominance, Influence, Steadiness, and Conscientiousness.

We focused more on feedback expression design (which is called as expression way design in this paper). We designed three levels in expression ways: general way, implicit way, and explicit way. It will be introduced later in the next section.

3 Method

In this section, we begin by describing the experimental setup. Next, we discuss some important parts in our experiment.

3.1 The Experimental Setup

We conducted a within-subjects experiment with 20 conversational agents’ users to explore the performance of different expression ways in different time and different functions. The independent variables consisted of expression ways, time, and functions. Each variable had three levels, thus producing a 3 × 3 × 3 factorial design.

The levels of expression ways were general way, implicit way and explicit way. These levels were mainly derived from a dimension in the social style model mentioned above [11]. From the above description, we can see that what is more relevant to the users’ affective experience is the “explicit-implicit” metric. Therefore, in this study, we used “explicit-introverted” as the level of the independent variable. Besides, there were a few affective expressions concerned about users in general way. So, the first variable was made up of these three different expression ways.

The functions’ levels consisted of weather, music, and news because of users’ heavy use of early research results. As for the time, it was determined in consideration of the time working persons have to interact with the agent. The time levels were in the morning, after a day’s work, before going to bed.

Considering the method, we used the “Wizard of Oz techniques” to simulate a real environment for communication. And all the materials were transferred into synthesized sound files by using a female speaker. Ethics approval to conduct this research was obtained from Beihang University.

Wizard of Oz Techniques.

In the field of human–computer interaction, a Wizard of Oz techniques is a research method in which subjects interact with a computer system that subjects believe to be autonomous [14], but which is actually being operated or partially operated by an unseen human being [15]. So, we used the “Wizard of Oz techniques” to put different responses (expression ways) into the test demo control platform. The experimenter controlled the process and results of human-computer interaction by operating the control platform so that users could mistakenly believe that he is interacting with a real conversation agent. When people can earnestly engage with what they perceive to be a conversation agent, they will form more complete mental models, while interacting with the experience in more natural ways [16].

Dependent Variable.

In this study, we measured user experience from five aspects (affection, confidence, naturalness, social distance, and satisfaction). The items were selected from an array of sources, primarily including the evaluation of conversational agents and measurements of user experience. Affection and confidence were important indicators for evaluating the users’ affective experience of conversational agents’ responses [17]. Naturalness could be used to evaluate a conversational agent’s responses from two aspects: expression and voice [18]. We evaluated the expressions in this experiment. Social distance has significance to grasp of the effective communication strategies and to the harmonization of interpersonal relationship in society. As for satisfaction, it was used widely in measuring user experience as a comprehensive index. All the items consisted of seven-point semantic differential scales.

Experimental Materials.

Studies showed that what differed in Chinese culture is the lower frequency, intensity, and duration with which emotions were typically experienced [19]. Through a summary of the existing responses to the agents’ common functions and discussions with experts, we found that the current agent responses can be disassembled into three parts. The first was the fact part, this part was the realization of the basic functions of the agent, such as “weather information”, “playing music”, etc. The judgment part was based on the judgment of the current situation based on facts, such as “it will rain tomorrow”, “it’s late now”, etc. Finally, the suggestion part, this part is the agent to care for the user after making a judgment, such as “remember to bring an umbrella”, “get to sleep” and so on.

Combining with the expression, in general way, agents only stated fact. Agents could add judgment and suggestions in other ways. For this study, we make the following operational definitions, please refer to the Table 1 for explanations. There was also an example shown in Table 2. This example may not be expressed clearly because of my poor translation and the difference between the English and the Chinese language environment.

Table 1. The explanations to three parts of a response.
Table 2. An example in weather function.

3.2 Participants

Twenty participants (M = 10, F = 10) were recruited from a university and other companies via an internal questionnaire. In line with best practice, participants were recruited until saturation had occurred. Demographic data showed that participants had an age ranged from 18 to 50 and consisted of students and staff across a wide range of subjects. All the participants indicated that they had previously used conversational agents, and this majority (60%) said they had previously used smart speakers. Participants were provided with 50 RMB in exchange for taking part in a one-hour experiment. All the participants signed of informed consent paper after understanding all the experimental procedures.

3.3 Procedure

Each participant was invited individually into the laboratory. The procedure comprised five main phases in a fixed order: completing the demographic and conversational agents usage questionnaire; being familiar with the smart speaker (for participants who had never used smart speakers); hearing responses by giving orders to the speaker; judging the user experience of each response; recalling the overall procedure and receiving a face-to-face interview. Participants were not informed about later phases. These phases are detailed below.

After completing informed consent, participants were invited to filling in the demographic and conversational agent usage questionnaire. Then participants who had never used smart speakers could have a chance to interact with the speakers for 10 min; they were told to use the functions which had no relevance to later phases.

Users’ instructions were given in advance. Users could hear a response by giving an order to the speaker. Then they were asked to quantitatively measure each response from five aspects: affection, confidence, naturalness, social distance, and satisfaction. Participants made judgments on a Likert–type scale with 1 as the complete disagreement and 7 as the complete agreement. Participants needed to hear 30 responses. Each function had three replies in each period. And there would be 3 replies as exercises before the formal experiment starts to eliminate the practice effect. After measuring 3 different responses (in 3 different expression ways) in the same function and the same time, participants were asked to select the best one among them. The results in different ways could be verified by each other. Due to the number of materials, evaluation aspects, and the equipment limitation caused by the method, the experimental sequence was fixed. In order to reduce the sequence effect, each response was scored immediately after the listening process. After each function experience was completed, an interview for that function was conducted first, and the time interval between the responses was relatively long. After discussing with the mentor, it was considered that the sequence effect had less impact.

Next, participants attempted to recall the overall procedure that they had heard during the experiment. And then they were invited to attend a face-to-face interview. The semi-structured interview was designed to be short and took between 8 and 10 min to be completed [20]. The questions for each function in the interview were:

  • Q1. When will you use the conversational agent for the weather forecast (music, and news)?

  • Q2. Please tell us your demands for the function at different times in one day.

  • Q3. Have you found the differences between these responses? Please tell me about your findings.

  • Q4. What impressed you most in these responses?

At the end of the interview, participants had the chance to share any other thoughts they wished. On completion of the interview, participants were thanked and provided with monetary rewards for their participation in our study.

3.4 Data Analysis

First of all, validity and reliability analysis were made upon such data to examine the scale we made.

For quantitative data, we used SPSS 25 to generate descriptive statistics (mean, standard deviation) and ran one-way ANOVAs for each dependent variable to determine the impact of groups. We used an alpha level of .05 for all statistical tests. ANOVAs are robust to violations of normality. Where there were violations in assumptions, we state results with caution.

For qualitative data, we invited four experienced user researchers to discuss the interview records. A focus group was used to extract keywords and classify them. When disagreement arose, it was a must to discuss and finally reached an agreement. The frequency of keywords was tallied to summarize the important factors in users’ demands. Q3 was also used to clean data. If the answer to Q3 was No, this piece of data was excluded from data analysis. All users identified the differences between responses. The data of 20 participants in total were analyzed and presented. In the following section, we present the findings from our analysis.

4 Results

In this section, we didn’t analyze the main effects of time and function. In my opinion, these two variables were not “independent variables” in common sense. They were more helpful for us to observe the effects of different expression ways. It was not significant to observe the main effect of time or function when ignoring the expression ways. As for the interaction effect analysis between independent variables, we didn’t describe them here because of the non-significant results. We firstly described the quantitative statistics by using one-way ANOVA. Next, we stated the results from the semi-structured interview.

4.1 Validity and Reliability Analysis

The data were first subjected to Kaiser-Meyer-Olkin (KMO) and Bartlett test analyses to test the scale’s structure validity, yielding to the results of KMO 0.894 and Bartlett Test values x2 = 1939.115; sd = 10; p < 0.001. It meant that the data are suitable for factor analysis. The factor analysis is performed with the aim to reveal whether the items of a certain scale are grouped into mutually exclusive fewer factors. Items in the same group are assigned a name according to the content of the items (Gorsuch, 1983). Also, factor analysis is used to test whether a particular scale is one-dimensional (Balcı, 2009). The analyses revealed the scale’s eigenvalue in a single factor as 3.820 and the percentage of explained variance was 76.407. The results showed that these five aspects could measure affective experience well.

As for reliability analysis, Cronbach’s alpha reliability coefficient was found to be 0.922. It meant that the internal consistency level of the scale is well.

In conclusion, it can be concluded that the scale we made for measuring affective experience is a valid and reliable scale.

4.2 Expression Ways

Participants’ overall measurements to different expression ways are shown in Table 3.

Table 3. The measurements of expression ways in total by one-way ANOVA.

From Table 3, the results showed a trend that responses with affective expressions (implicitly or explicitly) had better performance than those in general way. And there were statistically significant differences in all aspect. But no significant difference was found between the implicit and explicit ways except affection.

To test whether the results were influenced by functions and periods, we analyzed the five dependent variables for each function and period. Because of the similar results in these functions, we took music function as an example to describe the results carefully (see Fig. 1). The different results in other functions would be added.

Fig. 1.
figure 1

Measurements of three expression ways in music in total (top left), in the morning (top right), after a day’s work (bottom left) and before going to bed (bottom right).

The results in music were the same as the overall results (affection: F = 5.599, P = .004; social distance: F = 9.527, P = .000). There was also a trend that responses with affective expressions (implicitly or explicitly) had better performance than those in general way. Consistent with music, weather existed significant differences in their groups in affection (F = 3.989, P = .020), social distance (F = 12.331, P = .000), and satisfaction (F = 6.319, P = .002). All factors in news were statistically significant in their groups (affection: F = 13.074, P = .000; confidence: F = 4.641, P = .011; naturalness: F = 8.679, P = .000; social distance: F = 11.873, P = .000; satisfaction: F = 6.583, P = .002).

From the perspective of time, it was obvious that responses with affective expressions performed better in all three periods. But there was no significant difference between the implicit and explicit ways, which is the same as weather.

Furthermore, as Fig. 1 depicted, explicit expression ways scored higher than another two ways when experienced after a day’s work. The same situation also appeared in weather and news. The difference was that explicit expression way scored best in all three periods in the news, see Table 4.

Table 4. The measurements of expression ways in news.

4.3 Gender Differences

In the study, we controlled the “gender” variable to equalize the number of male and female subjects. So, did gender have a moderating effect on results? In this part, we mainly analyzed the effect of gender on the results. Figure 2(a) presented the descriptive statistics of the experience measures in general. The results showed that the overall average scores from males were higher than scores from females. And there was a statistically significant difference for all factors (affection: F = 12.870, P = .000; confidence: F = 22.078, P = .000; naturalness: F = 17.906, P = .000; social distance: F = 24.673, P = .000; satisfaction: F = 25.260, P = .000). To investigate if the results had different performances depending on the scenario, the five factors were calculated for each function, see Fig. 2(b) to 2(d).

In weather function, there were statistically significant differences for all factors (affection: F = 6.453, P = .012; confidence: F = 9.689, P = .002; naturalness: F = 5.420, P = .021; social distance: F = 6.564, P = .011; satisfaction: F = 21.012, P = .000). The news function had the same situation (affection: F = 7.999, P = .005; confidence: F = 4.859, P = .029; naturalness: F = 7.153, P = .008; social distance: F = 16.551, P = .000; satisfaction: F = 6.882, P = .009). Figure 2(c) shows the same trend in music, but only three factors had statistically significant differences (confidence: F = 8.361, P = .004; naturalness: F = 5.743, P = .018; social distance: F = 4.678, P = .0032). All the figures above showed that male participants seemingly had better experience than female participants. And there was the same trend in different periods.

Fig. 2.
figure 2

Measurements of three functions in general (top left), in weather (top right), music (bottom left) and news (bottom right).

4.4 Qualitative Statistics

Qualitative data were collected on multiple choices and users’ interviews. In this part, we present these results respectively.

Multiple Choice.

The results of multiple choices were shown in Fig. 3 and Fig. 4. Figure 3 depicted the choices made by participants in different functions. Figure 4 depicted the results in the view of time.

Fig. 3.
figure 3

The results of multiple choices from a functional perspective.

Fig. 4.
figure 4

The results of multiple choices in the view of time.

It was obvious that many participants selected the general responses in music. In the interview, half of the participants who selected general responses said that they didn’t want to listen too much before going to bed when using music function. They preferred to listen to music directly (In general ways, agents would play the music directly.). In their thoughts, synthesized speech might be a kind of disturbance and ruin the atmosphere especially before going to bed and after a day’s work. And it also led to the results shown in Fig. 4. In short, current levels of synthesized speech limited the expression of emotions from conversational agents.

Interestingly, quantitative findings seemed to contradict qualitative findings. In interviews, participants who chose general ways (which meant that they played the music directly without saying anything) thought that the atmosphere created by music was good enough, and there was no need to say anything at this time. In that situation, the use of an unnatural synthesized voice was a kind of disturbance. Besides, by checking the raw data, we found that scores from participants with those statements were very close between the general and implicit responses. But the scores from other participants had a bigger gap between the general and implicit responses. This was the two reasons for this contradiction.

Q1. When will you use the conversational agent for the weather (music, and news)?

The chart below displayed the using time of three functions, seeing Fig. 5. Users usually listen to weather forecasts in the morning and before sleep. Although most users decided to listen to music in the afternoon, there are also some users doing this in the morning and before sleep. As for news, few users choose to listen to the news before going to bed. Some design suggestions could be put forward based on the situation, especially frequent time.

Fig. 5.
figure 5

The results of using time of three functions.

Q2. Please tell us your demands for the function at different times in one day.

Table 5 illustrated the frequency of users’ demands in each function. From the table, we can sum up some points related to improving users’ experience. For weather, it is important to give some useful tips at the right time. For example, agents can remind users to bring an umbrella before going to work. And agents can tell users the conditions of roads before they leave home. Playing the right music at the right time may be the most important thing in music function. The agents should learn behavior and habits from each individual user over time. In terms of news, users value timeliness highly and prefer to reading headlines.

Table 5. The frequency of users’ demands.

Q3. Have you found the differences between these responses? Please tell me about your findings.

This question was regarded as a filter to clean data. From the results, all participants could tell the differences among responses. This showed satisfactory discrimination about the experiment materials.

Q4. What impressed you most in these responses?

Most participants said that they were impressed by these implicit and explicit responses, which is good things. But they were also impressed by the awkward unnatural voice. This kind of voice made them feel uncomfortable. So, naturalness is an important and basic element that influences the affective experience a lot. In a word, users prefer natural voice and humanized expressions. Recent studies have showed that certain users groups tend to personify their agents [21], particularly older adults and children [22].

5 Discussion

This section discusses the main findings and the implications for existing and future conversational agents. Details are discussed below.

5.1 Discussions of Affective Experience

In the study, we used five factors to measure users’ experience with conversational agents. Satisfaction, as a comprehensive indicator, reflected the user’s experience generally. Affection and social distance were used as subjective factors to evaluate users’ subjective feelings. They measured the user’s affective experience. Confidence and naturalness are objective factors to evaluate the performance of some objective aspects. We tried our best to use the real-world responses of the conversational agents as the general way. However, these responses could be slightly adjusted because of the control of other variables. Overall, the performance of the objective indicators was better. The overall score of confidence was ahead of other indicators. But synthesized tones seriously affected the performance in naturalness. Users generally felt that the voice was somewhat stiff and unnatural. Subjective factors’ scores were relatively lower. This showed that the current enterprises were not concerned about the user’s affective experience enough, and designers didn’t take the user’s affective experience into account when designing the feedback expression.

From the results, we could also see users’ preference for explicit expression ways after a day’s work. This trend existed in all three functions we set in the experiment. In the interviews, participants said they usually felt tired after a day’s work. In this case, explicit responses could make users feel more relaxed. Statistics showed that the average working hours of Chinese people are higher than those of most countries in the world, reaching 2200 h per year [23]. Long hours of work could easily make users depressed and tired. Participants agreed that explicit responses made them more relaxed and comfortable.

From the overall results, we can also find that explicit responses had the best comprehensive performance. Participants felt satisfied most of the time. However, in the context of Chinese culture, Chinese people tend to be more implicit in interpersonal communication. Previous studies showed that subjects with a Chinese cultural background were more responsive to robots that use implicit communication styles [24], and Chinese participants preferred an implicit communication style than German participants [25]. This is inconsistent with our results. There is a possible explanation. Although human-computer interaction needs to comply with some requirements of interpersonal communication, it is still different from interpersonal communication. In our interviews, although participants personified the conversational agents, they still regarded the agents as assistants rather than humans. Another possible explanation is that the limitations of conversational agents themselves may lead to users’ preference for explicit responses. The current dialogue mode requires users to order agents concisely, directly and clearly so that they can understand the user’s words easily. This situation also exists in Chinese agents. Before the study began, we conducted usability testing to understand the current usage of conversational agents. We found that the users’ natural expression would lead to identification problems because the sentences were too long and complex. And agents could not respond to users correctly.

5.2 Possible Reasons of Gender Heterogeneity

In the study, we found an interesting phenomenon: male participants scored significantly higher than female participants. This attracted our attention. There is a possible reason for this situation.

The possible reason is that the female speaker we use is more attractive to male participants. Studies have shown that humans have a general preference for heterosexual voices: men prefer high-pitched female voices; women prefer low-pitched male voices [26, 27]. This voice preference is closely related to the evolutionary significance behind voice: feminized voice is the characterization of female fertility [28], while masculine voice is the characterization of male’s good genes and resources [29]. However, since we have not conducted experiments on male speakers, we cannot verify this reason. So, designers should consider adding the options of male speakers for female users. In the previous usability testing, many female users have put forward similar requirements. They want to add male voices to conversational agents.

5.3 Limitation

This paper identifies some important characteristics in conversation and how these factors influence users’ affective experience. We performed this study in China, and thus our findings may not transfer to other cultures. The results of this study are based on the data of 20 users. The bigger size of samples is required for more convincing results in the future. Besides, only a female speaker was used in this study. We have no idea how a male speaker affects users’ affective experience. Additionally, we should get more information by asking more in-depth questions during the interview to explain the quantitative data easily.

6 Conclusion

This study contributes to the literature on conversational agents by taking affective experience into account in feedback expression design. In detail, we performed a quantitative evaluation of three expression ways in different functions and periods. We also performed a qualitative analysis of users’ experiences. We found that the user’s affective experience should be considered in expression ways’ design, and explicit responses performed better in most situations. The male participants seemed to be more satisfied with the agent’s feedback performance, as reflected in the higher cores of each dependent variable than women. These findings are significant as they provide a foundation for future design and research in this area in China.