1 Introduction

Current technological trends and computational evolution during the last decades have strongly influenced our communication process. In this context, collaborative virtual environments (CVEs), which are computer-generated interfaces to cause users the feeling of being together in an environment different than the one they actually are (Ellis 1995; Schroeder 2010), provide a shared place with common objects for multiple users to collaborate at a distance. CVEs then represent a media that brings remote people into spatial and social proximity, facilitating communication awareness (Wolff et al. 2008).

Two types of actors can inhabit a CVE: computer users and intelligent virtual agents (IVAs). IVAs are interactive characters that exhibit human-like qualities. They can communicate with humans or other IVAs using natural human modalities, with the purpose to simulate real-time perception, cognition and actions (Aylett et al. 2013). Both types of actors are embodied in the virtual scenario through avatars, their graphical representation in the CVE and their means for interacting with the virtual world and with others.

As in real life, actors’ interaction in CVEs is achieved via verbal and/or nonverbal channels. The IVAs’ display of nonverbal communication represents a broad research area mainly aimed to the design of human-like artificial behavior for both robots and avatars (e.g. Breazeal et al. 2005, Gobron et al. 2012). IVAs nonverbal interaction has also been recreated for collaborative situations.

In Hanna and Richards (2015), authors manage different parameters (e.g. the spatial extent to perform expressions, time to perform an expression, repetition of movements, and close physical postures) to provide IVAs with certain personalities expressed through verbal and nonverbal behavior. According to Jofré et al. (2016), nonverbal behavior is a primary factor during human–avatar interaction as much as it is in human–human interaction. In their work, they presented a human–computer interaction (HCI) system for virtual reality (VR) based on nonverbal cues. Nevertheless, for CVEs, there are scarce studies about the nonverbal cues that users can display through their avatar, such as deictic gestures or gazes. Despite some isolated effort for the automatic recognition of nonverbal interaction, it reminds as an open issue.

In real-life conditions, in Shahrour and Russell (2015), body trackers were applied for people’s nonverbal interaction automatic recognition to establish the topic of conversation and the cultural background of the subjects. For a collaborative situation, in Hayashi et al. (2014), the analysis of collaborative learning was conducted through eye-tracking glasses, microphone audio detection, and a pen device, to collect data of gazes, speech intervals, and task implementation to determine participation and learning attitude.

For digital environments, the recent interest in the area of affective computing has increased the study of the users’ nonverbal behavior focused on affective states (Zeng et al. 2009). Honold et al. (2012) presented a framework to acquire, analyze, and present nonverbal communication, to classify emotions. Also, particularly for users, an overview of the principal conditions of nonverbal interaction contextualized within CVEs was presented in Peña et al. (2015).

Using CVEs, in Peña (2014), collaborative learning is diagnosed through nonverbal cues automatic recognition to create a facilitator for collaboration. Another proposal to establish collaboration in CVEs was presented in Casillas et al. (2016), where a multilayer model using fuzzy classification, a rule-based inference was used to somehow assess group collaboration. None of them deals with the automatic collection of nonverbal interaction from a general point of view.

It is worth highlighting that CVEs present a particular scenario where the display of the users’ nonverbal cues is constrained by the technology that in contrast, facilitates the collaborative process better than other media technologies.

Understanding and effectively using nonverbal behavior is crucial in every social experience, both in artificial and real life. Verbal and nonverbal channels work together for the communication process, in which nonverbal interaction is used to repeat, conflict, complement, substitute, accent, moderate, and/or regulate verbal communication (Knapp and Hall 2010). Nonverbal communication provides a tremendous quantity of informational cues. Furthermore, collaborative work requires to be conscious of the presence of other participants and to understand what they are doing (Ammi and Katz 2015), for which communication is a key aspect. Hence, we propose that the comprehension of the collaboration flow can be achieved through the interaction that takes place during such collaborative interaction.

CVEs resource can service activities such as socializing or gaming. However, the work presented here is centered on nonverbal interaction suitable to achieve collaborative tasks. Also, because CVEs are predominantly visual, they provide elements to center attention on tasks that involve the use of space and objects; otherwise, this technology might not be necessary (Schroeder 2010; Spante et al. 2003). Likewise, unlike in real life, the focus in a CVE will be narrowed on a few things and constantly engaged because there is an ongoing reason for being there (Schroeder 2010). Keeping these considerations in mind, to establish the actors’ interactions in CVEs, the design of a domain ontology is here presented. The ontological model is composed of a taxonomy and its relations (object and data) for the processes to retrieve nonverbal interaction cues from a CVE. As well, some guidelines for higher indicators of a collaborative session are discussed. Then, this model is applied and presented in a case of study.

We consider that this domain ontology will provide insights for better understanding of nonverbal interaction in CVEs, and it constitutes the support to automatize its classification. In turn, this will provide the means to make computer analysis during the collaborative session and/or afterward.

2 Nonverbal interaction cues in CVEs

The richness of face-to-face verbal and nonverbal interactions is not readily available in CVEs, a condition that requires to be carefully considered to describe this domain.

As Knapp and Hall (2010) pointed out, defining nonverbal communication as the communication effected by other means than words is generally useful, although not completely precise. Separating verbal and nonverbal behavior is virtually impossible. For example, hand gestures are often classified as nonverbal communication, but sign languages are mostly linguistic, or not all spoken words are clearly or singularly verbal like onomatopoeic. Knapp and Hall (2010) also stated that another way to classify nonverbal behavior is by looking at what people study in this regard. They found that the theory and research associated with nonverbal communication focus on three primary units:

  1. 1.

    The communication environment, those elements that impinge on the human relationship, but that are not directly part of it, constituted by environmental factors such as furniture, lighting conditions or temperature, and Proxemics, the study of the spatial environment (Hall 1968).

  2. 2.

    The communicators’ physical characteristics such as body shape or skin color, which includes artifacts as clothes, hairstyle, or jewelry.

  3. 3.

    Body movement and position, known as Kinesics, which includes gestures, movements of the limbs, posture, touching behavior, facial expressions, eye, and vocal behavior.

Table 1 shows the nonverbal interaction units of study that were transposed to their expression in CVEs, see Peña et al. (2015) for details.

Table 1 Nonverbal interaction units of study transposed to CVEs (Peña et al. 2015)

Following the methodology proposed by Munoz et al. (2010), a domain ontology was developed. Such methodology is based on the cycle plan, do, check/study and act (PDSA), which results in an ordered sequence of steps, easy to understand and track for the ontology design. Protégé™ software was used in the design process.

A first intuitive and broad partition of the interaction cues can be in verbal and nonverbal interaction. Just to illustrate verbal interaction, the utterance was used as a unit. The definition of the rest of the ontology’s taxonomy elements for the interaction cues in CVEs uses the Knapp and Hall (2010) breakdown of nonverbal primary units of study. The nonverbal interaction cues were then grouped in: Kinesics, environmental factors, Paralanguage and Proxemics, see Fig. 1.

Fig. 1
figure 1

Main units for interaction cues taxonomy

This taxonomy is not exhaustive, but it specifically comprises the most common interaction cues for the collaborative process of actors in a CVE. To automatize the cue retrieval, we propose to distinguish its process. Each cue represents a flow of states that changes according to what is going on in the CVE session, which constitutes its process. Each state represents the start, peak, and final time of a cue, and they can be retrieved from the log files of a CVE session. In the next sections are discussed such processes. The selected cues suggested are those that were considered as suitable for automation retrieval.

2.1 Kinesics cues

For Kinesics, it is necessary to make a clear distinction of the CVE actor, that is, IVA or user. An IVA can display an effective number of nonverbal cues since they are reproduced by animations. On the other hand, a constraint on the number and spontaneity of the nonverbal cues is found as a result of the avatar control performed by the user (Capin et al. 1997).

According to Capin et al. (1997), a user can control the avatars using three different approaches:

  1. 1.

    Directly controlled through body sensors (Jovanov et al. 2009; Lange et al. 2011).

  2. 2.

    User-guided, when the user guides the avatar defining tasks and movements, usually through a computer input device such as the mouse or the keyboard.

  3. 3.

    In a semi-autonomous way, when the avatar has an internal state that depends on its goals and its environment, and the user can modify this state. The semi-autonomous display of nonverbal cue is then achieved by animation.

According to the employed approach to control de avatar, a gesture or a body posture is determined in different ways. A gesture is the communication of a message achieved by moving a body part or parts, but even though it starts with a body part movement, it does not necessarily will engender a gesture. In the directly or user-guided control approaches, the most common body parts of an avatar with movement are one arm and the head (Wolff et al. 2005). Using these cases as example, two states can be established, a simple body movement or a gesture display. Figure 2 presents a detailed diagram UML of these states.

Fig. 2
figure 2

A body part movements

Each 3D object can be located in the virtual environment through its pivot axes, information that can be retrieved at any moment form the CVE. In the case of a HCI with the environment, a log file can be generated. Also identified from the log files, a body part movement establishes a starting point; this state is kept until a gesture is detected or the movement stops. In the diagram UML (Fig. 2), the record of the log file with a timestamp accompanies the messages sent from one state to another (i.e., ^Record.MovementCue Time_Start = TimeLog).

However, gesture detection is not a straightforward activity; it requires the understanding of the gesture’s distinctive characteristics. As an example, a common arm gesture is a deictic gesture, useful to point something. For this particular case, Nickel and Stiefelhagen (2007) determined that the time a person holds a pointing gesture is usually only until attention from others is drawn to the pointed object, around a couple of seconds. Therefore, for automation purposes, the sustained selection of an object, for a couple of seconds can be considered as a pointing gesture. As of head movements, two gestures that can clearly be distinguished. Nodding, to show agreement or comprehension, and headshake, to indicate disagreement or incomprehension. Both are characterized in Cerrato and Skhiri (2003) using the number and intensity of the movement, which will support automation.

When the avatar is semi-automatically controlled, cue detection is different. In the log file can be detected the input that triggers the linked animation. Figure 3 shows a detailed state diagram UML for this situation. The animation can be either a gesture or a body action like jump or sitting. In this case, the message from one state to another represents the trigger input.

Fig. 3
figure 3

States of a user’s input that triggers an avatar gesture or body action

A body action might lead to a body posture. The user typically achieves body postures through a keys combination. Thus, they can be read directly from the log files and determined the same way as the semi-autonomous controlled actions as can be observed in Fig. 3.

2.2 Environmental factor cues

The different objects in the CVE correspond to the communication environment. The objects in it shape the architectural design of the scenario. Hall (1968) differentiated three Proxemics features related to objects:

  1. 1.

    Fixed features the space organized by unmoving boundaries such as a room.

  2. 2.

    Semi-fixed features movable objects that can change the space organization.

  3. 3.

    Dynamics movable objects.

While the fixed features cannot be modified during the session, the modification of semi-fixed features and dynamics in the workspace can take a role during collaborative interaction, especially in a CVE object-task-oriented session.

In the virtual world, for the actors to interact with an object, they have to select it first, which denotes the action of pointing or grabbing that object. After selecting the object, it can be either deselected or manipulated. The manipulation of an object represents its state modification, generally by moving or rotating it (Mine et al. 1997). The manipulation of the objects process is presented in a states diagram UML in Fig. 4. The interaction starts with the selection of the object; if the object is not deselected, then it can be moved, rotated, or transformed in other ways such as resized or colored. The object can go from one transformation state to another as long as it remains selected. The manipulation ends when the actor deselects the object.

Fig. 4
figure 4

Object manipulation process states

2.3 Proxemics cues

Proxemics studies, territory and personal space are represented by the actors’ navigation in the CVE. It is important to highlight that in virtual environments (VEs), there are no certain physical restrictions; for example, navigation can be performed by flying or using techniques such as tele-transportation.

As shown in the diagram UML in Fig. 5, navigation starts once the avatar is moved from its location. An avatar’s change in position will be reflected with a new X, Y and/or Z axes position in the CVE. If the avatar goes in an ascending direction (e.g. Y ax), air navigation is established, connector {1} in Fig. 5. If the avatar is moved in the ground (e.g. X or Z axes), land navigation is established, connector {2} in Fig. 5. Either ground or ascending movements can have a direction change, which is represented in the reorientation connector {3} in Fig. 5. Next the connectors {1}, {2} and {3} of the diagram are explained.

Fig. 5
figure 5

Navigation process states

Air navigation process, connector {1} from Fig. 5, is detailed in Fig. 6. When the avatar disrupts the ascending movement, it might descend, stay in the air floating, or fly and make interchanges from those states. When the avatar descends to the ground, the air navigation ends, connector {4} that returns to a new “last location” in Fig. 5.

Fig. 6
figure 6

Air navigation process states

For land navigation, connector {2} in Fig. 5, the starting point is a standing position. From the standing position, the avatar can move or move faster than regular, which is understood as running, and make interchanges from those states. The land navigation ends with a switch to air navigation or when the session ends, as depicted in Fig. 7.

Fig. 7
figure 7

Land navigation process states

Also, for both air and land navigation, the avatar can be reoriented by changing its facing direction, either to the right or to the left, as depicted in Fig. 8, and then return to navigation, connector {4} in Fig. 5.

Fig. 8
figure 8

Reorientation during the navigation process states

Other types of navigation such as fall, jump, land, or tele-transportation are usually achieved through a combination of keys and animation. In this case, a semi-autonomous control approach following the states presented in Fig. 3 can be applied, but in this case ending with an avatar’s new location.

2.4 Paralanguage cues

Hardly related to verbal communication is Paralanguage, described as the physical mechanisms to produce nonverbal vocal qualities and sounds (Juslin et al. 2005). A vocal expression can contain several nonverbal messages such as emotion and intention. Paralanguage includes pitch, rhythm, tempo, articulation, and resonance of the voice, and vocalizations such as laughing, crying, sighing, swallowing, clearing of the throat, or snoring, among others (Knapp and Hall 2010). Several techniques have been developed for their study and comprehension, though its automatic comprehension remains as a challenge (Johar 2015).

An easy characteristic of the human voice to extract in a computer system is whether the actor is vocalizing or not (making a pause), the Vocalization_pause cue in Fig. 1. From this cue, other indicators can be extracted such as frequency and duration of the speech, which are useful tools for the analysis of group interaction (e.g. Brdiczka et al. 2005; Dabbs and Ruback 1987).

From this cue, higher-level indicators such as the speech time rate can be obtained, that is, how much an actor speaks in relation to the others in the environment. Also, because we speak to someone in a dialogue interchange, to better understand interaction cues, the talking-turn structure can be obtained from the individual Vocalization_pause cue. Furthermore, a distinction between individual and group cues might lead to higher-level indicators of interaction, a possibility explored in the next section.

It is worth to mention that the display of a nonverbal cue might not involve other actors in the CVE; these actions are not interaction. However, there is not a straightforward way to distinguish such situations, which complicates its automatic detection. For example, when people collaborate, they might make a statement directed to no one in particular, not requiring an answer (Heath et al. 1995); however, that statement can influence others and therefore it represents an interaction.

The representation of talking-turns within a group of persons was presented in Dabbs and Ruback (1987) to understand the amount and structure of a group conversation. Based on this approach, we create a detailed state diagram UML, first for a dyad conversation, see Fig. 9. In a dyad conversation, a person vocalizing alone (taking the floor) can make a pause and then maintain the floor. Or, the pause can cause a conversation partner to take the floor, causing a switching pause. We added to Dabbs and Ruback (1987) approach, a relation from the pause state to the switching pause state. These situations are presented in Fig. 9 through three states: vocalization, pause and switching pause. In order to distinguish a pause from a small gap from one vocalization to another, for automatic speech recognition, the end of an utterance is usually determined by a pause in the range from 500 to 2000 ms (Mine et al. 1997), and then, a two-second silence can be functional to automatically determine the end of a talking-turn.

Fig. 9
figure 9

A dyad conversation process states

Dabbs and Ruback (1987) main contribution was the identification of group vocalization, with their respective group pause and group switching pause. From a person vocalization pause, several persons might vocalize at the same time generating a group vocalization. The group, same as a person, can make a pause that leads to a group pause, or a group switching pause which means that a distinct group of persons is now vocalizing. In this case, we considered that the group vocalization could occur either during a pause or while a person is vocalizing.

During group vocalization, a group pause can originate three situations: (1) a different members’ group vocalization; (2) the person with the floor after the group vocalization takes it again; or (3) a different person takes the floor. These situations are extended from the dyad flow and presented in the states diagram UML in Fig. 10, with three group states: group vocalization, group pause, or group switching pause.

Fig. 10
figure 10

Group vocalization process states

In the same way, a group state should give us a better understanding of the collaborative interaction process, in the cases of navigation, implementation (object manipulation), and body postures such as sitting. Some of these situations will be included in the case of study.

2.5 Taxonomy including selected cues for automation

Following the discussed processes and considering these cues as suitable for automation, the main units of nonverbal interaction cues (see Fig. 1) can be broken down in the taxonomy as shown in Fig. 11. Because the actor’s interaction triggers the taxonomy, the root name of the domain is Actor_Interaction.

Fig. 11
figure 11

Automatic interaction cues retrieval taxonomy

The Actor_Interactin can be of two classes, either verbal or nonverbal. In the case of verbal interaction, just as an example, the class was named Verbal_Communication_Utterance. For the nonverbal interaction, the class is identified as Nonverbal_Interaction_Cues broken into four subclasses: (1) Environmental_factors, (2) Kinesics_cues, (3) Paralanguage_cues and (4) Proxemics_cues.

Within the Environmental_factors subclass in the Object_cues sub-subclass were included the Dynamics_objects and Semi-fixed_objects, those represent objects that can be manipulated within the environment.

The included body movements in the Kinesics_cues subclass were the Face_Expressions in case they are automatically generated, and Gestures sub-subclass composed of Hand_Movements and Head_Movements, because they are the most common avatars’ body parts that have independent users’ direct movements.

The second sub-subclass in Kinesics_cues is Body_Positions, and here, four body positions were included: (1) Crouch_Position, (2) Sit_Position, (3) Squat_Position and (4) Stand_Position for stand up avatars, all as an example of the semi-autonomous controlled cues (see Sect. 2.1).

In the Paralanguage_cues sub-subclass, only Vocalization_pause was included because of the automation challenge recognition of paralanguages’ cues. And for this sub-subclass, Speech_Time_Rate is merely suggested, as the rate of time of the user speech.

Finally, for the Proxemics_cues, Navigation_cues derived from actors’ navigation (detailed in Sect. 2.3) were included: Ascending, Descending, Fall, Fly, Jump, Land, Run, Teleportation, Turn_left, Turn_right, and Walk.

3 Taxonomy application

This case study aims to retrieve the processes states to get individual and group nonverbal cues to classify them within the taxonomy. Thus, a session in a CVE to collaborate in an object-oriented task was implemented as follows:

Participants Three undergraduate male students from a computer science school, aged 20, 20 and 25, voluntarily participated in the study.

Materials, devices and experimental situation The session took place in a room where three Internet-connected Dell™ computers model Alienware X51 displayed the desktop CVE. Microphones, earphones, and the TeamSpeak™ application were used for communication. The participants’ manipulation of objects, the avatars’ navigation and movements, and talking time were automatically registered in a text file in each cycle of the application.

With the OpenSim™ software and the CtrlAltStudio™ viewer applications, a tridimensional (3D) CVE was generated. The session was videotaped using the Fraps™ application, saving the screens as the users see them; all participants’ screens were videotaped.

Design and procedure The task consisted of the assembly of several pieces from a geometric figure like the one shown in Fig. 12. During the sessions, the participants could watch the same model formed by plastic pieces.

Fig. 12
figure 12

Geometric figure

The scenario is an island in which several different color pieces were placed around a rectangular white plane, as can be observed in Fig. 13.

Fig. 13
figure 13

The scenario for the task session

The students had a first-person avatar; it means that they could not see themselves in the virtual world. The avatar corresponds to the participant’s gender and the user’s name is placed at the top of its avatar, to facilitate identification and communication.

The users’ interaction with the CVE was using keys combinations and the mouse. The participants could make land navigation, by walking or running, or air navigation by flying, floating, ascending and descending; and they could select, move or rotate the dynamic objects. Objects could be manipulated from the distance when they were at sight, and the users’ name in the top of the objects helps to know who is handling it, as can be seen in Fig. 14.

Fig. 14
figure 14

The task session

The participants were exposed to a 5-min trial session to familiarize themselves with the environment. Afterward, they started the trial session to perform a specific task. A piece of paper containing instructions of the keys and mouse combinations was placed in each participant’s desk, as additional help. The plastic figure to be assembled was also placed on each desk. There was no time limit to perform the collaborative task.

The instructions given to them were: “ By working together, assemble a figure like this (the figure in plastic was then shown to them). Use the microphone with the earphones for communication. Not all the pieces are required. Please assemble the figure on the white plane. Let us know when you finish.”

Data The session lasted 6:44.918 min. The software application saves in text files the X, Y, and Z coordinates of the objects and avatars positions, when the pieces or the avatars are moved or rotated, and when the microphone is deactivated or activated by the voice of a user. Data were treated first by an application based on the ontological model by the use of OWL API™, a Java API™ for creating, manipulating, and serializing OWL ontologies.

The text log file is formed by the application as follows: First, a timestamp is taken from the system, for each cycle of the application; the timestamp has hour, minute, second and ten thousandth second. Then, the user ID is placed in the log, and then, an ID for the type of action performed by the user (e.g. ON to start the session, CAM for walking, V for flying, A for grabbing an object, GD for turn to the right, MO for object manipulation, and so on; the initials are for the Spanish word(s) of the action). Then are incorporated in the text row the X, Y and Z axes placed between the less than symbol ‘ < ’ and the greater than symbol ‘ > ’. The manipulated objects in the VR environment were sequentially numbered, so in the text log file this number is placed at the end of the row for its identification. Next is presented an example of some rows of the text log file:

figure a

3.1 Results

Although the log files from the CVE have a classification of the nonverbal cues, they were treated following the processes described in Sect. 2, as mentioned using the OWL API™ and a Java API™ with the aim of establishing its generalization for CVEs.

In this particular CVE, the main nonverbal communication cues available are: Vocalization_pause from the subclass Paralanguage_Cues, Object_cue from Environmental_factors, and different types of Navigation_cue from Proxemics_cues subclass on the taxonomy. Two files are created by the application: one with the classification of each nonverbal interaction cue and another one with a report by interaction.

For navigation, a series of boolean functions were programed to verify the type of navigation with different parameters such as previous position or actual position of the Z axis, and previous and current position in grades to identify turns. For example, the next code is to characterize Walk (land navigation, see Fig. 5):

figure b

In the code, the walking condition is true when the Z axis of the avatar is equal to any surface and the speed does not correspond to a threshold that represents the running type of navigation. Another example is in the next code; this time to identify a reorientation to the left (see Fig. 5), which corresponds to the Turn_left subclass of the Navigation_cues subclass.

figure c

In this case, in the code is made a comparison with the Z axis angle to establish the turn, which gives the true result. All the navigation states were properly identified by the application. A segment of the output file of the application for the taxonomy classification is next presented. The timestamp is conserved. Then, the cue classification by name is placed, along with the X, Y and Z axes values.

figure d

Also, a segment of the interaction activity output report of the application, which groups the starting and ending position of each cue, is next transcribed:

figure e

3.1.1 Group analysis

The automatic identification of nonverbal interaction cues can be applied to understand the individual performance of an actor (intrasubject) within a CVE, for example, to observe how long was the actor taking care of the task in contrast with the navigation. Another comparison can be made among actors, like who have greater interaction with others. Finally, cues can be observed at the group level to analyze groups.

The three cues retrieved from the environment (Vocalization_pause, Object_cue and Navigation_cue) were treated to understand group cues and their intersection during the session. In the next examples, only minute 5 was analyzed.

Vocalization and pauses From the log files, a list with the starting and ending vocalization time of the participants was classified. Figure 15 shows a graph of minute 5 of the session; it shows the change in the states. In the rectangles are, in the upper left corner, the state (in this case vocalization and group vocalization), and in the upper right corner, the person or persons that contributed to that state (P1, P2 and/or P3). In the next row in the rectangle are the starting and ending time in minutes of the state in seconds and milliseconds. And at the bottom of the rectangle is the elapsed time in that state. Pauses conducted to a change in the state. In this particular minute, participants started to talk around the middle of it.

Fig. 15
figure 15

Vocalization states for the group

Navigation Navigation states in the session can be followed as shown in Fig. 16. This example is for the three participants (P1, P2 and P3) and from the first half of the minute 5. In the upper left corner of the rectangles are the states as follows: LR for left rotation, RR for right rotation, F for flying. Same as in vocalization, the next rows in the rectangle show the starting and ending time, and at the bottom of the rectangle, the total time in that state. It can be observed, for example, that P3 was static for 23 s at the beginning of the minute 5.

Fig. 16
figure 16

Navigation states of the three participants

Object manipulation As well, object manipulation can be followed as shown in Fig. 17. In this example, the manipulation of only one object is presented. The types of states in the upper left corner of the rectangles are G for grabbing or selecting the object, MO when the object is moved, and RO when the object is rotated, and D for dropping or deselecting the object. On the upper right corner are first the person that manipulated the object (i.e. P2) and then a number used to identify the objects in the scenario, in this case, object number 14 (i.e. O_14). In the following rows of the text in the rectangle are the starting and ending times of the state, and at the bottom of the rectangle is the elapsed time in that state. Object manipulation can be significant to distinguish when all the group members are working at the same time, as an implementation stage.

Fig. 17
figure 17

Environmental factors states

Finally, a graphic with the overlapping activities in that 5 minute is presented in Fig. 18. In Fig. 18 is the time in percentage for each type of cues and when two or three of them occur simultaneously. As can be observed, object manipulation in minute 5 was constant; at least one of the participants was at any time moving an object. Vocalization was always accompanied by another type of interaction (i.e. manipulation or navigation), although only 1.05% of the time the three types of interaction occurred simultaneously.

Fig. 18
figure 18

Navigation states for the group

Group activities are not a part of the taxonomy; however, logical agents can be added in the ontology to make this kind of inferences.

4 Discussions and future work

We propose that the comprehension of collaboration flow can be achieved through the interaction that takes place during collaborative task achievement. Nonverbal interaction during the accomplishment of an object-oriented task is the main activity performed while collaborating. In CVEs, these nonverbal cues have particular characteristics due to the users’ avatar constrained body movements, and therefore they need to be distinguished and treated in consequence.

The processes to extract nonverbal cues from a CVE, as units of study, were specified for a taxonomy model to narrow the domain. Each process was detailed in state diagrams UML, to determine when the cue starts, how long it lasts, and when it ends. Those states form the interaction cues.

Some of those processes were implemented in an application and applied to a case study focused on a collaborative assembling task, to classify the available cues and prove the model feasibility. Even more, the taxonomy from the ontological model is considered as the backbone for automatic retrieval of nonverbal interaction in CVEs.

Besides, higher-level indicators can be inferred from the model cues, similar to those presented in the Results section. This process highlights group cues to define group behavior based on the interaction during the accomplishment of the task.

For future work, the automatic comprehension of collaborative interaction in virtual environments will be implemented. We plan to integrate the ontology to intelligent agents with the aim to get the automation of collaboration definition types, such as division of labor or hierarchical collaboration. Another inference planned to be implemented are collaboration phases, such as planning, implementing, reviewing or control. This process analysis can include individual characteristics such as leadership or their influence on certain stages of the collaboration session. Along with learning or training proposes, this model can be applied for psychometric tests.