A Deep Learning Framework for Monitoring Audience Engagement in Online Video Events

Vrochidis, Alexandros; Dimitriou, Nikolaos; Krinidis, Stelios; Panagiotidis, Savvas; Parcharidis, Stathis; Tzovaras, Dimitrios

doi:10.1007/s44196-024-00512-w

A Deep Learning Framework for Monitoring Audience Engagement in Online Video Events

Research Article
Open access
Published: 21 May 2024

Volume 17, article number 124, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

A Deep Learning Framework for Monitoring Audience Engagement in Online Video Events

Download PDF

Alexandros Vrochidis ORCID: orcid.org/0000-0002-9738-1553^1,2,
Nikolaos Dimitriou¹,
Stelios Krinidis^1,2,
Savvas Panagiotidis³,
Stathis Parcharidis³ &
…
Dimitrios Tzovaras¹

720 Accesses
Explore all metrics

Abstract

This paper introduces a deep learning methodology for analyzing audience engagement in online video events. The proposed deep learning framework consists of six layers and starts with keyframe extraction from the video stream and the participants’ face detection. Subsequently, the head pose and emotion per participant are estimated using the HopeNet and JAA-Net deep architectures. Complementary to video analysis, the audio signal is also processed using a neural network that follows the DenseNet-121 architecture. Its purpose is to detect events related to audience engagement, including speech, pauses, and applause. With the combined analysis of video and audio streams, the interest and attention of each participant are inferred more accurately. An experimental evaluation is performed on a newly generated dataset consisting of recordings from online video events, where the proposed framework achieves promising results. Concretely, the F1 scores were 79.21% for interest estimation according to pose, 65.38% for emotion estimation, and 80% for sound event detection. The proposed framework has applications in online educational events, where it can help tutors assess audience engagement and comprehension while hinting at points in their lectures that may require further clarification. It is effective for video streaming platforms that want to provide video recommendations to online users according to audience engagement.

Real-Time Emotion Recognition Through Video Conference and Streaming

Enhancing Engagement Prediction in Online Environment Using Temporal Features

Synchronous Prediction of Continuous Affective Video Content Based on Multi-task Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Online video events have become very common mainly due to advances in internet connectivity, and video streaming technology has further proliferated with the travel restrictions caused by the COVID-19 pandemic. This study proposes a framework to complement the functionalities offered by online video events platforms by monitoring audience engagement at each time instant.

The framework is related to video content analysis, referring to the capacity to analyze video automatically to recognize and determine temporal and spatial occurrences. It is used in many areas, including entertainment, health care [1], sports [2], educational content [3], and security [4]. In [5], there is a methodology that fuses features from video content analysis with early viewership statistics of the videos to estimate the video popularity in six months; while in [6], researchers have used similar technologies to create video highlight detection according to user interests that arise automatically by video analytics when watching.

Although there are many methods for video analysis, to the best of our knowledge, there is no analysis of video events to detect their more interesting parts or monitor audience engagement. Interesting parts are those featuring significant sound events, such as applause, and segments where the audience is visibly engaged, as evidenced by pose, while the least engaging are those including people who are yawning or do not include many visibly engaged according to pose. Speakers often notice that audience attention decreases over time, and it is difficult to identify the factors that lead to this loss of interest. The proposed framework addresses this problem by identifying the most and least engaging parts of an online video event. In this respect, it is beneficial for recurrent video events such as educational lectures and seminars as it helps tutors identify the parts that should be better explained or presented in another manner. Of course, the application is not limited to education, as there are several domains, including business and entertainment, where assessing audience engagement can provide valuable feedback to the content creator and help improve the delivered video content quality. The key novel contributions of the proposed methodology are:

• A novel end-to-end six-layer deep learning framework for audience engagement analysis in video events;

• Utilization of a dense convolutional network trained with a custom-created dataset for audio event detection analysis;

• Fusion of an audio stream with video features to enhance engagement estimation;

• High accuracy in real-life data;

• Evidence of real-world applicability.

In more detail, computer vision tasks, such as keyframe extraction, face detection, pose estimation, emotion recognition, and an audio processing method for sound event detection, are applied to treat the problem. The proposed methodology assumes that participants have their cameras turned on, a common occurrence in lectures and conferences. It works even if only a few participants have open cameras without requiring all the cameras to be opened. In such instances, it only analyses those few participants with open cameras, and results are based on them. Moreover, the methodology extends its analysis beyond just video, leveraging audio inputs as necessary, which is crucial in similar cases, identifying the activities related to audio. Through this, beneficial information can be created and examined by participants, organizers, or presenters to improve their process during teaching, presentation approach, and other related activities.

A dataset was generated to evaluate the methodology, and testing experiments proved that this framework has significant potential in the online events analysis field. Commonly utilized in similar tasks, the F1 score effectively balances precision and recall, offering robustness to class imbalance. Its straightforward nature facilitates comparisons between proposed modules and alternatives, while quantifying the framework's classification accuracy and makes it ideal for evaluating this framework's effectiveness.

The remainder of this paper is organized as follows. In Sect. 2, there is a brief presentation of related work; while in Sect. 3, there are details about the proposed online event analysis. Section 4 includes the methodology evaluation, with the relevant used metrics, followed by Sect. 5, which concludes the study along with limitations and future improvements.

2 Related Work

This section provides an overview of the current state of the art for keyframe extraction, face detection, pose, and emotion estimation, alongside audio event detection.

Abed et al. [7] developed a keyframe selection framework for efficient face recognition, while Song and Fan [8] used the same techniques and object segmentation for video content analysis. The method proposed by Zhang et al. [9] focuses on a two-phase approach based on entropy and perceptual hash to extract representative of the main content keyframes. Clustering is another method used for keyframe extraction. It is a technique used to find frame clusters, and then a single frame is selected to represent each cluster [10]. Sun et al. [11] used feature fusion with k-means clustering, where keyframes are extracted as those that surpass a threshold capturing content variation. Another approach used by Tang and Chen [12] analyzed visual and audio features for distinct scenes and extracted one keyframe for each.

Luo et al. [13] proposed a deep cascaded method for face detection. This method iteratively exploits bounding-box regression as a localization technique. Mo et al. [14] used multi-task convolutional neural networks (MTCNN) for face detection and alignment in multiple faces. CNNs were also used by Li et al. [15] for face detection. Researchers used a dual-branch center face detector (DBCFace) based on CNN, and their results were comparable with state-of-the-art methods yet exhibited faster processing speed. In [16], a faster R-CNN was employed, trained on the large-scale WIDER face dataset [17]. By applying this method, researchers achieved state-of-the-art results on two datasets, FDDB [18] and IJB-A [19]. In a related study [20], the authors proposed a methodology for rotated face detection based on progressive calibration networks, achieving real-time detection performance for rotated faces. In [21], researchers tackled the problem using a deep convolutional neural network (CNN) to minimize training and testing times. Their method outperformed previous deep convolutional networks used for face detection in effectiveness and efficiency.

In [22], landmarks were used for pose estimation combined with face alignment, and researchers created one of the largest-sized datasets for testing. Zhu et al. [23] attempted to solve the problem of estimating pose under challenging situations using a CNN network and achieved significant improvements over state-of-the-art methods. In [24], the authors presented a robust facial pose estimation technique based on landmarks but only on those predicted with a high confidence score. CNNs were used to measure this score, and then erroneous ones were removed. In [25], a pipeline that can localize facial landmarks in non-frontal images was introduced. It is based on an optimized version of the part mixture model and used for head pose estimation in these frame types. In [26], pose estimation was conducted using three distinct methods. Two of these methods are based on landmark detection, while the third can recover in cases where other methods fail using the proposed dictionary approach. Facial alignment and pose estimation are combined in [27], where a lightweight deep neural network called an active shape model network is used, and in [28], researchers propose a system for multiple pose estimation by combining these two computer vision technologies.

In [29], researchers constructed a dataset to train a CNN for estimating the parameters of a 3D morphable model, combining it with an effective back-end emotion classifier. Xi et al. [30] proposed a framework that classifies six basic emotions using deep network transfer learning techniques and multiple temporal models. Shao et al. [31] proposed a framework that can work well with large-scale images and severe occlusions, using a deep learning-based attention and relation learning framework. In [32], there is a combination of emotion analysis with age prediction. A novel set of features is extracted for each image and then passed to a classifier using recurrent neural networks (RNNs). Niu et al. [33] approached this problem using a novel method that utilizes local information and the relationship between individual local face regions. Their approach outperformed state-of-the-art methods and contributed to emotion analysis in wild images.

Audio event detection involves sound event recognition from a continuous audio clip and providing their respective start and end times. Phan et al. [34] proposed a multi-task regression model that formulated event detection and localization as regression problems using the mean-squared error loss for training. Greco et al. [35] proposed a novel deep network that can detect abnormal events using a CNN with small kernels in convolutional layers. A CNN with 21 layers was also used in [36], which fed with sections of the gammatonegram representation. Gammatone filters are a linear approximation of the filtering performed by the ear. Their proposed network achieved state-of-the-art results in benchmark datasets and handled 5 s of audio per second. Romanov et al. [37] developed a system that can recognize thirteen types of non-speech events with a low false-positive rate using transfer learning, while Kao et al. [38] used a recurrent neural network (R-CRNN) to tackle the problem. In [39], two-dimensional spectrogram magnitude representations were used. The audio signal transformed into spectrograms, and then the problem was treated as image classification.

Unfortunately, a comparison table could not be provided due to the lack of common fields among related works with the proposed one. The works in this section are related to the single modules for each task, and no fused methodology employs all of them to analyze an event video. The focus in the related work section has been on comparing the established deep learning modules with similar ones, as the networks used demonstrated the best accuracy when compared separately within the same dataset.

3 Audience Engagement Analysis

This section presents the introduced audience engagement analysis framework. The framework starts by taking an online event video as input and continues with the pre-processing step of finding keyframes. Subsequently, it continues with face detection for each participant. Then, pose estimation determines how many people look below the horizontal axis. Intuitively, with head pose estimation, audience members who have lost interest and are distracted can be detected.

In the sequence, emotion estimation techniques are applied to recognize the candidates’ emotions, allowing the system operator to know one more piece of information for each event section. This layer contains yawn detection, and if a person classifies with the yawning label, the emotion changes to boredom. The audio analysis then inspects crucial information coming from each event audio. After completing all this analysis, the results are fused to indicate the participants’ interest in each video minute. The flow chart of the proposed system is shown in Fig. 1.

The output is a detailed report analysis of the online event video. This analysis contains information about the participants' engagement during each video session. The interest is estimated based on how many people look bored because of their pose or emotion. Also, information about how many attendants were in each period is provided. Except for this information, the total number of people indicating each emotion is counted. The considered emotions are happiness, sadness, neutrality, boredom, surprise, disgust, and fear. For each minute, the claps' number is included to enable the system operator to observe which parts the participants found crucial and applauded.

Audio and video stream features are fused at the end of the analysis process to provide insight into audience engagement. Audio reveals moments of high engagement, like applause, while video indicates participant involvement through body language. A comprehensive understanding of audience engagement during the event is achieved by integrating the features extracted from audio and video sources. Integrating video and audio data provides a holistic view of the event, surpassing the limitations of analyzing each data source separately.

Besides providing detailed information about participant attention, the methodology can find the video highlights when the engagement was the highest. First, the candidate highlights are produced based on keyframes that show when a scene has changed, and then they are sorted according to two factors. The first factor checks if there are claps in the keyframes minute. Those containing claps are sorted at the top, based on the assumption that usually, before and after them, there are highlight moments. The second factor is based on engagement. It is produced according to the participants' poses and emotions. If the participants look away from the screen or yawn, they are not considered engaged. Individuals who gaze at the screen or have an emotion other than neutrality are considered committed. The total number of engaged participants is calculated for each frame, with minute-by-minute averages providing comprehensive insights into overall engagement throughout the video session. Section 4 provides more details on the results obtained from the real-life data.

The audience engagement analysis is designed to be versatile and applicable to various event types without requiring modifications. While it may seem like an extended approach, its adaptability is inherent in its design, enabling it to be applied to multiple event videos encompassing different formats and contexts. For example, it works the same when people attend asynchronous videos such as pre-recorded lessons. After their allowance for open cameras, their emotions and engagement levels can be analyzed. By effortlessly adapting to the asynchronous nature of pre-recorded content, the methodology ensures a versatile and insightful tool for optimizing the experience across various video content platforms. Additionally, when some participants do not want to be part of the engagement analysis, the methodology can be modified to exclude them from the overall analysis and continue with the other participants. It is developed to prioritize anonymity to encourage participants’ willingness to permit engagement analysis. For instance, while determining engagement levels within a frame, specific participant identities or their positions in the video remain undisclosed, ensuring anonymity. Moreover, the framework can be easily modified to exclude a participant who does not want to be part of the engagement analysis. This approach maintains confidentiality while still providing valuable insights into overall engagement.

The utilized features were chosen with prioritization of those providing the optimal balance between value and computational complexity. For example, while spatial–temporal information could enhance the analysis, they were not considered due to potential difficulties in acquisition across various event types and misalignment with the framework’s scope.

Several obstacles may impact the quality of the analysis. Such obstacles involve cases where faces are hidden behind masks or sunglasses. While these are removed when viewing videos, their presence can affect the analysis. In such cases, the face is detected, and pose estimation is performed, but emotion estimation may be compromised due to the absence of landmarks. Consequently, the framework will proceed to analyze the next attendee. Camera or microphone failures can also disrupt the analysis process. When the microphone fails, the sound analysis may be inaccurate, prompting the framework to move to the next segment. However, it is worth noting that individuals experiencing microphone issues fix them immediately, minimizing their impact on the analysis.

3.1 Pre-processing Step

The pipeline starts by taking a video as input, and the first module is the keyframe extraction. It is an important task that is related to computational load. Instead of analyzing every video frame, only a few representatives are analyzed. There were extracted as many frames as the system operator requested for each minute, and each minute was divided equally into Ν parts. Subsequently, representatives were extracted with systematic sampling using the formula described in the equation [1]. A comprehensive analysis was conducted for the entire minute of the video by exclusively analyzing these. The equation used for the keyframe extraction step is as follows:

$$S = \left( F \right)*\frac{60}{b} ,$$

(1)

where S represents the step and determines the number of frames that should pass to extract one, F represents the rounded fps recording number of the video, and b is a constant number given by the user and represents how many frames are wanted per minute.

3.2 Face Detection

The subsequent framework’s step is the position detection of the participants’ faces in each keyframe. The module used for face detection [40] is a single, unified network composed of a backbone network and two task-specific sub-networks. The backbone uses in computing five multi-scale convolutional feature maps when an entire input image is provided. Their dimensions are given by:

$$D = X*X,$$

(2)

where D represents the dimensions and X is a constant that is 160 for F1 (feature map 1), 80 for F2, 40 for F3, 20 for F4, and 10 for F5. The first four (F1–F4) feature maps are produced using a pre-trained method described in Sect. 4. The fifth feature map (F5) is randomly initialized using the Xavier method [41]. The point of using different scale feature maps is to make the network more robust in detecting different face sizes. The network architecture is illustrated in Fig. 2.

Once the initial feature maps are computed, they are fed into the first sub-net, which performs a convolutional object classification. Bounding box regression is performed using the second sub-net. Features are extracted utilizing a deformable convolutional network (DCN) [42]. The DCN creates new feature maps, and in training, when this process ends, the multi-task loss is calculated, and cascade regression is used following the equation given by:

$$L = L_{Cl} \left( {n_{i } , \hat{n}_{i} } \right) + \hat{n}_{i} L_{bbox} \left( {x_{i } , \hat{x}_{i} } \right) + \widehat{ n}_{i} L_{pnt} \left( {p_{i } , \hat{p}_{i} } \right) + \hat{n}_{i} L_{mesh} \left( {v_{i } , \hat{v}_{i} } \right).$$

(3)

For every training anchor i, there is a multi-task loss L, while x_i, p_i, and v_i are predicted values for the bounding box, five landmarks, and 1 k vertices. The x̂_i, p̂_i, and v̂_i are the corresponding ground truths. L_cl (n_i, n̂_i) is the loss of face classification where n_i is the predicted probability of the anchor being a face, and n̂_i is 1 for being a face and 0 where it is not. It is a softmax loss for binary classes. The following unified point regression targets for multi-level face localization tasks are given by:

$$\begin{gathered} \frac{{\left( {x_{j} - x_{{{\text{center}}}}^{a} } \right)}}{{s^{a} }} , \hfill \\ \frac{{\left( {y_{j} - y_{{{\text{center}}}}^{a} } \right)}}{{s^{a} }} , \hfill \\ \frac{{\left( {z_{j} - z_{{\text{nose - tip}}}^{*} } \right)}}{{s^{a} }} , \hfill \\ \end{gathered}$$

(4)

where x_ĵ and y_ĵ are the ground-truth values for the two box coordinates, five landmarks, and 1 k 3D vertices in the image space. The value of z_ĵ represents the ground-truth z coordinates of the 1 k 3D vertices. The z coordinate of the nose tip is 0, and afterward, all other z coordinates are normalized by the anchor scale. The landmarks and vertices of this module are used only for loss calculation and are not used in the following modules.

3.3 Pose Estimation

A pose estimation method is employed to find where each participant looks, estimate the head orientation, and identify whether the participant pays attention. The combination of the position and orientation of each human face can be observed by pose estimation. Hence, this module is third and comes after face detection.

With the location of participants’ faces as input, this module produces the orientation of their heads concerning the principal axis of their webcam to monitor their engagement. The rationale is that distracted participants look far from the center of the screen.

Someone looking at the screen does not necessarily declare engagement; however, in most cases, it indicates attention to the event. The same applies to those who are not looking at the screen. They might be taking notes or just hearing the speaker without looking. The reason for checking the look is that not always, but in most cases, it can help in engagement estimation. The presenter may know that engagement decreased in a segment with 10 out of 20 individuals looking away. Sometimes, individuals might glance for a second, and the frame can be captured at that time, labeling that as uninterested. The solution is to increase the number of frames per minute to include many. This way, one glance will not significantly affect the results, as in most of them, the individual will look at the screen.

This module inputs a frame with faces and their location from the previous step and calculates the pitch, roll, and yaw axis for each. The prediction is achieved using Hopenet [43]. Its output is one value for each axe. In the testing phase, the network gets the image, and using the ResNet-50, it creates three fully connected layers that share the same network. The ResNet-50 [44] has 48 convolution layers alongside 1 MaxPool and 1 Average Pool. Each Euler angle is a vector of size S_bin, set by the user. The values in every output vector represent the probabilities that the angle falls into that bin. Normalization is applied using the softmax function.

The cross-entropy loss is employed in training alongside the mean square error. The architecture used is illustrated in Fig. 3. The network consists of three fully connected layers that predict angles A. They all share the previous convolutional network layers. In this manner, three cross-entropy losses are backpropagated into the network, improving learning. Subsequently, a regression loss is added to the network named mean-squared error loss. Three final losses are calculated. Each one is a combination of the respective classification and regression loss. This multi-loss L approach used in training is given by:

$$L = C\left( {y,\hat{y}} \right) + r*{\text{MSE}}\left( {y,\hat{y}} \right),$$

(5)

where L represents the loss, and C denotes the cross-entropy loss of y (actual value) and ŷ (predicted value). Then, follows the regression coefficient r, multiplied by the mean-squared error (MSE) of actual and predicted values. The choice of this architecture stemmed from its independence of landmark detection, which can be challenging to extract, especially in low-resolution facial images, as in this case.

Different ResNet structures were chosen for pose estimation and face detection due to the distinct requirements of each task. Face detection necessitates a deeper network with a higher computational cost, whereas pose estimation benefits from a less deep structure that balances accuracy and efficiency. Aligning the architecture with the specific demands of each task ensures optimal performance while minimizing computational costs.

3.4 Emotion Estimation

The subsequent pipeline module is the estimation of participants’ emotions. It enables the identification of the feelings expressed by each participant during the event. This module takes as input N frames extracted from the keyframe module and the bounding boxes of the faces, which appeared on them from face detection. First, the regions around each participant are isolated, and each frame is divided into sub-frames. Then, each sub-frame is provided as input to the methodology. The methodology starts with landmark detection, and according to them, it predicts the action units (AU) using a pre-trained neural network called JAA-Net [45]. JAA-Net is a deep learning framework consisting of four modules, analyzed below. Once the active action units are identified, emotions can be predicted based on them. This process is shown in Fig. 4.

Even though there are five detected landmarks from the face detection module, more landmarks should be used for accurate emotion detection, so when a frame is given to the methodology, there are 68 facial landmarks detected using Dlib [46]. The entire module of emotion estimation is based on facial landmarks, and this is the reason for searching for a reliable toolkit to find them. If these landmarks are not located, the image is not processed further. It occurs when the face is not well illuminated or occluded. If the scale is inappropriate, the image resizes according to the network specifications. In such scenarios, the emotion is not estimated at all, and the framework continues with the next detected face.

After locating the landmarks in each picture, they are given to a neural network called JAA-Net [45]. Each frame is first aligned according to each face. If a face is not well centered, the image rotates with the center of the nose, which must be vertical according to the horizontal axis. Face alignment aids in more accurate emotion predictions, making it an essential pre-processing method. The face alignment loss was calculated in training by:

$$L_{{{\text{align}}}} = \frac{1}{{2g_{0}^{2} }}\mathop \sum \limits_{i = 1}^{{p_{{{\text{align}}}} }} \left[ {\left( {c2_{i - 1} - \hat{c}2_{i - 1} } \right)^{2} + \left( {c2_{i} - \hat{c}2_{i} } \right)^{2} } \right]$$

(6)

where L_align represents the loss of alignment and g₀ is used for the normalization of the ground-truth inter-ocular distance. With c2_i-1 and c2i are presented the values of x and y ground-truth coordinates of the i-th facial landmark, while ĉ2_i and ĉ2_j-1 represent the predicted corresponding results, and p_align is the number of facial landmarks. After facial alignment, new landmarks are produced for the aligned face. These landmarks are utilized by the same network to predict the action units for each participant.

Before analyzing the JAA-Net, the terms of the Action units should be defined. Facial action units are fundamental actions of muscles or groups of muscles, and according to which are active, one can estimate the emotion of each participant using the facial action unit coding system (FACS) [47]. They are used for emotion detection and pain recognition [48]. For example, an action unit could consist of the upper and lower parts of the lips, and when active, it could mean that the person has an open mouth. When this action unit is active in conjunction with one more unit, it could mean the person is happy. By employing the FACS, one can determine which action units should be active for each emotion. Utilizing this knowledge, it is then possible to estimate the emotional state of each participant. The loss used for predicting the AU is given by:

$${ }AU_{loss} = \mathop \sum \limits_{j = 1}^{{n_{au} }} w_{j} \left[ {p_{j} \log \hat{p}_{j} + \left( {1 - p_{j} } \right)\log \left( {1 - \hat{p}_{j} } \right)} \right],{ }$$

(7)

where n_au denotes the number of total action units detected while p_j and p̂_j represent the probability of the ground-truth and the predicted occurrence of the j_th AU, which is 0 when it does not occur and 1 otherwise. With w_j, a weight is introduced to alleviate the data imbalance problem. This problem occurs when AUs are imbalanced in the dataset, which can frequently happen because some AUs are more challenging to activate than others. The w_j is calculated as follows:

$$w_{j} = \frac{{\left( {\frac{1}{{r_{j} }}} \right)}}{{\mathop \sum \nolimits_{x = 1}^{{n_{au} }} \frac{1}{{r_{x} }}}},$$

(8)

where w_j is the weight and r_j is the occurrence rate of the j-th AU in the training set.

The JAA-Net deep learning framework consists of four modules. The first module is hierarchical and multi-scale region learning and extracts a multi-scale feature from local regions, which have different sizes. The following module is the face alignment, which is responsible for aligning the face according to the nose tip. The third module is global feature learning and is responsible for capturing the structure and texture information of the whole face. The last module is adaptive attention learning, which is responsible for refining the attention map of each AU to capture features at different locations. These features are then integrated with the face alignment and the global feature for the final AU detection. The three modules of face alignment, global feature learning, and adaptive attention learning are jointly optimized and share the hierarchical and multi-scale region learning layers.

Emotions do not strictly consider engagement, but they can provide extra information to the user. According to [49], specific AUs could classify frustration, confusion, and delight but cannot distinguish boredom from neutral, meaning they can provide intriguing information except emotion.

3.5 Yawn Estimation

In addition to analyzing the emotions of each participant, an extra module was added to detect those who felt bored and were yawning. This module detects the distance between the lips, and if it is above a threshold, it detects yawning and classifies the person as bored. After analyzing which action units are active when someone is yawning, there was an observation that there was no difference from those of a happy person. In more detail, AU25 and AU1, which indicate the open mouth and the eyebrow lift, were active in most yawning conditions. Hence, it was necessary to develop an alternative approach for detecting yawning. During the design of the yawn detection module, various experiments were conducted to explore methodologies incorporating eye detection. However, the simplest solution that demonstrated effectiveness in most cases while striking a balance between efficiency and computational costs was selected. There was an attempt to use existing models to minimize cost and time.

Therefore, the Dlib model was used again. First, it is employed to find 68 landmarks L, which will help in emotion estimation. The same 68 landmarks can also help in yawn detection. Through this model, it was found which points correspond to the upper and lower lips. Let the top lip points be tl, while the bottom lip points are bl. After detecting the L points, their average location $\overline{tl}$, $\overline{bl}$ is calculated to find the centers of the upper and lower lips. Subsequently, upon capturing the two centers represented by C₁ and C₂, a distance D between them is calculated, and if this distance is above a threshold, the mouth is wide open, meaning that the participant is yawning. The opening of each mouth differs when someone laughs or yawns, so it is better to calculate those distances than use action units in that case. The threshold value is established through fine-tuning experiments to identify the optimal setting that best suits the framework's requirements. While the threshold remains fixed within the framework, it can be readily adjusted to accommodate user-specific needs. The formula used for calculating the distance between the lips is given by the following:

$$D = \left| { \overline{tl} - \overline{bl} } \right|,$$

(9)

where D represents the distance between the lips and $\overline{tl}$ is the average number between the landmarks of the top lip, while $\overline{bl}$ is the average number of the landmarks representing the bottom lip. Their absolute difference provides their distance. If D exceeds the threshold, the individuals are classified as yawning; otherwise, they are not. The methodology is shown in Algorithm 1.

3.6 Audio Event Detection

Audio event detection is the final module in the online event analysis framework. Using audio event detection, silence, speech, and clapping can be inferred, while it can also be estimated how many times they were found, how high their volume was, and whether the speaker was calm or neutral in speech. The audio of each event could provide crucial information to the user. For instance, highlight moments are sometimes followed by applause.

The sound event module starts after the yawn detection module by converting the video file to a WAV file, which contains only the sound of the video. After this conversion, the WAV file is split into N-second segments S, creating overlaps for each segment for better analysis. The framework processes each segment separately, assigning labels based on the predominant event occurring within that specific timeframe. Overlapping techniques were implemented to enhance the analysis accuracy, allowing for the detection of events irrespective of their appearance time, which is the advantage of this technique.

These files are then transformed into images, called spectrograms, and they feed a pre-trained CNN discussed in Sect. 4 for classification. Each segment S was analyzed to identify the presence of claps, pauses, or speech within them. A spectrogram represents the spectrum of frequencies F, showcasing how it changes over time T. The architecture of the audio analysis system is presented in Fig. 5. The spectrograms were generated using the LibROSA [50] library, and Fourier (STFT) [39] transforms were applied following the formula:

$${ }FT\left( {m,f} \right) = \mathop \sum \limits_{n = - \infty }^{ + \infty } x\left( n \right)f\left( {n - d} \right)e^{ - jfn} { },$$

(10)

where FT(m,f) represents the Fourier transform, and both m (magnitude) and f (frequency) are discrete and quantized as a fast Fourier transform (FFT) applies for performing the STFT. The function to be transformed is (x(n)), and it is multiplied by a window function (f(n)) that is not equal to zero for a short period. As the window slides down the time axis, the Fourier transform of the resulting signal is computed, yielding a two-dimensional representation of the signal.

These N spectrograms feed a neural network described in Sect. 4.2, and features are extracted. Additionally, the methodology uses the Root Mean Square (RMS) for every chunk C to analyze features F, such as the total and each sound event's volume, allowing the system operator to compare claps. In addition to this analysis, an audio emotion analysis is conducted using a multi-layer perceptron classifier, trained with the RAVDESS [51] dataset, as detailed in Sect. 4. Utilizing this classifier enables detecting whether the speaker is calm or neutral during speech. The classifier achieved an F1 score of 86.11% during training and can provide a percentage that indicates the distribution of calm versus non-calm speech, providing valuable context for emotional expression. The output of the audio processing module is a JSON file that contains all this information. The audio processing methodology is shown in Algorithm 2.

4 Experimental Results

Experiments were conducted to evaluate how this multi-modal system works when an online event is an input. First, scenarios were created for each participant who wanted to participate, and then online recordings were conducted. A presentation was held in each session, and a script was prepared for each participant describing the emotions that should appear together with reactions such as claps, yawns, and more. All experiments are present in this Section, together with the obtained results.

4.1 Dataset

Five videos formed a dataset with a total duration of 46 min. Initially, a 4-min video was created to examine the methodology and help following sessions' development. After this, four more sessions followed, including various participants. Before these sessions, a participation form was created. This form was concerned with personal data protection. Through this, users could consent to use their data for this research. The maximum number of participants was 8, and the average was 6.

Scenarios were created for each participant and changed in each session to include various reactions from everyone. During these sessions, there was always one central presenter, and each participant had an open camera to allow the methodology to analyze each face. The possible reactions in the scenarios included yawning, laughing, looking at the cell phone, affirmation, applause, and typical attendance. All participants engaged in various expressions and created a more comprehensive dataset.

Each video was then analyzed by extracting one frame every 10 s. These frames were then examined and followed by a manual registration of each participant’s emotion, pose, and attendance. Subsequently, the frames were analyzed using the methodology, and comparisons were made with the ground truths. The audio analysis of each event was made in the entire video without using fragmentary samples. The methodology underwent testing both on the generated dataset and was also integrated into the LiveMedia platform [52]. The analysis encompassed videos exceeding 10 h, and user evaluations were conducted to assess the methodology’s features [53].

4.2 Implementation Details

This section provides more details about the networks used and how they have been trained. The module used for face detection is based on a model called Retina Face [40], which implements RetinaNet [54]. As stated above, this module consists of five feature maps with different dimensions. The first four feature maps are produced using a pre-trained model named ResNet-152 [44]. The dataset used to train this network was Image-Net-11 k [55]. The dataset uses the WordNet hierarchy, where each node is represented by thousands of images, forming an image database.

The network used for pose estimation is called Hopenet. It was trained on a synthetically expanded dataset following the path of [38], which used the same dataset type to train the landmark prediction model. It was tested with the challenging AFLW2000 [23] dataset, which contains 2000 images annotated with 68 3D landmarks using a 3D model that fits each face. It includes fine-grained annotations for poses and is a prime candidate for use in similar tasks. The other dataset used for testing was BIWI [56], which was gathered in a lab setting using a Kinect v2 device and contained RGB-D video of different subjects concerning different head poses.

The deep neural network JAA-Net [45] was trained using the DISFA dataset [57] for emotion estimation. This dataset contains videos of the participants, which can then be divided into frames. Each video lasted 242 s, and there were 27 candidates. Except for the videos, it contains landmarks for each participant and 12 manually annotated by researchers specialized in FACS action units. Half of these action units concern the upper face, and the other half concern the lower side. Each participant watched a video that tried to make him laugh, cry, disgust, and other emotions. Thus, many emotions are included in this dataset. The frames used for training were 101.744 and 29.070 were used for the testing set. The training lasted 28 days, including 78 epochs, with a 63.30% F1 score. The output of this network includes frames that visualize the top 4 action units, with the participant as a background. Except for the visualization, action unit scores for each image are extracted. Each action unit score represents the prediction probability of being active. Those above 0.5 are set as 1, and those below 0. The emotion is estimated according to the action units of 0 and 1.

The network used for audio event detection is described in [58] and follows the DenseNet-121 [59] architecture. The model was pre-trained with a custom-created dataset specially designed for this reason. It is created from 403 event videos, so it contains claps, speech, and pauses. The number of training samples was 2.187, while the testing samples were 547. This dataset was created by transforming video files into WAV and splitting them into segments of five seconds each. Then, they were converted into spectrograms and were given for model creation. The training lasted less than 1 h, including 9 epochs, with a 99.49% F1 score.

A multi-layer perceptron classifier has been trained with the RAVDESS [51] dataset for the audio emotion analysis. This dataset consists of 24 candidates, and each one of them has 60 samples. Three features were used to detect two emotions, including calm and neutral. These features are the Mel frequency cepstral coefficient, which represents the short-term power range of sound; the chroma, which concerns 12 different categories of voice tone; and the Mel, which is about the frequency of the spectrogram of the audio file. These signals were utilized to detect whether the speaker was calm or neutral, thereby adding another feature employed for analysis.

4.3 Results

The methodology starts with keyframe extraction, which was set to extract one frame for every 10 s of the video. All frames were correctly extracted, and then the face detection module started. In all cases, the faces were accurately identified. There were some frames in which one more face was detected because of a portrait behind the participant. This face detection outcome was excluded from the experimental results. A user can understand similar cases by recognizing when the pose does not change, indicating a face presented in a picture rather than a video.

The formulas used to evaluate this methodology are described in [60]. Initially, accuracy was measured, but another metric that would not give much weight to true-negative values should be found. Therefore, the F1 score was calculated for each module. To measure the F1 score, recall and precision should also be measured. The dataset findings from the online event and the utilized networks will be provided. The performance evaluation of face detection, emotion estimation, and audio event detection relied on the F1 score, facilitating comparisons between the results from this framework using real-event data with different datasets. Apart from that, this metric was selected for its versatility, aligning seamlessly with all tasks executed by the framework.

The first examined module is the pose estimation, which can define where the participant looks, and according to this, some assumptions about interest could arise. If an individual stares too high or too low, it suggests a diminished interest in what the speaker is saying. The same hypothesis was made for those who look too much left or too much right. The pose estimation module uses a network called Hopenet [43], which, according to researchers, has a mean absolute error (MAE) of 6.15° for the AFLW2000 [23] dataset. The metric used represents the model’s average number of errors, whereas the absolute value of the difference between the expected and actual values was used to calculate the error. This error appears to be low, which makes this network ideal for estimating the poses of participants. On the same dataset, this methodology outperforms other models such as FAN [22], which has an MAE of 9.11°, and 3DDFA [23], which has an MAE of 7.39°.

A researcher manually annotated each frame to estimate the participants’ engagement according to their pose estimations. Looking at the phone earned the label uninterested, whereas looking at the screen or some degree near it was labeled as interested. After having the ground-truth values, true-positive values were labeled when someone was bored, and according to pose, the methodology put the label bored. True negatives were labeled when someone was watching with interest, and the algorithm found it accurately. False positives were labeled when someone was marked as bored but was not, while false negatives were labeled when the participant was uninterested, but the methodology did not find it.

First, the accuracy of frames that contained at least one person who appeared uninterested was calculated. The results showed an 88.98% accuracy for these frames. True negative results were presented when people were interested in and looking at the screen, a phenomenon that appeared in most cases. The F1 score was calculated for every dataset frame to reduce the influence of these results. The F1 score for the pose estimation interest was 79.21%, indicating the potential of this approach. The methodology proved accurate even under challenging lighting conditions and only struggled when the entire face was not visible or a hand obstructed it.

As shown in Fig. 6, participants were analyzed according to their pose. Some of them were looking at their mobile phone without paying attention to the speaker who was presented in the second image box. The pose estimation module worked very well in this case by finding the pose of everyone well. Each participant had different lighting conditions and distances between the face and the camera, but the methodology achieved good results without facing problems. Another crucial aspect of this methodology is the emotion estimation of participants. To estimate the emotion of each participant, a researcher manually annotated all of them for each extracted frame.

Then, the results were recorded based on ground-truth and predicted values. Together with the emotion estimation was the evaluation of yawning detection. When a participant was labeled as a yawning person, it was marked as bored. It is a way of adding a seventh emotion, while the other six emotions are happiness, sadness, neutrality, scare, disgust, and surprise. First, only frames in which at least one person had an emotion different from neutrality were examined. The accuracy of these frames was 83.46%. The methodology found many true-negative values in each frame attributed to the number of people presented in each frame. In an eight individuals' frame, most had neutral emotions, so there are several true-negative values.

A true negative presents when the person does not have an emotion different from the neutral one, and the methodology finds this correctly. True positive was labeled when there was an emotion other than neutral, and the approach found it. The false positive was attributed when the methodology found an emotion, but the person had a neutral one. The false negative was labeled in frames where the person had a different than neutral emotion, but the approach failed to detect it.

The F1 score was calculated for all frames because it does not give weight to true negatives; therefore, it is the same for those that include emotion and those that do not. The F1 score for this module was 65.38%.

The emotion estimation module had a 62.4% in the BP4D dataset [61], according to [45]. This result outperforms the methodologies proposed by Niu et al. [33] and Shao et al. [31], with F1 scores of 61.0% and 61.1%, respectively, on the same dataset. This result is similar to that of the online event dataset. This difference could be because some action units have higher F1 scores and are presented most of the time in the online event dataset. An example is the AU12, which concerns the lip raiser and appears when someone smiles. This action unit is crucial for detecting happy emotions and frequently occurs in the online event dataset. In this action unit, researchers obtained an F1 score of 88.2% in the BP4D dataset, which could explain the differences between the two datasets.

Some cases in which the module did not have good results were when the presented subject was not appropriately lighted. Light is crucial in that module because it works with landmarks. When those landmarks are not well detected, the emotion estimation cannot work accurately. When the face appears far from the camera or does not fully appear, the module cannot work properly, and the emotions are not estimated correctly. In Fig. 7, the result that the methodology correctly identified the participant’s emotion is presented. The predicted emotion appears in the top left part of the resulting image.

The frame consisted of four sub-frames that visualized the participant's top 4 active action units. Note that this image has a tiny rotation because the face is not fully aligned. The methodology turned the frame left to achieve correct alignment. The face alignment can also be seen in Fig. 7, which presents two participants. In the upper part, keyframes taken from the video are shown; while on the lower side, there is emotion estimation for those frames.

The next examined module was responsible for detecting and classifying the sound events. This module can provide the system operator with worthy information regarding the sound of the online event. Such information includes how loud the speaker was, the classification of speech as calm or neutral, and how loud each sound event was. The detected sound events include speech, applause, and pauses in the absence of speech. The used module was trained with samples from events, which made the system work better because the cases were similar. The F1 score of this module on a testing dataset excluded from the training and including 50 event videos was 86.03%. This score was counted when the module was created, and the event videos were not online. This methodology has an accuracy of 95.21%, according to [39]. This accuracy exceeds that of SoreNet [35], which achieved 88.9%, and AENet [35] with 81.5%. All these methodologies used spectrograms to classify sound into the appropriate categories and were compared using the same MIVIA Audio Events Dataset [56].

In the case of online events, the approach worked well by detecting claps and speech appropriately. Sometimes, when claps occurred suddenly while the speaker was talking, the methodology erroneously labeled that segment as speech rather than a clap. The sound of each file is segmented every 5 s, to define something like a clap, it should last more than 2.5 s, and the speaker should not speak. Otherwise, if there is speech in most of the segment time, it is labeled as speech.

Each candidate in the created dataset had to clap randomly during the event. This way, claps are detected at any time and will not be at the end of each video, which is the case in many event recordings. The methodology incorporates a neural network based on the DenseNet-121 architecture, meticulously trained with a custom dataset extracted from 403 event videos containing speech, applause, and pauses. This neural network serves the purpose of distinguishing between these audio events within the videos.

The methodology had an F1 score of 80% in finding claps, which means that most of the time, it found where they appeared. The difference between this score and that from training is that online events have better situations to treat because each candidate has a personal microphone close to his face. The results are presented in Table 1. This score is achieved using spectrograms, which, as mentioned before, visually represent the frequency changes in a signal over time. Thus, results improvement is achieved as the problem is treated with computer vision techniques. Alongside the table presenting F1 scores, confusion matrices have been included to enhance comprehension in Fig. 8.

Table 1 Ablative evaluation of AI used components

Full size table

Analyzing videos from online events could provide users with crucial insights into each participant's level of interest during each event. It helps the speaker understand where to add fascinating material to maintain a high interest level, and the organizers understand which parts are the most interesting. Furthermore, it is used to extract the top moments of each video. Figure 9 presents a joint application of the proposed method, including the top four video moments appearing with circles on the blue bar. Those moments are produced based on the engagement and are the ones where it was the highest during the event.

A platform [52] was employed to enhance the evaluation effectiveness, allowing the application deployment and analysis of many of their videos from real-world events. Following this deployment, platform users evaluated the framework, which yielded promising results. Further details on these outcomes are provided in [62]. This evaluation addresses the concern of real-world applicability. The results demonstrate promising outcomes, affirming the practical utility and effectiveness of the proposed framework. This integration enhances the manuscript's robustness and provides compeling evidence of the framework's viability in real-world settings where event analysis is imperative.

For a typical 10-min video analysis with a resolution of 1920 × 1080 and 6 frames per minute, the framework required approximately 40 min from end to end. All the experiments were implemented on an Nvidia GTX 1660 GPU. If the system operator requires fewer frames per minute for analysis with less information, a much faster speed can be achieved. The results show that it is a robust framework that can work in multiple situations regarding online video events. It could be a helpful proposal for those who want to analyze their speech, while hybrid events have increased recently.

5 Conclusion

This paper introduces a methodology for analyzing online video events to observe the engagement of each participant throughout their duration. The proposed framework uses keyframe extraction, face detection, pose and emotion estimation, and sound event detection modules to retrieve important information through content analysis. In a newly created realistic dataset, this framework achieved promising results with an F1 score of 79.21% in interest estimation according to the pose of each candidate, 65.38% F1 in emotion estimation, and 80% F1 in the sound event detection module. The combination of these can provide intriguing information to the system operator about the total interest of the participants during the event. Online video events are increasing, and speakers are trying to make their presentations more approachable and intriguing, which shows the importance of applying content analysis to retrieve practical information.

One method limitation is that it assumes each participant has an open webcam focused on the face. If many participants appear in one camera far from their faces, it is difficult to estimate their emotions and pose; therefore, the methodology will not work correctly. Unfortunately, comparisons with similar proposes are impractical as they do not exist. In future work, this system will be deployed with more features on a larger scale to improve the model’s prediction accuracy, including distinguishing between boredom and engagement using FACS. Additionally, there are plans to concentrate on key moments and analyze only those frames with the highest interest.

Data Availability

The participants of this study did not give written consent for their data to be shared publicly, so due to the sensitive nature of the research supporting data is not available.

References

Jiao, Z., Lei, H., Zong, H., Cai, Y., Zhong, Z.: Potential escalator-related injury identification and prevention based on multi-module integrated system for public health. Mach. Vis. Appl. 33, 29 (2022). https://doi.org/10.1007/s00138-022-01273-2
Article Google Scholar
Citraro, L., Márquez-Neila, P., Savarè, S., Jayaram, V., Dubout, C., Renaut, F., Hasfura, A., Shitrit, B., Fua, P.: Real-time camera pose estimation for sports fields. Mach. Vis. Appl. 31, 16 (2020). https://doi.org/10.1007/s00138-020-01064-7
Article Google Scholar
Vrochidis, A., Dimitriou, N., Krinidis, S., Panagiotidis, S., Parcharidis, S., Tzovaras, D.: A multi-modal audience analysis system for predicting popularity of online videos. In: Iliadis, L., Macintyre, J., Jayne, C., Pimenidis, E. (eds.) EANN 2021, 3, 465–476 (2021). https://doi.org/10.1007/978-3-030-80568-5_38
Kokila, M.L.S., Christopher, V.B., Sajan, R.I., Akhila, T.S., Kavitha, M.J.: Efficient abnormality detection using patch-based 3D convolution with recurrent model. Mach. Vis. Appl. 34, 54 (2023). https://doi.org/10.1007/s00138-023-01397-z
Article Google Scholar
Vrochidis, A., Dimitriou, N., Krinidis, S., Panagiotidis, S., Parcharidis, S., Tzovaras, D.: Video Popularity prediction through fusing early viewership with video content. In: Vincze, M., Patten, T., Christensen, H., Nalpantidis, L., Liu, M., (eds.) Computer Vision Systems, ICVS 2021, 12899, 159–168 (2021). https://doi.org/10.1007/978-3-030-87156-7_13
Chen, R., Zhou, P., Wang, W., Chen, N., Peng, P., Sun, X., Wang, W.: PR-Net: preference reasoning for personalized video highlight detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7980–7989 (2021). arXiv:2109.01799
Abed, R., Bahroun, S., Zagrouba, E.: KeyFrame extraction based on face quality measurement and convolutional neural network for efficient face recognition in videos. Multimed. Tools Appl. 80, 23157–23179 (2021). https://doi.org/10.1007/s11042-020-09385-5
Article Google Scholar
Song, X., Fan, G.: Joint key-frame extraction and object segmentation for content-based video analysis. IEEE Trans. Circuits Syst. Video Technol. 16(7), 904–914 (2006)
Article Google Scholar
Zhang, M., Tian, L., Li, C.: Key frame extraction based on entropy difference and perceptual hash. In: IEEE International Symposium on Multimedia (ISM) 2017, pp. 557–560 (2017).
Milan, K.A.P., Jeyaraman, K., Arockia, P.J.R.: Key-frame extraction techniques: a review. Recent Patents Comput. Sci. 11, 1 (2018). https://doi.org/10.2174/2213275911666180719111118
Article Google Scholar
Sun, Y., Li, P., Jiang, Z., Hu, S.: Feature fusion and clustering for key frame extraction. Math. Biosci. Eng. 18(6), 9294–9311 (2021). https://doi.org/10.3934/mbe.2021457
Article Google Scholar
Tang, B., Chen, W.: A description scheme for video overview based on scene detection and face clustering. J Circ. Syst. Comput. 30(1), 215000230 (2021). https://doi.org/10.1142/S021812662150002X
Article Google Scholar
Luo, D., Wen, G., Li, D., Hu, Y., Huan, E.: Deep-learning-based face detection using iterative bounding-box regression. Multimed. Tools Appl. 77, 24663–24680 (2018). https://doi.org/10.1007/s11042-018-5658-5
Article Google Scholar
Mo, H., Liu, L., Zhu, W., Li, Q., Liu, H., Yin, S., Wei, S.: A multi-task hardwired accelerator for face detection and alignment. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4284–4298 (2020)
Article Google Scholar
Li, X., Lai, S., Qian, X.: DBCFace: towards pure convolutional neural network face detection. IEEE Trans. Circuits Syst. Video Technol. 32(4), 1792–1804 (2022)
Article Google Scholar
Jiang, H., Learned-Miller, E.: Face detection with the faster R-CNN. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition 2017, pp. 650–657 (2017). https://doi.org/10.1109/FG.2017.82
Yang, S., Luo, P., Loy, C.C., Tang, X.: WIDER FACE: A face detection benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5533 (2016). https://doi.org/10.1109/CVPR.2016.596
Jain, V., Learned-Miller, E.: FDDB: A benchmark for face detection in unconstrained settings. University of Massachusetts, Amherst technical report 2, 4 (2010).
Klare, B.F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., Gother, P., Mah, A., Burge, M., Jain, A.K.: Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus benchmark A. In: CVPR, 1931–1939 (2015). https://doi.org/10.1109/BTAS.2018.8698561
Shi, X., Shan, S., Kan, M., Wu, S., Chen, X.: Real-time rotation-invariant face detection with progressive calibration networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2295–2303 (2018). https://doi.org/10.1109/CVPR.2018.00244
Triantafyllidou, D., Nousi, P., Tefas, A.: Fast deep convolutional face detection in the wild exploiting hard sample mining. Big data Res. 11, 65–76 (2018). https://doi.org/10.1016/j.bdr.2017.06.002
Article Google Scholar
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (and a Dataset of 230,000 3D Facial Landmarks). In: IEEE International Conference on Computer Vision, pp. 1021–1030 (2017). https://doi.org/10.1109/ICCV.2017.116
Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S. Z.: Face alignment across large poses: a 3d solution. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 146–155 (2016). https://doi.org/10.1109/CVPR.2016.23
Park, J., Heo, S., Lee, K., Song, H., Lee, S.: Robust facial pose estimation using landmark selection method for binocular stereo vision. In: 25th IEEE International Conference on Image Processing (ICIP), 186–190 (2018). https://doi.org/10.1109/ICIP.2018.8451443
Paracchini, M., Marcon, M., Tubaro, S.: Fast and reliable facial landmarks localization in non frontal images. In: 8th European Workshop on Visual Information Processing (EUVIP), pp. 88–92 (2019). https://doi.org/10.1109/EUVIP47703.2019.8946249
Derkach, D., Ruiz, A., Sukno, F. M.: Head pose estimation based on 3-D facial landmarks localization and regression. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition 2017, pp. 820–827 (2017). https://doi.org/10.1109/EUVIP47703.2019.8946249
Fard, A. P., Abdollahi, H., Mahoor, M.: ASMNet: a lightweight deep neural network for face alignment and pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1521–1530 (2021). https://doi.org/10.1109/CVPRW53098.2021.00168
Yang, X., Jia, X., Yuan, M., Yan, D.M.: Real-time facial pose estimation and tracking by coarse-to-fine iterative optimization. Tsinghua Sci. Technol. 25(5), 690–700 (2020). https://doi.org/10.26599/TST.2020.9010001
Article Google Scholar
Koujan, M. R., Alhabawee, L., Giannakakis, G., Pugeault, N., Roussos, A.: Real-time facial expression recognition “In The Wild” by disentangling 3D expression from identity. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 24–31 (2020). https://doi.org/10.1109/FG47880.2020.00084
Xi, O., Kawaai, S., Goh, E. G. H., Shen, S., Wan, D., Ming, H., Huang, D. Y.: Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: 19th ACM International Conference on Multimodal Interaction 2017, 577–582 (2017). https://doi.org/10.1145/3136755.3143012
Shao, Z., Liu, Z., Cai, J., Wu, Y., Ma, L.: Facial action unit detection using attention and relation learning. In IEEE transactions on affective computing, (2019). arXiv:1808.03457
Rizwan, S.A., Ghadi, Y., Jalal, A., Kim, K.: Automated facial expression recognition and age estimation using deep learning. Comput. Mater. Contin. 71, 3 (2022). https://doi.org/10.32604/cmc.2022.023328
Article Google Scholar
Niu, X., Han, H., Yang, S., Huang, Y., Shan, S.: Local Relationship Learning With Person-Specific Shape Regularization for Facial Action Unit Detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR, 11909–11918 (2019). https://doi.org/10.1109/CVPR.2019.01219
Phan, H., Pham, L., Koch, P., Duong, N. Q. K., Mcloughlin, I., Alfred, M.: Audio Event Detection and Localization with Multitask Regression Network. Technical Report (2020).
Greco, A., Saggese, A., Vento, M., Vigilante, V.: SoReNet: a novel deep network for audio surveillance applications. In: IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 546–551 (2019). https://doi.org/10.1109/SMC.2019.8914435
Greco, A., Petkov, N., Saggese, A., Vento, M.: AReN: a deep learning approach for sound event recognition using a brain inspired representation. IEEE Trans. Inf. Forensics Secur. 15, 3610–3624 (2020). https://doi.org/10.1109/TIFS.2020.2994740
Article Google Scholar
Romanov, S.A., Kharkovchuk, N.A., Sinelnikov, M.R., Abrash, M. R., Filinkov, V.: Development of a non-speech audio event detection system. In: IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), 1421–1423 (2020).
Kao, C. C., Wang, W., Sun, M., Wang, C.: R-CRNN: Region-based convolutional recurrent neural network for audio event detection. (2018). https://doi.org/10.21437/Interspeech.2018-2323
Papadimitriou, I., Vafeiadis, A., Lalas, A., Votis, K., Tzovaras, D.: Audio-based event detection at different SNR settings using two-dimensional spectrogram magnitude representations. Electronics 9(10), 1593 (2020). https://doi.org/10.3390/electronics9101593
Article Google Scholar
Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., Zafeiriou, S.: Retinaface: Single-stage dense face localisation in the wild. (2019). arXiv:1905.00641
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. Proc. Track, 9, 249–256 (2010). https://proceedings.mlr.press/v9/glorot10a.html. Accessed 22 Jan 2024
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 764–773 (2017). https://doi.org/10.1109/ICCV.2017.89
Ruiz, N., Chong, E., Rehg, J. M.: Fine-grained head pose estimation without key-points. In: IEEE Computer Vision and Pattern Recognition Workshops, 2155–215509 (2018). https://doi.org/10.1109/CVPRW.2018.00281
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Shao, Z., Liu, Z., Cai, J., Ma, L.: JAA-Net: joint facial action unit detection and face alignment via adaptive attention. Int. J. Comput. Vis. 129(2), 321–340 (2021). https://doi.org/10.1007/s11263-020-01378-z
Article Google Scholar
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Ekman, P., Rosenberg, E.L.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press (1997)
Google Scholar
Hinduja, S., Canavan, S., Kaur, G.: Multimodal fusion of physiological signals and facial action units for pain recognition. In: 15th IEEE International Conference on Automatic Face and Gesture Recognition, 577–581 (2020). https://doi.org/10.1109/FG47880.2020.00060
Grafsgaard, J., Wiggins, J.B., Boyer, K.E., Wiebe, E.N., Lester, J.: Automatically recognizing facial expression: Predicting engagement and frustration. In: Educational data mining: pp. 43–50, (2013).
McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto, O.: librosa: Audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, 8, 18–25 (2015). https://doi.org/10.25080/Majora-7b98e3ed-003
Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
LiveMedia Platform. INVENTICS A.E., Home Page. https://www.livemedia.gr (2023). Accessed 18 Jan 2024.
Vrochidis, A., Tsita, C., Dimitriou, N., Krinidis, S., Panagiotidis, S., Parcharidis, S., Chatzis, V.: User perception and evaluation of a deep learning framework for audience engagement analysis in mass events. In: International Conference on Human-Computer Interaction, pp. 268–287, (2023).
Lin, T., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: IEEE International Conference on Computer Vision (ICCV), 2999–3007 (2017). https://doi.org/10.1109/ICCV.2017.324
Ridnik, T., Ben-Baruch, E., Noy, A.: ImageNet-21K Pretraining for the Masses. (2021). arXiv:2104.10972
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M.: Reliable detection of audio events in highly noisy environments. Pattern Recogn. Lett. 65, 22–28 (2015). https://doi.org/10.1016/j.patrec.2015.06.026
Article Google Scholar
Mavadati, S.M., Mahoor, M.H., Barlett, K., Trinh, P., Cohn, J.F.: DISFA: a spontaneous facial action intensity database. IEEE Trans. Affect. Comput. 4(2), 151–160 (2012). https://doi.org/10.1109/T-AFFC.2013.4
Article Google Scholar
Vafeiadis, A., Kalatzis, D., Votis, K., Giakoumis, D., Tzovaras, D., Chen, L., Hamzaoui, R.: Acoustic scene classification: from a hybrid classifier to deep learning. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017) (2017). https://dora.dmu.ac.uk/handle/2086/15000
Huang, G., Liu, Z., Maaten, L., Weinberger, K.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708 (2017). arXiv:1608.06993
Hand, D., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018). https://doi.org/10.1007/s11222-017-9746-6
Article MathSciNet Google Scholar
Zhang, X., Yin, L., Cohn, J., Canavan, S., Reale, M., Horowitz, A., Liu, P., Girard, J.: BP4D-Spontaneous: A high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput. 32, 692–706 (2014). https://doi.org/10.1016/j.imavis.2014.06.002
Article Google Scholar
Vrochidis, A., Tsita, C., Dimitriou, N., Krinidis, S., Panagiotidis, S., Parcharidis, S., Tzovaras, D., Chatzis, V.: User Perception and evaluation of a deep learning framework for audience engagement analysis in mass events. In: International Conference on Human-Computer Interaction, 268–287 (2023). https://doi.org/10.1007/978-3-031-48057-7_17

Download references

Acknowledgements

This research has been financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship, and Innovation, under the call RESEARCH–CREATE–INNOVATE (project code: LiveMedia++, T1EDK-04943) and the HORIZON innovation program under grant agreement No 101135556 (project INDUX-R). This publication reflects only the authors’ views.

Author information

Authors and Affiliations

Center for Research and Technology, Hellas, 57001, Thessaloníki, Greece
Alexandros Vrochidis, Nikolaos Dimitriou, Stelios Krinidis & Dimitrios Tzovaras
Democitus University of Thrace, 65404, Kavala, Greece
Alexandros Vrochidis & Stelios Krinidis
Inventics, Hellas, 57001, Thessaloníki, Greece
Savvas Panagiotidis & Stathis Parcharidis

Authors

Alexandros Vrochidis
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaos Dimitriou
View author publications
You can also search for this author in PubMed Google Scholar
Stelios Krinidis
View author publications
You can also search for this author in PubMed Google Scholar
Savvas Panagiotidis
View author publications
You can also search for this author in PubMed Google Scholar
Stathis Parcharidis
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios Tzovaras
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AV: writing, editing, methodology. ND: conceptualization, methodology, editing. SK: supervision, editing. SP: visualization. SP: validation. DT: supervision.

Corresponding author

Correspondence to Alexandros Vrochidis.

Ethics declarations

Conflict of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Informed Consent

There is informed consent for this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vrochidis, A., Dimitriou, N., Krinidis, S. et al. A Deep Learning Framework for Monitoring Audience Engagement in Online Video Events. Int J Comput Intell Syst 17, 124 (2024). https://doi.org/10.1007/s44196-024-00512-w

Download citation

Received: 26 January 2024
Accepted: 25 April 2024
Published: 21 May 2024
DOI: https://doi.org/10.1007/s44196-024-00512-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Deep Learning Framework for Monitoring Audience Engagement in Online Video Events

Abstract

Similar content being viewed by others

Real-Time Emotion Recognition Through Video Conference and Streaming

Enhancing Engagement Prediction in Online Environment Using Temporal Features

Synchronous Prediction of Continuous Affective Video Content Based on Multi-task Learning

1 Introduction

2 Related Work