Advanced techniques for automated emotion recognition in dogs from video data through deep learning

Franzoni, Valentina; Biondi, Giulio; Milani, Alfredo

doi:10.1007/s00521-024-10042-3

Advanced techniques for automated emotion recognition in dogs from video data through deep learning

Original Article
Open access
Published: 04 July 2024

Volume 36, pages 17669–17688, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Advanced techniques for automated emotion recognition in dogs from video data through deep learning

Download PDF

2233 Accesses
1 Altmetric
Explore all metrics

Abstract

Inter-species emotional relationships, particularly the symbiotic interaction between humans and dogs, are complex and intriguing. Humans and dogs share fundamental mammalian neural mechanisms including mirror neurons, crucial to empathy and social behavior. Mirror neurons are activated during the execution and observation of actions, indicating inherent connections in social dynamics across species despite variations in emotional expression. This study explores the feasibility of using deep-learning Artificial Intelligence systems to accurately recognize canine emotions in general environments, to assist individuals without specialized knowledge or skills in discerning dog behavior, particularly related to aggression or friendliness. Starting with identifying key challenges in classifying pleasant and unpleasant emotions in dogs, we tested advanced deep-learning techniques and aggregated results to distinguish potentially dangerous human--dog interactions. Knowledge transfer is used to fine-tune different networks, and results are compared on original and transformed sets of frames from the Dog Clips dataset to investigate whether DogFACS action codes detailing relevant dog movements can aid the emotion recognition task. Elaborating on challenges and biases, we emphasize the need for bias mitigation to optimize performance, including different image preprocessing strategies for noise mitigation in dog recognition (i.e., face bounding boxes, segmentation of the face or body, isolating the dog on a white background, blurring the original background). Systematic experimental results demonstrate the system’s capability to accurately detect emotions and effectively identify dangerous situations or signs of discomfort in the presence of humans.

Explainable automated recognition of emotional states from canine facial expressions: the case of positive anticipation and frustration

Article Open access 30 December 2022

Facial Emotion Recognition in-the-Wild Using Deep Neural Networks: A Comprehensive Review

Article 13 December 2023

FaceEvoke: Eliciting Emotions Through Facial Analysis

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Humans and animals spontaneously connect to other individuals. The neurological mirroring system [1], observed in humans, primates, and other species, including dogs, cats, and birds, provides the physiological mechanism for perception-action coupling. This is essential for understanding the actions of others, mastering skills by imitation, [2] understanding intentions and emotions, and engaging in inter-species and intra-species empathy. The expression of emotions between species differs [3], but, intuitively, the proven similar reactions between humans and dogs are due to their mirroring systems [4]. E.g., dog owners know very well that inter-species behaviors of contagious joy, relaxation, and yawning, often occur in human--dog interaction.

In this paper, we exploit automated recognition of emotions shared among humans and dogs, focusing on the following research questions:

1.
What are the critical biases and technical considerations experimenting on real-world data as compared to data from a controlled environment?
2.
Which canine emotions are most often misclassified, and is this issue critical for the recognition of potentially dangerous situations?
3.
Can the classification of dog emotions be improved by focusing on the DogFACS coding descriptors?

We begin our study by identifying the challenges of emotion recognition in dogs, starting from the literature on emotion recognition and dog behavior. Building on the recent introduction of automated dog emotion recognition, we focus on classifying five specific pleasant and unpleasant emotions in dogs and then consolidate these findings into higher-level categories of danger, i.e., dangerous/non-dangerous inter-species interactions.

The potential applications of this study span across diverse domains, ranging from the development of empathy-enabled smart assistants to enhancing robotic pets’ empathetic functions [5,6,7]. E.g., in environments where social robots and dogs coexist, AI-driven emotion recognition could enable robots to adapt their behavior accordingly and provide timely alerts to humans to ensure that dogs are not under undue stress. [8,9,10,11] These intelligent systems could also interpret canine emotions to alert owners, veterinarians, and animal shelters or rescue organizations to a dog’s distress or discomfort, so they can respond appropriately, strengthening the dog-owner bond and potentially preventing harmful situations. This research also specifically paves the way for real-time advanced systems in prosthetic knowledge, [12] where AI with empathetic capabilities could become critical in ensuring a secure and responsive interaction with pets for people with cognitive or sensory challenges [13,14,15]. Last but not least, this technology can contribute to ethological studies and welfare guidelines.

The main contributions of this work can be summarized as follows:

Classification generalizability of dog emotions across breeds and environmental conditions through Transfer Learning, to enhance deep-learning reliability in real-world settings, moving beyond controlled laboratory conditions.
Enhanced dataset bias mitigation through image preprocessing, providing a systematic comparison of segmentation strategies to eliminate background elements that could lead to biased classification.
Systematic comparative analysis of neural network models, including a 5-emotions model and the classification of dangerous situations.

1.1 Challenges of emotion recognition in dogs

To better understand our task and where there are gaps in the literature, in this subsection we carefully analyze what challenges are specific to the task of canine emotion recognition, starting from human emotion recognition and proceeding to differentiate between the two.

Human-automated emotion recognition from images using deep learning is a mature research topic based on a solid state of the art, and easily implementable with machine learning techniques, identifying critical points of the face [16] or using deep neural networks on high-resolution images to exploit facial details such as micro-expressions. [17,18,19,20] Despite differences due to ethnicity, gender, and age, human-face-based emotion recognition can rely on strong generalizability due to the similarities among different human faces, and the same expression of eyes, eyebrows, mouth, and surrounding skin areas. [21, 22] Despite dogs and humans responding differently to emotionally competent stimuli, [3] the same process has been proved promising on dogs. [23,24,25]

On dogs, we can observe a broader range of critical differences between breeds, including e.g., head shape, fur length, coat texture, and body proportions. Also, the hair color distribution can be challenging, e.g., in breeds presenting spots, that may challenge shape recognition and segmentation algorithms. Even considering a single breed, it is much more difficult to claim that research findings and conclusions from a study conducted on a sample population can be applied to the entire population. Moreover, in dogs, we cannot exploit micro-expressions, and the state of the coat in real life (e.g., wet, cut, dirty) may influence the recognition algorithm’s performance. Other work [23, 24] suggests that the dog’s facial expression recognition is still critical for emotion recognition. The performance of emotion recognition can be improved by adding postural information.

When considering the annotation of canine data, a significant distinction from human subjects becomes apparent: unlike humans, dogs lack the capability for verbal or written communication to express their state. Consequently, dogs’ emotions from still images are inherently human-perceived, as of yet, a physiological dataset that can be reliably mapped to canine emotions remains unexplored. For videos, we can consider the opportunity to map action units (e.g., DogFACS [26] data) to emotions for emotions explainability. [3] Although a dog’s owner may have a deep understanding of their pet’s behavior, we deem professional evaluations (e.g., from veterinarians, dog trainers, or behavioral or ethology experts) to be more accurate, being less subject to personal biases and perception.

Another challenge in dog emotion recognition arises from the dynamic nature of real-world environments, as opposed to controlled settings. Images or videos captured in real life may contain noise, such as varying backgrounds (e.g., many images may feature happy dogs running in grass fields, sad dogs on floors, or sleepy dogs on couches). Furthermore, frames captured with personal devices may be out of focus, poorly composed (i.e., with the dog positioned away from the center or focal points), or inadequately lit. These factors can introduce significant biases in deep learning, as the background, environment, and camera specifications may unknowingly influence the features assigned to emotion classes.

In the context of our research, our objectives encompass the explicit identification of potential biases and the subsequent examination of assorted data-cleaning methodologies. Strategically, we consider the cited background biases within the training dataset in ad-hoc experiment settings, subsequently deploying data-cleaning techniques to evaluate their performance in mitigating such biases.

Summarizing, three main types of bias may be common, in dog emotion datasets, on which it is critical to focus or the creation of novel datasets or studies to the aim of avoiding them:

Systemic bias: video data have a unique style, setting, or dog breed that does not generalize. Intra-video correlations, in other words, might lead the model to learn something specific, unrelated to general dog emotions, which does not apply to unseen videos.
Statistic or algorithmic bias: Inherent bias in training data. If certain dogs, breeds, environments, or situations are over-represented in the videos, the model might learn to associate these factors with certain emotions, leading to biases. E.g., if most of the videos showing happiness were shot in parks, the model might incorrectly learn to associate parks with happiness.
Activity or selection bias: Non-uniform distribution of emotions. If certain emotions are over-represented or underrepresented in the training data, the model might learn to predict the more common emotions more frequently, leading to a bias in the emotion predictions.

1.2 Previous works

Recognition of animal emotions is an important area of research to improve animal welfare and scientific understanding, with important uses ranging from improving conditions for farm animals to assessing the welfare of laboratory rats.

An interesting case of study on animal visual emotion recognition refers to the Grimace Scale [27,28,29,30] for the measurement by visual expression of animals suffering, e.g., farm breeding horses and sheep, or to understand when it is time to suppress laboratory rats. [31]

Dogs’ emotions have been extensively analyzed in behavioral and ethology studies, in particular, to demonstrate empathy between humans and dogs. [32, 33] Automated emotion recognition in dogs is still a novel research application area; the foundation of our research is an emotion model that reflects human emotional experiences as well, [21, 22] and an evaluation of similarities and differences is included.

The Dog Facial Action Coding System (DogFACS) [26] is a manual annotation system used in other studies to describe changes in facial appearance based on movements of the underlying facial muscles in dogs. It provides a scientific observational tool for identifying and coding action units of facial movements in dogs: the system is based on the facial anatomy of dogs and has been adapted from the original FACS system used for humans created by Ekman and Friesen in 1978. [22]

In 2017, Catia Correia Caeiro, Daniel S. Mills, and Kun Guo [3] reached the milestone of quantifying and comparing human and domestic dog facial expressions in response to emotionally competent stimuli associated with different categories of emotional arousal. The authors use the DogFACS objective protocol to investigate the facial movements of dogs in response to emotional stimuli. The paper explores the question of whether the observed emotional expressions are a result of the objective context (affect elicitation) or the subjective context (affect prediction). The authors conclude that dogs showed distinctive facial actions based on the category of stimuli, producing different facial movements from humans in comparable states of emotional arousal. In the process, the authors also associated the DogFACS action units with dog emotions in a series of video clips (i.e., the Dog Clips dataset, used in this work) from real-world sample situations of canine emotional arousal. Our work uses these associations between Dog Clips tagged with DogFACS coding and canine emotions, for the dual purpose of refining automatic recognition by experimenting with real-world video, and to demonstrate how to reduce the bias potentially present in the contexts in which videos for dog emotions could be collected for future targeted studies.

More recently, a few studies have been conducted on automated emotion recognition in dogs.

The seminal study, [23] addressing the topic of automated recognition of emotions in dogs for the first time, received wide attention. Such preliminary work used images of dogs collected from the Internet, classified by the search engine with keywords related to the main emotions of happiness and anger, plus a rest state (i.e., sleep). The photographs were chosen in a controlled way, selecting on purpose clean ones, without noisy elements, e.g., bad lighting, confused environments, humans and other animals, and a complex background. A classification had been implemented based on transfer learning using the AlexNet Convolutional Neural Network, [34] reaching solid results. The paper also offered an additional explanation of canine expressions, analyzed by expert veterinary doctors, including the position of the head, and a series of variables for facial expression (e.g., ear movements, eye opening, mouth opening, and teeth visibility). Limitations of the preliminary study include the selective data, the emotional model limited to three emotions, and the format of the photographs, which do not convey a dynamic nature. In addition, the study only tested a neural network.

In 2022, two interesting papers were published on the topic, starting from the outcome of the seminal work. [23] A work analyzed canine emotion recognition from body posture. [25] The paper describes a system based on a machine learning model trained on pose estimation to differentiate the emotional states of dogs. The authors compiled a picture library with full-body dog pictures featuring 400 images with 100 samples each for the states Anger, Fear, Happiness, and Relaxation. A new dog key-point detection model was built using the framework DeepLabCut [35] for animal key-point detector training, learning from a total of 13,809 annotated dog images and with the capability to estimate the coordinates of 24 different body-part key points. The application determines the dog’s emotional state with an accuracy between 60% and 70%, assessing that this threshold exceeds the human ability to recognize dog emotions.

A second work is a controlled experiment inducing frustration and positive anticipation in Labradors to create a dataset for this specific breed. [24] The laboratory data under controlled conditions, have been labeled using the action units of DogFACS for these two emotions. The study compares two approaches: a DogFACS-based approach and a deep-learning approach. The DogFACS-based approach uses DogFACS variables as explainable high-level features, highlighting the approach using DogFACS is time-consuming, and requires extensive human annotation. The deep-learning approach uses raw images as input and extracts features using two neural networks (i.e., ResNet and Vision Transformer (ViT)). The authors found that features of a self-supervised pre-trained ViT (i.e., DINO-ViT) were superior to the other alternatives.

2 System background

In this section, we describe the knowledge structure and techniques leveraged in our system architecture, including emotional classes, the source dataset, enhanced data processing, and advanced strategies for emotion classification.

2.1 Emotional classes hierarchy

In our research, we identify five canine emotional states: a neutral state (i.e., relaxation), two states indicative of non-dangerous positive emotions (i.e., happiness, positive anticipation), and two states indicative of potentially threatening negative emotions (i.e., frustration, fear), as present in the labeling of the Dog Clips dataset for emotion recognition. In a second, higher-level analysis, we aggregate them for danger recognition. This nuanced categorization allows for a more granular and scalable analysis of canine emotional expressions.

2.2 Dog clip dataset

The Dog Clips dataset [3] that we used in this work is composed of 100 videos of different breeds of dogs extracted from public sources and evenly distributed among five emotions, i.e., Fear, Frustration, Happiness, Positive Anticipation (later also referred as Anticipation), and Relaxation.

The video segments in the original Dog Clips dataset are labeled in the work of Catia Correia Caeiro et al. [3] according to emotions and tagged with their Dog Facial Action Coding System DogFACS, [26] including dogs of different breeds for generalization. Action coding tags in DogFACS fall under three categories:

Action Units (AU): consecutive frame sets representing 11 movements whose muscular basis can be identified;
Action Descriptors (AD): consecutive frames sets describing 26 broader movements in which muscular basis is not identified;
Ear Action Descriptors: consecutive frames sets describing 5 variations from the ear neutral position due to ear muscular movements.

The Dog Clips dataset is tagged with 42 tags, occasionally with several action tags on a single frame. In the following list, we report our analysis on which tags are shared among differing emotional classes and which are exclusive to a single class.

Shared among all emotions (i.e., fear, frustration, happiness, positive anticipation, relaxation):
- Partial, Upper lip raiser, Jaw drop, Sniff, Mouth stretch, Lip Corner Puller, Head turn left, Inner brow raiser, Ears rotator, Ears flattener, Head up, Eyes turn left, Head tilt right, Blink, Ears downward, Lower lip depressor, Tongue show, Lips part, Head turn right, Eyes up, Nose lick, Eyes turn right, Head down, Ears adductor, Lip wipe, Ears Forward, Lip pucker
Shared among happiness, frustration, positive anticipation, fear:
- Panting, Head tilt left, Eyes down
Shared among fear, frustration, happiness:
- Chewing, Blow
Shared among happiness, positive anticipation:
- Body Shake, Nose wrinkler
Shared among happiness, relaxation:
- Eye closure
Exclusively appearing with happiness:
- Lick, Suck

It is worth noting that the only DogFACS action tags that can be directly and exclusively associated with an emotion are Lick and Suck, both of which are correlated with happiness. However, these actions are not necessarily present in every instance of happiness. Figure 1 includes sample frames from Dog Clips videos, showing in the caption how emotions are correlated with DogFACS action coding units.

A notable consideration for our study is that the dataset inherently contains dynamic elements due to camera motion, which introduces significant variations in both the dog’s perspective and the background across frames. These variations are critical for capturing the full spectrum of the dog’s emotional expressions in varying contexts. Although individual frame differences may be subtle, the collective sequence provides a rich, non-redundant dataset. On other datasets, where the camera’s point of view is stable, data redundancy may lead to overfitting, which can be addressed with a temporal range for frame selection.

2.3 Image preprocessing

A focal point of our approach is the data preprocessing phase, with particular emphasis on image segmentation. This step is crucial for improving the quality and pertinence of feature extraction, a dimension that remains underexplored in the literature in this area. Our research pioneers real-world imagery, rather than traditional laboratory or selectively curated datasets. This preprocessing approach allows engaging with the complexity and variability present in natural environments. Identifying and discarding the inherent noise in situational data, our study introduces comprehensive bias mitigation and correction strategies.

Specifically, we choose two preprocessing strategies that help classification to focus on areas of the dog’s body and head that are considered relevant for emotional classification [3], rather than those that represent bias and noise in the remaining discarded image segments. The nuances of our approach (blur and non-blur, body and head, segmentation and bounding box) are explained in the methodology (see Sect. 3).

2.4 Visual dog emotion Classification

Convolutional Neural Networks (CNN) are widely used in image classification, thanks to their flexibility and adaptability. They are also applied to image-based affective computing on humans. [17, 18, 36] Facilitating knowledge transfer from one domain to another in deep learning, Transfer Learning (TL) allows for the use of pre-trained CNNs on general image classifications without needing to re-train the network from scratch but using pre-trained basic image recognition capabilities in identifying low-level features (e.g., lines, shapes, patterns, and color distribution).

Low-level features are recognized by the initial layers of the CNN, while the knowledge of high-order features is embedded in the final layers. By TL, we can use this low-level knowledge within the same attribute space as a form of added knowledge for the domain of canine emotions. We implement the fine-tuning phase of TL replacing the final three layers with those dedicated to the study of dog emotion recognition, adjusting the network weights for this specific task.

The benefits of employing TL in Emotion Recognition studies are well-documented in existing literature, [19, 20, 37,38,39] even within non-visual domains, [40,41,42,43] such as spectrograms derived from speech segments or crowd sound. [44] In particular, previous studies on automated emotion recognition in dogs using neural networks [23,24,25] have consistently relied on this technique. In this work, we test the effectiveness of TL-based neural models on a broader range of emotions with in-class variety, for emotion and danger recognition in dogs.

3 Methodology and experiment workflow

In this section, we explain the methodology and workflow of our experiments. The architecture of system modules (see Fig. 2) includes frames extraction, data generation, application of emotions’ knowledge transfer to pre-trained Convolutional Neural Networks models, and classification of emotions and dangerous/non-dangerous states.

3.1 Dataset extension

The original Dog Clips dataset has been processed with different strategies to generate two classes of datasets for machine learning:

Video-based Partitioning (VP) datasets: in the VP datasets, frames are extracted directly from the Dog Clips videos;
Action-coding Partitioning (AP) datasets: in the AP datasets, we used DogFACS action coding as supplementary information to guide the data-cleaning process.

The data classes are both defined to meet typical constraints required by machine learning: training/test ratio which, in our setting 80%/20%, and training/test independence aimed to avoid the learning phase using data samples from the same video that will be submitted to the test phase. Strategically choosing frames that represent different stages of the emotional response, we maintain the integrity and diversity of the dataset, while keeping a uniform distribution of training/test samples for emotion classes. In the following paragraphs, our dataset generation is detailed for the two partitioning strategies, explaining step-by-step the combination of techniques used to generate the novel extended datasets based on different image processing.

3.1.1 Video-based datasets

The first partitioning strategy we used is based on extracting frames from the Dog Clips videos without further refinement. For such Video-based Partitioning (VP), the training/test split is applied to each emotion, according to a homogeneous class distribution. The VP-derived datasets are generated from basic frames by selecting the raw ones or processing them to focus on relevant features while reducing noise and biases stemming from the scene background and non-uniform camera behavior. Using all the frames that can include emotional information aligns with the methodology of the original Dog Clips dataset creators, ensuring consistency with manual annotation in future comparative studies. This approach also aligns with the practical application of our research in real-time scenarios, where rapid detection of emotional shifts is crucial for preventing dangerous situations.

VP Raw dataset The VP Raw dataset is obtained without filtering or processing the VP basic frames.

In Table 1, the total number of frames for each class is reported for Video-based Partitioning (VP).

Table 1 Frames per class in Video-based Partitioning (VP)

Full size table

VP face bounding box dataset The basic frames from the VP dataset undergo an initial form of processing and selection through the Doggie-smile algorithm [45, 46] for dog face recognition, using a YOLO convolutional network [47] and cropping the bounding box containing the head. This processing generates a VP face bounding Box dataset, where CNN classifiers can focus on the dog’s face. The drawback is to exclude potentially useful emotion recognition information from the dog’s body posture.

The Doggie-smile face detector returns six facial landmark points (i.e., two ears, forehead top, two eyes, nose) and their bounding box. We observe that this detector often excludes relevant parts of the face, depending on the dog’s position and facial shape (e.g., influenced by fur). Our experimental findings show that expanding the bounding box by 15% of the frame obtains a balance between including relevant parts of the face and excluding background noise. Figure 3a shows an example frame with dog face landmarks and the corresponding bounding box. Figure 4 illustrates the results of preliminary experiments that aimed to optimize the size of the bounding box by progressively varying its expansion. It should be noted that focusing on the dog’s face in this dataset reduces the number of frames because a frame is excluded if the face detector does not find the dog’s face; this situation can occur either because there is no dog in the scene or because the dog is in a position where the face is not visible.

VP face segmentation datasets The generation of this dataset acknowledges that the box-shaped crop applied to the frame may be too rough an approximation. Precisely isolating and focusing on the dog’s face could provide more relevant information for emotion recognition. In each basic frame, the Doggie-smile face detector and a segment analyzer from Meta [48, 49] are used to identify face landmarks and delineate image segments (i.e., areas characterized by uniform patterns), respectively. We select a subset of segments from the analyzed frame, encapsulating the maximum number of landmarks, while minimizing the covered surface area. The corresponding segments of the basic frame are preserved, while the remaining portions of the image are replaced with a white background. In Fig. 3b, the frame of face segmentation is shown, where a white area replaced the background.

VP body segmentation datasets The goal of this dataset category is to isolate the segment corresponding to the entire dog body within the resulting frames. This way, it includes potentially emotion-associated information about the dog’s posture, mirroring on the body the approach applied to face segmentation (see the previous paragraph). We use the DeepLabCut body landmarks detector, [35] which returns landmarks for relevant body parts of the dog (see Fig. 3d), alongside an associated confidence level. Then, we perform image segmentation [49] on the frame, and we choose a minimal set of segments that pursues a balance between covering the maximum number of landmarks with threshold confidence and minimizing the surface area. Figure 3e shows the application of this method resulting in a frame where the dog’s body is isolated on a white background.

Segmentation blurred datasets The inclusion of a blurred background has been specifically implemented in the segmentation versions of both face and body datasets, to address potential bias arising from training on frames with large white portions. Introducing a blurred background, which varies across frames, aims to mitigate this bias and promote more balanced training. In Fig. 3c and f, the resulting frames with a blurred background from face and body segmentation are shown.

3.1.2 Action-coding datasets

The second partitioning strategy investigates the potential utility of leveraging the information provided by action coding labels in emotion recognition (see research question n.3 in Sect. 1), according to the annotations from the Dog Clips dataset experts, based on the Dog Facial Action Coding System (DogFACS). If the results of the experiments show that such labels significantly improve performance, we could consider incorporating automated facial action recognition into our system.

The AP dataset is obtained from the basic VP dataset, using DogFACS action coding (see the list in subsect. 2.2) as supplementary information to guide the data-cleaning process. In the initial phase of this process, all frame sections labeled as AD74 (termed ‘Unscorable’) are removed, potentially resulting in gaps between relevant video sections. Note that the original dataset contains video segments that include people. To avoid human expressions or posture affecting the emotion training phase, also these frames are excluded from the AP dataset. Table 2 details the resulting number of frames for each class of emotion for AP. AP frames are reduced globally by 13,772 when compared to VP.

A set of AP-derived datasets has been produced transforming the corresponding frames from the VP-derived datasets (see the related paragraphs in subsubsect. 3.1.1):

AP Raw dataset, with no transformation;
AP Face Bounding Box dataset;
AP Face Segmentation dataset, also in blurred background version;
AP Body Segmentation dataset, also in blurred background version.

Table 2 Frames per class in Action-coding Partitioning (AP)

Full size table

3.2 Emotion knowledge transfer

In this section, we explain how the Transfer Learning (TL) method (i.e., knowledge transfer) is applied to a selection of state-of-the-art CNN classifiers, pre-trained for image classification.

We use the following Convolutional Neural Network (CNN) classifiers in our experimental flow, all pre-trained for patterns and features on ImageNet, [50] a dataset containing over 14 million images that cover over 20,000 categories:

AlexNet [34]
MobileNet [51]
VGG16 [52]
VGG19 [52]
Xception [53]
Inception-Resnet V2 (also referred as Resnet V2 later in tables) [54]

These models were then fine-tuned on both the VP and AP-derived datasets to adapt them to the specific task of recognizing dog emotions.

We replaced the last three classification layers for each CNN, including a fully connected layer with five neurons, corresponding to the classes of our emotion model.

The TL method involves the best practices [55] of applying a learning rate of $1*10^{-4}$ for the CNN models, increasing the learning rate for the new layers by a factor of 20.

The CNNs undergo six epochs of retraining, with a 256 batch size, using the Stochastic Gradient Descent optimizer with momentum [55].

The six fine-tuned models are applied to each extended dataset and segmentation strategy, for emotion recognition.

To discriminate between friendly or potentially aggressive states of the dog (i.e., danger) and to assess the danger recognition system’s capabilities, it is worth aggregating the emotional classes, identifying two main macro-classes:

Dangerous: including the dogs’ unpleasant emotional states, referring to discomfort (i.e., fear, frustration), thus potentially leading to accidents (e.g., bite attack) in case of human--dog interaction.
Non-dangerous: referring to the dog’s pleasant (i.e., happiness, positive anticipation) and neutral (i.e., relaxation) emotional states, thus to be considered safe for human--dog interaction.

It is relevant to highlight that the emotion included in the model is positive anticipation. A general emotion of anticipation may be considered a borderline state between the two macro-classes because it is not inherently dangerous and not a comfortable state for the dog, either: its expression may be considered a state of distress due to its high level of attention. Positive anticipation, on the other hand, can be included undoubtedly in the non-dangerous states.

4 Results and discussion

In this section, we present and discuss results for our research questions: to address the critical biases and technical considerations of real-world data versus controlled environments, identify the canine emotions most prone to misclassification and their implications for recognizing dangerous situations, and evaluate the use of CNNs and DogFACS action descriptors for canine emotion classification.

A critical challenge for deep emotion recognition specific to dogs regards biases and noise stemming from the background, which can easily influence the classification. For instance, if the training set includes many happy dogs running in the grass, the repetitive background features of the environment may influence the classification in the less numerous cases in which a dog is sad, or potentially aggressive when spotted in a similar place. Applying our preprocessing strategies on both the Video-based Partitioning (VP) and Action-coding partitioning (AP), we can provide and compare results for the six models.

Considering accuracy across VP datasets and CNNs, as shown in Table 3, the combination of the Face Bounding Box dataset and fine-tuned VGG19 stands out as the top performer, with an accuracy of 0.605. This result is evident when comparing it to the accuracy of 0.488 achieved on the VP Raw dataset, also by VGG19, showing an improvement. It is worth noticing that the Face Bounding Box method excludes frames without any visible dog’s face during the training process. The significance of focusing on the dog’s face for effective canine emotion recognition is reinforced by observing that the Face Bounding Box method is also the frame processing technique producing the best general results across all the CNNs, as shown in italics in the table.

Table 3 Emotion Recognition: comparison of CNN accuracy on various datasets using Video-based Partitioning (VP)

Full size table

Figure 5 illustrates the comparison of accuracy achieved by various CNNs on emotion and danger recognition, highlighting the consistently high performance attained with this dataset.

By analyzing the actual frames produced with segmentation, we observe that when this technique is applied either to the face or body, spurious segments may be kept, as shown in Fig. 6, where dogs are selected together with parts of the background, thus solving partially the problem of noise from background.

We observe that the Face Bounding Box is the frame processing technique producing the best general results across all the CNNs, as shown in italics in columns of Table 3.

Regarding the recognition of the meta-classes for potentially threatening emotional states, Table 4 and Fig. 5b present the comparative results of aggregated accuracy for the classification of Dangerous/not Dangerous categories across VP datasets and fine-tuned CNNs. The absolute best performance for danger recognition is obtained again by VGG19 on the Face Bounding Box dataset, with an accuracy of 0.858. As for emotion recognition, also for danger recognition the Bounding Box strategy produces the best results over the majority of CNNs. It is worth noting that almost all algorithms and dataset combinations improve their performance in classifying the five emotions after aggregation. This result is noteworthy because it shows that across all CNNs, the misclassification of emotions tends to remain within the same dangerous/non-dangerous group, thus not leading to false negatives of the dangerous category.

Table 4 Danger Recognition: comparison of CNN accuracy on various datasets using Video-based Partitioning (VP)

Full size table

Moreover, besides VGG19 being the absolute best with Face Bounding Box, we also note that Inception-Resnet V2 obtains the best-aggregated performances for all the other four datasets. In non-aggregated emotion recognition, Inception-Resnet V2 was best in two over five datasets.

In Fig. 8a, the confusion matrix for VGG19 on the Face Bounding Box dataset is shown. The aggregated confusion matrix displayed in Fig. 8b supports the previous observations regarding misclassification errors. Specifically, in the VGG19 confusion matrix the highest number of classification errors for a single class occurs with frustration, misclassified as fear in 39.89% cases. Happiness is misclassified for positive anticipation in 24.60% cases and conversely, positive anticipation is misclassified for happiness in 23.13%; relax is misclassified for positive anticipation in 23.34%. It is worth noting that these figures contribute positively to the aggregated results for the correct detection of dangerous dog states.

Table 5 presents the key metrics for emotion and danger recognition achieved by VGG19 on the Face Bounding Box dataset, considering both VP and AP. Notably, the F1 score reflects a balance between precision and recall, reaching a performance of up to 85.76% for VP danger classification.

Table 5 Performance evaluation of the VGG19 CNN on the face bounding box for emotion and danger recognition, VP and AP datasets

Full size table

Considering comparative results on the AP dataset in Table 6, VGG16 with the Face Bounding Box dataset obtains the absolute-best accuracy result of 0.620 for emotion recognition. Face Bounding Box processing is again generally improving performance across 4 CNNs over 6.

Table 6 Emotion Recognition: comparison of CNN accuracy on various datasets using Action-coding Partitioning (AP)

Full size table

In aggregated Dangerous/not dangerous results (see Table 7), Face Bounding Box achieves the best AP result with 0.808 for VGG19.

Table 7 Danger Recognition: comparison of CNN accuracy on various datasets using Action-coding Partitioning (AP)

Full size table

However, aside from the exception of best-absolute results, the performance increment observed in the AP experiments compared to the corresponding VP is generally modest. Improvements are only slightly higher in a few cases and, in some instances, slightly lower.

Regarding computational complexity (in FLOPS) and efficiency (in time), face bounding box detection averages a swift 0.09 s per image, while more complex face and body segmentation tasks are completed in about 3 s per image. In training and testing our network models, execution time scales with parameter size, with AlexNet and ResNetV2 being the most resource-intensive. However, ResNetV2, Xception, and VGG19 offer the best trade-off between accuracy and processing time. In contrast, MobileNet’s lower complexity does not achieve the desired performance, highlighting a critical balance between model efficiency and accuracy. The high accuracy and low complexity of the Face Bounding Box dataset with VGG19 indicate that deep learning can effectively classify dog emotions with this segmentation strategy.

A noteworthy observation is that while the performances of the different partitioning strategies are not particularly relevant, experiments conducted on segmentation reveal that blurring the background leads to an improvement in accuracy (see Fig. 7). For segmentation, blurring improves the neural network performances, with peaks for body segmentation of approximately 10% in the best improvement cases (i.e., AlexNet VP and AP, MobileNet AP, VGG19 AP, and VGG16 VP, for emotion recognition), while for face segmentation, the best improvement for VP reaches approximately 21% (i.e., AlexNet, danger recognition), for AP approximately 26% (i.e., AlexNet, danger recognition). This improvement is consistent across most CNN models, regardless of whether VP or AP strategies are used. The significant enhancement obtained through blurring the background supports the hypothesis that the presence of a white background, while effectively eliminating noisy elements surrounding the dog’s body, introduces a bias during training for emotion recognition. Blurring the background helps mitigate this bias, as shown in Fig. 7, leading to improved results.

4.1 Future directions

In this section, we streamline our approach to ensure that discussions on future improvements directly complement and highlight our study’s present achievements, offering a cohesive overview of our impact and the continued potential in this field. Our work sets a foundational step in applying deep learning for dog emotion recognition, utilizing the Dog Clips dataset97—a leap toward realism compared to traditionally simulated datasets used in human emotion studies. The dataset’s rich, uncontrolled environmental contexts present a valuable, more authentic source for analyzing genuine emotional expressions, highlighting our study’s immediate contribution to the field. The use of segmentation eliminates the side effect of having noisy frames. Addressing bias has been a pivotal aspect of our study, guided by techniques prioritizing dog face detection and analysis. Future enhancements in dataset diversity, particularly by augmenting breeds that might be underrepresented, could further elevate the robustness and inclusivity of our bias-mitigation approach (Fig. 8).

Our analysis currently focuses on static frames, for a more sustainable approach. However, we acknowledge the potential of motion dynamics between frames. Incorporating sequence models, such as attention layers with transformers, [56] could refine emotion recognition accuracy by leveraging the temporal information inherent in video data. The practical implications of our findings, especially in identifying potentially dangerous scenarios in human--dog interactions, lay a solid groundwork for real-time application testing in future studies.

Finally, we would like to highlight that, being emotions and pain similar in their neurological system, further applications of our approach can be related to the research topic of animal pain assessment by analyzing the expression of pain through the dog’s head and body, as we have done for danger states expression.

5 Conclusions

Our study focuses on the analysis and recognition of dog emotions and danger states using images from real-life scenarios, rather than the controlled scenarios that are prevalent in the literature.

Having identified the critical challenges in canine emotion recognition, we applied segmentation strategies as key aspects of our methodology. These strategies were found to be crucial in improving the focus on relevant features and minimizing misclassification due to background noise, thereby contributing to the robustness of our classification results. Among them, the face bounding box and blurring techniques peaked in performance.

Transfer learning was applied to pre-trained models such as VGG19 and Inception-Resnet V2, which proved to be able to cope with unstructured environments and heterogeneity of species and environmental conditions with remarkable results, achieving a peak accuracy of 0.8577 for threat detection.

Our models, together with segmentation strategies, also showed a desirable polarization in the misclassified cases, recording fewer false negatives in the classification of dangerous behaviors. This orientation toward safety is essential for practical applications where the cost of a false negative could be high, highlighting the reliability of the proposed approach for use in scenarios that require cautious and preventative action.

The investigation of the utility of DogFACS labels has shown that while they do provide some overall improvement in accuracy, these results are marginal. We can therefore state that the costly integration of automated DogFACS labels is not cost-effective.

In conclusion, our results confirm the applicability of transfer learning in complex and diverse settings and highlight the critical role of segmentation strategies in achieving high accuracy in real-world canine behavior analysis. Our methodology is proving effective for safety in real-world practical applications in human--dog interaction research.

Data availability

The clips obtained through segmentation are publicly available at the link: https://dataverse.harvard.edu/dataverse/DogEmo. The Dog Clip dataset, analyzed during the current study, and the transformations generated, are not publicly available due to copyright and can be obtained upon reasonable request by the authors of [3], who authorized the use of the Data Clip dataset in this study. The Dog FACS data and information are available at the link: https://animalfacs.com/dogfacs_new.

References

Rizzolatti G, Craighero L (2004) The mirror-neuron system. Annu Rev Neurosci 27:169–192. https://doi.org/10.1146/annurev.neuro.27.070203.144230
Article Google Scholar
Hess U, Fischer A (2013) Emotional mimicry as social regulation. Personality and social psychology review: an official journal of the Society for Personality and Social Psychology, Inc 17(2):142–157. https://doi.org/10.1177/1088868312472607
Caeiro C, Guo K, Mills D (2017) Dogs and humans respond to emotionally competent stimuli by producing different facial actions. Sci Rep 7(1):15525. https://doi.org/10.1038/s41598-017-15091-4
Article Google Scholar
Palagi E, Nicotra V, Cordoni G (2015) Rapid mimicry and emotional contagion in domestic dogs. R Soc Open Sci 2(12):150505
Article Google Scholar
Cherniack EP, Cherniack AR (2014) The benefit of pets and animal-assisted therapy to the health of older individuals. Curr Gerontol Geriatr Res. https://doi.org/10.1155/2014/623203
Article Google Scholar
Petersen S, Houston S, Qin H, Tague C, Studley J (2017) The Utilization of Robotic Pets in Dementia Care. J Alzheimers Dis. https://doi.org/10.3233/JAD-160703
Article Google Scholar
Weiss A, Wurhofer D, Tscheligi M (2009) “I love this dog’’-children’s emotional attachment to the robotic dog AIBO. Int J Soc Robot. https://doi.org/10.1007/s12369-009-0024-4
Article Google Scholar
Breazeal C (2003) Emotion and sociable humanoid robots. Int J Hum-Comput Stud 59(1–2):119–155. https://doi.org/10.1016/S1071-5819(03)00018-1
Article Google Scholar
Fong T, Nourbakhsh I, Dautenhahn K (2003) A survey of socially interactive robots. Robot Auton Syst 42(3):143–166. https://doi.org/10.1016/S0921-8890(02)00372-X
Article Google Scholar
Franzoni V, Milani A, Vallverdú J (2017) Emotional affordances in human-machine interactive planning and negotiation. In: Proceedings of the International Conference on Web Intelligence. WI ’17, pp. 924–930. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3106426.3109421
Franzoni V, Milani A, Biondi G (2017) Semo: A semantic model for emotion recognition in web objects. In: Proceedings of the International Conference on Web Intelligence. WI ’17, pp. 953–958. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3106426.3109417
Chan SW, Franzoni V, Mengoni P, Milani A (2018) Context-based image semantic similarity for prosthetic knowledge. In: 2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 254–258. https://doi.org/10.1109/AIKE.2018.00057
Franzoni V, Vallverdù J, Milani A (2019) Errors, biases and overconfidence in artificial emotional modeling. In: IEEE/WIC/ACM International Conference on Web Intelligence - Companion Volume. WI ’19 Companion, pp. 86–90. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3358695.3361749
Holzinger A, Röcker C, Ziefle M (2015) From smart health to smart hospitals. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-319-16226-3_1
Santos J, Rodrigues JJPC, Silva BMC, Casal J, Saleem K, Denisov V (2016) An IoT-based mobile gateway for intelligent personal assistants on mobile health environments. J Netw Comput Appl. https://doi.org/10.1016/j.jnca.2016.03.014
Article Google Scholar
Boyko N, Basystiuk O, Shakhovska N (2018) Performance evaluation and comparison of software for face recognition, based on dlib and opencv library. In: 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), pp. 478–482. IEEE
Li S, Deng W (2022) Deep facial expression recognition: A survey. IEEE Trans Affect Comput 13(3):1195–1215. https://doi.org/10.1109/TAFFC.2020.2981446
Article Google Scholar
Zhao S, Wang S, Soleymani M, Joshi D, Ji Q (2019) Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Trans Multimedia Comput Commun Appl 15(3s) https://doi.org/10.1145/3363560
Gervasi O, Franzoni V, Riganelli M, Tasso S (2019) Automating facial emotion recognition. Web. Intelligence. https://doi.org/10.3233/WEB-190397
Article Google Scholar
Riganelli M, Franzoni V, Gervasi O, Tasso S (2017) EmEx, a Tool for Automated Emotive Face Recognition Using Convolutional Neural Networks vol. 10406 LNCS. https://doi.org/10.1007/978-3-319-62398-6_49
Ekman P (1992) An Argument for Basic Emotions. Cogn Emot. https://doi.org/10.1080/02699939208411068
Article Google Scholar
Ekman P, Friesen WV (1978) Facial action coding system: a technique for the measurement of facial movement
Franzoni V, Milani A, Biondi G, Micheli F (2019) A preliminary work on dog emotion recognition. In: IEEE/WIC/ACM International Conference on Web Intelligence - Companion Volume. WI ’19 Companion, pp. 91–96. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3358695.3361750
Boneh-Shitrit T, Feighelstein M, Bremhorst A, Amir S, Distelfeld T, Dassa Y, Yaroshetsky S, Riemer S, Shimshoni I, Mills DS et al (2022) Explainable automated recognition of emotional states from canine facial expressions: the case of positive anticipation and frustration. Sci Rep 12(1):22611
Article Google Scholar
Ferres K, Schloesser T, Gloor PA (2022) Predicting dog emotions based on posture analysis using deeplabcut. Future Internet 14(4):97
Article Google Scholar
Waller BM, Caeiro C, Peirce K, Burrows AM, Kaminski J (2013) Dogfacs: the dog facial action coding system
Dalla Costa E, Minero M, Lebelt D, Stucke D, Canali E, Leach MC (2014) Development of the Horse Grimace Scale (HGS) as a Pain Assessment Tool in Horses Undergoing Routine Castration. PLoS ONE 9(3):92281. https://doi.org/10.1371/journal.pone.0092281
Article Google Scholar
Häger C, Biernot S, Buettner M, Glage S, Keubler LM, Held N, Bleich EM, Otto K, Müller CW, Decker S, Talbot SR, Bleich A (2017) The Sheep Grimace Scale as an indicator of post-operative distress and pain in laboratory sheep. PLoS ONE. https://doi.org/10.1371/journal.pone.0175839
Article Google Scholar
Langford DJ, Bailey AL, Chanda ML, Clarke SE, Drummond TE, Echols S, Glick S, Ingrao J, Klassen-Ross T, Lacroix-Fralish ML, Matsumiya L, Sorge RE, Sotocinal SG, Tabaka JM, Wong D, Van Den Maagdenberg AMJM, Ferrari MD, Craig KD, Mogil JS (2010) Coding of facial expressions of pain in the laboratory mouse. Nat Methods. https://doi.org/10.1038/nmeth.1455
Article Google Scholar
Sotocinal SG, Sorge RE, Zaloum A, Tuttle AH, Martin LJ, Wieskopf JS, Mapplebeck JCS, Wei P, Zhan S, Zhang S, McDougall JJ, King OD, Mogil JS (2011) The Rat Grimace Scale: a partially automated method for quantifying pain in the laboratory rat via facial expressions. Mol Pain 7:55. https://doi.org/10.1186/1744-8069-7-55
Article Google Scholar
Mota-Rojas D, Marcet-Rius M, Ogi A, Hernández-Ávalos I, Mariti C, Martínez-Burnes J, Mora-Medina P, Casas A, Domínguez A, Reyes B, Gazzano A (2021) Current Advances in Assessment of Dog’s Emotions, Facial Expressions, and Their Use for Clinical Recognition of Pain. Animals 11(11) https://doi.org/10.3390/ani11113334
Kujala MV, Somppi S, Jokela M, Vainio O, Parkkonen L (2017) Human Empathy, Personality and Experience Affect the Emotion Ratings of Dog and Human Facial Expressions. PLoS ONE 12(1):0170730. https://doi.org/10.1371/journal.pone.0170730
Article Google Scholar
Custance D, Mayer J (2012) Empathic-like responding by domestic dogs (Canis familiaris) to distress in humans: an exploratory study. Anim Cogn 15(5):851–859. https://doi.org/10.1007/s10071-012-0510-1
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Mathis A, Mamidanna P, Cury KM, Abe T, Murthy VN, Mathis MW, Bethge M (2018) Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 21:1281–1289. https://doi.org/10.1038/s41593-018-0209-y
Article Google Scholar
Franzoni V, Biondi G, Perri D, Gervasi O (2020) Enhancing Mouth-Based Emotion Recognition Using Transfer Learning. Sensors 20(18). https://doi.org/10.3390/s20185222
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80. https://doi.org/10.1109/79.911197
Article Google Scholar
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231. https://doi.org/10.1109/ICASSP.2017.7952552
Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191. https://doi.org/10.1109/34.954607
Article Google Scholar
Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw 92:60–68. https://doi.org/10.1016/J.NEUNET.2017.02.013
Article Google Scholar
Fayek HM, Lech M, Cavedon L (2015) Towards real-time speech emotion recognition using deep neural networks. In: 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5. https://doi.org/10.1109/ICSPCS.2015.7391796
Lech M, Stolar M, Bolia R, Skinner M (2018) Amplitude-frequency analysis of emotional speech using transfer learning and classification of spectrogram images. Adv Sci, Technol Eng Syst J 3(4):363–371. https://doi.org/10.25046/aj030437
Article Google Scholar
Prasomphan S (2015) Detecting human emotion via speech recognition by using speech spectrogram. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. https://doi.org/10.1109/DSAA.2015.7344793
Franzoni V, Biondi G, Milani A (2020) Emotional sounds of crowds: spectrogram-based analysis using deep learning. Multimed Tools Appl 79(47):36063–36075. https://doi.org/10.1007/s11042-020-09428-x
Article Google Scholar
Tureckova A (2017) GitHub - tureckova/Doggie-smile: Computer Vision for Faces - Final project — github.com. GitHub
Tureckova A, Holik T, Kominkova Oplatkova Z (2020) Dog face detection using yolo network. MENDEL 26(2):17–22. https://doi.org/10.13164/mendel.2020.2.017
Article Google Scholar
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. https://doi.org/10.1109/CVPR.2016.91
Hu R, Dollár P, He K, Darrell T, Girshick R (2018) Learning to segment every thing. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4233–4241. https://doi.org/10.1109/CVPR.2018.00445
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, Dollár P, Girshick R (2023) Segment Anything
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv. https://doi.org/10.48550/ARXIV.1704.04861. https://arxiv.org/abs/1704.04861
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807. https://doi.org/10.1109/CVPR.2017.195
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. AAAI’17, pp. 4278–4284. AAAI Press, ???
Bottou L (2012) Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade: Second Edition, 421–436
Lin T, Wang Y, Liu X, Qiu X (2022) A survey of transformers. AI Open

Download references

Acknowledgements

The authors heartily thank Catia Correia Caeiro, Daniel S. Mills, and Kun Guo, for providing their labeled Dog Clips dataset. The authors thank Francesco Micheli for his contribution to the preliminary work. This research is supported by the EmoRe research group on Affective Computing and Emotion Recognition, Department of Mathematics and Computer Science, University of Perugia, Italy. This research did not receive funding at the time of submission.

Funding

Open access funding provided by Università degli Studi di Perugia within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Perugia, Via Vanvitelli 1, 06123, Perugia, Italy
Valentina Franzoni, Giulio Biondi & Alfredo Milani
Department of Computer Science, Hong Kong Baptist University, Kowloon, Hong Kong SAR, China
Valentina Franzoni

Authors

Valentina Franzoni
View author publications
You can also search for this author in PubMed Google Scholar
Giulio Biondi
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Milani
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

In this section, the author’s contribution is stated, where the authors appear with their name initials. Problem identification and definition, study conception and design, experiments plan, project supervision were performed by VF; VF, GB helped in data cleaning, experiments design; GB was involved in code development and experiments run; VF, AM, GB helped in data and results presentation and draft writing; AM, VF helped in results discussion, results analysis, and critical revision.

Corresponding author

Correspondence to Valentina Franzoni.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval and informed consent

The authors declare no ethical issue. In particular, authors make sure to respect third parties’ rights such as copyright and moral rights.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Computational analysis

1.1 Preliminaries

The computational complexity of our proposed models is quantified by estimating the computational effort in terms of Floating Point Operations (FLOPs) required for processing. During the training phase, the complexity is directly proportional to the product of the number of trainable parameters, the number of frames processed during training, the batch size, and the total number of training epochs, as in the equation:

$$\begin{aligned} \text {F}_{\text {train}} = 2 \times \text {Parameters} \times \text {Train Frames} \times \text {Batch Size} \times \text {Epochs} \end{aligned}$$

For the testing phase, the computational complexity is determined by the number of parameters, the quantity of test frames, and the batch size, as in the following equation:

$$\begin{aligned} \text {F}_{\text {test}} = \text {Parameters} \times \text {Test Frames} \times \text {Batch Size} \end{aligned}$$

These calculations serve as an estimate of the model’s computational load during the training and testing phases for a specified dataset. It is important to note that the actual execution time may vary based on the specific libraries used to implement the models, as well as the characteristics of the hardware on which the models are run.

1.2 Computational analysis

Evaluating the actual efficiency of neural network (NN) and segmentation strategies based solely on the computation time can be misleading. Table 8 details the actual training times for various convolutional neural network (CNN) architectures when applied to the VP and AP datasets. However, from a computational complexity standpoint, absolute times are meaningful only when considered alongside the volume of processed data, the size of the neural network model, the batch size, and the number of training epochs. Table 9 lists the total number of frames for different data preprocessing strategies used on the VP and AP datasets, such as raw frames, face bounding boxes, and segmented body parts with and without blurring. Additionally, Table 10 presents the total number of parameters for the experimented NN models.

Table 8 VP and AP datasets times for training and test

Full size table

Table 9 Number of frames per dataset and segmentation type

Full size table

Table 10 Number of parameters per neural network

Full size table

An overall summary of complexity is provided in Table 12, which includes Training and Testing FLOPS figures that represent the computational demand calculated for 6 epochs with a batch size of 256. This table also displays the actual training and test times, offering insights into computational efficiency during the model evaluation phase. Such timings are essential for understanding the trade-offs between accuracy and speed in practical applications.

Notably, as indicated in Table 12, although the FLOPS metrics for MobileNet are two orders of magnitude lower compared to ResNetV2, the actual execution time is only reduced by an average of 50%.

Table 11 Average segmentation and blurring processing time in seconds per single image from raw version

Full size table

Table 12 VP and AP datasets and complexity parameters for training and test

Full size table

A final note on image dataset generation time, i.e., segmentation processing time from raw images, takes on average a neglectable 0.09 sec per image for face bounding box detection, while face segmentation and body segmentation with/without blurring takes about 3 secs per image on average, see Table 11.

Models like AlexNet and ResNetV2 are particularly demanding in terms of resources. Yet, it is models such as ResNetV2, Xception, and VGG19 that provide an optimal balance between precision and computational time. On the other hand, despite its reduced complexity, MobileNet falls short in delivering the expected performance, underscoring the essential equilibrium that must be struck between the efficiency of the model and its accuracy.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Franzoni, V., Biondi, G. & Milani, A. Advanced techniques for automated emotion recognition in dogs from video data through deep learning. Neural Comput & Applic 36, 17669–17688 (2024). https://doi.org/10.1007/s00521-024-10042-3

Download citation

Received: 08 August 2023
Accepted: 19 June 2024
Published: 04 July 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s00521-024-10042-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advanced techniques for automated emotion recognition in dogs from video data through deep learning

Abstract

Similar content being viewed by others

Explainable automated recognition of emotional states from canine facial expressions: the case of positive anticipation and frustration

Facial Emotion Recognition in-the-Wild Using Deep Neural Networks: A Comprehensive Review

FaceEvoke: Eliciting Emotions Through Facial Analysis

Explore related subjects

1 Introduction

1.1 Challenges of emotion recognition in dogs

1.2 Previous works

2 System background

2.1 Emotional classes hierarchy

2.2 Dog clip dataset

2.3 Image preprocessing

2.4 Visual dog emotion Classification

3 Methodology and experiment workflow

3.1 Dataset extension

3.1.1 Video-based datasets

3.1.2 Action-coding datasets

3.2 Emotion knowledge transfer

4 Results and discussion

4.1 Future directions

5 Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and informed consent

Additional information

Publisher's Note

Appendix A: Computational analysis

Appendix A: Computational analysis

1.1 Preliminaries

1.2 Computational analysis

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation