1 Introduction

Humans and animals spontaneously connect to other individuals. The neurological mirroring system [1], observed in humans, primates, and other species, including dogs, cats, and birds, provides the physiological mechanism for perception-action coupling. This is essential for understanding the actions of others, mastering skills by imitation, [2] understanding intentions and emotions, and engaging in inter-species and intra-species empathy. The expression of emotions between species differs [3], but, intuitively, the proven similar reactions between humans and dogs are due to their mirroring systems [4]. E.g., dog owners know very well that inter-species behaviors of contagious joy, relaxation, and yawning, often occur in human--dog interaction.

In this paper, we exploit automated recognition of emotions shared among humans and dogs, focusing on the following research questions:

  1. 1.

    What are the critical biases and technical considerations experimenting on real-world data as compared to data from a controlled environment?

  2. 2.

    Which canine emotions are most often misclassified, and is this issue critical for the recognition of potentially dangerous situations?

  3. 3.

    Can the classification of dog emotions be improved by focusing on the DogFACS coding descriptors?

We begin our study by identifying the challenges of emotion recognition in dogs, starting from the literature on emotion recognition and dog behavior. Building on the recent introduction of automated dog emotion recognition, we focus on classifying five specific pleasant and unpleasant emotions in dogs and then consolidate these findings into higher-level categories of danger, i.e., dangerous/non-dangerous inter-species interactions.

The potential applications of this study span across diverse domains, ranging from the development of empathy-enabled smart assistants to enhancing robotic pets’ empathetic functions [5,6,7]. E.g., in environments where social robots and dogs coexist, AI-driven emotion recognition could enable robots to adapt their behavior accordingly and provide timely alerts to humans to ensure that dogs are not under undue stress. [8,9,10,11] These intelligent systems could also interpret canine emotions to alert owners, veterinarians, and animal shelters or rescue organizations to a dog’s distress or discomfort, so they can respond appropriately, strengthening the dog-owner bond and potentially preventing harmful situations. This research also specifically paves the way for real-time advanced systems in prosthetic knowledge, [12] where AI with empathetic capabilities could become critical in ensuring a secure and responsive interaction with pets for people with cognitive or sensory challenges [13,14,15]. Last but not least, this technology can contribute to ethological studies and welfare guidelines.

The main contributions of this work can be summarized as follows:

  • Classification generalizability of dog emotions across breeds and environmental conditions through Transfer Learning, to enhance deep-learning reliability in real-world settings, moving beyond controlled laboratory conditions.

  • Enhanced dataset bias mitigation through image preprocessing, providing a systematic comparison of segmentation strategies to eliminate background elements that could lead to biased classification.

  • Systematic comparative analysis of neural network models, including a 5-emotions model and the classification of dangerous situations.

1.1 Challenges of emotion recognition in dogs

To better understand our task and where there are gaps in the literature, in this subsection we carefully analyze what challenges are specific to the task of canine emotion recognition, starting from human emotion recognition and proceeding to differentiate between the two.

Human-automated emotion recognition from images using deep learning is a mature research topic based on a solid state of the art, and easily implementable with machine learning techniques, identifying critical points of the face [16] or using deep neural networks on high-resolution images to exploit facial details such as micro-expressions. [17,18,19,20] Despite differences due to ethnicity, gender, and age, human-face-based emotion recognition can rely on strong generalizability due to the similarities among different human faces, and the same expression of eyes, eyebrows, mouth, and surrounding skin areas. [21, 22] Despite dogs and humans responding differently to emotionally competent stimuli, [3] the same process has been proved promising on dogs. [23,24,25]

On dogs, we can observe a broader range of critical differences between breeds, including e.g., head shape, fur length, coat texture, and body proportions. Also, the hair color distribution can be challenging, e.g., in breeds presenting spots, that may challenge shape recognition and segmentation algorithms. Even considering a single breed, it is much more difficult to claim that research findings and conclusions from a study conducted on a sample population can be applied to the entire population. Moreover, in dogs, we cannot exploit micro-expressions, and the state of the coat in real life (e.g., wet, cut, dirty) may influence the recognition algorithm’s performance. Other work [23, 24] suggests that the dog’s facial expression recognition is still critical for emotion recognition. The performance of emotion recognition can be improved by adding postural information.

When considering the annotation of canine data, a significant distinction from human subjects becomes apparent: unlike humans, dogs lack the capability for verbal or written communication to express their state. Consequently, dogs’ emotions from still images are inherently human-perceived, as of yet, a physiological dataset that can be reliably mapped to canine emotions remains unexplored. For videos, we can consider the opportunity to map action units (e.g., DogFACS [26] data) to emotions for emotions explainability. [3] Although a dog’s owner may have a deep understanding of their pet’s behavior, we deem professional evaluations (e.g., from veterinarians, dog trainers, or behavioral or ethology experts) to be more accurate, being less subject to personal biases and perception.

Another challenge in dog emotion recognition arises from the dynamic nature of real-world environments, as opposed to controlled settings. Images or videos captured in real life may contain noise, such as varying backgrounds (e.g., many images may feature happy dogs running in grass fields, sad dogs on floors, or sleepy dogs on couches). Furthermore, frames captured with personal devices may be out of focus, poorly composed (i.e., with the dog positioned away from the center or focal points), or inadequately lit. These factors can introduce significant biases in deep learning, as the background, environment, and camera specifications may unknowingly influence the features assigned to emotion classes.

In the context of our research, our objectives encompass the explicit identification of potential biases and the subsequent examination of assorted data-cleaning methodologies. Strategically, we consider the cited background biases within the training dataset in ad-hoc experiment settings, subsequently deploying data-cleaning techniques to evaluate their performance in mitigating such biases.

Summarizing, three main types of bias may be common, in dog emotion datasets, on which it is critical to focus or the creation of novel datasets or studies to the aim of avoiding them:

  • Systemic bias: video data have a unique style, setting, or dog breed that does not generalize. Intra-video correlations, in other words, might lead the model to learn something specific, unrelated to general dog emotions, which does not apply to unseen videos.

  • Statistic or algorithmic bias: Inherent bias in training data. If certain dogs, breeds, environments, or situations are over-represented in the videos, the model might learn to associate these factors with certain emotions, leading to biases. E.g., if most of the videos showing happiness were shot in parks, the model might incorrectly learn to associate parks with happiness.

  • Activity or selection bias: Non-uniform distribution of emotions. If certain emotions are over-represented or underrepresented in the training data, the model might learn to predict the more common emotions more frequently, leading to a bias in the emotion predictions.

1.2 Previous works

Recognition of animal emotions is an important area of research to improve animal welfare and scientific understanding, with important uses ranging from improving conditions for farm animals to assessing the welfare of laboratory rats.

An interesting case of study on animal visual emotion recognition refers to the Grimace Scale [27,28,29,30] for the measurement by visual expression of animals suffering, e.g., farm breeding horses and sheep, or to understand when it is time to suppress laboratory rats. [31]

Dogs’ emotions have been extensively analyzed in behavioral and ethology studies, in particular, to demonstrate empathy between humans and dogs. [32, 33] Automated emotion recognition in dogs is still a novel research application area; the foundation of our research is an emotion model that reflects human emotional experiences as well, [21, 22] and an evaluation of similarities and differences is included.

The Dog Facial Action Coding System (DogFACS) [26] is a manual annotation system used in other studies to describe changes in facial appearance based on movements of the underlying facial muscles in dogs. It provides a scientific observational tool for identifying and coding action units of facial movements in dogs: the system is based on the facial anatomy of dogs and has been adapted from the original FACS system used for humans created by Ekman and Friesen in 1978. [22]

In 2017, Catia Correia Caeiro, Daniel S. Mills, and Kun Guo [3] reached the milestone of quantifying and comparing human and domestic dog facial expressions in response to emotionally competent stimuli associated with different categories of emotional arousal. The authors use the DogFACS objective protocol to investigate the facial movements of dogs in response to emotional stimuli. The paper explores the question of whether the observed emotional expressions are a result of the objective context (affect elicitation) or the subjective context (affect prediction). The authors conclude that dogs showed distinctive facial actions based on the category of stimuli, producing different facial movements from humans in comparable states of emotional arousal. In the process, the authors also associated the DogFACS action units with dog emotions in a series of video clips (i.e., the Dog Clips dataset, used in this work) from real-world sample situations of canine emotional arousal. Our work uses these associations between Dog Clips tagged with DogFACS coding and canine emotions, for the dual purpose of refining automatic recognition by experimenting with real-world video, and to demonstrate how to reduce the bias potentially present in the contexts in which videos for dog emotions could be collected for future targeted studies.

More recently, a few studies have been conducted on automated emotion recognition in dogs.

The seminal study, [23] addressing the topic of automated recognition of emotions in dogs for the first time, received wide attention. Such preliminary work used images of dogs collected from the Internet, classified by the search engine with keywords related to the main emotions of happiness and anger, plus a rest state (i.e., sleep). The photographs were chosen in a controlled way, selecting on purpose clean ones, without noisy elements, e.g., bad lighting, confused environments, humans and other animals, and a complex background. A classification had been implemented based on transfer learning using the AlexNet Convolutional Neural Network, [34] reaching solid results. The paper also offered an additional explanation of canine expressions, analyzed by expert veterinary doctors, including the position of the head, and a series of variables for facial expression (e.g., ear movements, eye opening, mouth opening, and teeth visibility). Limitations of the preliminary study include the selective data, the emotional model limited to three emotions, and the format of the photographs, which do not convey a dynamic nature. In addition, the study only tested a neural network.

In 2022, two interesting papers were published on the topic, starting from the outcome of the seminal work. [23] A work analyzed canine emotion recognition from body posture. [25] The paper describes a system based on a machine learning model trained on pose estimation to differentiate the emotional states of dogs. The authors compiled a picture library with full-body dog pictures featuring 400 images with 100 samples each for the states Anger, Fear, Happiness, and Relaxation. A new dog key-point detection model was built using the framework DeepLabCut [35] for animal key-point detector training, learning from a total of 13,809 annotated dog images and with the capability to estimate the coordinates of 24 different body-part key points. The application determines the dog’s emotional state with an accuracy between 60% and 70%, assessing that this threshold exceeds the human ability to recognize dog emotions.

A second work is a controlled experiment inducing frustration and positive anticipation in Labradors to create a dataset for this specific breed. [24] The laboratory data under controlled conditions, have been labeled using the action units of DogFACS for these two emotions. The study compares two approaches: a DogFACS-based approach and a deep-learning approach. The DogFACS-based approach uses DogFACS variables as explainable high-level features, highlighting the approach using DogFACS is time-consuming, and requires extensive human annotation. The deep-learning approach uses raw images as input and extracts features using two neural networks (i.e., ResNet and Vision Transformer (ViT)). The authors found that features of a self-supervised pre-trained ViT (i.e., DINO-ViT) were superior to the other alternatives.

2 System background

In this section, we describe the knowledge structure and techniques leveraged in our system architecture, including emotional classes, the source dataset, enhanced data processing, and advanced strategies for emotion classification.

2.1 Emotional classes hierarchy

In our research, we identify five canine emotional states: a neutral state (i.e., relaxation), two states indicative of non-dangerous positive emotions (i.e., happiness, positive anticipation), and two states indicative of potentially threatening negative emotions (i.e., frustration, fear), as present in the labeling of the Dog Clips dataset for emotion recognition. In a second, higher-level analysis, we aggregate them for danger recognition. This nuanced categorization allows for a more granular and scalable analysis of canine emotional expressions.

2.2 Dog clip dataset

The Dog Clips dataset [3] that we used in this work is composed of 100 videos of different breeds of dogs extracted from public sources and evenly distributed among five emotions, i.e., Fear, Frustration, Happiness, Positive Anticipation (later also referred as Anticipation), and Relaxation.

Fig. 1
figure 1

Images from the dataset, showing a sample for each class of emotion and a list of associated DogFACS action coding: (a) Fear: Ears rotator, Panting, Tongue show, Lower lip depressor, Lip Corner Puller, Lips part, Jaw drop, Head up, Eyes turn right, Blink, Head turn right; (b) Frustration: Lips part, Jaw drop; (c) Happiness: Ears rotator, Eyes up, Tongue show; (d) Anticipation: Eyes up, Head tilt left; (e) Relaxation: Ears rotator Lower lip depressor, Lips part, Jaw drop, Eye closure

The video segments in the original Dog Clips dataset are labeled in the work of Catia Correia Caeiro et al. [3] according to emotions and tagged with their Dog Facial Action Coding System DogFACS, [26] including dogs of different breeds for generalization. Action coding tags in DogFACS fall under three categories:

  • Action Units (AU): consecutive frame sets representing 11 movements whose muscular basis can be identified;

  • Action Descriptors (AD): consecutive frames sets describing 26 broader movements in which muscular basis is not identified;

  • Ear Action Descriptors: consecutive frames sets describing 5 variations from the ear neutral position due to ear muscular movements.

The Dog Clips dataset is tagged with 42 tags, occasionally with several action tags on a single frame. In the following list, we report our analysis on which tags are shared among differing emotional classes and which are exclusive to a single class.

  • Shared among all emotions (i.e., fear, frustration, happiness, positive anticipation, relaxation):

    • Partial, Upper lip raiser, Jaw drop, Sniff, Mouth stretch, Lip Corner Puller, Head turn left, Inner brow raiser, Ears rotator, Ears flattener, Head up, Eyes turn left, Head tilt right, Blink, Ears downward, Lower lip depressor, Tongue show, Lips part, Head turn right, Eyes up, Nose lick, Eyes turn right, Head down, Ears adductor, Lip wipe, Ears Forward, Lip pucker

  • Shared among happiness, frustration, positive anticipation, fear:

    • Panting, Head tilt left, Eyes down

  • Shared among fear, frustration, happiness:

    • Chewing, Blow

  • Shared among happiness, positive anticipation:

    • Body Shake, Nose wrinkler

  • Shared among happiness, relaxation:

    • Eye closure

  • Exclusively appearing with happiness:

    • Lick, Suck

It is worth noting that the only DogFACS action tags that can be directly and exclusively associated with an emotion are Lick and Suck, both of which are correlated with happiness. However, these actions are not necessarily present in every instance of happiness. Figure 1 includes sample frames from Dog Clips videos, showing in the caption how emotions are correlated with DogFACS action coding units.

A notable consideration for our study is that the dataset inherently contains dynamic elements due to camera motion, which introduces significant variations in both the dog’s perspective and the background across frames. These variations are critical for capturing the full spectrum of the dog’s emotional expressions in varying contexts. Although individual frame differences may be subtle, the collective sequence provides a rich, non-redundant dataset. On other datasets, where the camera’s point of view is stable, data redundancy may lead to overfitting, which can be addressed with a temporal range for frame selection.

2.3 Image preprocessing

A focal point of our approach is the data preprocessing phase, with particular emphasis on image segmentation. This step is crucial for improving the quality and pertinence of feature extraction, a dimension that remains underexplored in the literature in this area. Our research pioneers real-world imagery, rather than traditional laboratory or selectively curated datasets. This preprocessing approach allows engaging with the complexity and variability present in natural environments. Identifying and discarding the inherent noise in situational data, our study introduces comprehensive bias mitigation and correction strategies.

Specifically, we choose two preprocessing strategies that help classification to focus on areas of the dog’s body and head that are considered relevant for emotional classification [3], rather than those that represent bias and noise in the remaining discarded image segments. The nuances of our approach (blur and non-blur, body and head, segmentation and bounding box) are explained in the methodology (see Sect. 3).

2.4 Visual dog emotion Classification

Convolutional Neural Networks (CNN) are widely used in image classification, thanks to their flexibility and adaptability. They are also applied to image-based affective computing on humans. [17, 18, 36] Facilitating knowledge transfer from one domain to another in deep learning, Transfer Learning (TL) allows for the use of pre-trained CNNs on general image classifications without needing to re-train the network from scratch but using pre-trained basic image recognition capabilities in identifying low-level features (e.g., lines, shapes, patterns, and color distribution).

Low-level features are recognized by the initial layers of the CNN, while the knowledge of high-order features is embedded in the final layers. By TL, we can use this low-level knowledge within the same attribute space as a form of added knowledge for the domain of canine emotions. We implement the fine-tuning phase of TL replacing the final three layers with those dedicated to the study of dog emotion recognition, adjusting the network weights for this specific task.

The benefits of employing TL in Emotion Recognition studies are well-documented in existing literature, [19, 20, 37,38,39] even within non-visual domains, [40,41,42,43] such as spectrograms derived from speech segments or crowd sound. [44] In particular, previous studies on automated emotion recognition in dogs using neural networks [23,24,25] have consistently relied on this technique. In this work, we test the effectiveness of TL-based neural models on a broader range of emotions with in-class variety, for emotion and danger recognition in dogs.

3 Methodology and experiment workflow

In this section, we explain the methodology and workflow of our experiments. The architecture of system modules (see Fig. 2) includes frames extraction, data generation, application of emotions’ knowledge transfer to pre-trained Convolutional Neural Networks models, and classification of emotions and dangerous/non-dangerous states.

Fig. 2
figure 2

Plan and workflow of experiments: frames extraction and filtering strategies (purple), image processing (yellow), classification phase (green) (color figure online)

3.1 Dataset extension

The original Dog Clips dataset has been processed with different strategies to generate two classes of datasets for machine learning:

  • Video-based Partitioning (VP) datasets: in the VP datasets, frames are extracted directly from the Dog Clips videos;

  • Action-coding Partitioning (AP) datasets: in the AP datasets, we used DogFACS action coding as supplementary information to guide the data-cleaning process.

The data classes are both defined to meet typical constraints required by machine learning: training/test ratio which, in our setting 80%/20%, and training/test independence aimed to avoid the learning phase using data samples from the same video that will be submitted to the test phase. Strategically choosing frames that represent different stages of the emotional response, we maintain the integrity and diversity of the dataset, while keeping a uniform distribution of training/test samples for emotion classes. In the following paragraphs, our dataset generation is detailed for the two partitioning strategies, explaining step-by-step the combination of techniques used to generate the novel extended datasets based on different image processing.

3.1.1 Video-based datasets


The first partitioning strategy we used is based on extracting frames from the Dog Clips videos without further refinement. For such Video-based Partitioning (VP), the training/test split is applied to each emotion, according to a homogeneous class distribution. The VP-derived datasets are generated from basic frames by selecting the raw ones or processing them to focus on relevant features while reducing noise and biases stemming from the scene background and non-uniform camera behavior. Using all the frames that can include emotional information aligns with the methodology of the original Dog Clips dataset creators, ensuring consistency with manual annotation in future comparative studies. This approach also aligns with the practical application of our research in real-time scenarios, where rapid detection of emotional shifts is crucial for preventing dangerous situations.


VP Raw dataset The VP Raw dataset is obtained without filtering or processing the VP basic frames.

In Table 1, the total number of frames for each class is reported for Video-based Partitioning (VP).

Table 1 Frames per class in Video-based Partitioning (VP)

VP face bounding box dataset The basic frames from the VP dataset undergo an initial form of processing and selection through the Doggie-smile algorithm [45, 46] for dog face recognition, using a YOLO convolutional network [47] and cropping the bounding box containing the head. This processing generates a VP face bounding Box dataset, where CNN classifiers can focus on the dog’s face. The drawback is to exclude potentially useful emotion recognition information from the dog’s body posture.

The Doggie-smile face detector returns six facial landmark points (i.e., two ears, forehead top, two eyes, nose) and their bounding box. We observe that this detector often excludes relevant parts of the face, depending on the dog’s position and facial shape (e.g., influenced by fur). Our experimental findings show that expanding the bounding box by 15% of the frame obtains a balance between including relevant parts of the face and excluding background noise. Figure 3a shows an example frame with dog face landmarks and the corresponding bounding box. Figure 4 illustrates the results of preliminary experiments that aimed to optimize the size of the bounding box by progressively varying its expansion. It should be noted that focusing on the dog’s face in this dataset reduces the number of frames because a frame is excluded if the face detector does not find the dog’s face; this situation can occur either because there is no dog in the scene or because the dog is in a position where the face is not visible.

Fig. 3
figure 3

Samples of landmark identification and background segmentation

Fig. 4
figure 4

Line graph showing emotion classification accuracy by different CNN models as the bounding box size around a dog’s face varies. The x-axis indicates the bounding box size, from small to large (as a percentage), and the y-axis shows accuracy. Line styles and colors differentiate the CNN models


VP face segmentation datasets The generation of this dataset acknowledges that the box-shaped crop applied to the frame may be too rough an approximation. Precisely isolating and focusing on the dog’s face could provide more relevant information for emotion recognition. In each basic frame, the Doggie-smile face detector and a segment analyzer from Meta [48, 49] are used to identify face landmarks and delineate image segments (i.e., areas characterized by uniform patterns), respectively. We select a subset of segments from the analyzed frame, encapsulating the maximum number of landmarks, while minimizing the covered surface area. The corresponding segments of the basic frame are preserved, while the remaining portions of the image are replaced with a white background. In Fig. 3b, the frame of face segmentation is shown, where a white area replaced the background.


VP body segmentation datasets The goal of this dataset category is to isolate the segment corresponding to the entire dog body within the resulting frames. This way, it includes potentially emotion-associated information about the dog’s posture, mirroring on the body the approach applied to face segmentation (see the previous paragraph). We use the DeepLabCut body landmarks detector, [35] which returns landmarks for relevant body parts of the dog (see Fig. 3d), alongside an associated confidence level. Then, we perform image segmentation [49] on the frame, and we choose a minimal set of segments that pursues a balance between covering the maximum number of landmarks with threshold confidence and minimizing the surface area. Figure 3e shows the application of this method resulting in a frame where the dog’s body is isolated on a white background.


Segmentation blurred datasets The inclusion of a blurred background has been specifically implemented in the segmentation versions of both face and body datasets, to address potential bias arising from training on frames with large white portions. Introducing a blurred background, which varies across frames, aims to mitigate this bias and promote more balanced training. In Fig. 3c and f, the resulting frames with a blurred background from face and body segmentation are shown.

3.1.2 Action-coding datasets

The second partitioning strategy investigates the potential utility of leveraging the information provided by action coding labels in emotion recognition (see research question n.3 in Sect. 1), according to the annotations from the Dog Clips dataset experts, based on the Dog Facial Action Coding System (DogFACS). If the results of the experiments show that such labels significantly improve performance, we could consider incorporating automated facial action recognition into our system.

The AP dataset is obtained from the basic VP dataset, using DogFACS action coding (see the list in subsect. 2.2) as supplementary information to guide the data-cleaning process. In the initial phase of this process, all frame sections labeled as AD74 (termed ‘Unscorable’) are removed, potentially resulting in gaps between relevant video sections. Note that the original dataset contains video segments that include people. To avoid human expressions or posture affecting the emotion training phase, also these frames are excluded from the AP dataset. Table 2 details the resulting number of frames for each class of emotion for AP. AP frames are reduced globally by 13,772 when compared to VP.

A set of AP-derived datasets has been produced transforming the corresponding frames from the VP-derived datasets (see the related paragraphs in subsubsect. 3.1.1):

  • AP Raw dataset, with no transformation;

  • AP Face Bounding Box dataset;

  • AP Face Segmentation dataset, also in blurred background version;

  • AP Body Segmentation dataset, also in blurred background version.

Table 2 Frames per class in Action-coding Partitioning (AP)

3.2 Emotion knowledge transfer

In this section, we explain how the Transfer Learning (TL) method (i.e., knowledge transfer) is applied to a selection of state-of-the-art CNN classifiers, pre-trained for image classification.

We use the following Convolutional Neural Network (CNN) classifiers in our experimental flow, all pre-trained for patterns and features on ImageNet, [50] a dataset containing over 14 million images that cover over 20,000 categories:

  • AlexNet [34]

  • MobileNet [51]

  • VGG16 [52]

  • VGG19 [52]

  • Xception [53]

  • Inception-Resnet V2 (also referred as Resnet V2 later in tables) [54]

These models were then fine-tuned on both the VP and AP-derived datasets to adapt them to the specific task of recognizing dog emotions.

We replaced the last three classification layers for each CNN, including a fully connected layer with five neurons, corresponding to the classes of our emotion model.

The TL method involves the best practices [55] of applying a learning rate of \(1*10^{-4}\) for the CNN models, increasing the learning rate for the new layers by a factor of 20.

The CNNs undergo six epochs of retraining, with a 256 batch size, using the Stochastic Gradient Descent optimizer with momentum [55].

The six fine-tuned models are applied to each extended dataset and segmentation strategy, for emotion recognition.

To discriminate between friendly or potentially aggressive states of the dog (i.e., danger) and to assess the danger recognition system’s capabilities, it is worth aggregating the emotional classes, identifying two main macro-classes:

  • Dangerous: including the dogs’ unpleasant emotional states, referring to discomfort (i.e., fear, frustration), thus potentially leading to accidents (e.g., bite attack) in case of human--dog interaction.

  • Non-dangerous: referring to the dog’s pleasant (i.e., happiness, positive anticipation) and neutral (i.e., relaxation) emotional states, thus to be considered safe for human--dog interaction.

It is relevant to highlight that the emotion included in the model is positive anticipation. A general emotion of anticipation may be considered a borderline state between the two macro-classes because it is not inherently dangerous and not a comfortable state for the dog, either: its expression may be considered a state of distress due to its high level of attention. Positive anticipation, on the other hand, can be included undoubtedly in the non-dangerous states.

4 Results and discussion

In this section, we present and discuss results for our research questions: to address the critical biases and technical considerations of real-world data versus controlled environments, identify the canine emotions most prone to misclassification and their implications for recognizing dangerous situations, and evaluate the use of CNNs and DogFACS action descriptors for canine emotion classification.

A critical challenge for deep emotion recognition specific to dogs regards biases and noise stemming from the background, which can easily influence the classification. For instance, if the training set includes many happy dogs running in the grass, the repetitive background features of the environment may influence the classification in the less numerous cases in which a dog is sad, or potentially aggressive when spotted in a similar place. Applying our preprocessing strategies on both the Video-based Partitioning (VP) and Action-coding partitioning (AP), we can provide and compare results for the six models.

Considering accuracy across VP datasets and CNNs, as shown in Table 3, the combination of the Face Bounding Box dataset and fine-tuned VGG19 stands out as the top performer, with an accuracy of 0.605. This result is evident when comparing it to the accuracy of 0.488 achieved on the VP Raw dataset, also by VGG19, showing an improvement. It is worth noticing that the Face Bounding Box method excludes frames without any visible dog’s face during the training process. The significance of focusing on the dog’s face for effective canine emotion recognition is reinforced by observing that the Face Bounding Box method is also the frame processing technique producing the best general results across all the CNNs, as shown in italics in the table.

Table 3 Emotion Recognition: comparison of CNN accuracy on various datasets using Video-based Partitioning (VP)

Figure 5 illustrates the comparison of accuracy achieved by various CNNs on emotion and danger recognition, highlighting the consistently high performance attained with this dataset.

Fig. 5
figure 5

Emotion and danger recognition: bar plot of CNN model accuracies for the face bounding box segmentation using Video-based Partitioning (VP) and Action-coding Partitioning (AP). Bars indicate accuracy for each model

By analyzing the actual frames produced with segmentation, we observe that when this technique is applied either to the face or body, spurious segments may be kept, as shown in Fig. 6, where dogs are selected together with parts of the background, thus solving partially the problem of noise from background.

Fig. 6
figure 6

Segmentation errors illustrated: in both frames, the algorithm incorrectly includes the surface beneath the dog as part of its segmentation

We observe that the Face Bounding Box is the frame processing technique producing the best general results across all the CNNs, as shown in italics in columns of Table 3.

Regarding the recognition of the meta-classes for potentially threatening emotional states, Table 4 and Fig. 5b present the comparative results of aggregated accuracy for the classification of Dangerous/not Dangerous categories across VP datasets and fine-tuned CNNs. The absolute best performance for danger recognition is obtained again by VGG19 on the Face Bounding Box dataset, with an accuracy of 0.858. As for emotion recognition, also for danger recognition the Bounding Box strategy produces the best results over the majority of CNNs. It is worth noting that almost all algorithms and dataset combinations improve their performance in classifying the five emotions after aggregation. This result is noteworthy because it shows that across all CNNs, the misclassification of emotions tends to remain within the same dangerous/non-dangerous group, thus not leading to false negatives of the dangerous category.

Table 4 Danger Recognition: comparison of CNN accuracy on various datasets using Video-based Partitioning (VP)

Moreover, besides VGG19 being the absolute best with Face Bounding Box, we also note that Inception-Resnet V2 obtains the best-aggregated performances for all the other four datasets. In non-aggregated emotion recognition, Inception-Resnet V2 was best in two over five datasets.

In Fig. 8a, the confusion matrix for VGG19 on the Face Bounding Box dataset is shown. The aggregated confusion matrix displayed in Fig. 8b supports the previous observations regarding misclassification errors. Specifically, in the VGG19 confusion matrix the highest number of classification errors for a single class occurs with frustration, misclassified as fear in 39.89% cases. Happiness is misclassified for positive anticipation in 24.60% cases and conversely, positive anticipation is misclassified for happiness in 23.13%; relax is misclassified for positive anticipation in 23.34%. It is worth noting that these figures contribute positively to the aggregated results for the correct detection of dangerous dog states.

Table 5 presents the key metrics for emotion and danger recognition achieved by VGG19 on the Face Bounding Box dataset, considering both VP and AP. Notably, the F1 score reflects a balance between precision and recall, reaching a performance of up to 85.76% for VP danger classification.

Table 5 Performance evaluation of the VGG19 CNN on the face bounding box for emotion and danger recognition, VP and AP datasets

Considering comparative results on the AP dataset in Table 6, VGG16 with the Face Bounding Box dataset obtains the absolute-best accuracy result of 0.620 for emotion recognition. Face Bounding Box processing is again generally improving performance across 4 CNNs over 6.

Table 6 Emotion Recognition: comparison of CNN accuracy on various datasets using Action-coding Partitioning (AP)

In aggregated Dangerous/not dangerous results (see Table 7), Face Bounding Box achieves the best AP result with 0.808 for VGG19.

Table 7 Danger Recognition: comparison of CNN accuracy on various datasets using Action-coding Partitioning (AP)

However, aside from the exception of best-absolute results, the performance increment observed in the AP experiments compared to the corresponding VP is generally modest. Improvements are only slightly higher in a few cases and, in some instances, slightly lower.

Regarding computational complexity (in FLOPS) and efficiency (in time), face bounding box detection averages a swift 0.09 s per image, while more complex face and body segmentation tasks are completed in about 3 s per image. In training and testing our network models, execution time scales with parameter size, with AlexNet and ResNetV2 being the most resource-intensive. However, ResNetV2, Xception, and VGG19 offer the best trade-off between accuracy and processing time. In contrast, MobileNet’s lower complexity does not achieve the desired performance, highlighting a critical balance between model efficiency and accuracy. The high accuracy and low complexity of the Face Bounding Box dataset with VGG19 indicate that deep learning can effectively classify dog emotions with this segmentation strategy.

A noteworthy observation is that while the performances of the different partitioning strategies are not particularly relevant, experiments conducted on segmentation reveal that blurring the background leads to an improvement in accuracy (see Fig. 7). For segmentation, blurring improves the neural network performances, with peaks for body segmentation of approximately 10% in the best improvement cases (i.e., AlexNet VP and AP, MobileNet AP, VGG19 AP, and VGG16 VP, for emotion recognition), while for face segmentation, the best improvement for VP reaches approximately 21% (i.e., AlexNet, danger recognition), for AP approximately 26% (i.e., AlexNet, danger recognition). This improvement is consistent across most CNN models, regardless of whether VP or AP strategies are used. The significant enhancement obtained through blurring the background supports the hypothesis that the presence of a white background, while effectively eliminating noisy elements surrounding the dog’s body, introduces a bias during training for emotion recognition. Blurring the background helps mitigate this bias, as shown in Fig. 7, leading to improved results.

Fig. 7
figure 7

VP and AP Emotions and Danger recognition accuracy, face (blue) and body (orange) for segmentation (dashed) and segmentation blurred (plain color) datasets (color figure online)

4.1 Future directions

In this section, we streamline our approach to ensure that discussions on future improvements directly complement and highlight our study’s present achievements, offering a cohesive overview of our impact and the continued potential in this field. Our work sets a foundational step in applying deep learning for dog emotion recognition, utilizing the Dog Clips dataset97—a leap toward realism compared to traditionally simulated datasets used in human emotion studies. The dataset’s rich, uncontrolled environmental contexts present a valuable, more authentic source for analyzing genuine emotional expressions, highlighting our study’s immediate contribution to the field. The use of segmentation eliminates the side effect of having noisy frames. Addressing bias has been a pivotal aspect of our study, guided by techniques prioritizing dog face detection and analysis. Future enhancements in dataset diversity, particularly by augmenting breeds that might be underrepresented, could further elevate the robustness and inclusivity of our bias-mitigation approach (Fig. 8).

Fig. 8
figure 8

Confusion matrices using the VGG19 CNN on the faces bounding box dataset with Video-based Partitioning (VP), represented as heatmaps. Rows indicate true emotion labels, while columns represent predicted emotion classes. The intensity of the colors corresponds to the number of samples, with warmer colors indicating higher frequencies and cooler colors indicating lower frequencies of true-predicted label pairs. (a) shows the emotion recognition matrix with a 5x5 layout, and (b) shows the danger recognition matrix with a 2x2 layout (color figure online)

Our analysis currently focuses on static frames, for a more sustainable approach. However, we acknowledge the potential of motion dynamics between frames. Incorporating sequence models, such as attention layers with transformers, [56] could refine emotion recognition accuracy by leveraging the temporal information inherent in video data. The practical implications of our findings, especially in identifying potentially dangerous scenarios in human--dog interactions, lay a solid groundwork for real-time application testing in future studies.

Finally, we would like to highlight that, being emotions and pain similar in their neurological system, further applications of our approach can be related to the research topic of animal pain assessment by analyzing the expression of pain through the dog’s head and body, as we have done for danger states expression.

5 Conclusions

Our study focuses on the analysis and recognition of dog emotions and danger states using images from real-life scenarios, rather than the controlled scenarios that are prevalent in the literature.

Having identified the critical challenges in canine emotion recognition, we applied segmentation strategies as key aspects of our methodology. These strategies were found to be crucial in improving the focus on relevant features and minimizing misclassification due to background noise, thereby contributing to the robustness of our classification results. Among them, the face bounding box and blurring techniques peaked in performance.

Transfer learning was applied to pre-trained models such as VGG19 and Inception-Resnet V2, which proved to be able to cope with unstructured environments and heterogeneity of species and environmental conditions with remarkable results, achieving a peak accuracy of 0.8577 for threat detection.

Our models, together with segmentation strategies, also showed a desirable polarization in the misclassified cases, recording fewer false negatives in the classification of dangerous behaviors. This orientation toward safety is essential for practical applications where the cost of a false negative could be high, highlighting the reliability of the proposed approach for use in scenarios that require cautious and preventative action.

The investigation of the utility of DogFACS labels has shown that while they do provide some overall improvement in accuracy, these results are marginal. We can therefore state that the costly integration of automated DogFACS labels is not cost-effective.

In conclusion, our results confirm the applicability of transfer learning in complex and diverse settings and highlight the critical role of segmentation strategies in achieving high accuracy in real-world canine behavior analysis. Our methodology is proving effective for safety in real-world practical applications in human--dog interaction research.