1. Introduction
Keeping pets has become increasingly popular in recent years, leading to a surge in stray dogs due to abandonment, loss, and breeding. This has resulted in numerous issues, such as disease spread, attacks on humans, the disruption of urban cleanliness, and traffic accidents. Although the government uses TNvR and precise capture, addressing dog attacks is time-consuming and labor-intensive. In recent years, many surveillance cameras have been installed in essential areas, such as roads, intersections, transfer stations, and public places. However, these surveillance cameras cannot provide immediate warning messages before incidents occur. Nevertheless, recent computer vision technology can analyze camera footage and replace human reporting by sending alerts to emergency services when one or more stray dogs are detected as being about to attack. Therefore, computer vision has also been widely used for object identification. Integrating these technologies to detect and analyze dog behavior can save time and processing power, and facilitate the real-time collection of dog information and issue immediate warning alerts.
From 2014 to 2022, researchers used animal motion tracking and gesture recognition to study animal emotions and improve their emotional well-being. Sofia et al. used computer vision technology to assess animal emotions and pain recognition through a comprehensive analysis of facial and body behavior [
1]. Identifying animal emotional behaviors is challenging because they express internal emotional states subjectively [
2]. Researchers traditionally observe or record videos of animal behavior to analyze their behaviors. However, automatic facial and body pose analysis enables the extensive annotation of human emotional states. Fewer studies have focused on the mechanical behavior of non-human animals. Animal tracking studies include pose estimation, canine behavior analysis, and animal identification and tracking techniques using deep learning methods. Analyzing facial expressions and body behaviors to understand animal emotions presents many challenges. Techniques for recognizing animal emotional states and pain are more complex than those for tracking movement.
Recently, researchers have used computer vision and deep learning techniques for canine emotion recognition. Zhu used indoor static cameras to record dogs’ behavior during locomotion, and their architecture combined pose and raw RGB streams to identify pain in dogs [
3]. Franzoni et al. and Boneh et al. used images of dogs in experiments that elicited emotional states, and the main target was the detection of emotion on the dog’s face [
4,
5]. Ferres et al. recognized dog emotions from body poses, using 23 regions on the body and face as critical points [
6]. The imaging dataset for these studies was limited to a single dog, and high-resolution, clear images of faces and limbs were necessary. Research on dog emotion recognition using computer vision and deep learning has mainly focused on high-resolution, clear facial images of a single dog. These studies have generally used surveillance cameras, and the emotional state of animals has been primarily based on physical behavior due to distance and low-resolution videos. Past research on human emotion recognition has used text, audio, or video data and various models to achieve high accuracy, with facial expressions or body language analysis used for emotion recognition. However, no studies investigate dog tracking and emotion recognition due to the complexity of dog behavior and a lack of readily available imaging data.
Numerous studies on object detection have been conducted [
7,
8,
9,
10,
11,
12]. In object detection, colors, textures, edges, shapes, spatial relationships, and other features are extracted from data, and machine learning methods are used to classify objects according to these features. Dalal and Triggs used the histogram of an oriented gradient image feature extractor and a support vector machine (SVM) classifier to achieve human detection [
7]. With the development of deep learning in artificial intelligence, convolutional neural networks (CNNs) have been applied in various deep learning technologies. Deep learning is now commonly used in computer vision, mainly because of the 2012 ImageNet Large-Scale Visual Recognition Challenge [
13]. AlexNet, the deep learning network architecture proposed by Alex Krizhevsky [
14], heralded the era of the CNN model. Subsequently, VGG, GoogleNet, and ResNet architectures, all of which are commonly used in innovative technologies, were developed [
15,
16,
17].
Object tracking refers to the tracking of objects in continuous images; after the objects in each image are detected, they are tracked to determine and analyze their movement trajectory. Pedestrians and cars have been the objects most commonly tracked in previous studies [
18,
19,
20,
21,
22], and the MeanShift tracking method, Kalman filter method, particle filter method, local steering kernel object texture descriptors method, CamShift method, and optical flow method have been commonly used for tracking [
12,
18,
19,
20,
21,
22]. Several methods have been developed for CNN-based feature extraction and object tracking in video. For example, simple online and real-time tracking with a deep association metric (DeepSORT) combines information regarding an object’s position and appearance to achieve high tracking accuracy [
23].
In most previous studies on human emotion recognition, human emotions have been classified using traditional methods involving feature extractors and classifiers. Some recent studies have explored using CNN models to extract human features. In 2010, Mikolov et al. proposed recurrent neural networks (RNNs) to deal effectively with time series problems [
24]. Regarding research on human emotion recognition, Ojala et al. and Gu et al. used the local binary pattern method [
25,
26] and the Gabor wavelet transform method, respectively, to recognize facial expressions [
27]. Oyedotun et al. proposed a facial expression recognition CNN model that receives RGB data and depth maps as input [
28]. Donahue et al. introduced long-term recurrent convolutional networks, which combine CNNs and long short-term memory (LSTM) models to recognize people in videos [
29].
Animals have basic emotions that result in different emotional states and neural structures in their brains [
30]. However, the lack of large datasets makes assessing canine emotional states more challenging than humans. Nevertheless, we can evaluate a dog’s physiology, behavior, and cognitive mood [
31]. Facial expressions, blink rate, twitching, and yawning are among the essential sources of information for assessing animal stress and emotional states [
1,
32]. In addition to facial behavior, body posture and movement are associated with affective states and pain-related behaviors [
33,
34]. Open spaces, novel objects, elevated plus mazes, and qualitative behavioral assessments evaluate animals’ pain, discomfort, and emotional mood [
35,
36]. In recent years, physical and postural behavior has also been utilized to assess affective emotions in dogs and horses [
1,
37,
38].
The present study focused on the recognition of the emotions of dogs in videos to identify potentially aggressive dogs and relay warning messages in real time. The proposed system first uses YOLOv3 architecture to detect dogs and their positions in the input videos. To track the dogs, we modified the sizes of the images input into the DeepSORT model, improved the feature extraction model, trained the model on the dog dataset, and modified each final tracking position to the position of each tracked dog. The modified model is called real-time dog tracking with a deep association metric (DeepDogTrack). Finally, the system categorizes the dogs’ emotional behaviors into three types—angry (or aggressive), happy (or excited), and neutral (or general emotional) behaviors—based on manual judgments made by veterinary experts and custom dog breeders. The dog emotion recognition model proposed in this study is called the long short-term deep features of dog memory networks (LDFDMN) model. This model uses ResNet to extract the features of the dog region that are tracked in the continuous images, which are then input into the LSTM model. The LSTM model is then used for emotion recognition.
The contributions of this study are as follows:
An automated system that integrates an LSTM model with surveillance camera footage is proposed for monitoring dogs’ emotions.
A new model for dog tracking (DeepDogTrack) is developed.
A new model for dog emotion recognition (LDFDMN) is proposed.
The proposed system is evaluated according to the results of experiments conducted using various training data, methods, and types of models.
2. Related Work
2.1. The Processing of the SORT
The overall SORT process involves the detection, estimation, data association, and creation and deletion of tracked identities.
Detection: First, Faster-RCNN is used for detection and feature extraction. Because the detection objects in this study are objects, other objects are ignored, and only objects that are more than 50% likely to be a object are considered.
Estimation: The SORT model’s estimation model describes the model of the object and enters the movement model of its representation and transmission target in the next frame. First, the Kalman filter is used to predict the target state model (including size and position) of an object detected at time
T at time
T + 1. An object’s state model can be expressed as follows:
where (
u,
v) represents the coordinates of the object’s center at time
T; (
s,
r) represents the region and aspect ratio of the object’s bounding box at time
T; and
and
, respectively, represent the center point and speed of the object at time
T. When the object in the next frame is detected, the object’s bounding box
is used to update the object’s status. If no correlations between the objects are detected, the prediction model is not updated.
Data association: The detection result is used to determine the object’s target state; that is, the bounding box of the object at time T is used to predict the new position of the object at time T + 1. First, the model predicts the bounding box of the object at time T and the ith object at time T + 1 and calculates the Mahalanobis distance between them. Thereafter, the model uses the Hungarian algorithm for matching to enable multi-object tracking. When the intersection area (intersection over union [IOU]) is less than the threshold value, the object is regarded as the tracking target.
Creation and deletion of tracked identities: When an object enters or leaves the screen, its identity information must be added or deleted from this system. To prevent erroneous tracking, the model must detect objects to be tracked within a few frames of their entrance to determine whether the object must be newly added to this system. Furthermore, the IOU of the object in each frame and in the next frame is calculated; if its value is less than the threshold value, the object is determined to have left the screen, and the object’s identity information is deleted.
2.2. The Processing of the DeepSORT
The overall DeepSORT process involves the detection, estimation, data association, and creation and deletion of tracked identities.
Detection: The DeepSORT model uses YOLOv3 architecture for pedestrian detection. Because the detection objects in this study are pedestrians, other objects are ignored, and only objects that are more than 50% likely to be pedestrians are considered.
Estimation: The pedestrian’s description is to enter the motion of its representation and propagation target in the next frame. First, the model uses the Kalman filter to predict the state model (including size and position) of a pedestrian detected at time
T at time
T + 1. DeepSORT expresses the state model of the pedestrian as eight values
, as follows:
where (
u,
v) and (
r,
h) are the coordinates of the pedestrian’s center and the aspect ratio and height of the bounding box of the pedestrian at time
T, respectively. At time
T, the Kalman filter is used to predict the pedestrian’s position at time
T +1.
, represents the predicted position
of the pedestrian at time
T + 1, where
are the coordinates, length, width, and height, respectively, of the pedestrian’s center at time
T + 1. When a pedestrian is detected, the
values are updated to reflect the target state of the pedestrian. If no pedestrian is detected, the predictive model is not updated.
Pedestrian feature extraction: The trained CNN model, which contains two convolution layers, a max pooling layer, and six residual layers, is used to extract the features of each pedestrian at time T + 1, which are output as a 512-dimensional feature vector. The feature vector of the jth pedestrian at time T + 1 is expressed as .
Data association: The pedestrian region
at time
T is the predicted new position of the pedestrian at time
T + 1. Thereafter, the Mahalanobis distance between the pedestrian region at time
T and the region of the
ith pedestrian at time
T + 1
is calculated as follows:
First, is converted into , where represents the coordinates of the pedestrian’s center, is the aspect ratio of the pedestrian, and is the height of the pedestrian. represents the new position of the ith pedestrian at time T + 1, represents the new location of the jth pedestrian at time T + 1, is the covariance matrix of the ith pedestrian, and n is the total number of pedestrians at time T + 1. The detection index based on Mahalanobis distance can be used to obtain the optimal match. The χ2 distribution and its 95% confidence interval are used as the detection threshold value, which was 9.4877 in the present study.
The Mahalanobis distance is suitable for movement positions that produce low uncertainty regarding the pedestrian’s position. The state distribution of a pedestrian is predicted using a frame, and the pedestrian’s position in the next frame is obtained using the Kalman filter. This method only provides an approximate position, and the positions of pedestrians that are obstructed or moving quickly will not be correctly predicted. Therefore, the model uses a CNN to extract the feature vector of the pedestrian and calculates the cosine distance between the extracted vector and the feature vector of the pedestrian in this system. The minimum cosine distance is represented as follows:
Finally, the position and features of the pedestrian are matched and fused. The fused cost matrix
is expressed as follows:
where
λ is the weight. Because using a nonfixed camera to shoot may cause the image to shake violently,
λ should be set to 0. Therefore,
λ can also account for the problem of obscured pedestrians and reduce ID switching (IDSW) during tracking.
The creation and deletion of tracked identities is the same as for SORT.
2.3. LSTM Model
In traditional neural networks, each neuron is independent and unaffected by time series. In RNNs, time series data are used as input [
24]. Earlier layers of an RNN exert weaker effects than subsequent decisions. When too many series are present in the data, the gradient disappears or explodes. To address this problem, Sepp and Jürgen proposed the LSTM model [
39] in 1997. An LSTM model comprises numerous LSTM cells, each having three inputs, three components, and two outputs. The three inputs
are the input at time
t, the output
at time
t – 1, and the long-term memory (LTM)
at time
t – 1. The three components are the input gate
, the output gate
and the forget gate
. The three components all use sigmoid functions as activation functions to obtain an output value between 0 and 1, simulating the opening and closing of a valve. The input gate uses the input
at time
t and the output
at time
t – 1 to determine whether the LTM
should incorporate the memory
generated at time
t. The output gate determines whether the LTM
generated at time
t should be output according to the input
at time
t and the output
at time
t – 1. The forget gate uses the input
at time
t and the output
at time
t – 1 to determine whether the LTM
at time
t – 1 should be added to the LTM
at time
t. The two outputs of the LSTM model are the output
and the LTM
at time
t. The LSTM model has one more output (
, or LTM) than ordinary RNNs do, which enables it to solve the gradient problem caused by excessive time series in ordinary RNNs.
3. Proposed System
This study automatically detects the dog’s movements through surveillance video to predict the dog’s emotions. Therefore, this study must first convert the surveillance video into a continuous image, then detect the dogs in each image, track the dogs’ position in each image, and make emotional predictions from the dog’s movements in the surveillance video.
The proposed system combines CNNs with a deep association metric and RNN technologies to detect, track, and recognize the emotions of dogs. The system process is illustrated in
Figure 1. First, dogs in each frame of the input video are detected; then, each dog is tracked; and finally, each dog’s behavior is analyzed to determine which emotion is being expressed. The dogs’ emotions are categorized into three types: angry (or aggressive), happy (or excited), and neutral (or general). The methods used for dog detection, tracking, and emotion recognition are described in the following sections.
3.1. Dog Detection
The first step of object detection is image feature extraction. Originally, to achieve this end, suitable filters were used to manually extract various features. However, since the rise of deep learning, CNNs have been commonly used to extract features automatically. Experiments have revealed that CNN-based object detection methods are highly accurate. Therefore, the system described herein uses a YOLOv3 CNN-based object detection algorithm [
40] for dog detection. In addition to using Darknet53 to extract shadow features, YOLOv3 uses feature pyramid network technology to address the inability of YOLOv2 to detect small objects. The processing method of YOLOv3 involves first dividing the input image into 13 × 13, 26 × 26, and 52 × 52 grid cells. YOLOv3 is pretrained on Microsoft’s Common Objects in Context (MSCOCO) image dataset [
41], which contains 80 object classes and generates (13 × 13 + 26 × 26 + 52 × 52) tensors of the prediction results. Because many overlapping frames may be obtained, the model uses non-maximum suppression (NMS) processing, and the most reliable and unique bounding box is regarded as the predicted result of object detection. The dog detection process is illustrated in
Figure 2.
3.2. Dog Feature Extraction
The model uses a ResNet CNN to extract the features of each dog from the sub-images of all the dogs and Mask R-CNN architecture to remove the backgrounds of the sub-images (
Figure 3).
Dog feature extraction: The ResNet uses the shortcut connection method to reinforce the learning bottlenecks of multiple layers. This method involves retaining the input feature map before convolution. After the input feature map is subjected to two layers of convolution, a ReLU function, and a third layer of convolution, the output feature map is combined with the retained feature map to preserve the pre-convolution features.
Background removal: The proposed system uses Mask R-CNN architecture to remove the backgrounds of the dog images [
42]. Mask R-CNN architecture adds a new output to Faster R-CNN architecture to produce a fully convolutional network that can be used to solve object detection and segmentation problems [
43]. Faster R-CNN outputs the classification and coordinate offset of a predicted object. Each pixel in the predicted region is classified as part of the foreground or background, as illustrated in
Figure 4.
3.3. Dog Tracking
After a dog is detected, it is tracked to determine its movement trajectory. The dog-tracking system identifies the position of the same dog in consecutive images and plots these positions to form an action path. The system uses a DeepDogTrack model for dog tracking. In addition to using a Kalman filter to predict the dog’s position in the next frame, the model also uses a CNN to extract and match the dog’s features in consecutive frames to determine the dog’s motion status. DeepDogTrack is an improved DeepTrack pedestrian tracking model. The DeepSORT model integrates simple online and real-time tracking (SORT) [
44] and CNN technology to extract and match each pedestrian’s features and analyze the location and appearance information of each pedestrian to achieve accurate tracking. To reduce the computation time of the system and improve the accuracy of dog tracking, the system adopts our novel DeepDogTrack model, which contains improvements in the processing flow and adjustment of parameters.
3.3.1. SORT and DeepSORT
SORT is a practical multi-object tracking method that can effectively track objects in consecutive frames. The SORT model proposed herein uses Faster-RCNN and a Kalman filter to detect an object’s position and to predict the object’s position in the next frame, respectively. Thereafter, the model calculates the Mahalanobis distance between an object’s location and its predicted location in the next frame and uses the Hungarian algorithm [
45] for matching to enable multi-object tracking. Therefore, the overall SORT process involves the detection, estimation, data association, and creation and deletion of tracked identities.
Although SORT is a simple and effective multi-object tracking method, it compares only the size and position of a predicted object and does not consider the object’s features. To address this limitation, the proposed system incorporates DeepSORT, which improves upon the detection method of SORT and accounts for the object’s features, thus enhancing the accuracy of object tracking. DeepSORT applies SQRT’s object tracking to pedestrian tracking. DeepSORT is based on SORT’s multiple object tracking (MOT) architecture and uses the Kalman filter to predict a given pedestrian’s position in the next frame. The model calculates the Mahalanobis distance between the region of the predicted pedestrian and the region in which other pedestrians may be located. Thereafter, a CNN is used to extract and calculate the minimum cosine distance between the pedestrian’s features and the features of all the pedestrians in the next frame. Finally, the Hungarian algorithm is used for matching to enable multi-pedestrian tracking. Accordingly, DeepSORT involves the detection, estimation, feature extraction, data association, and the creation and deletion of tracked identities.
3.3.2. Real-Time Dog Tracking with a Deep Association Metric (DeepDogTrack)
Because DeepSORT is typically used to track pedestrians, and the proportions of the human body are 64 × 128, the input must be a fixed-size image. Proportion features are extracted using a simple CNN model, and the result predicted using the Kalman filter is used as the tracking region of the object. However, the proportions of dogs are different from those of humans. To adapt DeepSORT for the tracking of dogs and improve the computational efficiency, the DeepDogTrack model takes the detected dog region as input data, and the size of the region is not fixed. To increase the depth of the model and minimize error, a deep residual network (ResNet) is used to extract the dogs’ features. The DeepSORT model was retrained using the dog data-set to improve its tracking accuracy. The architecture of the proposed DeepDogSORT dog-tracking model is illustrated in
Figure 5. The original and improved results are presented in
Figure 6.
3.4. Dog Emotion Recognition
The automatic recognition of dog emotion in this study first defines the emotional type of dogs and then proposes a deep learning technology for predicting dog emotions.
3.4.1. The Emotions of the Dogs
Dogs go through their developmental stages faster than humans and have all the emotional ranges they can reach by four to six months old (depending on how quickly their breed matures). However, the variety of emotions in dogs does not exceed that of humans by two to two and a half years old. Dogs will have all the basic emotions: joy, fear, anger, disgust [
46,
47,
48], and even love. However, based on current research, dogs do not appear to have more complex emotions such as guilt, pride, and shame [
46]. Therefore, we can determine which emotions the dog experiences through the dog’s body language. A dog’s emotional state is primarily determined by facial and physical behavior, or a combination of the two. However, the data source of this study is surveillance cameras due to their long distance and low-resolution video. Therefore, the dogs’ emotional state in this study was generally determined by physical behavior. In addition, since the emotions of fear, anger and disgust need to match the subtle features of the face, these emotions are uniformly assumed to be angry (or aggressive). The proposed model lists the basic human emotions anger (or aggressive) and happiness (or excitement) [
49], but these two emotions are relatively extreme behaviors. To strengthen the evaluation of canine emotional types, the third emotion in this study is based on the dog’s physical behavior, which is called neutral (or general).
Therefore, the emotions of the dogs in this study are categorized into three types—angry (or aggressive), happy (or excited), and neutral (or general)—according to the manual judgment of veterinary experts and custom dog breeders. The descriptions and characteristics of the three emotional types of dogs are shown in
Table 1.
3.4.2. The Dog Emotion Recognition Model
The dog emotion recognition model proposed herein is the LDFDMN model. After a dog is detected, the dog region and the dog’s features are extracted using the ResNet model. Thereafter, these continuous and time-series-associated features are transmitted to the LSTM model for processing, and the time series output results are obtained. Dog emotion recognition is based on dogs’ continuous behaviors; analyzing these behaviors is therefore essential to the proposed system, and the RNN and LSTM models used to do so are described as follows.
LDFMN Model
In the proposed system, a ResNet CNN and DeepDogTrack model are used to extract features from and to track dog regions, respectively. The tracked dog region is converted into an image set, as illustrated in
Figure 7. Each image set depicts the continuous movement of a dog and is used as a data-set for dog emotion recognition. If the image set comprises fewer than 16 images, it is deleted; if the image set exceeds 16 images, the set is trimmed to 16 images. Thereafter, the image set is input into the LDFDMN model, and the dog emotion recognition results are obtained. The architecture of the LDFDMN model is illustrated in
Figure 8.
Dog Emotion Recognition after Background Removal
Each of the model’s detection regions includes nondog regions, or backgrounds. If the background area is larger than the dog area, the extracted dog features will be affected, resulting in a reduced dog emotion recognition rate. Therefore, the proposed model uses a Mask R-CNN model to remove backgrounds from the image set before the dog tracking and emotion recognition are processed by DeepDogTrack and the LDFDMN model, respectively.
Video Preprocessing
In this study, we trained the LDFDMN model by using videos collected from YouTube, the Folk Stray Dog Shelter, and the Dog Training Center (hereafter, DTC) of the Customs Administration of Taiwan’s Ministry of Finance. The input data of the LDFDMN model must be a fixed-length feature vector, but the lengths of the videos collected for this study differed, and multiple dogs may have been present in each video. Therefore, each video was divided into multiple sub-images, each of which was resized to 360 × 360 pixels. Sub-images of the same dog were used to create experimental videos in order to analyze the dog’s emotions.
Although the backgrounds of the dog regions are supposed to be removed by the Mask R-CNN before tracking, the sub-images may depict the background instead of the dog because of classification errors, resulting in a set of fewer than 16 continuous sub-images. To address this problem, the Farneback optical flow method is applied [
50], and the 16 sub-images in each image set are linearly interpolated according to the optical flow value. The results of the linear interpolation of an image are presented in
Figure 9. In the figure, the optical flow information of the image at times
t(0) and
t(1) is used to produce a linear interpolation of the image at time
.
3.5. Dog Emotion Recognition in Surveillance Videos
The proposed system was tested using three dog-tracking methods (DeepSORT, DeepSORT_retrained [a version of the DeepSORT model retrained using the dog data-set], and DeepDogTrack) and two dog emotion recognition methods (sub-images with and without backgrounds). The methods were combined into six models, as listed in
Table 2.
5. Conclusions
The primary purpose of this study was to develop a multi-CNN model for dog detection, tracking, and emotion recognition. The dog detection model was trained using the MSCOCO data-set, and dog tracking and emotion recognition models were trained using videos collected from YouTube, the Folk Stray Dog Shelter, and the DTC. In the dog detection experiment, the detection rates for the TestSet1 and TestSet2 data-sets were 97.59% and 95.93%, respectively. The reasons for detection errors were obscured facial features, special breeds of dogs, obscured or cropped bodies, and incomplete regions. The effects of these factors can be minimized by reducing the number of object types, increasing the sample size of dogs in the training data-set and making the ground truth region more apparent. In the dog-tracking experiment, the MOTA values for videos of a single dog and for multiple dogs were as high as 93.02% and 86.45%, respectively. The tracking failures occurred in cases where large parts of the dog’s body were obscured. In the dog emotion recognition experiments, the identification accuracy rates for the two data-sets were 81.73%, and 76.02%, respectively. The results of the emotion recognition experiment indicate that removing the backgrounds of dog images negatively affects the identification accuracy. Furthermore, happy and neutral emotions are similar and therefore difficult to distinguish. In other cases, the dog’s movements may not be apparent, the image may be blurred, the shooting angle may be suboptimal, or the image resolution may be too low. Nevertheless, the results of the experiments indicate that the method proposed in this paper can correctly recognize the emotions of dogs in videos. The accuracy of the proposed system can be further increased by using more images and videos to train the detection, tracking, and emotion recognition models presented herein. The system can then be applied in real-world contexts to assist in the early identification of dogs that exhibit aggressive behavior.
Research on automatic face and emotion recognition technology has developed rapidly and matured, and many data-sets have been collected. However, because dogs are not easy to control, there are few datasets for dog tracking and emotion recognition. Therefore, to improve the accuracy of tracking and emotion recognition, it is necessary to further collect many dog-tracking and emotion recognition data-sets in the future.