1. Introduction
Pigs are critical in the global livestock industry and are a significant protein source [
1]. Pork consumption is projected to grow at an average annual rate of 1.2% from 2018 to 2027 [
2], underscoring the importance of health and welfare management for pigs. Early detection of health and welfare issues in pigs is essential for maintaining productivity and preventing the spread of diseases. However, it is impractical to manually monitor subtle behavioral changes in commercial-scale pig farms over extended periods [
3], highlighting the need for efficient and automated health monitoring systems. Such systems are essential to enhance productivity while managing pigs’ welfare and health conditions. With increasing demands for automation and efficiency in the livestock industry, automated health monitoring systems are expected to contribute to real-time management and improved productivity in pig farming [
4].
Models with fast detection speeds and high computational efficiency are required to monitor pigs’ health status in real time using cameras installed in individual pens within commercial pig farms. However, the corrosive environment caused by ammonia emissions in pig farms limits the use of high-performance computing resources. In this context, convolutional neural network (CNN)-based models are suitable for object detection tasks. Among these, You Only Look Once (YOLO) [
5], with its CNN-based end-to-end structure, predicts both object locations and classes simultaneously, offering faster processing speeds compared to those of Region-Based Convolutional Neural Networks (R-CNNs) [
6]. While R-CNNs use a step-by-step approach with region proposals to achieve high detection accuracy, their computational demands make them unsuitable for real-time processing. Similarly, transformer-based models [
7], leveraging self-attention mechanisms to learn global relationships, provide higher detection accuracy than that of CNN-based models but require high-performance computing resources, posing limitations in real-time commercial pig farm settings. Given the need for computational efficiency and real-time processing, YOLO is well suited for automated health monitoring systems.
However, these object detection models rely on red, green, and blue (RGB) images, which can experience reduced detection accuracy in environments with complex lighting conditions or significant background variability [
8]. For instance, in top-view footage captured by fixed cameras in the same pen, lighting variations or natural light can cause differences in brightness depending on object positions. Additionally, materials like flooring within the Region of Interest (RoI) may ambiguously blend with objects. Furthermore, these models, trained on seen environments, exhibit limited generalization capabilities and lower detection accuracy in previously unseen environments [
9].
Figure 1 illustrates these challenges, highlighting the limitations of object detection in unseen environments and the reduced detection accuracy of the YOLOv11n [
10] detector in such scenarios.
In
Figure 1a, there is a screenshot captured from [
11], whereas d is a screenshot captured from [
12]. In this study, the YOLO detector is applied to both the training data and unseen environments represented by a and d. The results in b and e, shown with green boxes for detected objects, demonstrate low detection accuracy in these unseen environments. Upon analyzing the causes, it is found that the training utilized the German pig dataset [
13]. This dataset, collected in German environments, features a characteristic where the floor is darker than the pigs in the foreground. However, this dataset characteristic leads to detection errors when the floor’s brightness is similar to or brighter than the pigs, as observed in b. Furthermore, similar errors occur in e due to brightness variations in objects or walls within pigpens that differ from the training environment.
To overcome these limitations, recently developed foundation models have emerged as transformative tools for object detection and segmentation. These models, trained on extensive datasets, demonstrate high adaptability to diverse tasks and environments. The Segment Anything Model (SAM) [
14] offers robust segmentation performance with minimal user input, effectively separating foreground and background across varied conditions. Large Mask Inpainting (LaMa) [
15] enhances masked areas by leveraging surrounding pixels, achieving high performance in background restoration tasks. Additionally, Depth Anything [
16] uses depth information to accurately distinguish objects’ physical positions from the background, maintaining stable performance even in challenging lighting conditions or complex facility layouts. By analyzing the depth information in c and f, it becomes evident that leveraging depth-based techniques could provide a promising solution to the aforementioned errors.
To address these issues, this study proposes a method for generating Depth-Oriented Gray (DOG) images using foundation models such as the SAM, LaMa, and Depth Anything. This approach establishes the RoI in unseen environments and effectively generalizes the brightness of objects and backgrounds in test images. The proposed method improves detection accuracy by accurately distinguishing between foreground and background, even in complex environments. Foundation models are transformer-based architectures that require significant processing time and high-performance computing resources. However, the proposed method achieves real-time processing by utilizing foundation models only during the initialization phase, significantly enhancing both detection accuracy and speed.
The contribution of this study is as follows:
This study proposes a new method that utilizes foundation models, including the SAM, LaMa, and Depth Anything, to accurately separate the foreground and background based on depth information in pigpens. The method effectively generates depth background images and establishes the RoI, even in test environments with diverse lighting conditions and high background complexity.
The proposed method generates DOG images by combining HSV-Value, inverted HSV-Saturation images, and the generated depth background images. This approach is designed to maintain high detection and generalization performance, leveraging depth information to address the accuracy degradation issues observed in conventional detection models when applied to unseen test environments.
This study proposes a cost-effective approach to utilizing depth information without requiring additional depth sensors by leveraging the Depth Anything model. In unseen test environments, depth background images are generated only once using GPU-based Depth Anything. Subsequently, input images are processed in real time using CPU-based DOG image generation. This method enables operation on systems equipped with low-cost CPUs and GPUs, significantly reducing system setup costs.
2. Related Work
Research on object detection and tracking in various environments has been actively conducted alongside advancements in deep learning technologies, which are increasingly applicable to video monitoring systems. Video monitoring enables real-time detection and analysis of object states, which is pivotal in optimizing management and automation across diverse fields such as agriculture, logistics, and security.
However, complex environmental factors, lighting changes, and overlaps with structures remain significant challenges that can reduce object detection accuracy. These issues highlight the limitations of traditional deep learning models, which often fail to ensure generalization performance in environments different from those they were trained on. As a result, frequent object detection errors undermine the reliability and utility of video monitoring systems [
17,
18,
19].
Existing studies have proposed various approaches to improve object detection accuracy to address these challenges. One prominent approach involves attention mechanisms that guide detection models to focus on relevant regions in an image. For instance, methods that define the Region of Interest (RoI) and Regions of Uninterest (RoUs) help detection models exclude irrelevant background information and focus on key objects [
20,
21,
22]. By allowing models to concentrate on foreground areas during detection, RoI settings effectively enhance detection accuracy. Additionally, segmentation-based approaches that separate the foreground from the background further assist detection models in focusing on object features, offering another effective strategy [
23,
24,
25]. These attention-based methods improve object detection performance in complex environments and demonstrate broad applicability across various domains.
Nevertheless, external environmental factors, such as lighting-induced imbalances, continue to pose significant challenges even for these approaches. To overcome such issues, numerous experiments incorporating various preprocessing and postprocessing techniques have been proposed [
26,
27,
28,
29,
30]. For example, image transformations emphasize the differences between the foreground and background, enhance image quality, or make key features more distinguishable. These techniques mitigate problems caused by lighting variations or noise in complex environments, contributing to improved detection accuracy. Enhancing contrast and emphasizing key features facilitates clear foreground–background separation, even under diverse lighting conditions or complex backgrounds.
Finally, compared to optical cameras, depth cameras offer the advantage of more effectively distinguishing between the foreground and background [
31,
32,
33,
34]. Depth cameras utilize the Time of Flight (ToF) principle, emitting signals and measuring the return time after interacting with objects to generate depth information [
35]. This capability positions depth cameras as valuable tools for addressing challenges posed by lighting variations and complex backgrounds. However, despite their advantages, such as low cost, fast processing, and high reliability, depth cameras have limitations, including missing information due to installation location or environmental conditions and additional noise from indoor or outdoor variability. The accuracy of depth information collected by cameras can vary significantly depending on lighting conditions and structural factors in the environment. Therefore, optimizing installation locations is crucial to fully leverage the benefits of depth cameras while overcoming these environmental constraints [
36].
Table 1 summarizes the results of a search using the keywords “pig”, “depth”, and “detection”. This study represents the first research to improve object detection model performance by utilizing depth information to establish the RoI and generate generalized images in unseen environments. In contrast to previous studies, this research proposes an approach that leverages the foundation model Depth Anything to utilize depth information without the additional costs associated with purchasing and installing depth sensors.
Furthermore, this study overcomes the limitations of traditional depth sensor-based detection techniques by introducing a new method that employs foundation models such as the SAM, LaMa, and Depth Anything. Specifically, the proposed approach uses foundation models in the test pigpen environment during the initialization process, utilizing GPUs only once. After initialization, real-time DOG images are generated through simple CPU-based image processing. This practical method addresses the generalization challenges of test images in unseen environments while enhancing detection models’ accuracy and efficiency.
3. Materials and Methods
This study utilizes recently introduced deep learning-based foundation models, including the SAM, LaMa, and Depth Anything. The SAM is a deep learning-based object segmentation model trained on large-scale datasets, capable of generating foreground object masks based on user inputs (e.g., the entire image, bounding boxes, or object points). These masks are passed to LaMa, a deep learning model that seamlessly restores masked regions using surrounding pixel information. LaMa generates natural background images, even for complex or large masked areas. The generated background images are then provided to Depth Anything, which calculates the depth information for each pixel to produce depth background images. Depth Anything estimates depth based on the physical position and distance of objects, creating depth images that remain unaffected by the facilities or lighting conditions of the pigpen environment. These depth background images effectively establish the RoI.
We propose a new DOG image generation method, which involves converting RGB colors to the hue, saturation, value (HSV) color space and leveraging HSV-Value and HSV-Saturation information. In the HSV-Saturation space, white and black are defined as achromatic colors, and since most pigs have colors based on white or black, their HSV-Saturation values are close to zero. This consistent representation of pigs’ brightness in HSV-Saturation images, irrespective of their positions in the RGB image, allows the effective separation of foreground pixels that are otherwise difficult to distinguish using color information alone. However, HSV-Hue, which represents pure color information as an angle, was excluded because it is sensitive to changes in object position and lighting conditions, making it less reliable. This HSV-based approach helps address issues arising from complex lighting conditions and color similarities between objects and the background.
For the HSV-Value and depth background images used in the image generation process, we apply Contrast-Limited Adaptive Histogram Equalization (CLAHE) [
46]. CLAHE enhances contrast by uniformly distributing pixel values and adjusts histograms locally to overcome the limitations of global histogram equalization. This technique prevents noise amplification while effectively improving contrast. By applying CLAHE to both HSV-Value images and depth background images, the histograms are smoothed, and object shapes and boundaries become more distinct. Consequently, boundaries between objects and backgrounds, which are challenging to distinguish using HSV-Value images alone, can be effectively separated by combining these images with depth-based images for accurate RoI determination.
Figure 2 illustrates the entire process proposed in this study. Given an input image, the first step is the RoI Setting Module, executed only once to generate the depth images of the input scene. The generated depth information establishes RoIs that effectively separate the foreground and background in a top-view fixed camera environment. Next, the DOG Generation Module takes the initialized depth background image and established RoIs from the RoI Setting Module as inputs to generate DOG images. This module aims to utilize simple image processing techniques to effectively distinguish RoUs and RoIs based on the initialized RoI information from the test pigpen and to create images emphasizing the differences between the foreground and background pixels within the RoI region. The green boxes in the figure represent the detected objects. This approach achieves generalization for test images in unseen environments, enhances object detection model performance, and enables real-time detection.
3.1. RoI Setting Module
The method for setting the RoI to enhance detection accuracy is illustrated in
Figure 3. First, the SAM estimates the probability that each pixel in the input image belongs to the user-specified object region. The resulting mask is used to separate the foreground object, and that mask is then passed to LaMa. LaMa removes the masked foreground region and restores the remaining background as an inpainting model specialized in filling large masked areas. The background image generated by LaMa is subsequently provided to Depth Anything, which predicts a depth value for each pixel to create a depth background image. Unlike RGB-based information, this depth information remains unaffected by facility layouts, lighting conditions, and floor materials, making it particularly effective in environments where RGB data alone are insufficient.
Although the inpainting process may leave shadows in areas where the object has been removed, Depth Anything recognizes those shadows as non-object shadows, thereby accurately generating the depth background. Afterward, CLAHE is applied to the depth background image to enhance contrast, followed by the Otsu [
47] thresholding algorithm to finalize the RoI.
3.2. DOG Generation Module
The process for generating DOG images, aimed at generalizing test images and improving the accuracy of object detection models, is illustrated in
Figure 4. The proposed DOG Generation Module leverages the depth background images produced during the preprocessing stage to achieve real-time depth-like image generation. Initially, the pixel values of both the grayscale image and the depth background image are summed. This step mitigates the brightness inconsistencies caused by lighting variations and environmental factors, enabling the enhanced representation of object contours and details.
Subsequently, Contrast-Limited Adaptive Histogram Equalization (CLAHE) was applied with parameters set to clipLimit = 4 and tileGridSize = (4, 4) to enhance contrast. clipLimit = 4 was chosen to effectively prevent excessive noise amplification, preserving the integrity of the image during processing. tileGridSize = (4, 4) was utilized to emphasize differences in smaller regions, which allowed for the precise delineation of boundaries between the floor and pigpen walls, as well as the edges of objects. This contrast enhancement improved the distinction between the foreground and background, allowing the object detection model to effectively identify the RoI. Following the application of CLAHE, the histograms of the grayscale image and the depth background image were equalized, and the two images were combined using equal weights. The reason for applying equal weights was to maintain a balanced contribution between the grayscale image, which contains objects, and the depth background image, which does not contain objects.
Grayscale images often struggle to differentiate between the foreground and background under varying lighting conditions or due to the presence of diverse floor materials. To address this challenge, the depth background information is merged with the grayscale image using equal weighting to create a composite image. This composite image retains the structural details and contours of the objects in the grayscale image while effectively incorporating depth background information. Finally, element-wise multiplication is performed between this composite image and the inverted saturation component of the HSV color space. This process emphasizes low-saturation foreground regions, such as pigs, while suppressing high-saturation regions containing rich color information, resulting in the final DOG image.
Figure 5 illustrates the HSV images derived from the input video. The HSV-Hue component, which separates colors such as red, green, and blue into angles in the color space, was not utilized because it is heavily influenced by color and fails to represent pixel values of objects within the image consistently. In contrast, HSV-Saturation defines lower values as achromatic regions, and when inverted, achromatic objects are represented with higher pixel values. This characteristic provides compelling features when performing element-wise multiplication. HSV-Value, generated based on the maximum value in the red, green, and blue color space, vividly highlights the texture of objects.
Algorithm 1 describes the process for generating RoI and DOG images from a sequence of video frames. It involves preprocessing steps to define RoIs, creating depth background images, and generating DOG images to enhance object detection accuracy in unseen environments. By leveraging deep learning-based foundation models (SAM, LaMa, Depth Anything) and simple image processing techniques, the algorithm ensures efficient real-time performance.
Algorithm 1. RoI Setting Module and DOG Generation Module |
Input: Frame: video frame sequence F = . Output: Region of Interest (RoI) image; Depth background (DB) image; Depth-oriented gray (DOG) image. |
For each frame, carry out . If is the first frame, then one obtains the following:
= inference with SAM given
= inference with LaMa given
= inference with Depth Anything given
= apply CLAHE on RoI_image = apply Otsu thresholding on Elsewhere, one obtains the following: = split into HSV color space; = apply CLAHE on = Weighted Sum (, ); = bitwisenot = element-wise multiply(, ). If RoI is 0, then one obtains the following: = copy pixel value from to ; = . Return . |
In summary, the proposed methodology integrates these state-of-the-art foundation models and advanced processing techniques to address the challenges of unseen environments. The modular design of the RoI Setting Module and DOG Generation Module enables accurate separation of the foreground and background, while enhancing the generalization capabilities for object detection models. This approach demonstrates both its practicality and potential for deployment in real-world scenarios, such as pigpen monitoring systems.
4. Experiment Results
This study was conducted in the following hardware and software environments. The experiments were conducted on an Intel Core i7-7700K @ 4.20 GHz processor (Intel Corporation, Santa Clara, CA, USA) and a GeForce GTX 1660 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The software environment was based on the Ubuntu 22.04 LTS operating system (Canonical Ltd, London, UK), with the deep learning models implemented and tested using PyTorch 2.0.1 (Meta Platforms, Inc. Menlo Park, CA, USA) and CUDA 11.7 (NVIDIA Corporation, Santa Clara, CA, USA).
Figure 6 presents the data and training setup used in this study. The training dataset was sourced from the German pig dataset [
13] and consisted of 985 images, divided into 788 images for training and 197 for validation. The test dataset consisted of 200 images derived from videos acquired in Korea. Accuracy measurements were specifically conducted using data filmed at a pigsty on a farm located in Jochiwon-up, Sejong-si, Chungcheongnam-do, South Korea, involving 23 pigs. The test dataset additionally included images of another Korean pig, Belgian pigs, and Chinese pigs, specifically utilized to evaluate the generalization performance of the proposed method. The model was trained for 300 epochs with a batch size of 16 and a resolution of 640 × 640.
The dataset involving 23 pigs presented challenges such as indistinguishable separation between pigs and the background due to flooring materials and object overlap caused by feeding troughs located at the top. Additionally, the top-right area of the image was darker than other regions, which could result in detection errors. In contrast, the single image of another Korean pig exhibited clear separation between objects and the background but was prone to errors caused by sunlight. The Belgian pig image faced challenges in the central area, where objects and the background could not be distinctly separated, similarly to the 23-pig dataset. Lastly, the Chinese pig image included a black pig, which was absent in the training dataset (German pig dataset), and objects with dark colors that closely resembled the background, both contributing to potential detection errors.
True Positive (TP): TP represents the number of instances where the model correctly detects existing objects. For example, if an image in a real farm environment contains 23 pigs, and the model accurately detects 20 out of these 23 pigs, the TP value would be 20. A high TP value indicates that the model demonstrates excellent detection performance.
False Positive (FP): FP represents the instances where the model incorrectly detects an object in locations where no object exists. For example, if the model mistakenly identifies an area without a pig containing a pig, it counts as an FP. A lower FP value indicates higher reliability in the model’s detection performance.
False Positive (FN): FN refers to the number of instances where the model fails to detect existing objects. For example, if a pig is present but the model does not detect it, it counts as an FN. A lower FN value indicates that the model detects objects without missing them.
Average Precision (AP): AP50 refers to the Average Precision at an Intersection over a Union (IoU) threshold of 50% or higher and is a metric used to evaluate the overall performance of object detection models. An IoU measures the overlap ratio between the detected bounding box and the ground truth bounding box, while AP represents the area under the Precision–Recall curve. AP50 clearly indicates how well a model can accurately detect objects with at least 50% overlap between predicted and actual boundaries.
Inference Time: Inference Time represents the time (in milliseconds) a model takes to process a single image. It is a key metric for evaluating the model’s real-time processing capability. For instance, if the Inference Time is 10 ms, the model can process 100 images per second.
Table 2 presents the results of testing models trained on grayscale images when the test image types were grayscale, Depth Anything, and the proposed method (DOG). For real-time detection, the smallest model variants, whereby the model ending with n is nano and t is tiny, were employed. The results showed that testing with depth images overall improved AP50 from 3% up to 9.5% compared to testing with the grayscale images of unseen data from Jochiwon. Additionally, the accuracy of all detection models improved compared to that of the baseline, demonstrating more stable detection performance in unseen environments when using depth images. This indicated that depth images can resolve difficulties in object detection when there is a distinction between the foreground and background.
Comparing the test results of grayscale and DOG images revealed that even without training on DOG images, detection of unseen data using DOG images overall improved AP50 from 2.7% up to 6.4% over grayscale image detection. For test images with a resolution of 640 × 640, the Inference Time for Depth Anything using the Depth Anything small model on a GPU was 193.7 ms. In contrast, the proposed DOG image transformation took 3.6 ms on a CPU, demonstrating a 53.8-times increase in processing speed.
Furthermore, the results emphasized the practicality of the proposed method for real-time applications. Unlike Depth Anything, which relied on computationally expensive GPU-based processing, the proposed method achieved significant speed improvements by utilizing lightweight CPU-based operations. This efficiency made it particularly suitable for deployment in low-resource environments, such as farms, where high-performance computing infrastructure was not readily available.
Figure 7 illustrates the object detection results according to the test image types. The detection results of the object detection model varied based on the test image type. In the enlarged prediction results, green boxes represent TP, blue boxes represent FP, and yellow boxes represent FN. As shown, FP and FN occurred due to the ambiguity between the foreground and background in unseen environments, negatively impacting the accuracy of the object detection model. The Depth Anything and DOG Image detection results successfully detected all 23 pigs. However, errors still occurred in detecting the feeding troughs. These errors are attributed to the feeding troughs having a different shape from those in the training environment and having a depth similar to that of pigs. This issue can be resolved by setting the RoI as a red-highlighted area, as in the DOG Image with the RoI approach.
Figure 8 shows the results of applying the proposed method across various environments. Grayscale images demonstrated that detection results can be negatively affected when the brightness of objects and backgrounds is similar. While HSV-Saturation images effectively represented achromatic pigs with pixel values close to zero in various environments, they suffer from the drawback of losing texture information. The depth of the background played a crucial role in generating DOG images. Although generating depth images for every frame using Depth Anything was not feasible for real-time processing, combining grayscale images and depth background images with equal weighting created an image similar to a depth image through simple image processing on a CPU. The Otsu algorithm was also applied to the generated depth background image to define the RoI region.
The RoI remained unchanged after initialization in the pigsty environment, where a fixed camera was used. This initial setup helped the object detection model identify the areas to detect, improving detection accuracy. The DOG images created from the above components generalized the foreground (pigs) across various environments while reducing the brightness values of the background. The images enabled a clear distinction between the foreground and background, and the DOG images enabled clearer object boundaries, with the green boxes representing the detected objects, resulting in more robust and precise detection even in challenging unseen environments.
The DOG images not only addressed varying lighting conditions effectively but also enhanced object boundaries, resulting in robust and precise detection performance. For example, in the first column, the gray image shows a heat lamp positioned in the upper left corner near the feeding trough, where the brightness of objects decreases as they move further away from the lamp. Despite this variation in lighting conditions within a single image, the DOG image effectively distinguishes the boundaries between the white flooring and objects while maintaining similar brightness levels among the objects. The second column depicts an environment with natural light, where pigs near the light source appear brighter, and those farther away exhibit darker brightness levels. Similarly to the first column, the DOG image equalized the brightness levels among the objects, even under varying lighting conditions within the same image. In the third column, the gray image presents a scenario where the brightness of the pigs and the floor are similar, making differentiation challenging. However, the DOG image clearly separates the objects from the background, demonstrating its effectiveness. Lastly, in the fourth column, the black pig image highlights pigs darker than the floor. The DOG image adjusted the brightness of the black pigs to a level comparable to the white pigs in other DOG images, further enhancing the distinction between objects and improving overall clarity.
5. Discussion
The results of this study demonstrated that the proposed method, which utilized foundation models during the initialization phase with top-view images as input, was capable of operating in real time on a CPU through simple image processing. The DOG images generated by the proposed method showed superior object detection performance compared to that of grayscale images in various pigpen environments, indicating their effectiveness in distinguishing between the foreground and background even in complex settings. However, in tilted-view environments, the performance of DOG images varied depending on the camera angle and perspective transformation method. In certain cases, distinguishing between the foreground and background became challenging, potentially reducing detection accuracy. These limitations highlighted the need for the further development of robust methodologies that could reliably operate in tilted-view scenarios.
Figure 9 presents the results of applying the proposed method to tilted-view scenarios. It shows the outcomes of applying the proposed method to tilted-view input image after performing perspective transformations, with the green boxes representing the detected objects. In the input image, pigs closer to the camera appeared larger than those farther away. However, in the transformed image, a distortion was observed where pigs farther from the camera appeared larger than those closer to the camera due to the applied perspective transformation. This distortion seemed to have arisen during the perspective transformation process. Additionally, incorrect results from the Otsu thresholding of the depth background image led to the generation of RoI images that obscured parts of the pigs. To address this issue, developing appropriate perspective transformation methods or designing a more robust approach that can reliably work in tilted-view environments could be a topic for future research.
Table 3 presents the performance of different YOLO model versions when tested on unseen environment data. The models were trained on augmented training data, where DOG images were added to the training dataset to enhance robustness against complex and previously unseen conditions. The results demonstrated how the inclusion of DOG images improved the models’ generalization capability and detection accuracy in challenging environments. Models trained on both Gray and DOG images showed an overall improvement in AP50, ranging from 3.6% to 6.6%, compared to models trained on Gray images. This indicated that incorporating DOG images into the training process contributed to the better handling of challenging scenarios and improved overall detection performance.
However, there are two key challenges associated with augmenting the training data. First, while the proposed method performs effectively in top-view scenarios, it fails to achieve consistent results in tilted-view settings. Second, the initialization process must be performed on all training images, which significantly increases the overall training time.
Figure 10 illustrates the CPU and GPU memory usage during the initialization phase of DOG image generation. The initialization step represented the time required for the RoI Setting Module to operate on a single image. To measure this, the models were loaded prior to recording, and the stages of the initialization step were marked with a green line. Although this process was performed only once per pigpen, it took approximately 10 s. The models used were the SAM (sam2 hiera small), LaMa (big lama), and Depth Anything (Depth Anything V2 small). This initialization step was also conducted under the same experimental conditions as the those of the main experiments.
6. Conclusions
This study proposed a methodology utilizing DOG images and the Depth Anything model to achieve high object detection performance and cost efficiency even in unseen environments. The approach addresses the performance degradation issues of gray image-based detection models in complex backgrounds and unseen environmental conditions. Cost efficiency is significantly improved by generating depth information through Depth Anything without requiring additional depth sensors.
Experimental results demonstrated that DOG images achieved up to a 6.4% increase in AP50 compared to gray images, with a processing time of 3.6 ms on a CPU approximately 53.8 times faster than the GPU-based Depth Anything’s depth image generation time of 193.7 ms. These findings confirmed that DOG images resolve foreground–background distinction challenges in complex lighting conditions and backgrounds, proving their applicability in real-time object detection systems.
Future research will focus on developing real-time video-based detection and tracking algorithms and enhancing generalization performance across diverse environments to expand the applicability of real-time monitoring systems in areas such as farm management and smart farming. Following approaches like [
19], it is anticipated that generating DOG images on a CPU and processing object detection models such as YOLO on a GPU in a pipeline could enable real-time video monitoring even in embedded board environments. Furthermore, if the proposed method, initially designed for top-view operation, incorporates camera position estimation and 3D modeling for perspective transformation to a top view for the RoI setting, it is also expected to be applicable in tilted-view environments.