1. Introduction
The necessity for annotated data has grown significantly in recent years, as deep learning models heavily depend on large data collection for efficient training. As tasks like people detection and tracking become more complex and precise, the importance of extensive datasets becomes crucial. Object detection methods can be applied widely in various fields, such as manufacturing [
1], sports [
2], and many others. It is especially essential to have datasets that cover various scenarios, environmental conditions, and diverse demographics to ensure the adaptability and robustness of models. The computer vision community has dedicated resources to create datasets like PASCAL VOC [
3] or MS COCO [
4], contributing significantly to research on complex tasks across different scenes. However, while beneficial, these datasets do not cover all possible scenarios.
The collection of extensive and diverse real-world data poses various challenges and limitations. Gathering real-world data is a time-consuming process with higher operational costs, particularly when manually annotating thousands of images with numerous objects, inevitably leading to human errors. The current increase in privacy regulations and restrictions on data collection further complicates matters. Real data often contain sensitive information, including people’s identities, locations, and activities, raising significant privacy concerns, especially in regions like the US and Europe, where regulations like the EU General Data Protection Regulation [
5] have been introduced. Considering the necessity for diverse data to encompass a wide range of scenarios, synthetic data emerges as a valuable solution that can be generated through simulation. The utilization of synthetic data provides the flexibility to create various scenarios with different nuances. With synthetic data, it is possible to have control over the elements present in the scene, visibility of individuals or objects, actions performed by actors, types of locations, and numerous other factors.
Three advanced types of methods generating labeled data can be distinguished, offering richer information about the objects to be detected. The summary of these three categories can be found in
Table 1. These methods are performed once before training and can be further augmented. The first type of method is to use background-less images of objects to be learned (named in this work as Img: images generated without 3D software). Such an image is being spawned on another one giving more valuable data. In the work in [
6] an algorithm is used in automotive transport for detecting road damage. To ensure the realism of generated data, images of road cracks are only put on specially selected areas. To improve the realism of synthetic data, the work in [
7] focuses on using color-grading images and fitting the size of the pasted object. This ensures a better blend with the background, enhancing visual coherence. The work in [
8] introduces a cycle-GAN algorithm, creating realistic images based on the provided ones. This is used in multi-organ detection in CT images, generating realistic organ images.
The second popular category of methods presented in [
14] is to use 3D software to render 3D objects on a specific background (Obj: 3D models generated on a flat background). This method works on the assumption that synthetic images do not have to be realistic and ensures that the object is being rendered from multiple angles. Such an approach diversifies the background of an image and provides information about the object from different points of view, which is crucial while detecting more complex shapes. The work in [
13] suggests that the best results are obtained when the rendered objects used become background-less, giving the possibility to change the background, as well as to rotate or change their positions. Such an operation should be performed multiple times to ensure filters learn the features of the object, not the background. The authors of [
10] use an especially prepared environment with various cameras and visual effects. Objects can be rendered from various camera angles with different light settings. Blur and noise effects are also added to 3D scenes.
The last noted group of methods uses a Game Engine for creating a full crowd simulation (Sim: images rendered using simulation; refer to
Figure 1), which is proven to be a valuable tool for validating object detection [
17], person detection [
22], construction site issues [
18], and tracking methods [
31]. However, it faces limitations in preparing diverse data for training. The challenge lies in the difficulty of achieving a wide range of situations essential for effective training. The complexity lies in addressing the challenge of data diversity in synthetic data generation [
13], as certain methods may lead to biased datasets under specific input conditions. Models trained on diverse datasets exhibit enhanced performance when dealing with data beyond the target domain. This is why simulations can only be used to overcome specific problems. One of the possible usages [
21] is to render images of walking ants in a randomly generated environment, which is possible because generating forest ground from a top view is a simple task to overcome. Labels are later used to detect, track, and estimate their positions. Another approach [
19] uses an ocean simulation to detect people in the water. The simulation uses multiple ship models, varies the weather conditions, and spawns thrash in the sea. The goal of an algorithm is to detect drowning people and to distinguish them from swimming ones. The trained model will only be used over the ocean, so there is no need to diversify the environment so much.
In the context of Virtual Worlds [
30], synthetic data have found applications as proxies for multi-object tracking analysis. In the context of detecting cars and traffic, virtual crowd simulation is utilized to augment training data and expand datasets [
23], particularly emphasizing highly reflective objects, with a focus on bathroom utilities. The video game environment [
29] serves as a common tool, leveraging synthetic samples to achieve results comparable to models trained on real-world data. The integration of synthetic data generated within virtual simulations into training sets proves instrumental in significantly enhancing the performance of object detection algorithms. An approach involving the generation of synthetic objects on real backgrounds [
28], featuring a high density of detectable objects, aims to emulate real-world clutter effectively. By incorporating multiple synthetic and real datasets alongside a simulation tool [
32], there is potential to create large volumes of affordably annotated synthetic data. This approach can lead to the establishment of domain similarity among these datasets, contributing to more robust and comprehensive training datasets.
Several authors have emphasized the importance of achieving photorealism in simulated data, investigating the impact of synthetic datasets generated through photorealistic rendering techniques. This focus extends to areas such as street scene parsing [
24] and transfer learning, with a predominant reliance on synthetic training data [
27]. In the work in [
15], a new algorithm was designed to enhance realism by initiating from a limited set of real images. Then, it estimates the rendering parameters needed to synthesize similar images when provided with a coarse 3D model of the target object. Furthermore, the generation of synthetic data [
12] involves incorporating randomized illumination, blur, and noise to address the challenges of object detection in complex environments. This approach aims to overcome the limitations associated with existing methods, which heavily depend on large volumes of labeled real data. In the realm of cross-modality learning, a framework [
26] employs terms utilizing a deep convolutional network to establish a non-linear mapping between RGB and thermal data. This enables the learning of features that are both discriminative and resilient to poor illumination conditions.
Models trained exclusively on synthetic data often fall short in performance when tested on real-world datasets, and the process of data synthesis itself likely contributes to the observed domain gap [
33]. To mitigate the disparity between real and synthetic data, two common strategies are employed. One approach involves mixing real data into the training set alongside synthetic data. Other strategies include conducting fine-tuning on mixed data after pretraining on larger uncorrelated with the given problem but in a similar domain dataset (like COCO) or enhancing the quality of synthetic data to align more closely with the target domain. While fine-tuning models with real data can lead to improvements, it does not address the fundamental issue of the domain gap. Addressing the domain gap for semantic segmentation [
34] can be achieved by adapting the representations learned by segmentation networks across synthetic and real domains. Alternatively, domain randomization [
25] can be applied together with fine-tuning on real data. In this approach, simulator parameters such as lighting, pose, and object textures are randomized in non-realistic ways, compelling the neural network to learn the essential features of the object of interest despite the artificial variations introduced during simulation.
The presented investigation shows various methods for preparing synthetic data, along with exploring techniques for training multi-object detection methods for classes with a limited number of data. The diversity of generated results is a key focus of synthetic data, with considerations given to the issue of photorealism. This study presents a different approach to training a multi-object detection algorithm using real data, synthetic data, and a transfer learning approach. Then, the impact of the relationship between the number of real and synthetic data on the effectiveness of training and classification was analyzed, which allows for determining the required amount of synthetic data that must be added to the training set. Finally, based on existing solutions for generating synthetic data (Img, Obj, and Sim), the influence of data type on classification parameters was checked. The novel outcome of the conducted research is the most effective recipe for generating synthetic data and a method for utilizing such data in training models to detect large-sized objects. This conclusion is derived from the latest knowledge and insights in the field, providing a comprehensive guide for practitioners involved in similar applications.
3. Results
All calculations were performed on a computer with the following parameters: Xeon W-3200 processor, number of cores/threads 12/24, processor clock 3.3 GHz, cache memory 19.25 MB, 32GB RAM, and GPU NVIDIA Quadro RTX6000 24GB. The dataset was split in the following manner: training set: 50% of labels, validation set: 30% of labels, test set: 20% of labels. Later, cross-validation of the chosen models was performed. To create hybrid datasets, synthetic data were added with various ratios to the training set. It was ensured that the correct ratio was obtained by calculating the number of labels instead of images.
To evaluate the quality of the models’ outcomes, the following evaluation metrics were applied: (a) Precision; (b) Recall; (c) Mean Average Precision (mAP), calculated at an Intersection over Union (IoU) threshold of 0.55 (
[email protected]); (d) mAP, evaluated using a series of IoU thresholds ranging from 0.5 to 0.954 (
[email protected]:0.95); (e) F1 score; (f) Box loss metric, indicating how accurately the algorithm can detect an object’s center; (g) Objectness, measured as the probability of an object existing in a proposed region of interest; and (h) Classification, utilizing the assignment of a class label to the detected object.
3.1. Evaluation of Different Training Approaches
As intended, the multi-object detection method was trained first on real data. Subsequently, a transfer learning procedure was executed using solely real data. In the second phase, a combination of synthetic data and real data, called hybrid data, was employed. The transfer learning approach was then repeated with this hybrid data, ensuring an equal distribution of synthetic and real data in the hybrid set. The results demonstrate that incorporating synthetic data in the form of hybrid data significantly enhances detection quality (refer to
Table 3).
Training detection methods exclusively with real data proved to be the least effective, prompting the utilization of the transfer learning approach in this scenario. Further improvements were observed when transfer learning (tf. real), hybrid data (hybrid), and transfer learning for hybrid data (tf. hybr.) were applied, respectively. The results for the Precision, Recall, F1 score, and mAP 0.5 parameters are above the value of 0.9, and only a large difference is visible for the precise detection of the mAP 0.5:0.95 parameter, where the transfer learning approach for hybrid data performs relatively best.
On an individual class basis, YOLOv7 demonstrated commendable results for approximate detection (mAP 0.5). However, a marked enhancement in the classification efficiency of individual classes was achieved through the introduction of the transfer learning approach for hybrid data, evident in the mAP parameter 0.5:0.95 in
Table 4.
3.2. Estimation of Synthetic Data Amount
This research also investigated the optimal proportion of generated synthetic data in the training set. For hybrid data and the transfer learning approach, various ratios of synthetic data to real data were examined. Firstly, only synthetic data were used for training, indicating the worst results. On the other hand, when only real data were used, the highest mAP was observed (refer to
Figure 5, top plot, and
Table 5). Still, when an equal amount of synthetic and real data are used, the model performance is slightly worse than for a full real dataset. Even in extreme cases, where 90% synthetic data and 10% real data are used, the results are worse only by an average of 7–11% compared with the full real data. In the second step, synthetic data were added to the full real dataset (refer to
Figure 5, bottom plot, and
Table 5). There is a trend showing that adding synthetic data significantly improves the results. The major boost is observed after adding 25% of synthetic data. Adding synthetic data, and thus enlarging the training set, definitely improves the results. This confirms the usefulness of synthetic data in the context of supplementing real data.
3.3. Choice of Synthetic Method Generation
During this phase, diverse methods for generating synthetic data were rigorously tested and detailed in the corresponding
Section 2.2. The YOLOv7 detection method was individually trained on each approach through the transfer learning method. Subsequently, an equivalent number of images were randomly selected from the datasets associated with each approach, ensuring parity in the size of each set. After a comprehensive evaluation of the various strengths and weaknesses inherent in each solution, it was determined that the most favorable results were achieved by combining all approaches of synthetic generation methods (refer to
Table 6).
4. Discussion
The presented work explores diverse methods of generating synthetic data, including crowd simulation, utilizing a plane with a variable background, and incorporating objects onto photo backgrounds. Each approach has its own set of advantages and drawbacks, with typical crowd simulation offering a limited amount of repeatable data, potentially leading to more random outcomes.
The research involved the inclusion of object classes represented in the COCO database, as well as those requiring manual search and marking. Initially, the YOLOv7 network was trained from scratch using real data. Subsequently, a pretrained model was employed to enhance results. Finally, synthetic data, a blend of real and simulation data (hybrid data) in varying proportions, were introduced, and the network training method was repeated similarly to real data. The research also investigated the impact of options for generating simulation data and the quantity of synthetic data relative to real data on classification quality.
The results of the research show that the use of real data in the training process is insufficient. Particularly for classes with limited or no available source data, the use of synthetic data emerges as a highly effective alternative. This study employed the well-established method of training the YOLOv7 network, with an extensive analysis of various data generation approaches.
Table 3 illustrates that employing hybrid data for the transfer learning approach yields the best results across all indicated parameters. Notably, this approach significantly enhances the precision of object detection, as evident in the mAP parameter 0.5:0.95.
Furthermore, this research reveals that the influence of the amount of synthetic data relative to real data is minimal (refer to
Figure 5). However, it is worth noting that synthetic data can be a good replacement and complement to real data when balancing data from multiple classes is required. The research results present the same amount of real and synthetic data in a hybrid set. However, in scenarios with limited real data, supplementing the set with a larger volume of synthetic data proves beneficial. However, in scenarios with limited real data, supplementing the set with a larger volume of synthetic data proves beneficial. That is crucial for ensuring a proper balance of data in case not all classes are represented by a sufficient number of real data. Synthetic data, with their capacity to generate diverse yet realistic images, stand out as a key feature.
The primary criterion for data generation was maximizing diversity, and the use of the Unreal Engine game engine helped maintain photorealism. Assessing the impact of photo realism on subsequent work proved challenging due to the absence of clear parameters for evaluating such realism. Diversity in the data was achieved through the incorporation of different 3D models, multiple models in a single scene, object occlusions, random arrangement of models, random positioning of camera views, diverse backgrounds, and lighting. This diversity, practically unreachable in reality, fills a crucial gap when actual images are unavailable.
Promising methods for data generation are generative models, such as generative adversarial networks [
38] and diffusion models [
39], which are currently used in many applications. This type of methodology currently serves mainly as a data augmentation method, which creates coherent and logical images based on already provided pictures. Future development of these methods may make them an alternative to generating synthetic data and will speed up the process of preparing training data.
The strategy of training on synthetic data significantly affects the results and influences the layers of neural networks [
16]. Notably, the comparison between a detector trained on real data and one trained on synthetic data revealed the highest similarity in the early layers, while the most significant difference was observed in the head part of the network. Feature extractors (first layers) are only responsible for detecting various shapes that are identical for similar problems. This observation was used for a simple yet effective approach [
14], where the layers responsible for feature extraction were frozen in pretrained models on real images. Subsequently, only the remaining layers were trained on synthetic data.
The effective application of simulation data, treated as synthetic data, plays a pivotal role in enhancing the quality of training data preparation. On one hand, it facilitates the generation of substantial datasets that enrich real data with diverse sets in various aspects. On the other hand, synthetic data become crucial when real data for a specific class is lacking in any existing dataset. However, a significant challenge lies in appropriately structuring the classifier architecture to bridge the gap between the domains of real and synthetic data. The presented work delves into extensive research on the application of synthetic data in the detection of numerous objects. The selected list of objects includes classes with available data in publicly accessible benchmarks and classes requiring manual resource search.
The utilization of synthetic data as a form of data augmentation is increasingly prevalent today. Image augmentation, as discussed in [
9], is a popular way to boost dataset quality. It involves rotating, scaling, adjusting color, and blending images to create more labeled data, preventing overfitting during training. Such a process can be carried out in each epoch, providing rich data every iteration. With numerous data generation scenarios available, synthetic data finds practical applications in detecting various objects. Current knowledge enables the generation of realistic data, and as successive graphics engines are released, the photorealism of simulations is expected to improve. Nevertheless, a significant challenge for the future lies in achieving appropriate domain transfer, domain bias, and solutions for overfitting to construct a classifier based solely on synthetic images that remain effective on real data.