2.1. Background
Among the various potential ML methods one could use, rapid progress in selected convolutional neural network (CNN) methods has been attracting the most attention recently. CNN methods are able to efficiently learn distinctive high-level features of object detection in remote sensing [
7,
13]. CNN algorithms can be divided into two categories: so-called two-stage detectors (RCNN) and single-stage detectors (SSD, YOLO, and RetinaNet) [
14,
15].
Faster R-CNN is used for generic object detection and has been successfully adapted from its two predecessors, R-CNN and Fast R-CNN [
16,
17], to solve many recognition problems. Faster R-CNN consists of two modules: a region proposal network (RPN) and a Fast R-CNN detector [
14,
16,
17]. Faster R-CNN improves object detection architecture by replacing the selection search algorithm in Fast R-CNN with a region proposal network (RPN) [
14,
18]. An RPN is a fully convolutional network for proposal generation [
14]. The rest of the model architecture remains the same as in Fast R-CNN. The system overview of Faster R-CNN is given in
Figure 1.
Duporge and Isupova [
20] applied a CNN model to automatically detect and count African elephants in a woodland savanna ecosystem in South Africa. They used WorldView-3 and -4 satellite data. Dumitrescu, Boiangiu, and Voncilă [
21] focused on creating a fast and reliable object detection algorithm that is trained on scenes depicting people in an indoor environment. This method combines YOLOv4 and Faster R-CNN. Additionally, Li and Ma as well as Fu and Xu [
22,
23] used Faster R-CNN to detect people in a sequence of video images. On the other hand, Wang et al. [
4] used Mask R-CNN, a representative extension of Faster R-CNN, which produces segmentation masks of objects. Ren, Zhu, and Xiao [
24] used a modified Faster R-CNN to detect ships and planes in optical remote sensing images (NWPU VHR-10).
When compared to other popular object detection methods, like YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), or RetinaNet, Faster R-CNN offers distinct advantages that make it particularly effective for this specific application; Faster R-CNN’s RPN generates high-quality region proposals that are more accurate and reliable compared to the single-stage approaches used in YOLO, SSD, and RetinaNet. The RPN is designed to propose regions that are likely to contain objects of interest, which is crucial for detecting small target objects like people in urban satellite imagery. This two-stage approach, where the RPN first proposes regions and then a classifier fine-tunes the detection, ensures higher precision and recall, which are critical for applications where accurate human detection is paramount [
14,
16,
17,
18,
19,
25]. Although YOLO and SSD are known for their speed, they often sacrifice accuracy, particularly in complex and cluttered scenes. YOLO’s single-stage detector processes the entire image concurrently, providing a faster detection time, but at the cost of reduced accuracy for small objects and increased false positives. SSD faces similar issues. RetinaNet introduces focal loss to handle the class imbalance problem (i.e., many more background examples than objects) in object detection [
15,
26]. In contrast, Faster R-CNN’s two-stage process allows for the more precise localization and classification of people, which is essential when dealing with the varied and detailed backgrounds of urban environments [
25,
27].
Deep ML methods require a high volume of data, namely for their training [
28,
29]. The number of training samples for object detection using CNN algorithms can vary widely depending on several factors, including the complexity of the task, the diversity of the objects, the sizes of the objects in the images, and the architecture of the CNN being used. CNN algorithms enable the extraction of image features, which can be effectively processed to reduce the dimensionality of the task [
30].
Additional data manipulation methods (data augmentations) are applied to achieve improved results. These techniques help models generalize better by exposing them to a wider variety of image conditions. The following data augmentation techniques for object detection from satellite images are frequently employed: rotating, flipping, shifting, translating, and clipping [
11,
31,
32].
2.2. Study Areas and Data
Prague was selected as the area of interest for analysis of the occurrence of people on streets. The study area in Prague consists of three subsets: Prague Castle, Charles Bridge, and the Old Town Square, including its surroundings (
Figure 2). Two reasons for these selections apply: the areas contain various surface types (to verify the ability to detect people in various urban conditions) and the frequent occurrence of people in these locations at the given time (
Figure 3). One full WorldView-3 scene from the morning of 23 July 2019, covering 25 km
2, was obtained. The spatial resolution of the WorldView-3 panchromatic and multispectral images is 0.3 m and 1.6 m, respectively.
In these study areas, ground truth data were obtained by visual vectorization. Ground truth data (
Table 1) are needed for training and accuracy assessment. After vectorization, an exploratory data analysis was performed in which it was found that people in the satellite images were shown as 4 pixels that shared a neighborhood (
Figure 3). The digital numbers (DNs) of these pixels were significantly higher than the DNs of surface pixels.
The surface types were as follows:
Location 1—interlocking pavement.
Location 2—small square light cobbles, forming large square areas, bordered by small square dark cobbles.
Location 3—small square light cobbles, forming small square areas, bordered by small square dark cobbles.
Location 4—small square light cobbles, forming large rectangular areas, bordered by small square dark cobbles.
Location 5—interlocking pavement.
Location 6—square light cobbles, forming large square areas, bordered by square dark cobbles.
Location 7—interlocking pavement.
Location 8—marble pavement.
Vectored digital data for buildings and greenery were obtained from digital technical maps of IPR Prague (The Prague Institute of Planning and Development) as appropriate auxiliary data. These data enabled the creation of masks that improve the quality of people detection by hiding or blocking certain parts of the input imagery, allowing the neural network to focus on a specific area of interest. By excluding irrelevant areas using the mask, computation can be reduced, leading to a potential increase in detection speed.
2.3. Methodology
Upon review, Faster R-CNN and YOLO were selected for data processing using implementation in ArcGIS Pro version 3.3.1. The selection of these most promising models was confirmed by the “AutoDL” function in ArcGIS. The Python programming language and ArcGIS Python API library were used to automate various subtasks (available at
https://github.com/PavelKVSB/people_detection_AI_Satellite_Images, accessed on 2 September 2024), and the hardware parameters include Intel Core i7-8700 CPU 3.20 GHz; NVIDIA GEFORCE GTX 1070; and 16 GB RAM.
To begin, an image that contained three RGB bands was created from a panchromatic image because neural network algorithms perform better with this image type [
33,
34]. Next, it was necessary to create training data from the ground truth data for the Faster R-CNN model. To generate the training data, the following parameters were set: tile size (size of the image chips) at 128, stride size (distance to move in X and Y when creating the next image chip) at 64, and the metadata format of training data in PASCAL Visual Object Classes in .xml format. This part of the processing was carried out in ArcGIS Pro. After the training data were prepared, the model was trained using Jupyter Notebooks (version 7.2.2) with the GPU enabled in ArcGIS Pro SW. The training data were prepared, and the sizes of the validation portion (20%) as well as the batch size were set. Typical mini-batch sizes were 16, 32, 64, 128, 256, and 512 [
35], depending on CPU/GPU availability. The Faster R-CNN algorithm was imported and the backbone model was configured. In addition, the “learning rate find” function was utilized to select the most suitable learning rate for model training. After these settings were fixed (
Table 2), the given model was trained. The 13th and 14th methods achieved the highest values of the average precision score (AP) (almost 42%,
Table 2). The AP is the precision average across all recall values between 0 and 1 [
36]. The average precision is high when both precision and recall are high, and low when either of them is low across a range of confidence threshold values [
37]. It can also be seen from
Table 2 that when the batch size is smaller, the average precision score is lower. However, even a large increase in the batch size does not necessarily lead to better results; it is necessary to find the best batch size value, which in our case was 16. After testing various backbones available for Faster R-CNN, Resnet152 was found to provide the best results. Variants 13 and 14 have a higher AP value than all other variants. This is because, in these cases, training data were used with a buffer of 15 cm (13th) and 30 cm (14th). All testing was processed in the ArcGIS SW environment.
By utilizing the buffer parameter, the models received a significant boost in their APs, as indicated in
Table 2. Furthermore, this clearly demonstrates that the most optimal outcomes were obtained when the backbone was configured as Resnet152, the batch size was set to 16, and the number of epochs was set to 150. Therefore, these settings were used for further optimization with data augmentation. The following data augmentation functions were used:
Horizontal flip—The model learns to recognize objects regardless of whether they are oriented to the left or right. This is especially beneficial in tasks where the orientation of objects is not fixed, such as object detection. By enhancing the diversity of the input data, this technique reduces the risk of overfitting and improves the model’s ability to generalize to new, unseen data.
Vertical flip—This transformation can introduce variability that helps the model generalize better to different orientations.
Rotation—Random rotation of the image by a maximum of 45° in either direction (clockwise or counterclockwise). This transformation allows the model to become more robust to slight variations in the orientation of objects.
Zoom—Images were randomly zoomed up to 50%. Zooming helps create a variety of scales within the dataset.
Warp—Random generation of different deformation of images (including skewing, stretching, and shifting), within the specified 0.3 range. Warping simulates natural distortions and irregularities that might be present in real images.
The use of the “get transforms” function is essential for data augmentation. This function applies a sequence of transformation operations to images during both the training and validation processes. These transformations are random and are applied every time the images pass through the model, meaning that each image may appear differently in each batch. This variability is key for increased data variability, which leads to improved model generalization and a reduced overfitting risk. As a result, the “get_transforms” function enhances the effectiveness of the training process, thereby improving its robustness and performance when encountering new, unseen data.
Variants 15 and 16 represent outputs of training with data augmentation. Variant 15 uses the training data without a buffer, and the AP reached 42% (
Table 3). When training data are applied with a buffer of 15 cm, the AP value increased to 49%. Based on the results obtained using the Resnet152 backbone, data augmentation was applied to other backbones as well. Subsequently, various additional backbone variants were tested, as shown in
Table 3. None reached better results than ResNet152 with a 15 cm buffer. The variant with MobileNet V3 reached almost 43%, VGG16 and VGG19 37%, Inception V3 31.55%, and V4 only 0.29%. The DenseNet backbones (121, 161, 169, and 201) do not provide useful results (mAP about 1%) The best results among the tested backbones were achieved by the MobileNet V3 backbone, with an AP of 42.65%. Slightly lower results were recorded for the VGG16 and VGG19 backbones, where the AP was around 37%. The worst results were obtained using DenseNet, with an AP of approximately 1%.
The training process and validation are depicted by the plotting of validation and training losses after fitting the model (
Figure 3). In both cases, the function on the training data is decreasing, i.e., the model is learning, and its predictions are improving. When the loss function on the validation data decreases, it means that the model not only predicts well on the training data, but also generalizes to new data. This is important because it is necessary to have a model that will perform well on various data. The difference between training and validation loss is also important. If the difference between the two losses is large, it may indicate that the model is overtrained (overfitting), which means that it has learned the specifics of the training data too well and cannot generalize to new data. If the curves are close together, it is a good sign that the model generalizes well [
38]. The functions for variant 9 show larger differences (
Figure 4a) than the functions for variant 15 (
Figure 4b), where data augmentation was applied.
The trained model then needed to be tested on a sample of data that was not used during training. ArcGIS Pro with its “Detect Objects Using Deep Learning” function was employed to test the model. Four models were tested: variants 9 and 13 without data augmentation, and variants 15 and 16 with data augmentation. All four models were tested on an image without a mask and one with a mask (
Figure 5). The predictions that had a confidence score higher than 50% were included in the result as predicted people.
Simultaneously, the YOLOv3 model was trained with different settings (
Table 4). According to the results of the Faster R-CNN testing, this testing was focused on models with data augmentation and a buffer of 15 cm. The batch size was set to 32, 300 epochs were used, and Darknet53 was used as a backbone. The YOLO model reached a maximum AP of 36.32%, much lower than the AP achieved by the Faster R-CNN model with data augmentation and a 15 cm buffer.
In addition, other widely used object detectors, such as RetinaNet and Single Shot Detector (SSD), were evaluated using ArcGIS Pro software (version 3.3.1) implementation. A class arcgis.learn.SingleShotDetector() allows the user to create an SSD object detector with the specified parameters, including grid sizes used for creating anchor boxes, zoom scales and aspect ratios of the anchor boxes, backbone models for feature extraction, and dropout probability. The SSD implementation supports the backbones from the ResNet, DenseNet, and VGG families. A class arcgis.learn.RetinaNet() creates a RetinaNet object detector with the specified zoom scales and aspect ratios of anchor boxes, as well as backbone models. Only backbones from the ResNet family are supported by RetinaNet implementation. The aspect ratio parameters were set to 1:1 to define the square shape of the labels. Multiple zoom parameters were tested to define the appropriate scale of the anchor boxes to cover the sizes of the detected people in the image.
Table 5 summarizes the parameters setting used for the SSD model and the RetinaNet model.
The above testing in ArcGIS Pro implementation yielded very poor results, with a maximal average precision value of only 0.1% for the SSD model employing the VGG19 backbone, a batch size of 16, a zoom of 0.5, and a grid of 8.
Based on the poor results of the YOLO model and ArcGIS Pro implementations, only the Faster R-CNN model was further tested.