Next Article in Journal
Polarimetric Calibration Technique for a Fully Polarimetric Entomological Radar Based on Antenna Rotation
Next Article in Special Issue
Comparing Object-Based and Pixel-Based Methods for Local Climate Zones Mapping with Multi-Source Data
Previous Article in Journal
Parameter Estimation for Precession Cone-Shaped Targets Based on Range–Frequency–Time Radar Data Cube
Previous Article in Special Issue
NaGAN: Nadir-like Generative Adversarial Network for Off-Nadir Object Detection of Multi-View Remote Sensing Imagery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Object Categorization and Scene Classification over Remote Sensing Images via Features Fusion and Fully Convolutional Network

1
Department of Computer Science and Software Engineering, Al Ain University, Al Ain 15551, United Arab Emirates
2
Department of Computer Science, Air University, Islamabad 44000, Pakistan
3
Department of Humanities and Social Science, Al Ain University, Al Ain 15551, United Arab Emirates
4
Department of Computer Science, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia
5
Department of Computer Engineering, Tech University of Korea, 237 Sangidaehak-ro, Siheung-si 15073, Korea
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(7), 1550; https://doi.org/10.3390/rs14071550
Submission received: 17 February 2022 / Revised: 10 March 2022 / Accepted: 22 March 2022 / Published: 23 March 2022
(This article belongs to the Special Issue Pattern Analysis in Remote Sensing)

Abstract

:
The latest visionary technologies have made an evident impact on remote sensing scene classification. Scene classification is one of the most challenging yet important tasks in understanding high-resolution aerial and remote sensing scenes. In this discipline, deep learning models, particularly convolutional neural networks (CNNs), have made outstanding accomplishments. Deep feature extraction from a CNN model is a frequently utilized technique in these approaches. Although CNN-based techniques have achieved considerable success, there is indeed ample space for improvement in terms of their classification accuracies. Certainly, fusion with other features has the potential to extensively improve the performance of distant imaging scene classification. This paper, thus, offers an effective hybrid model that is based on the concept of feature-level fusion. We use the fuzzy C-means segmentation technique to appropriately classify various objects in the remote sensing images. The segmented regions of the image are then labeled using a Markov random field (MRF). After the segmentation and labeling of the objects, classical and CNN features are extracted and combined to classify the objects. After categorizing the objects, object-to-object relations are studied. Finally, these objects are transmitted to a fully convolutional network (FCN) for scene classification along with their relationship triplets. The experimental evaluation of three publicly available standard datasets reveals the phenomenal performance of the proposed system.

Graphical Abstract

1. Introduction

Recent advances in imaging technology have demonstrated that remote sensing (RS) imagery now has a higher resolution than reported previously. RS images are currently being employed in a variety of research disciplines, including object categorization [1], image reconstruction [2], change detection analysis [3], land-use classification [4], scene classification [5], and environmental monitoring [6]. Scene classification for RS images is crucial in practical applications since it aims to assign a scene category to each RS image on the basis of its semantic information.
Scene classification for RS images, which attempts to assign a scene category to each RS image on the basis of its semantic content, is critical in practical applications. Generally, accurate aerial scene classification requires excellent feature extraction. Apart from classic methods based on hand-crafted features [7], recent years have seen incredible performances achieved through deep convolutional neural network (CNN)-based approaches [8]. Moreover, CaffeNet [9], AlexNet [10], VGG Net [11], GoogLeNet [12], and ResNet [13] are all regularly used CNN models. Thus, CNNs have exhibited an exceptional capacity to extract discriminative features from aerial scenes. Despite the outstanding results obtained using CNN-based approaches, the task of extracting useful features from aerial scene imagery continues to face several difficulties.
To begin, in comparison to natural scenes, aerial scene images exhibit a significant degree of intraclass diversity. Specifically, items belonging to the same scene type may appear in a variety of sizes and orientations. Additionally, the appearance of the same scene may be altered owing to the varied imaging environments, such as the height of the equipment for image capturing and the solar altitude. Secondly, scene images from distinct classes may contain identical items and structural differences, resulting in a minor degree of interclass dissimilarity. In general, a strong depiction of aerial imagery is critical for gaining a competitive edge in this field. As a result, the features that we employ and how we apply them are becoming increasingly significant in the domain of aerial scene classification.
In this paper, we present an efficacious framework to significantly enhance the classification accuracy for remote sensing imagery. Initially, we incorporate a fuzzy C-means segmentation to partition the scene into homogeneous regions as segments of different objects in the scene. After segmentation, a Markov random field (MRF) model is adopted as a postprocessing and labeling technique. During postprocessing, the segmented regions of the image are more clearly segregated as disconnected parts are converted to connected components and, finally, unique labels are assigned to segmented objects using the probabilistic approach. Once the segments have been labeled, they can be used to extract features using classical and CNN-based methods. As a deep feature extractor, we deploy a pretrained CNN while super-pixel patterns, spectral-spatial features (SSFs), and Haralick texture features are extracted as classical features. A parallel feature fusion is incorporated to fuse all the extracted features. The fused feature set is transmitted to multiple kernel learning (MKL) for object categorization in the remote sensing imagery. These categorized objects are then analyzed for the object-to-object relationship (OOR) present in the scene imagery. Finally, these relationships triplets and categorized objects are fed to a fully convolutional network (FCN) for scene classification. We evaluated our system over three publicly available datasets. Moreover, the comparison of our results with various state-of-the-art (SOTA) methods demonstrates significant improvements over other SOTA techniques. The key contributions of this research are as follows:
  • We employed MRF as a postprocessing and labeling technique after segmentation to avoid the challenges encountered during segmentation while using other segmentation techniques, i.e., accurate scene classification.
  • CNN and classical features including Haralick features, spectral-spatial features, and super-pixel patterns are fused to improve the classification accuracy.
  • MKL-based categorization significantly enhances the performance of object categorization.
  • Probability-based OOR relations are introduced to contextually analyze the relationship between the objects present in the remote sensing scenes.
  • After object categorization and OOR exploration, FCN is applied for the remote scene classification.
The rest of the paper is organized as follows: Section 2 discusses related works. Section 3 provides an overview of the proposed method, which includes segmentation, labeling, feature extraction, and their fusion. Section 4 gives the details of the datasets used, the experimental design, and the outcomes. Lastly, in Section 5, we provide the conclusions of this study.

2. Related Work

Exploring the locations among several objects, their calibration and positioning, and the impact of scenic imagery are complicated issues in the domain of aerial and remote sensing images. We conducted a literature review across multiple domains, including object categorization, object segmentation, labeling, and scene classification to develop appropriate dynamics and metrics for the presented approach.

2.1. Object Categorization

The area of object categorization involves various challenges for researchers, including locating objects, detecting and analyzing their relationships, finding occluded components, and separating classes for desirable outcomes. Over the last decade, the bag-of-features model has undoubtedly been the most popular and effective paradigm for imagery categorization and classification. Numerous intriguing works have focused on the bag-of-features concept [14]. Martin et al. [15] developed a Bayesian inference model to assess each object’s previous knowledge to track several objects. It then revised the potential mass function to allow for more precise object recognition and convergence rate for accurate classification. In [16], they offered a unique class-specific illustration technique for object categorization. Initially, they used a Gaussian mixture model (GMM) to describe the features of images inside that class. Image and GMM models were then compared in terms of their respective Euclidean distances, which were utilized to represent each image. This was achieved by concatenating the representations of all the classes. In this method, they could express an image by combining the class-specific features, as well as the visual components. In [17], an effective technique was presented to classify the indoor–outdoor scenes by employing multi-object categorization. They used two different approaches to segment the images, and then object categorization was performed using multiple kernel learning (MKL) by considering local descriptors with the combination of signatures of a specific region. After finding the object relationships, they applied multiclass logistic regression to classify the scenes.
Wong et al. [18] presented an approach for online object detection and classification of the image’s object classes. They proposed using kernel learning to rapidly track all the objects in a scene rather than relying on past knowledge of a single object. The Neovision2 tower benchmark dataset was used to develop a biologically inspired approach for detecting an object’s contours and motion. Sumbul et al. [19] developed methods that incorporated the attention of a multisource region network that computed the pre-source feature illustration and was then distributed across the network’s members on the basis of their representations of object locations. They employed multispectral approaches to achieve better accuracy.

2.2. Scene Classification

Previously published research utilized low-level cues to categorize objects and scenes. These low-level cues include histograms of gradients [20], statistical analysis of structural information for texture discrimination [21], GIST [22], and scale-invariant feature transform (SIFT) [23]. However, these solutions depended on technical expertise and expert knowledge to generate feature representations, which have limits when it comes to representing large amounts of scene data. To overcome the shortcomings of low-level feature-based classification approaches, several approaches have been devised to improve the efficiency of scene classification by aggregating the collected local low-level visual cues into a mid-level scene illustration. One of the most extensively used systems based on mid-level visual features is bag of visual words (BoVW) [24]. It constructs a visual dictionary using k-means clustering, and mid-level visual information is extracted and achieved through feature encoding. Their model used BoVW and its advanced versions to classify scenes on numerous occasions. Additionally, some other mid-level features based on traditional approaches exist, including spatial pyramid matching [25], improved fisher kernel [26], and vectors of locally aggregated descriptors [27].
However, previously described approaches, which rely on low- and mid-level features retrieved from RS imagery, are not particularly sophisticated and, hence, cannot adequately reflect the semantic information contained in images. Recent research has demonstrated that deep learning approaches, particularly CNN, perform exceptionally well in computer vision applications due to their great feature extraction capacity. Additionally, RS image scene classification falls under the category of high-level image processing tasks that are strongly connected to computer vision. RS images have a poor resolution at an early stage, and the scenes to be identified are large-area land cover, in contrast to natural images used in computer vision, which focus on small-scale items. As a result, it has trouble incorporating deep learning-based algorithms into the categorization of RS image scenes. However, the RS images now have a high spatial resolution, while the disparity amongst RS and natural images has also been minimized; hence, the possibility of incorporating different remote sensing visualization techniques into image processing has increased. Numerous CNN-based algorithms for scene classification have been introduced recently [28]. Rather than relying on low- and mid-level cues, CNN-based approaches may extract hierarchical features from RS images. Additionally, the majority of CNN-based approaches make use of models that have been pretrained on ImageNet [29], including AlexNet [10], VGG [11], ResNet [13], and DenseNet [30]. Hu et al. [31] validated the efficiency of CNN models utilizing convolutional layer features. Li et al. suggested a unique filter bank in [32] for simultaneously capturing local and global data in order to improve the results of classification. They investigated the effect of various training procedures on the categorization process. Their system includes three different training approaches: feature extraction and fine-tuning through a pretrained CNN framework, and fully trained networks. The experimental findings revealed that, when compared to the other two procedures, the fine-tuning strategy achieved a better classification accuracy.

3. Proposed System Methodology

This section demonstrates a novel object categorization and scene classification (OCSC) model that categorizes the objects present in the remote scene imagery. Moreover, it classifies the scenes on the basis of these categorized objects. Initially, a remote sensing image is considered for segmentation by employing fuzzy C-means (FCM) algorithms. Then, these segmented objects are further processed to improve the segments and labeled via MRF. The labeled objects are then analyzed for feature extraction by CNN, while classical features including Haralick features, SSFs, and super-pixel patterns are also extracted. After the fusion of these extracted features, MKL is applied to categorize the unique objects in the remote scene images. Once the categories of the objects are separated, the OOR is computed on the basis of probability triplets. Finally, these OOR probabilities and object categories are taken as the input of FCN for remote sensing scene classification. Figure 1 illustrates the hierarchal view of our system.

3.1. Preprocessing Stage

Un-sharp masking [33] for image sharpening is performed during preprocessing to provide an enhanced image with sharp edges. Three parameters are used to produce un-sharp masking: amount, radius, and threshold. The amount parameter is used to adjust the contrast between the edges and is typically specified as a percentage. Radius defines the thickness of the edge and can be increased. A threshold is used to control the image’s brightness level. We set the radius and amount parameters to 0.75% and 1.25%, respectively, during our study. The following formula can be used to obtain a sharper image:
I s h = I o + I o I b × a m t ,
where I s h represents the sharpened image, I o specifies the original image, a blurred image is represented by I b and a m t is to describe the amount parameter which denotes the strength of the sharpening effect.

3.2. Object Segmentation via Fuzzy C-Means

This section describes the fuzzy C-means (FCM) approach [34,35] for segmentation. Initially, homologous components are spotted on the basis of pixels that are considered data points, consistent with the method. Rather than being assigned to a single defined cluster, each pixel demonstrating a fuzzy logic is then considered to be a member of numerous clusters. By iteratively minimizing the objective function, the FCM fragments the image. Additionally, these features constrain ideal image clusters by reducing cluster weights using the squared error objective function A N P , Q as follows:
A N P , Q = i = 1 c j = 1 n p i j r x j q i 2 ,
where n illustrates the number of data points with r real numbers in the i-th cluster, c denotes the clusters, p i j r reflects the membership of x j pixels in the i-th cluster, and q i expresses the centroid of the cluster.
p i j = 1 k = 1 c x j q i x j q k 2 2 1 r 1 ,
p i j 0 , 1 ,   for   i = 1 , , c ,
q i = j = 1 n p i j r x j j = 1 n p i j r ,
where A N P , Q , the distance between each pixel and the cluster center, may be calculated using P and Q . When the minimal distance from the pixel to the cluster center is observed, a high membership value is allocated to the well-suited pixel. Using the typical FCM approach, a high level of computational complexity is produced because of the analysis of spatial values at each iteration that is used to quantify the distance from the cluster center to the relevant pixel in an image. Figure 2 shows the outcomes of segmenting the images from the UCM dataset.

3.3. Labeling via Markov Random Field

A Markov random field (MRF) [36,37] can be described in formal terms by a set of sites S = {1, …, N}. These are N pixel places. A collection of random variables w n n = 1 N and a set of neighbors N n n = 1 N are connected with each of the N locations. To qualify as a Markov random field, the model must adhere to the following Markov property:
Pr w n w S \ n = Pr w n w N n .
As a result, a Markov random field (MRF) can be considered to be an undirected model that specifies the conditional probabilities of variables as a product of potential functions such that
Pr w = 1 Z j = 1 J ϕ j w C j ,
where ϕj [•] is the j-th potential function, which never yields a negative value. This value is determined by the state of a subset of the variables C j ⊂ {1, …, N}. This subset is referred to as a clique in this context. The partition function, denoted by Z , is a normalizing constant that ensures the resulting probability distribution is correct. We used MRF for postprocessing of segmented regions. The segmented regions having discontinuities are initially connected by considering the multiple key points on boundaries and connecting these key points to accurately separate the segmented regions. Then, these regions with a boundary around the connected regions having pixels with similar features are grouped together and assigned a unique label. Figure 3 illustrates the results of MRF labeling on a selection of images from the AID. Figure 3 illustrates the results of MRF labeling on a selection of images from the AID.

3.4. Feature Extraction

To categorize the objects in remote sensing imagery, various classical and deep features are analyzed. Classical features including Haralick texture features, SSFs, and super-pixel patterns are computed on the basis of statistical techniques while deep learning-based features are extracted using a pretrained CNN model. This section covers the feature computation, fusion, and selection processes in detail.

3.4.1. CNN Features

To extract CNN features [38], VGG-16 (a pretrained CNN model) is incorporated. Deng et al. [39] trained this model on the ImageNet dataset. The model is simple and comprises an input layer and 13 convolutional layers. The input layer considers the images with dimensions of 320 × 320 × 3 as input. There are also five pooling layers (max pooling) following the three fully connected layers. The window size for max-pooling is 2 × 2. The rectified linear unit (ReLU) is used as an activation function in hidden layers. To extract effective CNN features, a transfer learning method is applied that exploits the already learned features to make the model useful as compared to training a new model from scratch. The general architecture of CNN features extraction is shown in Figure 4.

3.4.2. Haralick Features

Remote sensing images of several objects may appear identical in color but have distinct texture patterns. This inspired us to integrate texture features that behave as local descriptors. To obtain texture features, we used a cooccurrence matrix. The four local features are derived from a matrix of cooccurrences termed Haralick features [40]. Haralick assumed that this matrix contains texture information, and texture features are subsequently computed from this matrix. The cooccurrence matrix contains 14 factors; however, only four are commonly used. These four texture features, energy ( E ), contrast ( C ), correlation ( Cor ), and entropy ( H ), are computed mathematically by the following equations:
E = i j ( M i , j ) 2 ,
C = k = 0 m 1 k 2 i j = k M i , j ,
Cor = i j i μ i j μ j M i , j σ i σ j ,
H = i j M i , j log M i , j .
It was demonstrated that these four parameters were sufficient to produce acceptable results in a classification test. These four parameters are listed with their values in Table 1.

3.4.3. Spectral–Spatial Features (SSFs)

Mathematical morphology [41] is one of the well-known paradigms that equips operators with the ability to generate high-quality SSFs [42]. Erosion and dilation are basic mathematical morphology operations that examine an image’s geometrical structures by comparing them to small patterns called structuring elements.
Attribute filters (AF): Various flat regions of the image, or areas of the image that have comparable gray levels are used to extract various types of information, specified by the feature names. An image’s equivalent tree representation can be used to effectively build attribute filters as in [43]. By applying a threshold to all of the image’s mapped values, the following sets of higher- and lower-level sets (i.e., flat zones) are created that can be further classified into subcategories:
U f = X : X C o n C o m p f λ , λ Z L f = { X : X C o n C o m p ( [ f < λ ] ) , λ Z } ,
where C o n C o m p denotes the connected components of the generic image. An inclusion relationship [33] exists between the interconnected components that are obtained by either the lowest- or the highest-level sets.
Attribute profiles (APs): APs define a generic collection of profiles that make use of the attribute filter’s flexibility to conduct a more thorough investigation of the scene.
Extended attribute profiles: Because hyperspectral sensors acquire data across many spectral bands, extended attribute profiles (EAPs) based on morphological attribute filters are used to analyze hyperspectral high-resolution images. The EAPs are based on the application of the APs to hyperspectral data.
E A P = A P P C 1 , A P P C 2 , , A P P C c ,
where P C denotes one principal component obtained by applying principal component analysis to the data.
Extended multi-attribute profiles (EMAPs): Many features can be used to extract spatial elements more effectively; hence, EMAPs combine multiple EAPs into a single data structure.
E M A P = { E A P a 1 , E A P a 2 , , E A P a m .
The spatial information extraction in the EMAP is substantially more powerful than a single EAP; however, processing these features incurs a substantial cost in terms of computation, as the max-tree and min-tree are generated just once for each P C and are processed with various attributes at multiple stages. The visual results of SSFs over areal images are presented in Figure 5.

3.4.4. Super-Pixel Pattern

We present a method for creating super-pixels following [44] that is faster and more memory-effective than current approaches. It demonstrates state-of-the-art boundary conformance and enhances the segmentation efficiency. Simple linear iterative clustering is a modification of k-means for super-pixel creation, with two critical differences: the first one is reducing optimization time by narrowing the search area based on super-pixel size, which leads to significantly fewer distance calculations, and the second one describes that there is no dependence of the number of super-pixels k on how many pixels N there are; hence, the complexity is reduced to a linear function. It is possible to regulate the size and coherence of the super-pixels using color and spatial distance combined as a weighted distance metric.
Super-pixels correspond to clusters in color-image plane space. This causes an issue in determining the distance measure D i s t F . To compute the distance between a pixel i and cluster center C k , distance measure D i s t F is used. A color space [ l   a   b ] T having a range of known values is considered for color representation of every pixel. The pixel’s position   x y T , on the other hand, may take a range of values that vary according to the size of the image. We need to compute two distances, i.e., normalized color distance and spatial distance. We then combine these two distances into a single measure by their respective maximum distances within a cluster, N o r s p t and N o r c o l . In doing so, D i s t F is written as
d i s t c o l = l j l i 2 + a j a i 2 + b j b i 2 , d i s t s p t = x j x i 2 + y j y i 2 , D i s t F = d c N o r c o l 2 + d s N o r s p t 2 .
Results of super-pixel patterns computed over some remote sensing images from UCM dataset are presented in Figure 6.

3.5. Feature Fusion

The CNN, Haralick features, SSFs, and super-pixel patterns are computed separately as F e a t u r e C N N , F e a t u r e H a r a l i c k , F e a t u r e S S , and F e a t u r e S P , respectively. All these feature vectors are merged as in [45] to form a complete fused feature vector and normalized before fusion, to ensure that the individual feature vectors elements do not surpass other elements. Once normalization is performed, the CNN, Haralick, SSF, and super-pixel patterns are pooled to form a complete fused feature vector.
F e a t u r e F = F e a t u r e C N N   F e a t u r e H a r a l i c k   F e a t u r e S S   F e a t u r e S P .
A high-dimensional feature vector is obtained as a result of the two-level decomposition of complex images while feature analysis is executed. Consequently, an inadequate classification is witnessed when the input feature vectors have high dimensions. Therefore, reducing the size of feature vectors is important in order to reduce computational costs and improve performance. For the purpose, GA-based [46] feature selection is employed to obtain the reduced dimensional feature vector F e a t u r e F i n :
F e a t u r e F i n = G A F e a t u r e F .

3.6. Object Categorization: Multiple Kernel Learning

The proposed system employs MKL [17] to achieve object categorization on the basis of multiple regions and signatures of the regions in complex remote sensing imagery, as shown in Figure 7. During object categorization, an image I having a number of c clusters obtained from the segmented and labeled objects that are presented in various distinct colors is taken to extract descriptor D I , which describes the region R of the image I. Now, to compute the signature x I , a function f R from local descriptors D I as f R : D I x I is incorporated. Mathematically, f R can be written as follows:
C e n t e r c = 1 c I i D i c I ,
μ c = 1 c I i D i c I C e n t e r c D i c I C e n t e r c ,
μ I , c = i D i c I C e n t e r c D i c I C e n t e r c μ c ,
where C e n t e r c represent clusters center c, entire descriptors are described by |c| in the clusters c for all the images from a class, descriptors of image I that belong to cluster c are shown as D i c I , and the mean of those centered descriptors belonging to c is denoted by μ c , while μ j , c   is the signature computed from the image I. Then, a vector V E C I , C is obtained from μ I , c . The computation of signature vector x I of image I for all c is performed by the concatenation of all V E C I , C .
V E C I = V E C I , 1 V E C I , C .

3.7. Probability-Based Object-to-Object Relations (OORs)

After recognition of multiple objects in a complex scene, the relationship between these objects is identified. To enhance the scene recognition performance, object-to-object relations (OORs) [47] are computed on the basis of contextual information regarding objects. As complex scenes comprise multiple co-occurring visual features, these OORs significantly recognize patterns to understand the scenes. For instance, a car is likely to be seen on roads instead of the sky or water, while a ship is likely to be in the sea or water instead of on roads. To determine the OORs, several features and relative positions of the objects are considered. Initially, to find the weight of the j -th target object for j   1 ,   2 ,   ,   n with respect to another relevant i -th object for i 1 ,   2 ,   ,   n , a dot product is computed as follows:
w t j , i = f j f i d j , i ,
where the visual cures of the j -th and i -th object are represented by f j and f i , respectively. The distance between the j -th and i -th object is denoted by (j, i). Lastly, to determine the relation of the j -th object with other objects, the following expression is used:
R j = i w t j , i f i n ,
where the visual features of the i -th object are denoted by f i n . While the relations are computed between the objects, the scene labels are predicted by the classifier on the basis of these OORs. Figure 8 presents a schematic view of OORs.

3.8. Scene Recognition: Fully Convolutional Network

Once the OOR is determined, object triplets and probabilities are forwarded to the FCN that classifies the scenes by incorporating the object category and contextual relationship between those objects. FCN [48] is an architecture that is mostly used for semantic segmentation. FCN employs locally connected layers, including convolution, pooling, and up-sampling, in a variety of ways. Avoiding dense layers results in fewer parameters (i.e., making the networks faster to train). Additionally, because all connections are local, an FCN can be used with varying image sizes. The network is composed of a down-sampling path for extracting and interpreting context, as well as an up-sampling path for localization.
A fully convolutional network (FCN) with the following hyperparameters is used to classify the remote sensing scenes: a learning rate of 0.01, a batch size of 16, and 32 conv_block1 filters, 64 conv_block2 filters, 128 conv_block3 filters, 264 conv_block4 filters, and 512 conv_block5 filters. Although we could choose a learning rate with a floating-point value between 0.0001 and 0.1, a learning rate of 0.01 led to the best results during our training process for remote scene classification over the benchmark datasets under consideration, i.e., UCM, AID, and RESISC45 datasets. Similarly, the batch size can range from 1–100, but the power of 2 is mostly chosen as the batch size; we chose 16 (24) following its better performance during training. Figure 9 depicts the results of scene classification on a benchmark dataset.

4. Experimental Results

To evaluate the training/testing performance of the proposed model, we used the leave-one-out cross-validation method on three publicly available datasets: AID, RESIEC45 dataset, and UCM dataset.

4.1. Datasets Description

4.1.1. Aerial Images Dataset

The Aerial Images Dataset (AID) [49] is a newly created large-scale aerial image collection. The AID comprises 30 classes having 10,000 images. Each class is composed of 200–400 images, and every image contains at least two objects and at most eight objects. The dataset covers the following aerial scene types: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, and viaduct. Figure 10 presents some example images from the AID.

4.1.2. RESISC45 Dataset

The RESISC45 dataset [50] is one of the well-known benchmarks for remote sensing image scene classification. This dataset was created by Northwestern Polytechnical University (NWPU); therefore, it is also named NWPU-RESISC45, and it consists of 31,500 remote sensing images of 45 various scene classes. Each class comprises 700 images with a minimum of two and maximum of 10 objects in each class. These classes are airplane, airport, basketball diamond, baseball court, beach, bridge, forest, golf course, etc. Figure 11 shows a few classes of he f NWPU-RESISC45 dataset.

4.1.3. UCM Dataset

The UCM dataset [51] is a benchmark that is publicly available for research purposes. The dataset comprises 21 classes with 100 images in each class. The number of objects in each class may vary from two to seven depending on the class scenario. The dimensions of the images are 256 × 256 pixels. For several cities across the country, the USGS National Map Urban Area Imagery collection was used to manually extract the imagery. The classes are labeled as agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis court. Figure 12 illustrates a few examples of the UCM dataset.

4.2. Experimental Evaluation

In this section, we present the recognition accuracies based on the confusion matrices computed over three complex datasets: the AID, UCM dataset, and RESISC45 dataset. For OCSC, we used an FCN as a classifier, and the proposed system was evaluated by the leave-one-subject-out (LOSO) cross-validation technique. Figure 13 demonstrates the results over the UCM dataset with an average of 98.75% scene classification accuracy. Figure 14 presents a classification accuracy of 97.73% over the AID, and Figure 15 demonstrates an average accuracy of 96.57% over the RESISC45 dataset.
Class-wise accuracies may be studied with the color code against each class label on the left of the graph, which is specified for the corresponding class. The mixture of different colors on the right denotes different classes present in the result which may be encoded as misclassification. Misclassification is interpreted as a color in the graph line above that specific class, which is a false positive (FP), or a color in the mixture below the original class, which is a false negative (FN). For instance, the FL class in the AID shows both FPs and FNs in the graph along with correct predictions, where CH and DS are FPs, while FR and IN are FNs shown in the graph.
The recognition results of the UCM dataset show that GC, HB, and PG had lower accuracies compared to other scene classes. However, the overall recognition accuracy was better and comparable with other state-of-the-art methods. There are a total of 21 classes in the UCM dataset; out of those, we achieved remarkable performance on 18 classes, while the other three classes had good results, nearly equivalent to other existing methods.
Similar to the UCM dataset, we observed better performance over the AID compared to other SOTA techniques, as presented in Figure 14. Most of the classes demonstrated remarkable results in terms of accuracy. Higher accuracy was achieved by more than 20 classes including RV, SQ, SM, ST, and VD, while some other classes (DR, DS, FL, and PD) still need improvement. For instance, the IN class achieved an accuracy of 90% as shown in Figure 14, which demonstrates that 2% of cases were incorrectly recognized as FL and 8% of cases were misclassified as “DS”. Likewise, class-wise accuracies may be studied with the color against each class label on the left of the graph, where a mixture of different colors on the right denotes misclassification.
Analogous to that of the UCM and AID, the OCSC model demonstrated excellent performance when evaluated over the RESISC45 dataset. Figure 15 illustrates that most of the classes depicted exceptional performance in terms of recognition accuracy including PG and MW with accuracies of 99%, where MW was misclassified 1% of the time as MK, while the lowest accuracy was noted for the CL class, where CL was misclassified as CM, DR, and DS 9%, 8%, and 4% of the time, respectively.
In this section, experimental evaluation was performed on benchmarks including the AID, UCM dataset, and RESISC45 dataset. At first, the CNN and classical features (i.e., SSFs, Haralick features, and super-pixel patterns) were given to the most commonly used classifier artificial neural network (ANN), and its results were obtained. Then, the same features were given to a deep belief network (DBN) for recognition. Finally, a comparison of the recognition results using conventional approaches with that of the proposed OCSC model using FCN was performed. Table 2, Table 3 and Table 4 present the comparison results of precision, recall, and F1 Score over the AID, RESISC45 dataset, and UCM dataset, respectively.
In this section, we present the precision, recall, and F-1 measures computed over three complex datasets, the AID, UCM dataset, and RESISC45 dataset. We applied ANN and DBN for the remote sensing scene classification and compared the results with FCN (proposed) model. Although there were some comparable results in a few classes over the AID, we overall observed a significant improvement compared to the other well-known classifiers. A few classes including BR and DR showed better recall using DBN, while PD had better precision using DBN; however, results were overall excellent in all classes using the proposed model. Similarly, the mean values of precision, recall, and F1 score were highest when applying FCN (proposed model).
A similar pattern was observed when we applied three different classifiers over the RESISC45 dataset. We experienced a better precision value for BR and ST classes, while AT, BC, BH, FW, and OP classes had better recall value compared to the proposed method when a DBN was applied to the same dataset. Nevertheless, the mean precision, recall, and F1 score were the highest amongst the three well-known classifiers.
For a comprehensive evaluation, we compared the proposed system with various existing state-of-the-art (SOTA) methods including the self-attention feature selection module represented by SAFENet [52], label augmentation via ResNet18 + LA + KL [53], ACNet [54] for exploring local and global features integrated with some attention techniques for remote scene classification, ARCNet-VGGnet16 [55], Deep Fusion [56] using two-stream deep architecture for high-resolution aerial images classification, Fusion by Addition [57], and Siamese ResNet50 [58]. We compared the mean accuracy of scene classification, and the results are illustrated in Table 5. It is demonstrated that the boosted performance of our proposed OCSC system outperformed the other reported methods in terms of mean accuracy. Specifically, comparing BoVW and SAFENet depicts an increase in the accuracy of scene classification that validates the effectiveness of feature fusion in our model. Furthermore, there is also an increase in the scene classification accuracy compared to ACNet over the AID and FESIEC45 dataset, although somewhat low but comparable accuracy was observed on the UCM dataset.

4.3. Ablation Study

We presented various features including CNN, Haralick, spectral-spatial, and super-pixel patterns. Here, we discuss the focal point of whether each of the features adds something new to the system to determine if all these features are essential for the OCSC system. To answer this, we conducted experiments to validate the influence of feature fusion and used a greedy approach that incrementally added features to our system starting with the best ones, i.e., CNN. Initially, we started experiments with CNN features only and achieved scene recognition accuracies of 91.37%, 91.88%, and 90.55% over the AID, UCM dataset, and NWPU-RESISC45 dataset, respectively. Then, we added super-pixel patterns and fused them with CNN features, observing significantly enhanced performance from 91.37% to 92.69% for AID, 91.88% to 93.19% for the UCM dataset, and 90.55% to 93.57% for the NWPU-RESISC45 dataset. The improved performance motivated us to further increase the number of features, similarly to the fusion of CNN and super-pixel patterns (SPPs) demonstrated earlier. Next, we conducted experiments with the addition of SSFs to the previously fused set of features. Fusion of SSFs to the already fused features set of CNN and SPP produced better results in terms of accuracy compared to the results obtained by previously fused features. An increase in the performance of recognition accuracy was witnessed from 92.69% to 94.19%, 93.19% to 94.99%, and 93.57% to 95.25% over the AID, UCM dataset, and NWPU-RESISC45 dataset, respectively. Therefore, we fused another classical feature, Haralick feature, with the already fused version of features and performed experiments for object categorization and scene classification. Combining all the features produced the best recognition performance with overall recognition accuracies of 97.73%, 98.75%, and 96.57% for the AID, UCM dataset, and NWPU-RESISC45 dataset, respectively. Figure 16 demonstrates the effectiveness of features while incorporating a greedy approach for feature fusion over different benchmark datasets for scene classification.
It is clear from the results presented in the Figure 16 that fusion of CNN and classical features produced comparative results to CNN. This was a bit different for the UCM dataset, where our approach had less but acceptable accuracy when considering the computational complexity of both techniques. The well-known CNN models are computationally complex compared to FCM. The details of computational time are illustrated in Table 6. We tested these algorithms on an Intel system with 32 GB RAM and Intel (R) Core (TM) i7-1065G7 CPU @ 1.30 GHz 1.50 GHz, along with an NVIDIA GeForce GPU. The proposed model had the least computational time required for the segmentation of remote sensing images compared to CNN.

5. Discussion

The proposed OCSC was designed to achieve object categorization and scene recognition over remote sensing imagery. In this article, we developed a framework that uses FCM for the segmentation of RS images and MRF for labeling of the segmented images. The labeled images were then further analyzed for extraction of features including CNN features and classical features (Haralick features, Spectral–spatial features, super-pixel patterns). Here, CNN features were extracted using a pretrained CNN model (VGG16), while classical features were extracted through machine learning techniques and mathematical formulation. These extracted features were then combined using a parallel fusion mechanism and optimized before transmitting to MKL as input, where various categories of objects were specified. Once the objects were categorized, the object-to-object relationship was determined, and a fully convolutional network was employed to classify the scenes.
Initially, the segmentation process is the fundamental module to properly classify remote sensing imagery. Therefore, an effective mechanism of FCM segmentation was incorporated to achieve significant results for segmented regions from the complex high-resolution scene images. After obtaining segmented regions, as a postprocessing step, an MRF was applied to obtain the labeled objects for further processing of feature extraction. During this labeling phase, the segmented regions were analyzed on the basis of the regions (connected, disconnected), and postprocessing was performed to more accurately isolate the boundaries of the regions segmented in the previous phase. These improved segmented regions were then labeled on the basis of a perceptual grouping mechanism, where each segmented region was assigned with a unique label (color).
This complementary module for labeling significantly enhanced the object categorization. We conducted experiments for both modules i.e., by employing only FCM for segmentation and by applying MRF for postprocessing and labeling of segmented regions. When only FCM-based segmentation was performed, the object categorization on the benchmark datasets achieved less accuracy; however, we saw an improvement when we added postprocessing and labeling of the objects using MRF before analysis for feature extraction. The performance in terms of object categorization accuracy was significantly increased. The details of these experimental results were demonstrated in the ablation experiment section. Moreover, our approach of feature fusion after extracting CNN features and classical features had an impact on the recognition accuracy of the scene, which led to the overall enhanced scene classification. The effect of different features on object categorization and scene recognition was illustrated in detail in the ablation experiment section.
We applied ANN and DBN for the remote sensing scene classification and compared the results with FCN (proposed) model. Although there were some comparable results in a few classes over the AID dataset, we overall observed a significant improvement compared to the other well-known classifiers. A few classes including BR and DR showed better recall using DBN, while PD had better precision using DBN; however, overall results were excellent in all classes using the proposed model. Similarly, the mean values of precision, recall, and F1 score were highest when applying FCN (proposed model).
A similar pattern was observed when we applied three different models over the RESISC45 dataset. We experienced a better precision value for BR and ST classes, while AT, BC, BH, FW, and OP classes had a better recall value compared to the proposed method. Nevertheless, the mean precision, recall, and F1 score were the highest amongst the three well-known classifiers.
While working with the OCSC model, despite the tremendous performance, we were also confronted with some limitations and constraints. Some tiny objects, such as people and animals, eluded our classification. Similarly, multiple vehicles were sometimes recognized as single vehicles when they were occluded by more than 50% in terms of pixels.

6. Conclusions

The proposed OCSC system was designed to achieve object categorization and scene classification over various complex aerial scene images and publicly available benchmark datasets. In this paper, we incorporated FCM followed by MRF to segment and label the aerial images from different remote sensing benchmark datasets. Furthermore, we analyzed these labeled images for extraction of classical and deep features. Moreover, these features were taken as input for object categorization by employing MKL. After the successful categorization of multiple objects present in the remote scene images, the inter-object relationships were computed to finally classify the scenes by applying FCN. The remarkable results of the proposed model show that it outperformed the SOTA remote sensing scene classification techniques.

Author Contributions

Conceptualization, A.A.R. and A.J.; methodology, A.A.R. and Y.Y.G.; software, A.A.R., S.A.A. and T.a.S.; validation, A.A.R., Y.Y.G. and J.P.; formal analysis, T.a.S., S.A.A. and J.P.; resources, Y.Y.G., T.a.S. and J.P.; writing—review and editing, A.A.R., T.a.S. and J.P.; funding acquisition, Y.Y.G., S.A.A. and J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant (2021R1F1A1063634) of the Basic Science Research Program through the National Research Foundation (NRF) funded by the Ministry of Education, Republic of Korea.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Galleguillos, C.; Belongie, S. Context-based object categorization: A critical survey. Comput. Vis. Image Underst. 2010, 114, 712–722. [Google Scholar] [CrossRef] [Green Version]
  2. Wang, G.; Ye, J.C.; Mueller, K.; Fessler, J.A. Image reconstruction is a new frontier of machine learning. IEEE T-MI 2018, 37, 1289–1296. [Google Scholar] [CrossRef] [PubMed]
  3. Saha, S.; Bovolo, F.; Bruzzone, L. Unsupervised deep change vector analysis for multiple-change detection in VHR images. IEEE TGRS 2019, 57, 3677–3693. [Google Scholar] [CrossRef]
  4. Srivastava, P.K.; Han, D.; Rico-Ramirez, M.A.; Bray, M.; Islam, T. Selection of classification techniques for land use/land cover change investigation. ASR 2012, 50, 1250–1265. [Google Scholar] [CrossRef]
  5. Jalal, A.; Ahmed, A.; Rafique, A.A.; Kim, K. Scene Semantic Recognition Based on Modified Fuzzy C-Mean and Maximum Entropy Using Object-to-Object Relations. IEEE Access 2021, 9, 27758–27772. [Google Scholar] [CrossRef]
  6. Manfreda, S.; McCabe, M.F.; Miller, P.E.; Lucas, R.; Pajuelo Madrigal, V.; Mallinis, G.; Dor, E.B.; Helman, D.; Estes, L.; Ciraolo, G.; et al. On the use of unmanned aerial systems for environmental monitoring. Remote Sens. 2018, 10, 641. [Google Scholar] [CrossRef] [Green Version]
  7. Khan, M.A.; Sharif, M.; Akram, T.; Raza, M.; Saba, T.; Rehman, A. Hand-crafted and deep convolutional neural network features fusion and selection strategy: An application to intelligent human action recognition. Appl. Soft Comput. 2020, 87, 105986. [Google Scholar] [CrossRef]
  8. Guo, H.; Liu, J.; Xiao, Z.; Xiao, L. Deep CNN-based hyperspectral image classification using discriminative multiple spatial-spectral feature fusion. Remote. Sens. Lett. 2020, 11, 827–836. [Google Scholar] [CrossRef]
  9. Liu, Y.; Liu, Y.; Ding, L. Scene classification based on two-stage deep feature fusion. IEEE Geosci. Remote Sens. Lett. 2017, 15, 183–186. [Google Scholar] [CrossRef]
  10. Han, X.; Zhong, Y.; Cao, L.; Zhang, L. Pre-trained alexnet architecture with pyramid pooling and supervision for high spatial resolution remote sensing image scene classification. Remote Sens. 2017, 9, 848. [Google Scholar] [CrossRef] [Green Version]
  11. Muhammad, U.; Wang, W.; Chattha, S.P.; Ali, S. Pre-trained VGGNet architecture for remote-sensing image scene classification. In Proceedings of the 2018 24th International Conference on Pattern Recognition, Beijing, China, 20–24 August 2018; pp. 1622–1627. [Google Scholar]
  12. Tang, P.; Wang, H.; Kwong, S. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 2017, 225, 188–197. [Google Scholar] [CrossRef]
  13. Wang, M.; Zhang, X.; Niu, X.; Wang, F.; Zhang, X. Scene classification of high-resolution remotely sensed image based on ResNet. J. Geovisualization Spat. Anal. 2019, 3, 1–9. [Google Scholar] [CrossRef]
  14. Grzeszick, R.; Plinge, A.; Fink, G.A. Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1242–1252. [Google Scholar] [CrossRef]
  15. Martin, S. Sequential bayesian inference models for multiple object classification. In Proceedings of the 14th International Conference on Information Fusion, Chicago, IL, USA, 5–8 July 2011; pp. 1–6. [Google Scholar]
  16. Bo, L.; Sminchisescu, C. Efficient match kernel between sets of features for visual recognition. Adv. Neural Inf. Process. Syst. 2009, 22, 135–143. [Google Scholar]
  17. Ahmed, A.; Jalal, A.; Kim, K. A novel statistical method for scene classification based on multi-object categorization and logistic regression. Sensors 2020, 20, 3871. [Google Scholar] [CrossRef]
  18. Wong, S.C.; Stamatescu, V.; Gatt, A.; Kearney, D.; Lee, I.; McDonnell, M.D. Track everything: Limiting prior knowledge in online multi-object recognition. IEEE Trans. Image Process. 2017, 26, 4669–4683. [Google Scholar] [CrossRef] [Green Version]
  19. Sumbul, G.; Cinbis, R.G.; Aksoy, S. Multisource region attention network for fine-grained object recognition in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4929–4937. [Google Scholar] [CrossRef]
  20. Mizuno, K.; Terachi, Y.; Takagi, K.; Izumi, S.; Kawaguchi, H.; Yoshimoto, M. Architectural study of HOG feature extraction processor for real-time object detection. In Proceedings of the 2012 IEEE Workshop on Signal Processing Systems, Ann Arbor, MI, USA, 5–8 August 2012; pp. 197–202. [Google Scholar]
  21. Penatti, O.A.; Valle, E.; Torres, R.D.S. Comparative study of global color and texture descriptors for web image retrieval. J. Vis. Commun. Image Represent. 2012, 23, 359–380. [Google Scholar] [CrossRef]
  22. Oliva, A.; Torralba, A. Building the gist of a scene: The role of global image features in recognition. Prog. Brain Res. 2006, 155, 23–36. [Google Scholar]
  23. Rashid, M.; Khan, M.A.; Sharif, M.; Raza, M.; Sarfraz, M.M.; Afza, F. Object detection and classification: A joint selection and fusion strategy of deep convolutional neural network and SIFT point features. Multimed. Tools Appl. 2021, 2019, 15751–15777. [Google Scholar] [CrossRef]
  24. Jalal, A.; Nadeem, A.; Bobasu, S. Human Body Parts Estimation and Detection for Physical Sports Movements. In Proceedings of the 2019 2nd International Conference on Communication, Computing and Digital systems (C-CODE), Islamabad, Pakistan, 6–7 March 2019; pp. 104–109. [Google Scholar]
  25. Liu, B.D.; Meng, J.; Xie, W.Y.; Shao, S.; Li, Y.; Wang, Y. Weighted spatial pyramid matching collaborative representation for remote-sensing-image scene classification. Remote Sens. 2019, 11, 518. [Google Scholar] [CrossRef] [Green Version]
  26. Perronnin, F.; Sánchez, J.; Mensink, T. Improving the fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 143–156. [Google Scholar]
  27. Yu, J.; Zhu, C.; Zhang, J.; Huang, Q.; Tao, D. Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 661–674. [Google Scholar] [CrossRef] [PubMed]
  28. Mandal, M.; Vipparthi, S.K. Scene independency matters: An empirical study of scene dependent and scene independent evaluation for CNN-based change detection. IEEE Trans. Intell. Transp. Syst. 2020, 23, 2031–2044. [Google Scholar] [CrossRef]
  29. Studer, L.; Alberti, M.; Pondenkandath, V.; Goktepe, P.; Kolonko, T.; Fischer, A.; Liwicki, M.; Ingold, R. A comprehensive study of imagenet pre-training for historical document image analysis. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 720–725. [Google Scholar]
  30. Leksut, J.T.; Zhao, J.; Itti, L. Learning visual variation for object recognition. Image Vis. Comput. 2020, 98, 103912. [Google Scholar] [CrossRef]
  31. Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef] [Green Version]
  32. Li, F.; Feng, R.; Han, W.; Wang, L. High-resolution remote sensing image scene classification via key filter bank based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8077–8092. [Google Scholar] [CrossRef]
  33. Deng, G. A generalized unsharp masking algorithm. IEEE Trans. Image Process. 2010, 20, 1249–1261. [Google Scholar] [CrossRef]
  34. Kalist, V.; Ganesan, P.; Sathish, B.S.; Jenitha, J.M.M. Possiblistic-fuzzy C-means clustering approach for the segmentation of satellite images in HSL color space. Procedia Comput. Sci. 2015, 57, 49–56. [Google Scholar] [CrossRef] [Green Version]
  35. Thitimajshima, P. A new modified fuzzy c-means algorithm for multispectral satellite images segmentation. In Proceedings of the IGARSS 2000 IEEE 2000 International Geoscience and Remote Sensing Symposium. Taking the Pulse of the Planet: The Role of Remote Sensing in Managing the Environment, Honolulu, HI, USA, 24–28 July 2000; pp. 1684–1686. [Google Scholar]
  36. Lai, K.; Bo, L.; Ren, X.; Fox, D. Detection-based object labeling in 3d scenes. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, St. Paul, MN, USA, 14–19 May 2012; pp. 1330–1337. [Google Scholar]
  37. Zheng, C.; Zhang, Y.; Wang, L. Semantic segmentation of remote sensing imagery using an object-based Markov random field model with auxiliary label fields. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3015–3028. [Google Scholar] [CrossRef]
  38. Zhang, M.; Li, W.; Du, Q.; Gao, L.; Zhang, B. Feature extraction for classification of hyperspectral and LiDAR data using patch-to-patch CNN. IEEE Trans. Cybern. 2018, 50, 100–111. [Google Scholar] [CrossRef]
  39. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  40. Patil, N.K.; Malemath, V.S.; Yadahalli, R.M. Color and texture based identification and classification of food grains using different color models and Haralick features. Int. J. Comput. Sci. Eng. 2011, 3, 3669. [Google Scholar]
  41. Aptoula, E.; Lefèvre, S. A comparative study on multivariate mathematical morphology. Pattern Recognit. 2007, 40, 2914–2929. [Google Scholar] [CrossRef] [Green Version]
  42. Zhang, L.; Zhang, Q.; Du, B.; Huang, X.; Tang, Y.Y.; Tao, D. Simultaneous spectral-spatial feature selection and extraction for hyperspectral images. IEEE Trans. Cybern. 2016, 48, 16–28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  43. Ghamisi, P.; Benediktsson, J.A.; Cavallaro, G.; Plaza, A. Automatic framework for spectral-spatial classification based on supervised feature extraction and morphological attribute profiles. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2147–2160. [Google Scholar] [CrossRef]
  44. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [Green Version]
  45. Song, W.; Li, S.; Fang, L.; Lu, T. Hyperspectral image classification with deep feature fusion network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3173–3184. [Google Scholar] [CrossRef]
  46. Yang, R.; Wang, Y.; Xu, Y.; Qiu, L.; Li, Q. Pedestrian Detection under Parallel Feature Fusion Based on Choquet Integral. Symmetry 2021, 13, 250. [Google Scholar] [CrossRef]
  47. Song, X.; Jiang, S.; Wang, B.; Chen, C.; Chen, G. Image representations with spatial object-to-object relations for RGB-D scene recognition. IEEE Trans. Image Process. 2019, 29, 525–537. [Google Scholar] [CrossRef]
  48. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  49. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
  50. Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef] [Green Version]
  51. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
  52. Kim, J.; Chi, M. SAFFNet: Self-Attention-Based Feature Fusion Network for Remote Sensing Few-Shot Scene Classification. Remote Sens. 2021, 13, 2532. [Google Scholar] [CrossRef]
  53. Xie, H.; Chen, Y.; Ghamisi, P. Remote sensing image scene classification via label augmentation and intra-class constraint. Remote Sens. 2021, 13, 2566. [Google Scholar] [CrossRef]
  54. Tang, X.; Ma, Q.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. Attention consistent network for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2030–2045. [Google Scholar] [CrossRef]
  55. Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1155–1167. [Google Scholar] [CrossRef]
  56. Yu, Y.; Liu, F. A two-stream deep fusion framework for high-resolution aerial scene classification. Comput. Intell. Neurosci. 2018, 2018, 8639367. [Google Scholar] [CrossRef] [Green Version]
  57. Chaib, S.; Liu, H.; Gu, Y.; Yao, H. Deep feature fusion for VHR remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4775–4784. [Google Scholar] [CrossRef]
  58. Liu, X.; Zhou, Y.; Zhao, J.; Yao, R.; Liu, B.; Zheng, Y. Siamese convolutional neural networks for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1200–1204. [Google Scholar] [CrossRef]
  59. He, C.; Zhang, Q.; Qu, T.; Wang, D.; Liao, M. Remote sensing and texture image classification network based on deep learning integrated with binary coding and Sinkhorn distance. Remote Sens. 2019, 11, 2870. [Google Scholar] [CrossRef] [Green Version]
Figure 1. A schematic view of the proposed model over the AID.
Figure 1. A schematic view of the proposed model over the AID.
Remotesensing 14 01550 g001
Figure 2. Fuzzy C-means segmentation over some images from UCM Dataset: (row 1) the original images; (row 2) the segmented images.
Figure 2. Fuzzy C-means segmentation over some images from UCM Dataset: (row 1) the original images; (row 2) the segmented images.
Remotesensing 14 01550 g002
Figure 3. MRF labeling of segmented images over AID: (a) original image; (b) segmented image; (c) labeled via MRF.
Figure 3. MRF labeling of segmented images over AID: (a) original image; (b) segmented image; (c) labeled via MRF.
Remotesensing 14 01550 g003
Figure 4. CNN feature extraction using pretrained CNN.
Figure 4. CNN feature extraction using pretrained CNN.
Remotesensing 14 01550 g004
Figure 5. Spectral–spatial feature representation: (a) original image (left) and corresponding SSF extraction (right) from UCM dataset; (b) original image (left) and corresponding SSF extraction (right) from AID.
Figure 5. Spectral–spatial feature representation: (a) original image (left) and corresponding SSF extraction (right) from UCM dataset; (b) original image (left) and corresponding SSF extraction (right) from AID.
Remotesensing 14 01550 g005
Figure 6. Results of super-pixel patterns on some remote sensing images from UCM dataset: (a) super-pixels patterns applied on UCM images; (b) homogeneous regions extracted from super-pixel patterns.
Figure 6. Results of super-pixel patterns on some remote sensing images from UCM dataset: (a) super-pixels patterns applied on UCM images; (b) homogeneous regions extracted from super-pixel patterns.
Remotesensing 14 01550 g006
Figure 7. Object categorization using MKL over an image from RESISC45 dataset.
Figure 7. Object categorization using MKL over an image from RESISC45 dataset.
Remotesensing 14 01550 g007
Figure 8. Schematic view of OORs between object triplets present in the remote sensing imagery.
Figure 8. Schematic view of OORs between object triplets present in the remote sensing imagery.
Remotesensing 14 01550 g008
Figure 9. Scene recognition results over RESISC45 dataset by applying fully convolutional network through categorized objects and analyzed object-to-object relations.
Figure 9. Scene recognition results over RESISC45 dataset by applying fully convolutional network through categorized objects and analyzed object-to-object relations.
Remotesensing 14 01550 g009
Figure 10. A few classes of Aerial Image Dataset having rich texture features with diverse backgrounds.
Figure 10. A few classes of Aerial Image Dataset having rich texture features with diverse backgrounds.
Remotesensing 14 01550 g010
Figure 11. A few class representatives of the NWPU-RESISC45 dataset.
Figure 11. A few class representatives of the NWPU-RESISC45 dataset.
Remotesensing 14 01550 g011
Figure 12. A few classes of the UCM dataset.
Figure 12. A few classes of the UCM dataset.
Remotesensing 14 01550 g012
Figure 13. The recognition accuracy of OCSC model over UCM dataset. AG = agricultural; AP = airplane; BD = baseball diamond; BH = beach; BG = building; CP = chaparral; DR = dense residential; FR = forest; FW = freeway; GC = golf course; HB = harbor; IS = intersection; MR = medium residential; MP = mobile home park; OP = overpass; PG = parking; RV = river; RW = runway; SR = sparse residential; ST = storage tank; TC = tennis court.
Figure 13. The recognition accuracy of OCSC model over UCM dataset. AG = agricultural; AP = airplane; BD = baseball diamond; BH = beach; BG = building; CP = chaparral; DR = dense residential; FR = forest; FW = freeway; GC = golf course; HB = harbor; IS = intersection; MR = medium residential; MP = mobile home park; OP = overpass; PG = parking; RV = river; RW = runway; SR = sparse residential; ST = storage tank; TC = tennis court.
Remotesensing 14 01550 g013
Figure 14. The recognition accuracy of OCSC model over AID. AP = airplane; BL = bare land; BB = baseball field; BH = beach; BR = bridge; CR = center; CH = church; CM = commercial; DR = dense residential; DS = desert; FL = farmland; FR = forest; IN = industrial; MW = meadow; MR = medium residential; MT = mountain; PK = park; PG = parking; PD = playground; PN = pond; PT = port; RS = railway station; RT = resort; RV = river; SL = school; SR = sparse residential; SQ = square; SM = stadium; ST = storage tank; VD = viaduct.
Figure 14. The recognition accuracy of OCSC model over AID. AP = airplane; BL = bare land; BB = baseball field; BH = beach; BR = bridge; CR = center; CH = church; CM = commercial; DR = dense residential; DS = desert; FL = farmland; FR = forest; IN = industrial; MW = meadow; MR = medium residential; MT = mountain; PK = park; PG = parking; PD = playground; PN = pond; PT = port; RS = railway station; RT = resort; RV = river; SL = school; SR = sparse residential; SQ = square; SM = stadium; ST = storage tank; VD = viaduct.
Remotesensing 14 01550 g014
Figure 15. The recognition accuracy of OCSC model over RESISC45 dataset. AP = airplane; AT = airport; BB = baseball diamond; BC = basketball court; BH = beach; BR = bridge; CH = church; CP = chaparral; CF = circular formland; CL = cloud; CM = commercial area; DR = dense residential; DS = desert; FR = forest; FW = freeway; GC = golf course; GT = ground track field; HR = harbor; ID = island; IN = industrial area; IS = intersection; LK = lake; MW = meadow; MR = medium residential; MK = mobile home park; MT = mountain; OP = overpass; PL = palace; PG = parking lot; RW = railway; RS = railway station; RF = rectangular formland; RV = river; RD = roundabout; RW = runway; SI = sea ice; SP = ship; SB = snow berg; SR = sparse residential; SM = stadium; ST = storage tank; TC = tennis court; TR = terrace; TS = thermal power station; WT = wetland.
Figure 15. The recognition accuracy of OCSC model over RESISC45 dataset. AP = airplane; AT = airport; BB = baseball diamond; BC = basketball court; BH = beach; BR = bridge; CH = church; CP = chaparral; CF = circular formland; CL = cloud; CM = commercial area; DR = dense residential; DS = desert; FR = forest; FW = freeway; GC = golf course; GT = ground track field; HR = harbor; ID = island; IN = industrial area; IS = intersection; LK = lake; MW = meadow; MR = medium residential; MK = mobile home park; MT = mountain; OP = overpass; PL = palace; PG = parking lot; RW = railway; RS = railway station; RF = rectangular formland; RV = river; RD = roundabout; RW = runway; SI = sea ice; SP = ship; SB = snow berg; SR = sparse residential; SM = stadium; ST = storage tank; TC = tennis court; TR = terrace; TS = thermal power station; WT = wetland.
Remotesensing 14 01550 g015
Figure 16. Recognition accuracies of OCSC model over three benchmark datasets using feature fusion under greedy approach.
Figure 16. Recognition accuracies of OCSC model over three benchmark datasets using feature fusion under greedy approach.
Remotesensing 14 01550 g016
Table 1. Attribute values of different Haralick features for labeled objects compared with GT with errors on AID.
Table 1. Attribute values of different Haralick features for labeled objects compared with GT with errors on AID.
ObjectsEvaluationFeatures
ContrastEnergyEntropyCorrelation
Tennis CourtGT121,3340.21510.20530.8905
SG130,7740.21080.21750.8917
ER±9440±0.0043±0.0122±0.0012
ShipGT191,4280.19190.44010.7926
SG191,8540.19610.44580.7811
ER±426±0.0042±0.0057±0.0115
Soccer FieldGT169,8830.72050.39330.4577
SG160,1250.71630.38750.4612
ER±9758±0.0042±0.0058±0.0035
VehiclesGT102,6570.42290.31660.5926
SG108,9410.41950.31920.5933
ER±6284±0.0034±0.0026±0.0007
GT = ground truth; SG = segmented; ER = error.
Table 2. Scene classification results against three classifiers on AID.
Table 2. Scene classification results against three classifiers on AID.
ClassesANNDBNFCN (Ours)
PrecisionRecallF1 ScorePrecisionRecallF1 ScorePrecisionRecallF1 Score
AP0.7680.7320.750.8110.8550.8320.9010.9770.937
BL0.8830.7650.820.7540.8150.7830.9650.9650.965
BB0.6910.8130.7470.6880.7550.720.8240.9720.892
BH0.7240.7980.7590.6170.8450.7130.9770.9110.943
BR0.8170.8410.8290.7540.9330.8340.8990.9310.917
CC0.6770.8750.7630.7250.8410.7790.8910.8890.89
CM0.7550.8390.7950.6970.7980.7440.9150.7870.846
DR0.6950.7590.7260.7110.8990.7940.8720.8540.911
DT0.7860.6980.7390.6950.8840.7780.9280.9710.949
FL0.6950.7640.7280.6540.8150.7260.9710.8920.93
FR0.7540.8560.8020.6320.8560.7270.9150.9770.945
IN0.6550.8130.7250.7190.7960.7560.8110.8920.85
MW0.7710.7920.7810.7050.8620.7760.9130.9280.92
MR0.7980.7330.7640.7330.7840.7580.9860.9660.976
MN0.6990.7950.7440.8260.6980.7570.8970.9370.917
PK0.7840.8750.8270.7980.8140.8060.9120.9010.906
PG0.7890.8390.8130.7710.7610.7660.9770.8870.93
PD0.6810.8210.7440.8110.8860.8470.7990.9160.854
PN0.7190.7880.7520.6310.8180.7120.8550.8910.873
RS0.7250.8110.7660.8010.8750.8360.9250.9170.921
RT0.7740.8590.8140.7830.8360.8090.9360.9770.956
RV0.6150.6940.6520.6970.8250.7560.8710.8710.871
SL0.6640.8510.7460.6650.8510.7470.9950.9510.973
SR0.7760.7850.780.7090.8980.7920.9560.8790.916
SQ0.7640.8090.7860.7220.8350.7740.8910.9030.897
SM0.6870.7170.7020.7150.7460.730.8190.9160.865
ST0.6940.8390.760.8120.8160.8140.9770.9210.948
VT0.6390.7750.70.7890.8570.8220.8870.9110.899
AP0.7150.7950.7530.7450.8770.8060.9730.9350.954
BL0.6360.6990.6660.7810.7980.7890.9850.9050.943
Mean0.7280.7940.7580.7320.8310.7760.9140.9210.916
Table 3. Scene classification results against three classifiers on RESISC45 dataset.
Table 3. Scene classification results against three classifiers on RESISC45 dataset.
ClassesANNDBNFCN (Ours)
PrecisionRecallF1 ScorePrecisionRecallF1 ScorePrecisionRecallF1 Score
AP0.6110.8740.7190.6220.7170.6660.9010.9770.937
AT0.6370.7690.6970.7830.8650.8220.8990.8450.871
BD0.7120.8250.7640.7590.7860.7720.9950.9510.973
BC0.6980.8130.7510.7680.9370.8440.9860.9150.949
BG0.6720.7490.7080.6510.8310.7570.9670.9030.934
BH0.6550.8750.7490.7480.8750.8070.8440.8690.859
BR0.7510.8390.7930.8790.8310.8540.8710.8390.855
CL0.6970.7810.7370.7280.7290.7280.8720.9210.896
CH0.7430.8290.7840.6880.8660.7670.8860.9380.911
CF0.7790.7870.7830.8250.7810.8020.9850.9540.969
CD0.7020.8540.7710.7160.6980.7070.9010.9770.937
CA0.6990.7720.7340.8030.8650.8330.8830.9650.922
DR0.7850.8010.7930.7760.7580.7670.9950.9510.973
DT0.7340.7910.7610.8680.8010.8330.9860.9370.961
FT0.7090.7670.7370.6890.7740.7290.9670.9030.934
FW0.6640.7750.7150.7910.8750.8310.8440.8610.852
GC0.6370.7390.6840.7110.8390.7700.9770.8390.903
GT0.6490.8120.7210.7820.8810.8290.8720.9210.896
HR0.7110.7380.7240.6860.7870.7330.8860.9380.911
IA0.7530.8130.7820.6580.8240.7320.9850.9540.969
IN0.6680.7450.7040.8540.7610.8050.9010.9770.937
ID0.6220.8510.7190.7830.8240.8030.8830.9650.922
LK0.6770.7510.7120.8520.6990.7680.9950.9510.973
MD0.7110.8250.7640.7550.7850.7700.9860.9370.961
MR0.7870.6940.7380.6770.8230.7430.9670.9030.934
MH0.6890.7850.7340.7110.7850.7460.8440.8750.859
MN0.7910.8390.8140.8160.8190.8170.9770.8390.903
OP0.6980.7890.7410.7940.8950.8410.8720.8510.861
PC0.6550.8180.7270.7290.8520.7860.8860.9380.911
Pk0.7850.7550.7700.6880.8150.7460.9850.9540.969
RL0.7090.7450.7270.6450.7580.6970.9670.9030.934
RS0.6150.6980.6540.7310.7750.7520.8440.8750.859
RF0.8220.7390.7780.7790.8810.8270.9770.8390.903
RV0.7460.6990.7220.7450.7210.7330.8720.9210.896
RT0.6990.8110.7510.6540.8190.7270.8860.9380.911
RN0.7750.7820.7780.6970.7540.7240.9850.9540.969
SI0.7160.7570.7360.7250.6880.7060.9010.9770.937
SH0.8830.7650.820.8110.6690.7330.8830.9650.922
SB0.7880.8010.7940.7350.7150.7250.9950.9510.973
SR0.8110.7350.7710.8240.6890.7500.9860.9370.961
SD0.6990.6190.6570.7550.7450.7510.9670.9030.934
ST0.7540.7850.7690.8460.7780.8110.8440.8750.859
TC0.7680.8380.8010.6610.8290.7360.9770.8390.903
TR0.8720.7210.7890.6930.8180.750.8720.9210.896
TP0.6890.7380.7130.7780.7360.7560.8860.9380.911
WD0.6610.6540.6570.6880.7440.7150.9850.9540.969
Mean0.7350.7940.7610.7630.8130.7840.9470.9390.942
Table 4. Scene classification results against three classifiers on UCM dataset.
Table 4. Scene classification results against three classifiers on UCM dataset.
ClassesANNDBNFCN (Ours)
PrecisionRecallF1 ScorePrecisionRecallF1 ScorePrecisionRecallF1 Score
AG0.8370.6580.7370.6990.8990.7870.9670.9030.934
AP0.7550.7880.8110.8150.8750.8440.8440.8750.860
BD0.7920.7530.8150.7410.8190.7870.9770.8390.903
BH0.8730.7070.7810.7840.9440.8570.8720.9210.896
BG0.7990.7910.7950.7850.8590.8210.8860.9380.912
CP0.7010.8250.7580.6670.7690.7150.9850.9540.970
DR0.7660.8110.7880.7190.6880.7040.9010.9770.938
FR0.7830.7110.7460.8830.9650.9230.8090.9510.881
FW0.6990.7640.7310.7920.8810.8350.9950.9510.973
GC0.7150.7950.7530.7630.8960.8250.9860.9370.961
HB0.8550.8010.8280.6480.8210.7250.9670.9030.934
IS0.7850.8150.8280.7910.7980.7950.8440.8750.860
MR0.8210.8020.830.7370.8090.7720.9770.8390.903
MP0.7870.6550.7150.7830.8980.8370.8720.9210.896
OP0.8450.6690.7470.8970.7620.8250.8860.9380.912
PG0.7690.7590.7640.7990.7110.7530.9850.9540.970
RV0.8110.6610.7290.6750.8550.7550.9670.9030.934
RW0.8450.7160.7760.7950.7890.790.8440.8750.860
SR0.7750.7970.8060.8190.7730.7960.9770.8390.903
ST0.7710.8240.7970.7190.8980.7990.8720.9210.896
TC0.7860.8910.8360.8010.7950.7980.8860.9380.912
Mean0.7890.7770.7830.7680.8350.8010.9190.9130.916
Table 5. Comparison of scene classification accuracies of SOTA methods with the proposed OCSC model.
Table 5. Comparison of scene classification accuracies of SOTA methods with the proposed OCSC model.
Author/MethodMean Accuracy %
AID DatasetUCM DatasetRESIEC Dataset
SAFENet [52]86.91 + 0.4486.79 + 0.3381.32 + 0.62
ResNet18 + LA + KL [53]96.5299.2195.26
DBSNet [59]92.9397.90--
CaffeNet [49]89.53 ± 0.3195.02 ± 0.81--
GoogLeNet [49]86.39 ± 0.5594.31 ± 0.89--
VGG-VD1-16 [49]89.64 ± 0.3695.21 ± 1.20--
Deep Fusion [56]94.5898.02--
Fusion by Addition [57]91.8797.42--
Siamse ResNet50 [58]--94.2995.95
Proposed97.7398.7596.57
Table 6. Computation time comparison of proposed segmentation technique with CNN over benchmark datasets.
Table 6. Computation time comparison of proposed segmentation technique with CNN over benchmark datasets.
Algorithm/MethodDatasetFCMFCM + MRFCNN
Average computation time
(s)
UCM57.7 × 21 = 1211.785.1 × 21 = 1787.186.9 × 21 = 1824.9
AID61.5 × 30 = 1845.087.9 × 30 = 2637.088.5 × 30 = 2655.0
RESISC4567.1 × 45 = 3019.591.5 × 45 = 4117.592.8 × 45 = 4176.0
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ghadi, Y.Y.; Rafique, A.A.; al Shloul, T.; Alsuhibany, S.A.; Jalal, A.; Park, J. Robust Object Categorization and Scene Classification over Remote Sensing Images via Features Fusion and Fully Convolutional Network. Remote Sens. 2022, 14, 1550. https://doi.org/10.3390/rs14071550

AMA Style

Ghadi YY, Rafique AA, al Shloul T, Alsuhibany SA, Jalal A, Park J. Robust Object Categorization and Scene Classification over Remote Sensing Images via Features Fusion and Fully Convolutional Network. Remote Sensing. 2022; 14(7):1550. https://doi.org/10.3390/rs14071550

Chicago/Turabian Style

Ghadi, Yazeed Yasin, Adnan Ahmed Rafique, Tamara al Shloul, Suliman A. Alsuhibany, Ahmad Jalal, and Jeongmin Park. 2022. "Robust Object Categorization and Scene Classification over Remote Sensing Images via Features Fusion and Fully Convolutional Network" Remote Sensing 14, no. 7: 1550. https://doi.org/10.3390/rs14071550

APA Style

Ghadi, Y. Y., Rafique, A. A., al Shloul, T., Alsuhibany, S. A., Jalal, A., & Park, J. (2022). Robust Object Categorization and Scene Classification over Remote Sensing Images via Features Fusion and Fully Convolutional Network. Remote Sensing, 14(7), 1550. https://doi.org/10.3390/rs14071550

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop