1. Introduction
Fine-grained visual classification (FGVC) refers to the task of distinguishing the categories of the same class. Fine-grained classification differs from traditional classification, as the former models intra-class variance, while the latter is about the inter-class difference. Examples of naturally occurring fine-grained classes include birds [
1,
2], dogs [
3], flowers [
4], vegetables [
5], plants [
6], etc., while human-made categories include airplanes [
7], cars [
8], food [
9], etc. Fine-grained classification is helpful in numerous computer vision and image processing applications such as image captioning [
10], machine teaching [
11], instance segmentation [
12], etc.
Fine-grained visual classification is a challenging problem, as there are minute and subtle differences within the species of the same classes, e.g., a crow and a raven, compared to traditional classification, where the difference between the classes is quite visible, e.g., a lion and an elephant. Fine-grained visual classification of species or objects of any category is a Herculean task for human beings and usually requires extensive domain knowledge to correctly identify the species or objects.
As mentioned earlier, fine-grained classification in image space aims to reduce the high intra-class and low inter-class variance. We provide a few sample images from the dog and bird datasets in
Figure 1 to highlight the problem’s difficulty. The examples in the figure show the images with the same viewpoint. The colors are also roughly similar. Although the visual variation is minimal between classes, all of these belong to different dog and bird categories. In
Figure 2, we provide more examples of the same mentioned categories. Here, the differences in the viewpoint and colors are prominent. The visual variation is more significant than the images in
Figure 1, but these belong to the same class.
Many approaches have been proposed to tackle the problem of fine-grained classification; for example, earlier works converged on part detection to model the intra-class variations. Next, the algorithms exploited three-dimensional representations to hand multiple poses and viewpoints to achieve state-of-the-art results. Recently, with the advent of CNNs, most methods have exploited the modeling capacity of CNNs as a component or as a whole.
This paper aims to investigate the capability of traditional CNN networks compared to specially designed fine-grained CNN classifiers. We strive to answer whether current general CNN classifiers can achieve comparable performance to fine-grained ones. To show competitiveness, we employ several fine-grained datasets and report top-1 accuracy for both classifier types. These experiments provide a proper place for general classifiers in fine-grained performance charts and serve as baselines for future comparisons of FGVC problems.
Our Contributions: We claim the following contributions in this article.
We present an overview of Fine-Grained Visual Classification (FGVC) CNN algorithms, which leverage deep learning for nuanced object classification within closely related categories.
We provide a comprehensive review of state-of-the-art traditional classification algorithms, highlighting their limitations in addressing the subtleties of fine-grained distinctions.
We systematically compare, investigate, and evaluate traditional classifiers against FGVC methods across six diverse fine-grained datasets, providing insights into their respective performances across varying complexities and offering a valuable resource for benchmarking and further exploration.
We further provide a forward-looking perspective suggesting a future direction for FGVC algorithms by exploring the integration of traditional classifiers as the backbone.
This paper is organized as follows.
Section 2 presents related work about the fine-grained classification networks.
Section 3 introduces the traditional state-of-the-art algorithms, which will be compared against fine-grained classifiers.
Section 4 shows the experimental settings and datasets for evaluation.
Section 5 offers a comparative evaluation between the traditional classifiers and fine-grained classifiers; finally,
Section 7 concludes the paper. Train models and codes are available at
https://github.com/saeed-anwar/FGSE (accessed on 20 October 2023).
2. Fine-Grained Classifiers
Fine-grained visual classification is an important and well-studied problem. Fine-grained visual classification aims to differentiate between subclasses of the same category instead of the traditional classification problem, where discriminative features are learned to distinguish between classes. Some of the challenges in this domain are the following. (i) The categories are highly correlated, i.e., small differences and small inter-category variance to discriminate between subcategories. (ii) Similarly, the intra-category variation can be significant due to different viewpoints and poses. Many algorithms, such as [
13,
14,
15,
16,
17,
18,
19], are presented to achieve the desired results. In this section, we highlight the recent approaches. The FGVC research can be divided into the following main branches, reviewed in the paragraphs below.
Part-Based FGVC Algorithms. The part-based category of algorithms relies on the distinguishing features of the objects to leverage the accuracy of visual recognition, which includes [
20,
21,
22,
23,
24,
25]. These FGVC methods [
26,
27] aim to learn the distinct features present in different parts of the object, e.g., the differences present in the beak and tail of the bird species. Similarly, the part-based approaches normalize the variation present due to poses and viewpoints. Many works [
1,
28,
29] assume the availability of bounding boxes at the object level and the part level in all the images during the training as well as testing settings. To achieve higher accuracy, Refs. [
22,
30,
31] employed both object-level and part-level annotations. These assumptions restrict the applicability of the algorithms to larger datasets. A reasonable alternative setting would be the availability of a bounding box around the object of interest. Recently, Ref. [
21] applied simultaneous segmentation and detection to enhance the performance of segmentation and object part localization. Similarly, a supervised method is proposed [
16], which locates the training images similar to a test image using KNN. The object part locations from the selected training images are regressed to the test image.
Bounding Box-Based Methods. The succeeding supervised methods take advantage of the annotated data during the training phase while requiring no knowledge during the testing phase and learning on both object-level and object-part-level annotation in the training phase. This approach is furnished in [
32], where only object-level annotations are given during training, while no supervision is provided at the object part level. Similarly, Spatial Transformer Network (STCNN) [
33] handles data representation and outputs vital regions’ locations. Furthermore, recent approaches focused on removing the limitation of previous works, aiming for conditions where the information about the object part location is not required in the training or testing phase. These FGVC methods are suitable for deployment on a large scale and help the advancement of research in this direction.
Attention Models. Recently, attention-based algorithms have been employed in FGVC, which focuses on distinguishing parts via an attention mechanism. Using attention, Ref. [
25] presented two attention models to learn appropriate patches for a particular object and determine the discriminative object parts using deep CNN. The fundamental idea is to cluster the last CNN feature maps into groups. The object patches and object parts are obtained from the activations of these clustered feature maps. Ref. [
25] needs the model to be trained on the category of interest, while we only require the general trained CNN. Similarly, DTRAM [
34] learns to end the attention process for each image after a fixed number of steps. Several methods are proposed to take advantage of object parts. However, the most popular one is the deformable part model (DPM) [
35], which learns the constellation relative to the bounding box with Support Vector Machines (SVM). Ref. [
36] improved upon [
37] and employed DPM to localize the parts using the constellation provided by DPM [
35]. Navigator–Teacher–Scrutinizer Network (NTSNet) [
38] uses informative regions in images without employing any annotations. Another teacher–student network was proposed recently as Trilinear Attention Sampling Network (TASN) [
39], composed of a trilinear attention module, attention-based sampler, and a feature distiller.
No Bounding Box Methods. Contrary to utilizing the bounding box annotations, current state-of-the-art methods of fine-grained visual categorization avoid incorporating the bounding boxes during testing and training phases altogether. Refs. [
24,
40] used a two-stage network for object and object part detection and classification, employing R-CNN and Bilinear CNN, respectively. Part Stacked CNN [
18] adopts the same strategy as [
24,
40] of a two-stage system; however, the difference lies in the stacking of the object parts at the end for classification. Ref. [
41] proposed multiple-scale RACNN to acquire distinguishing attention and region feature representations. Moreover, HIHCA [
42] incorporated higher-order hierarchical convolutional activations via a kernel scheme.
Distance metric learning Methods. An alternative approach to part-based algorithms is distance learning algorithms, which aim to cluster the data points/objects into the same category while moving different types away from each other. Ref. [
43] trained Siamese networks using deep metrics for signature verification and, in this context, set the trend in this direction. Recently, Ref. [
44] employed a multi-stage framework that accepts pre-computed feature maps and learns the distance metric for classification. The pre-computed features can be extracted from DeCAF [
45], as these features are discriminative and can be used in many tasks for classification. Ref. [
46] employs pairwise confusion (PC) via traditional classifiers.
Feature Representation-Based Methods. These methods utilize the features from CNN methods to capture the global information. Many works, including [
24,
25,
32,
47], utilized the feature representations of a CNN and employed them in many tasks, such as object detection [
48], understanding [
49], and recognition [
50]. CNN captures global information directly instead of traditional descriptors that capture local information and require manual engineering to encode global representation. Destruction and Construction Learning (DCL) [
51] takes advantage of a standard classification network and emphasizes discriminative local details. The model then reconstructs the semantic correlation among local regions. Ref. [
49] illustrated the reconstruction of the original image from the activations of the fifth max-pooling layer. Max-pooling ensures invariance to small-scale translation and rotation; however, global spatial information might achieve robustness to larger-scale deformations. Ref. [
52] combined the features from fully connected layers using VLAD pooling to capture global information. Similarly, Ref. [
53] pooled the features from convolutional layers instead of fully connected layers for text recognition based on the idea that the convolutional layers are transferable and are not domain-specific. Following in the footsteps of [
52,
53], PDFR by [
17] encoded the CNN filter responses, employing a picking strategy via the combination of Fisher Vectors. However, considering feature encoding as an isolated element is not an optimum choice for convolutional neural networks.
Feature Integration Algorithms. Recently, feature integration methods combine features from different layers of the same CNN model. This technique is becoming popular and is adopted by several approaches. The intuition behind feature integration is to take advantage of global semantic information captured by fully connected layers and instance-level information preserved by convolutional layers [
54]. Ref. [
55] merged the features from intermediate and high-level convolutional activations in their convolutional network to exploit low-level details and high-level semantics for image segmentation. Similarly, for localization and segmentation, Ref. [
56] concatenated the feature maps of convolutional layers at a pixel as a vector to form a descriptor. Likewise, for edge detection, Ref. [
57] added several feature maps from the lower convolutional layers to guide CNN and predict edges at different scales.