HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification

Su, Jing; Liang, Jianmin; Zhu, Jiayi; Li, Yongjiang

doi:10.3390/sym16091231

Open AccessArticle

HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification

School of Mathematics and Computer, Guangdong Ocean University, Zhanjiang 524088, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work and shared the first authorship.

Symmetry 2024, 16(9), 1231; https://doi.org/10.3390/sym16091231

Submission received: 1 August 2024 / Revised: 11 September 2024 / Accepted: 13 September 2024 / Published: 19 September 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning networks have yielded promising insights in the field of image classification. However, the hierarchical image classification (HIC) task, which involves assigning multiple, hierarchically organized labels to each image, presents a notable challenge. In response to this complexity, we developed a novel framework (HCAM-CL), which integrates a hierarchical cross-attention mechanism with a CNN-LSTM architecture for the HIC task. The HCAM-CL model effectively identifies the relevance between images and their corresponding labels while also being attuned to learning the hierarchical inter-dependencies among labels. Our versatile model is designed to manage both fixed-length and variable-length classification pathways within the hierarchy. In the HCAM-CL model, the CNN module is responsible for the essential task of extracting image features. The hierarchical cross-attention mechanism vertically aligns these features with hierarchical levels, uniformly weighing the importance of different spatial regions. Ultimately, the LSTM module is strategically utilized to generate predictive outcomes by treating HIC as a sequence generation challenge. Extensive experimental evaluations on CIFAR-10, CIFAR-100, and design patent image datasets demonstrate that our HCAM-CL framework consistently outperforms other state-of-the-art methods in hierarchical image classification.

Keywords:

hierarchical image classification; cross-attention mechanism; CNN-LSTM

1. Introduction

Hierarchical image classification (HIC) [1,2,3] has become an essential tool across a diverse array of applications [4,5,6], enabling the systematic organization of labels within a layered framework. For instance, in film image annotation, an image categorized as ‘comedy’ may be further refined by more specific descriptors such as ‘romantic comedy’ or ‘satirical comedy’. In the domain of design patent image sorting, an item tagged as ‘jacket’ might also be included in the broader classification of ‘clothing’. This mirrors the hierarchical taxonomies found in nature, where cats and dogs are grouped under ‘mammals’ and apples and pears under ‘fruits’, illustrating a structured hierarchy. The research landscape, invigorated by breakthroughs in deep learning techniques [7,8] and the comprehensive ImageNet dataset [9], has witnessed a burgeoning interest in image classification [10] and segmentation [11] in recent years. However, the hierarchical nature of labels in real-world images, with their complex interrelationships, introduces a level of sophistication and challenge that goes beyond the simplicity of conventional image classification, which typically associates each object with a single label.

Early attempts to tackle hierarchical image classification (HIC) problems relied on the use of Convolutional Neural Networks (CNNs) [3,12,13,14]. While these approaches were groundbreaking, they were primarily limited to examining the correlations between images and labels and did not fully account for the structured relationships between the labels themselves. To fill this gap, recent work has introduced unified frameworks that integrate a CNN with a Recurrent Neural Network (RNN) to decode label associations [15]. Despite their improved ability to model label relevance, CNN-RNN approaches require datasets with fixed hierarchical levels, which hinders their adaptability to real-world situations. In [16], an improved CNN-LSTM model was proposed to handle the HIC task with variable-length paths, modeling a more accurate relationship between images and labels. However, these methods were based on the global features of images and the global correlation between labels, which limited the feature description ability and the understanding of the latent dependencies between labels.

With the burgeoning success of attention mechanisms across various domains [17,18], interest has been growing in leveraging these techniques to enhance HIC models. Notably, attention-based HIC frameworks [19,20,21] have been employed to better interpret complex scenes. However, despite their advancements, they still fall short in adequately capturing the hierarchical structure of label correlations.

To address the limitation of low accuracy in current recognition methods and the constraint imposed by fixed-length paths for HIC, this paper presents a novel and adaptable end-to-end CNN-LSTM framework equipped with a hierarchical cross-attention mechanism (HCAM) called HCAM-CL. The HCAM-CL framework is adept at performing hierarchical image classification tasks that involve both fixed- and variable-length categorizations. The model integrates a CNN for global image feature extraction and introduces a hierarchical cross-attention module specifically designed to navigate the complexities between image features and their hierarchical labels, generating joint local feature maps that capture this intricate relationship. An LSTM module further processes this information, decoding the image into an ordered sequence of labels, each paired with its corresponding local feature map. Significant advancements introduced by our HCAM-CL framework include the following:

We introduce HCAM-CL as an intuitive, end-to-end solution for HIC, particularly for instances with variable-length hierarchies, by conceptualizing HIC as a sequential generation task.
The incorporation of an enhanced hierarchical cross-attention mechanism provides superior differentiation of label dependencies within their hierarchical structures.
An empirical evaluation of hierarchical benchmarks, including CIFAR-10, CIFAR-100, and a design patent image dataset, demonstrates the superior capabilities of our model in tackling various HIC challenges.

The paper is organized as follows: Section 2 reviews the relevant literature in the field. Section 3 presents a comprehensive overview of the HCAM-CL model. In Section 4, we present our experimental results and provide a detailed analysis. The paper is concluded in Section 5, where we summarize our findings and contributions.

2. Related Work

With the success of deep learning in image classification tasks, HIC has attracted increasing attention from many researchers. This section briefly reviews the recent works for HIC, focusing on deep learning-based models.

Traditional image classification models have mostly focused on feature learning for hierarchical classification. Zhao et al. [1] used recursive regularization on different level nodes and learned a sparse matrix to perform hierarchical classification. Lima et al. [2] combined a filter, wrapper, and feature selection for HIC. However, most of these methods had difficulty in modeling higher-order relationships and were also computationally expensive. Recently, many studies were actively conducted on HIC based on CNNs with their ability to capture discriminative features. Yan et al. [12] introduced the HD-CNN model, which separates classification into coarse and fine levels using designated classifiers. Lin et al. [14] applied a bilinear CNN model for granular image classification by fusing features extracted from an outer product of two CNN streams into comprehensive image descriptors. In addition, Zhu et al. [13] developed a Branch CNN (B-CNN) tailored for HIC. Seo et al. [22] used a knowledge-embedded classifier to obtain hierarchical results from the Fashion-MNIST dataset with H-CNN. Zhang et al. [23] fused CNNs with multi-task learning via HB-CNN, enriching both coarse and fine category features. Taoufiq et al. [24] implemented HierarchyNet alongside a novel multiplicative layer, using early layer signals to refine predictions for urban building datasets. Noor et al. [25] proposed a capsule network called ML-CapsNet for hierarchical image classification, which aims to use a capsule network to simulate the semantic relationship of image features to achieve hierarchical classification and to improve the consistency of the model by reconstructing the loss function to comprehensively predict the hierarchical structure. Despite their merits, these methods tend to overlook crucial hierarchical relationships between labels.

In addition, the integration of hierarchical information into loss functions has been explored as an effective strategy for improving classification models [26]. He et al. [27] developed a novel orthogonal loss function explicitly designed to quantify the interrelationships between coarse and fine categorizations within hierarchical frameworks. Further innovating this idea, He et al. [28] proposed a triplet loss function that skillfully separates parent categories while simultaneously coalescing child samples, thereby refining the hierarchical learning process. Kuang et al. [29,30] fused CNN architectures with ontology principles to support the learning of more distinctive coarse-level features within hierarchical classifications. In line with technological advances, knowledge distillation techniques have recently been applied to hierarchical models [31]. One such innovation is a hierarchical CNN model supported by an ontology concept, which has shown great promise in fashion classification and recommendation systems. However, its application has primarily been limited to datasets with predetermined hierarchical levels, thereby limiting its usefulness in diverse and dynamic real-world scenarios.

Studies on hybrid CNN-RNN models have been conducted to address the challenges of hierarchical image classification, leveraging the potent feature extraction capabilities of CNNs alongside the sequential label generation prowess of RNNs. Guo et al. [15] implemented a model that utilized CNNs to distill image features, followed by an RNN structured to process the nuances of class hierarchy. They further enhanced the model by integrating a residual module within the RNN, which augmented its generalization capabilities. Koo et al. [16], embracing the synergy between CNNs and LSTMs, constructed two distinct models for handling HIC tasks, one tailored to fixed-length paths and the other to variable-length paths. In this setup, the CNN component focused on extracting image feature data, while the LSTM was tasked with encoding hierarchical labels, both coarse and fine, as sequences to fine-tune classification outcomes. Despite the marked improvements in HIC efficiency via CNN-RNN collaborations, these methods predominantly extract features from the entire image scope. This approach often leads to an influx in superfluous information, which is then indiscriminately factored into the training process of the hierarchical classification models.

A number of studies have created innovations by utilizing attention mechanisms to refine the performance of hierarchical classifiers. Chen et al. [19] leveraged such mechanisms to infuse semantic label information into HIC. Their approach employs the prediction of a parent label as prior knowledge to intuitively guide the prediction of its child labels. This methodology achieved state-of-the-art (SOTA) results on the Caltech-UCSD Birds dataset, which features a variable-length hierarchy. Simultaneously, Chen et al. [20] enriched a CNN model with a visual attention module, enabling the initial layer to acquire knowledge that informs the feature learning process of the subsequent layer—a technique specifically aimed at addressing the challenges of long-tailed hierarchical recognition tasks. Similarly, PIZARRO et al. [21] developed a B-CNN model supported by an attention mechanism to facilitate hierarchical image classification. This mechanism strategically orchestrates the interaction between branches in a dynamic, data-driven manner, thereby enhancing the model’s ability to accurately identify local class members and the hierarchical connections between concepts.

However, these models mainly focused on modeling the correlation between images and labels while paying limited attention to the intricate dependencies between the labels themselves. In response to this, we present an HCAM-CL method designed to improve the accuracy of hierarchical multi-label image classification. A comprehensive description of our proposed method is given in Section 3.

3. Methodology

3.1. Problem Statement

Given a set of hierarchical multi-label images I and the corresponding labels Y, HIC aims to learn a pattern Φ: I →Y that maps each input image x to the corresponding label y, where y = {y₁, y₂, …, y_C}. For each image x, we have the corresponding label y_t = {Q_k| 0 ≤ k ≤ N, 1 ≤ t ≤ C}, where C is the maximum layer level number of hierarchical labels of all samples, Q_k is the kth label in the label space of the dataset, and N is the number of labels in the dataset. The label after the cth position is set to NULL when the layer level number c of the labels corresponding to the input image is lower than the maximum layer level number C.

3.2. The Framework of Our Model

We propose a novel HIC model, HCAM-CL, which consists of three deep learning modules: CNN, the hierarchical cross-attention module, and LSTM. The CNN module serves as the encoder of the framework, which is responsible for extracting high-dimensional features from the input image, and the hierarchical cross-attention module employs a hierarchical multi-head attention mechanism to model relationships and dependencies between images and hierarchical labels. This mechanism symmetrically aligns features with hierarchical levels, uniformly weighing the importance of different spatial regions. The LSTM module serves as the decoder, which aims to decode the fused feature information to the hidden output with label semantics by taking multiple labels of an image as a text sequence. The architecture of the proposed HCAM-CL model is illustrated in Figure 1.

3.2.1. Image Feature Extraction

The goal of HIC is to recognize the corresponding labels for an input image. Therefore, the quality of the extracted feature is vital for the performance of the proposed framework. A CNN module, specifically the Inception v3 [32] network, is adopted to extract image features in the proposed HCAM-CL. The Inception v3 network uses convolutional kernels of different sizes to merge features at different scales, which can enhance the non-linearity of features and the correlation of image features, effectively improving the bottleneck problem of feature representation.

X = Φ_Inc(x,w)

(1)

where x ∈ R^256×256 and y = {y₁,y₂,…,y_c} represent the input image and the real labels, respectively. C is the maximum number of label levels in the dataset. X ∈ R^H×W×L represents the output feature vector of the feature extraction network, where H and W are the height and width of each channel, and L is the number of channels. w is the weight parameter to be learned by the feature extraction network, and Φ_Inc(∙) is the feature extraction function of the Inception v3 network.

3.2.2. Hierarchical Cross-Attention Mechanism

The challenge of hierarchical image classification (HIC) significantly exceeds the complexity of conventional image classification due to the complicated task of directly recognizing the hierarchical relationships between labels from images. To effectively deal with this complexity, our approach integrates the multi-head attention mechanism with the cross-attention mechanism, both of which have been shown to be effective in general image classification efforts. To further elucidate the correlation between images and their hierarchical label structures, we developed an advanced hierarchical multi-head attention mechanism called the hierarchical cross-attention mechanism. The hierarchical cross-attention mechanism symmetrically associates image features with hierarchical levels, taking equal account of the importance of different spatial regions. This mechanism is adept at capturing essential high-level semantic information about both the image and its corresponding labels, and it is adept at identifying the spatial dependencies between labels, thereby enriching the representation of critical information. Central to this mechanism are two key modules: the initial layer of the multi-head attention mechanism, which examines the mapping relationship between the global features of the image and its labels, and the subsequent layer, which explores the association between the image features and the aggregated label information. The architectural blueprint of our proposed hierarchical cross-attention model, depicting these layers, is shown in Figure 1, illustrating the novel design aimed at enhancing the model’s ability to understand and classify hierarchical image data with impressive accuracy.

In the first layer of our hierarchical cross-attention mechanism, the feature vector, X ∈R^H×W×L, acts as both keys (K) and values (V) derived from the image. Concurrently, the output from the LSTM module at the previous time step (t-1) serves as the query (Q) derived from the current label w^t at the time step t. The plan allows each attention head to independently assess the correlation between the global image feature and the present label by generating an image label attention map that elucidates their relationship from distinct perspectives. For each head i, linear transformation matrices

{W 1}_{i}^{Q}

,

{W 1}_{i}^{K}

, and

{W 1}_{i}^{V}

are defined to map

w^{t}

and

X

to their corresponding query Q, key K, and value V explicitly as follows:

{Q 1}_{i} = {W 1}_{i}^{Q} w^{t}

,

{K 1}_{i} = {W 1}_{i}^{K} X

,

{V 1}_{i} = {W 1}_{i}^{V} X

. The calculation of attention scores per head utilizes the scaled dot-product attention mechanism, followed by scaling and normalization to determine the weight distribution:

{A t t e n t i o n}_{i}^{(1)} ({Q 1}_{i}, {K 1}_{i}, {V 1}_{i}) = s o f t m a x (\frac{{Q 1}_{i} {K 1}_{i}^{T}}{\sqrt{{d 1}_{k}}}) {V 1}_{i}

(2)

Subsequently, the outputs from all heads are concatenated and passed through another linear transformation matrix

{W 1}^{O}

to generate the attention output. This output is reintegrated with the initial input

W^{t}

via a residual connection and subsequently undergoes layer normalization. This crucial step helps to mitigate overfitting and enhances the stability of the model.

Z_{t, x} = L a y e r N o r m ({W 1}^{O} C o n c a t ({A t t e n t i o n}_{1}^{(1)}, \dots, {A t t e n t i o n}_{h}^{(1)}) + W^{t})

(3)

In the second layer of the hierarchical cross-attention mechanism, we strategically leverage the output

Z_{t, x}

from the initial layer as the query (Q) for further inquiry, while the original image features

X

are reused as the key (K) and value (V) to further investigate the mapping relationship between image features and hierarchical labels at a deeper level. For each attention head, denoted by

i

, we process these inputs through a new set of linear transformation matrices,

{W 2}_{i}^{Q}

,

{W 2}_{i}^{K}

, and

{W 2}_{i}^{V}

, to compute the attention scores for each head, which are then scaled and normalized to determine the weight distribution, expressed as follows:

{A t t e n t i o n}_{i}^{(2)} ({Q 2}_{i}, {K 2}_{i}, {V 2}_{i}) = s o f t m a x (\frac{{Q 2}_{i} {K 2}_{i}^{T}}{\sqrt{{d 2}_{k}}}) {V 2}_{i}

(4)

where

{Q 2}_{i} = {W 2}_{i}^{Q} Z_{t, x}

,

{K 2}_{i} = {W 2}_{i}^{K} X

, and

{V 2}_{i} = {W 1}_{i}^{V} X

, represent the transformations enabling a deeper linkage between the current analytical focus and the hierarchical label context. Similar to the first layer, the attention outputs across all heads are unified and then transformed through a linear matrix

{W 2}^{O}

, producing the layer’s integrated output:

H_{t, x} = {W 2}^{O} C o n c a t ({A t t e n i o n}_{1}^{(2)}, \dots, {A t t e n t i o n}_{h}^{(2)})

(5)

Within this framework,

h

symbolizes the total number of attention heads, and

H_{t, x}

represents the image feature representation after comprehensively considering the relationships with hierarchical labels, providing an information-rich fused feature for subsequent hierarchical classification.

The culmination of our hierarchical cross-attention mechanism is achieved with the output from the second layer of multi-head attention

H_{t, x}

. The output is combined with the input

Z_{t, x}

initially processed through the same layer via a residual connection. Subsequently, layer normalization is applied to the data to produce the final output of the hierarchical cross-attention mechanism, denoted as

v^{t},

which is computed as follows:

v^{t} = L a y e r N o r m (H_{t, x} + Z_{t, x})

(6)

The output offers a comprehensive feature representation that thoroughly considers both the content of the image and the intrinsic relationships of hierarchical labels, which is pivotal for the task of hierarchical image classification. By examining the complex interplay between image content and hierarchical labels, our model is adept at capturing subtle, fine-grained hierarchical relationships. This ability is crucial for enhancing classification accuracy as it ensures that the distinctions and connections between labels across different hierarchy levels are accurately recognized and utilized.

3.2.3. Hierarchical Classification

After obtaining the attention information, we leverage the LSTM module to conduct hierarchical classification. The LSTM module, based on the hierarchical-attention mechanism, can be written as follows:

h^{t}, c^{t} = {L S T M}_{a t t} ([v^{t}, w^{t}], h^{t - 1}, c^{t - 1})

(7)

where

{L S T M}_{a t t}

(∙) denotes a single LSTM unit;

h^{t}

is the hidden state at time step t;

c^{t}

is the output of the memory gate at time step t;

h^{t - 1}

and

c^{t - 1}

are the hidden state and memory state of the LSTM, respectively, at the previous moment;

w^{t}

is the current label information; and

v^{t}

is the attention information corresponding to the current label.

Next, by combining the previous hidden state

h^{t - 1}

, the attention information

v^{t}

, and the previous output

w^{t}

, we employ two fully connected layers and the softmax function to obtain the confidence

{\tilde{y}}^{t}

for the next label. Finally, cross-entropy is employed as the loss function. They are formulated as follows:

{\tilde{y}}^{t} = s o f t m a x (\emptyset (h^{t - 1}, v^{t}, w^{t}))

(8)

L = - \frac{1}{M} \sum_{i = 1}^{M} \sum_{k = 1}^{N} y_{i k}^{t} l o g {\tilde{y}}_{i k}^{t}

(9)

where

\emptyset r e p r e s e n t s {t w o f u l l y c o n n e c t e d l a y e r s; y}_{i k}^{t}

is the ground truth of the label at time step t;

{\tilde{y}}_{i k}^{t}

is the predicted value of the label at time step t;

M

is the total number of samplers; and

N

is the total number of labels in the dataset.

4. Experimental Results

To demonstrate the effectiveness of our approach, we conducted extensive experiments on the CIFAR-10, CIFAR-100, and design patent image datasets, which are structured for hierarchical classification. The empirical results underline the superior performance of our methodology. The following sections are dedicated to the description of the experimental datasets, the settings used, and the evaluation metrics employed. This is followed by a thorough analysis of our proposed method relative to competing approaches, complemented by insights from ablation studies.

4.1. Experimental Dataset and Settings

To evaluate the hierarchical image classification (HIC) capabilities of our proposed method, we utilized the CIFAR-10, CIFAR-100, and design patent image datasets. The CIFAR-10 dataset consists of 60,000 images, divided into 50,000 for training and 10,000 for testing. For the purposes of our experiment, CIFAR-10 was manually restructured to represent ten different hierarchical categories with a total of 18 different labels. The hierarchical structure created for CIFAR-10 is shown in Figure 2. Similar to CIFAR-10, the design patent image dataset presents images, with each one containing a single target object associated with multiple labels. These labels are linked hierarchically, reflecting the structured relationship inherent in the objects depicted in the images. For example, an image of a jacket will have the labels ‘clothes’ (top level), ‘jacket’ (second level), and ‘long jacket’ (lowest level). Our experiments used a subset of the design patent image dataset, consisting of 63,600 training images and 19,500 validation images, categorized into 19 labels across 12 hierarchical classes. The labels associated with each patent image category are shown in Figure 3. As can be seen from Figure 2 and Figure 3, both datasets exhibit non-uniform hierarchical structures, with some categories exhibiting two-level hierarchies and others exhibiting three-level hierarchies.

In contrast to the CIFAR-10 and design patent image datasets, the CIFAR-100 dataset used in the experiment consists of a predefined two-layer hierarchical structure with 20 major classes and 100 sub-classes. Each sub-class in the training set is represented by 500 images, while the validation set contains 100 images per sub-class.

To implement our proposed model, we used the TensorFlow framework programmed in Python to train and test the models. The Inception v3 architecture was chosen for the CNN module to extract visual features from images, resulting in a feature vector with dimensions of 12 × 12 × 768. Correspondingly, the LSTM module was configured with a hidden state and a memory unit, both with dimensions of 768. To streamline the prediction process, a special start label, ‘START’, was introduced at the beginning of each label sequence. All labels underwent Word2Vec embedding processing to ensure uniform sequence sizes.

To circumvent the traditional RNN limitation of fixed-length input labels, our model incorporates a padding label, ‘NULL’, to align input sequences to a uniform dimension, which it then transforms back into the original variable-length sequence. The model’s initial learning rate was set to 0.00025 with a learning rate decay factor of 0.94. The entire training process took 100 epochs.

4.2. Evaluation Metric and Baseline Models

The effectiveness of our model was assessed using several levels of accuracy: fine accuracy and coarse accuracy. For fine accuracy, a prediction was considered correct if the entire predicted label sequence exactly matched all labels in the ground truth. For coarse accuracy, a prediction was considered correct if it matched the ground truth at any hierarchical level, even if it did not match the entire label path. We also used the F1 score of the predicted labels as an additional evaluation metric, which represents the harmonic mean of precision and recall.

We compared our model with a number of established baseline models on the CIFAR-10, CIFAR-100, and design patent image datasets, as detailed in Table 1, Table 2 and Table 3. The comparative analysis included a number of successful CNNs and encoder-decoder models known for their performance on HIC tasks: B-CNN [13], H-CNN [23], BA-CNN [21], CNN-RNN [15], CNN-LSTM [16], and ML-CapsNet [25].

4.3. Results and Analysis

In our initial evaluation, we assessed the effectiveness of our proposed HCAM-CL model against six baseline methods using the CIFAR-10 dataset. As indicated in Table 1, our model demonstrated superior performance in managing both variable-length paths-such as those addressed by the CNN-RNN and CNN-LSTM models-and fixed-length paths, as tailored by the B-CNN, H-CNN, ML-CapsNet, and BA-CNN models. Specifically, in the CIFAR-10 dataset, the HCAM-CL model improved fine accuracy by 11.05%, 9.33%, 0.44%, 0.97%, 3.34%, and 0.26%; coarse2 accuracy by 10.05%, 7.91%, 2.42%, 2.72%, 3.64%, and 0.61%; and coarse1 accuracy by 3.05%, 2.08%, 2.90%, 2.78%, 1.30%, and 0.46%, respectively, compared to the other baselines. These findings underscore the efficacy of the hierarchical cross-attention mechanism within our HCAM-CL model for the HIC task.

Furthermore, in terms of multi-label classification performance, our HCAM-CL model showed excellent performance in all relevant evaluation metrics with an F1 score of 0.9345. Compared to the baseline models, our proposed approach improved the F1 score by margins of 7.58%, 5.99%, 0.61%, 2.01%, 2.88%, and 0.30%, respectively. Although it can be seen in Table 1 and Table 2 that the accuracy of the BA-CNN model is not significantly different from that of our model, our model offers greater flexibility in handling variable-length hierarchical image classification tasks. The BA-CNN model is primarily designed for hierarchical image classification with fixed-length labels, which has some limitations in practical applications. For example, when the length of the labels corresponding to the input images is unpredictable, our model is better equipped to handle these variations.

Similarly, we examined the performance of our HCAM-CL model on a subset of the patent image dataset and compared the results with those of previously established best-performing HIC models, as shown in Table 2. In the comparison, our model showed superiority over the other baseline models on all evaluation metrics. It is noteworthy that on the design patent image dataset, our HCAM-CL model offered even more substantial performance improvements when measured against the two variable-length hierarchical models, CNN-RNN and CNN-LSTM, with remarkable accuracy improvements of 28.52% and 12.93%, respectively, at the fine-grained level. This demonstrates the robustness of our HCAM-CL model, which surpasses the generalization capabilities of the baseline methods, particularly highlighted by the weaker performance of CNN-RNN on datasets such as the design patent image dataset with its prevalence of lower-quality images. These results suggest that the cross-attention mechanism contributes to improving the performance by enabling the model to more accurately identify the relationships between images and labels.

In Figure 4, we present a comparative visualization of the fine-grained accuracy achieved by our HCAM-CL model versus other models on both the CIFAR-10 and design patent image datasets. As shown, our model outperforms the competing models on all metric levels for these datasets. Additionally, the model performs even better on the design patent image dataset than on the CIFAR-10 dataset, suggesting that the hierarchical cross-attention mechanism effectively enhances the model’s generalization performance.

Regarding the CIFAR-100 dataset, as shown in Table 3, all methods show slightly reduced accuracy at the fine-grained level compared to the CIFAR-10 and patent image datasets. Nevertheless, our model maintains a notable advantage, outperforming the second-best model by 2.03%. These results demonstrate that our proposed HCAM-CL model significantly enhances the learning of inter-label dependencies, enabling it to predict more coherent label associations compared to the baseline models. Therefore, the model’s superior performance on datasets like the CIFAR-10, CIFAR-100, and design patent image datasets underscores its well-balanced and symmetric capability in handling diverse and complex classification tasks.

4.4. Ablations

To determine the contribution of the cross-attention component within our hierarchical cross-attention mechanism model for hierarchical image classification, we conducted ablation studies, focusing on specific elements of the model, using the patent image dataset.

The designed ablation studies aimed to dissect the influence and value of each component within HCAM-CL. The first study involved the removal of the hierarchical cross-attention mechanism. We replaced the specialized attention module in HCAM-CL with a simple feature fusion approach to determine the extent to which the attention mechanism enhances the overall performance of the model.

We also investigated the effect of reducing the complexity of the attention mechanism. Specifically, we reduced the number of cross-attention layers from two to one, thereby training and evaluating the model with a single-layer cross-attention framework to determine its impact on model effectiveness.

The results of the ablation studies presented in Table 4 led us to the following key findings:

Removing the attention mechanism caused the accuracy of the model on the patent image dataset to drop from 82.98% to 70.05%. This significant drop underlines the central role of the cross-attention mechanism in capturing the intricate correlations between images and their associated labels.
Reducing the number of cross-attention levels resulted in a drop in model accuracy on the design patent image dataset from 82.98% to 79.54%. This decrease suggests that a multi-layer cross-attention framework enhances the model’s competence in sequence generation tasks.

The results of the ablation study clearly demonstrate that the cross-attention components of HCAM-CL are crucial for enhancing the model’s performance. Additionally, increasing the number of layers in the cross-attention mechanism further contributes to the model’s effectiveness. These findings not only confirm the significance of the key components in HCAM-CL but also provide strategic insights for future refinements of the model.

5. Conclusions

In this study, we introduce a novel HCAM-CL method that integrates a hierarchical cross-attention model with a CNN-LSTM architecture. This method is specifically designed for hierarchical image classification (HIC) tasks, effectively handling both fixed-length and variable-length label paths. Our approach employs a hierarchical cross-attention mechanism to capture the intricate relationships between images and their labels, thereby mapping complex dependencies among labels. This is followed by an LSTM module that decodes the label hierarchy, significantly enhancing both the model’s accuracy and generalization capabilities. Extensive experiments on the CIFAR-10, CIFAR-100, and design patent image datasets confirmed that our approach outperforms alternative baseline models in the field of hierarchical image classification.

Author Contributions

Data Curation, J.L. and J.Z.; Formal Analysis, J.L. and Y.L.; Funding Acquisition, J.S. and Y.L.; Investigation, J.Z. and Y.L.; Methodology, J.S. and J.L.; Project Administration, J.S.; Resources, J.S. and Y.L.; Software, J.L. and J.S.; Supervision, J.S. and Y.L.; Validation, J.S., J.L., Y.L. and J.Z.; Visualization, J.L.; Writing—Original Draft, J.S., J.L. and Y.L.; Writing—Review and Editing, J.S., J.L., Y.L. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a special grant from the program for scientific research start-up funds of Guangdong Ocean University under Grant No. 060302102303, the Industry-University-Research Innovation Fund Project of the Science and Technology Development Center of the Ministry of Education under Grant No. 2020QT13, the Ministry of Education’s Industry-University-Research Collaborative Education Project under Grant No. 239920011, and the National College Students Innovation and Entrepreneurship Training Program under Grant No. 010403102309.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request. The data used in this study can be found at the following websites: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 10 June 2023); https://iplab.gpnu.edu.cn/info/1044/1608.htm (accessed on 22 May 2023).

Conflicts of Interest

The authors declare that this research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Zhao, H.; Hu, Q.; Zhu, P.; Wang, Y.; Wang, P. A Recursive Regularization Based Feature Selection Framework for Hierarchical Classification. IEEE Trans. Knowl. Data Eng. 2019, 33, 2833–2846. [Google Scholar] [CrossRef]
Lima, H.C.S.C.; Otero, F.E.B.; Merschmann, L.H.C.; Souza, M.J.F. A Novel Hybrid Feature Selection Algorithm for Hierarchical Classification. IEEE Access 2021, 9, 127278–127292. [Google Scholar] [CrossRef]
Fu, R.; Li, B.; Gao, Y.; Wang, P. CNN with coarse-to-fine layer for hierarchical classification. IET Comput. Vis. 2018, 12, 892–899. [Google Scholar] [CrossRef]
Kowsari, K.; Sali, R.; Ehsan, L.; Adorno, W.; Ali, A.; Moore, S.; Amadi, B.; Kelly, P.; Syed, S.; Brown, D. HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach. Information 2020, 11, 318. [Google Scholar] [CrossRef] [PubMed]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 558–567. [Google Scholar]
Gao, D.; Yang, W.; Zhou, H.; Wei, Y.; Hu, Y.; Wang, H. Deep Hierarchical Classification for Category Prediction in E-commerce System. arXiv 2020, arXiv:2005.06692. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Minaee, S.; Boykov, Y.Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef]
Yan, Z.; Zhang, H.; Piramuthu, R.; Jagadeesh, V.; DeCoste, D.; Di, W.; Yu, Y. HD-CNN: Hierarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2740–2748. [Google Scholar]
Zhu, X.; Bain, M. B-CNN: Branch convolutional neural network for hierarchical classification. arXiv 2017, arXiv:1709.09890. [Google Scholar]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear cnn models for fine-grained visual recognition. arXiv 2015, arXiv:1504.07889. [Google Scholar]
Guo, Y.; Liu, Y.; Bakker, E.M.; Guo, Y.; Lew, M.S. CNN-RNN: A large-scale hierarchical image classification framework. Multimedia Tools Appl. 2017, 77, 10251–10271. [Google Scholar] [CrossRef]
Koo, J.; Klabjan, D.; Utke, J. Combined convolutional and recurrent neural networks for hierarchical classification of images. arXiv 2018, arXiv:1809.09574. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Li, R.; Lin, C.; Collinson, M.; Li, X.; Chen, G. A hierarchical-attention hierarchical recurrent neural network for dialogue act classification. arXiv 2018, arXiv:1810.09154. [Google Scholar]
Chen, T.; Wu, W.; Gao, Y.; Dong, L.; Luo, X.; Lin, L. Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 2023–2031. [Google Scholar]
Chen, Q.; Liu, Q.; Lin, E. A knowledge-guide hierarchical learning method for long-tailed image classification. Neurocomputing 2021, 459, 408–418. [Google Scholar] [CrossRef]
Pizarro, I.; Nanculef, R.; Valle, C. An Attention-Based Architecture for Hierarchical Classification with CNNs. IEEE Access 2023, 11, 32972–32995. [Google Scholar] [CrossRef]
Seo, Y.; Shin, K.-S. Hierarchical convolutional neural networks for fashion image classification. Expert Syst. Appl. 2018, 116, 328–339. [Google Scholar] [CrossRef]
Zhang, X.; Tang, L.; Luo, H.; Zhong, S.; Guan, Z.; Chen, L.; Zhao, C.; Peng, J.; Fan, J. Hierarchical bilinear convolutional neural network for image classification. IET Comput. Vis. 2021, 15, 197–207. [Google Scholar] [CrossRef]
Taoufiq, S.; Nagy, B.; Benedek, C. HierarchyNet: Hierarchical CNN-based urban building classification. Remote Sens. 2020, 12, 3794. [Google Scholar] [CrossRef]
Noor, K.T.; Robles-Kelly, A.; Kusy, B. A capsule network for hierarchical multi-label image classification. In Structural, Syntactic, and Statistical Pattern Recognition; Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR); Springer International Publishing: Cham, Switzerland, 2022. [Google Scholar]
He, G.; Huo, Y.; He, M.; Zhang, H.; Fan, J. A novel orthogonality loss for deep hierarchical multi-task learning. IEEE Access 2020, 8, 67735–67744. [Google Scholar] [CrossRef]
He, G.; Ji, J.; Zhang, H.; Xu, Y.; Fan, J. Feature Selection-Based Hierarchical Deep Network for Image Classification. IEEE Access 2020, 8, 15436–15447. [Google Scholar] [CrossRef]
He, G.; Li, F.; Wang, Q.; Bai, Z.; Xu, Y. A hierarchical sampling based triplet network for fine-grained image classification. Pattern Recognit. 2021, 115, 107889. [Google Scholar] [CrossRef]
Kuang, Z.; Li, Z.; Zhao, T.; Fan, J. Deep multi-task learning for large-scale image classification. In Proceedings of the IEEE Third International Conference on Multimedia Big Data, Laguna Hills, CA, USA, 19–21 April 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 310–317. [Google Scholar]
Kuang, Z.; Yu, J.; Yu, Z.; Fan, J. Ontology-driven hierarchical deep learning for fashion recognition. In Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval, Miami, FL, USA, 10–12 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 19–24. [Google Scholar]
Kuang, Z.; Zhang, X.; Yu, J.; Li, Z.; Fan, J. Deep embedding of concept ontology for hierarchical fashion recognition. Neurocomputing 2020, 425, 191–206. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]

Figure 1. The architecture of the HCAM-CL model.

Figure 2. Hierarchical class tree for CIFAR 10.

Figure 3. Categories and labels of the patent image dataset.

Figure 4. Fine accuracy of different methods on CIFAR-10 and the design patent image set.

Table 1. Comparison of results of our proposed method with those of other baseline methods on CIFAR-10.

	Accuracy			F1-Score
	Coarse1	Coarse2	Fine	F1-Score
CNN–RNN [15]	0.9577	0.8286	0.7801	0.8587
CNN–LSTM [16]	0.9674	0.85	0.7973	0.8746
B-CNN [13]	0.9592	0.9049	0.8862	0.9284
H-CNN [22]	0.9604	0.9019	0.8809	0.9144
ML-CapsNet [25]	0.9752	0.8927	0.8572	0.9057
BA-CNN [21]	0.9836	0.9230	0.8880	0.9315
Ours (HCAM-CL)	0.9882	0.9291	0.8936	0.9345

Table 2. Comparison of results of our proposed method with those of other baseline methods on design patent image dataset.

	Accuracy			F1-Score
	Coarse1	Coarse2	Fine	F1-Score
CNN-RNN [15]	0.7685	0.5946	0.5446	0.6546
CNN-LSTM [16]	0.852	0.7446	0.7005	0.7767
B-CNN [13]	0.9388	0.7849	0.7392	0.8194
H-CNN [22]	0.9476	0.8019	0.7796	0.8268
ML-CapsNet [25]	0.9587	0.8654	0.8059	0.8856
BA-CNN [21]	0.9606	0.8790	0.8184	0.9015
Ours (HCAM-CL)	0.9674	0.8888	0.8298	0.9078

Table 3. Comparison of results of our proposed method with those of other baseline methods on CIFAR-100.

Model	Fine Accuracy
BA-CNN [21]	0.6147
B-CNN [13]	0.6442
ML-CapsNet [25]	0.6462
H-CNN [22]	0.6923
CNN-RNN [15]	0.7226
DHC [6]	0.7591
HCAM-CL	0.7794

Table 4. Comparison of metrics on whether to use cross attention.

	Accuracy			F1-Score
	Coarse1	Coarse2	Fine	F1-Score
Without cross attention	0.852	0.7446	0.7005	0.7767
With cross attention	0.9518	0.8590	0.7954	0.8758
With hierarchical cross attention	0.9624	0.8888	0.8298	0.9008

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, J.; Liang, J.; Zhu, J.; Li, Y. HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification. Symmetry 2024, 16, 1231. https://doi.org/10.3390/sym16091231

AMA Style

Su J, Liang J, Zhu J, Li Y. HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification. Symmetry. 2024; 16(9):1231. https://doi.org/10.3390/sym16091231

Chicago/Turabian Style

Su, Jing, Jianmin Liang, Jiayi Zhu, and Yongjiang Li. 2024. "HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification" Symmetry 16, no. 9: 1231. https://doi.org/10.3390/sym16091231

APA Style

Su, J., Liang, J., Zhu, J., & Li, Y. (2024). HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification. Symmetry, 16(9), 1231. https://doi.org/10.3390/sym16091231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Problem Statement

3.2. The Framework of Our Model

3.2.1. Image Feature Extraction

3.2.2. Hierarchical Cross-Attention Mechanism

3.2.3. Hierarchical Classification

4. Experimental Results

4.1. Experimental Dataset and Settings

4.2. Evaluation Metric and Baseline Models

4.3. Results and Analysis

4.4. Ablations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI