SPIRIT: Structural Entropy Guided Prefix Tuning for Hierarchical Text Classification

Zhu, He; Xia, Jinxiang; Liu, Ruomei; Deng, Bowen

doi:10.3390/e27020128

Open AccessArticle

SPIRIT: Structural Entropy Guided Prefix Tuning for Hierarchical Text Classification

¹

State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, No. 37 Xue Yuan Road, Hai Dian District, Beijing 100191, China

²

School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou 215213, China

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(2), 128; https://doi.org/10.3390/e27020128

Submission received: 20 December 2024 / Revised: 11 January 2025 / Accepted: 12 January 2025 / Published: 26 January 2025

(This article belongs to the Special Issue Causal Inference in Recommender Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Hierarchical text classification (HTC) is a challenging task that requires classifiers to solve a series of multi-label subtasks considering hierarchical dependencies among labels. Recent studies have introduced prompt tuning to create closer connections between the language model (LM) and the complex label hierarchy. However, we find that the model’s attention to the prompt gradually decreases as the prompt moves from the input to the output layer, revealing the limitations of previous prompt tuning methods for HTC. Given the success of prefix tuning-based studies in natural language understanding tasks, we introduce Structural entroPy guIded pRefIx Tuning (SPIRIT). Specifically, we extract the essential structure of the label hierarchy via structural entropy minimization and decode the abstractive structural information as the prefix to prompt all intermediate layers in the LM. Additionally, a depth-wise reparameterization strategy is developed to enhance optimization and propagate the prefix throughout the LM layers. Extensive evaluation on four popular datasets demonstrates that SPIRIT achieves a state-of-the-art performance.

Keywords:

information theory; structural entropy; representation learning

1. Introduction

Hierarchical text classification (HTC) poses a significant challenge, as the classifier must simultaneously comprehend the semantics in the text and the hierarchical correlations inherent in the label set. A popular approach for HTC is to express the label set as a label hierarchy (intrinsically, a directed acyclic graph) and adopt a graph neural network to extract structural information using a dual-encoder framework in conjunction with the LM [1]. Furthermore, to enhance the collaboration between the text and the structure encoders, various deep learning techniques, such as semantic matching [2], information maximization [3], and contrastive learning [4,5], are introduced to HTC. To fully exploit and reuse the capabilities of pretrained language models, recent studies have proposed prompt tuning methods [6,7,8], achieving great improvements in HTC.

However, two observations reveal the limitations of vanilla prompt tuning and spark our interest in prefix tuning for HTC: (1) As depicted in Figure 1, HPT [6], a prominent prompt tuning method for HTC, where the attention scores on the prompt contents potentially display continual downward trends as the input dataflows toward the output layer. (2) Other prior research indicates that prompt tuning encounters performance bottlenecks on natural language understanding tasks, whereas the involvement of prefix tuning [9,10] yields performance comparable to fine-tuning [11]. Considering the two observations, prefix tuning may be a better choice than prompt tuning for HTC, as the prefix could persistently recall the label hierarchy to the language model.

Another issue preventing the prompt learning method from breakthroughs in HTC is that natural language-based prompts fail to convey the intricate label hierarchy [6], even when using advanced language models like ChatGPT [5]. In contrast, we consider generating prompts from a structural perspective. In light of structural entropy [12], which measures the uncertainty of a graph and provides the essential structure of the graph formulated in coding trees, we aim to construct structural prefixes for HTC with the principle of structural entropy minimization. Specifically, we implement an algorithm to realize optimal coding tree construction, uncovering the underlying structure of label hierarchies. Subsequently, we design a structural prefix network to decode the structural information embedded in the coding trees and generate initial prefix representations. Additionally, a depth-wise reparameterization is originated to propagate the prompt tokens to each hidden layer, allowing prefixes to be better adapted to the LM and reduce the training burden. In comparison with other supervised-only, contrastive learning, and prompt tuning methods, SPIRIT achieves superior performance on four common datasets. Overall, the contributions of our work can be summarized as follows:

We propose a structural entropy guided prefix tuning method named SPIRIT for hierarchical text classification. To the best of our knowledge, this paper is the first introduction of prefix tuning to HTC.
To construct hierarchy-aware prefixes, we decode the essential structure of label hierarchies with a structural entropy minimization algorithm, providing abstractive information to be encoded into the prefix.
Given the essential structure of label hierarchies, a structural prefix network is proposed to construct initial prefix embeddings for prompting all hidden layers in the LM. Further, a depth-wise reparameterization strategy is originated for adapting structural prefixes to the LM.
Experiments on four datasets demonstrate the effectiveness of SPIRIT For reproducibility, source code is available at https://github.com/Rooooyy/SPIRIT (accessed on 11 January 2025).

2. Preliminary

Hierarchical Text Classification. In a specific hierarchical text classification task, there exists L classification subtasks

Y = {Y^{1}, Y^{2}, \dots, Y^{L}}

, where

Y^{l} = {y_{1}, y_{2}, \dots} \in {0, 1}^{| Y^{l} |}

is a multi-label classification task at the l-th level. The total number of label is denoted as

| Y | = \sum_{l = 1}^{L} | Y^{l} |

. Furthermore, dependencies among labels in adjacent levels, derived from their semantics, can be formulated as a directed graph. This graph—known as the label hierarchy—contains edges only from nodes at the

(l - 1)

-th level to those at the l-th level (i.e., no self-loops or cross-level linking), capturing the hierarchical dependencies embedded in the label set. The ground truth labels always co-occur as paths from the root node to the leaf nodes. Given a piece of text

D = {x_{1}, x_{2}, \dots, x_{N}}

, the classifier should predict the labels in

Y

concerning the hierarchy.

Structural Entropy. Structural entropy [12] extends Shannon entropy on graph structures, measuring the uncertainty of a graph. Coding trees are carriers for structural entropy, revealing the essential structure of a graph. Formally, the structural entropy of a graph

G = (V, E)

is defined as follows:

H^{T} (G) = - \sum_{α \in T} \frac{g_{α}}{v o l (G)} {log}_{2} \frac{v o l (α)}{v o l (α^{-})},

(1)

where

α

is a non-root node of coding tree

T

which represents a subset of

V_{G}

,

α^{-}

is the parent node of

α

on the coding tree.

g_{α}

represents the out degree of

α

on G.

v o l (G)

denotes the volume of graph G while

v o l (α)

and

v o l (α^{-})

are the total degrees of nodes in

T_{α}

and

T_{α^{-}}

, respectively. The height of the coding tree should be fixed to formulate a certain coding scheme. Accordingly, the K-dimensional structural entropy of the graph G is the minimal structural entropy determined by coding tree T with height no greater than K, formally,

H_{K} (G) = min_{{T | h e i g h t (T) \leq K}} H^{T} (G) .

(2)

Coding Tree. The coding tree

T

of a graph

G = (V, E)

is a tree with the following properties: (1) Every node

α \in T

is assigned with a non-empty subset of V, denoted as

T_{α}

. The root node

λ

of

T

marks all nodes in G, i.e.,

T_{λ} = V

. For each leaf node

γ \in T

,

T_{γ}

contains a single node

v \in G

. (2) For each non-leaf node

α \in T

, its i-th child node is denoted as

α^{< i >}

,

T_{α}

is the union of its child nodes, i.e.,

T_{α} = ⋃_{i} T_{a^{< i >}}

(there is a proposition that nodes on the same level as

T

provide a partition of V). Additionally, a notation list is given in Appendix A.1.

3. Methodology

In this section, we elaborate on the design of SPIRIT in Figure 2.

3.1. Input Prompt

Following [6], we construct dynamic soft prompts incorporating hierarchy-aware information. The quantity of virtual tokens is equal to the depth of the label hierarchy. Each token corresponds to a layer in the hierarchy with the same index. Furthermore, each layer’s prompt token is appended by a special token for layer-wise classification. Formally, given a document

D = {x_{1}, x_{2}, \dots, x_{N}}

, the input to the LM is as follows:

{x_{C}; x_{1}, x_{2}, \dots, x_{N}; t_{1}, x_{P}, t_{2}, x_{P}, \dots, t_{L}, x_{P}; x_{S}},

(3)

where

x_{C}

and

x_{S}

denote the embeddings of the <CLS> and <SEP> tokens, respectively.

x_{P}

is initialized with <MASK> token for layer-wise multi-label prediction.

{t_{1}, t_{2}, \dots, t_{L}}

are prompt tokens produced by a prompt encoder. Specifically, L virtual nodes are added to the label hierarchy and each virtual node is linked with all label nodes in the same layer. For representation learning, the virtual nodes are randomly initialized while the label nodes are initialized with their textual description encoded by the LM. Thereafter, a one-layer graph attention network (GAT) [13] is adopted to model label correlations on the augmented label hierarchy graph. The embeddings of virtual nodes fetched by GAT are taken as continuous prompts, where the information of the label hierarchy is condensed. Given a node u on the label hierarchy graph within virtual nodes, the information propagation process of GAT is formulated as

t_{u} = ReLU (\sum_{v \in N (u) ⋃ u} \frac{W_{G} t_{v}}{c_{u}}),

(4)

where

N (u)

denotes the neighbors for node u,

c_{u}

is a normalization constant, and

W_{G} \in R^{d_{G} \times d_{G}}

is the trainable parameter. Then, we filter out those embeddings of the virtual nodes

{t_{1}, t_{2}, \dots, t_{L}}

to construct the prompt.

3.2. Structural Entropy Guided Prefix

Simply prompting the model at input layer might not be sufficient for HTC as the label hierarchy provides rules for classification, rather than merely serving as task instructions. To continually reinforce the LM’s memory of the label hierarchy, we replay the knowledge of labels via prefix tuning [9]. This involves introducing prompt tokens for all intermediate layers in the LM. Intuitively, the intermediate layers should be more attentive to coarse-grained information compared with the input layer. Further, layers at higher levels should receive and produce denser information than those at lower levels. Based on these assumptions, we utilize the essential structural information provided by coding trees, which preserve minimal structural entropy [12], to construct prefix tokens. We also propose a depth-wise reparameterization strategy for broadcasting the prefixes to LM layers at different depths.

Coding Tree Construction. Structural entropy could measure the uncertainty of a graph. Minimizing structural entropy, a coding tree can be constructed to reveal the essential structure of the graph. Converting the label hierarchy into such a coding tree would make the intermediate LM layers easier to understand. To this end, we design an efficient algorithm to construct the optimal coding tree obtaining minimal K dimensional structural entropy. Given a graph

G = (V, E)

, let

C = {C_{1}, \dots, C_{m}}

be a partition of V, where

C_{i} \subset V

is called a community. The algorithm is implemented based on three atom functions as follows:

Definition 1.

Function

F_{M} (\cdot)

. Given any two communities

C_{α}, C_{β} \in C

, merge function

F_{M} (α, β)

merges

C_{α}

and

C_{β}

into a new community

C_{γ}

, i.e.,

C_{γ} = C_{α} \cup C_{β}

.

Here, we denote the reduction in structural entropy after calling

F_{M} (α, β)

as

Δ H_{K}^{(α, β)} (G)

. According to Equation (1), we have

\begin{matrix} Δ H_{K}^{(α, β)} (G) = & \frac{1}{v o l (G)} {[v o l (α) - g_{α}] l o g_{2} [v o l (α)] \\ + [v o l (β) - g_{β}] l o g_{2} [v o l (β)] \\ - [v o l (γ) - g_{γ}] l o g_{2} [v o l (γ)] \\ + [g_{α} + g_{β} - g_{γ}] l o g_{2} [v o l (G)]} . \end{matrix}

(5)

Definition 2.

Function

F_{S} (\cdot)

. Given a graph G and a partition

C

, function

F_{S} (C)

squeezes G in the following procedure. For all communities

C_{i}

in

C

, nodes marked by

C_{i}

on graph G are squeezed to a new node

v_{i}^{+}

, and the edge weights of

v_{i}^{+}

to others is the summation of those original nodes in

C_{i}

.

Definition 3.

Function

F_{U} (\cdot)

. Given a coding tree

T

and a partition

C

.

F_{U} (T, C)

will update coding tree

T

by taking all communities in

C

as leaf nodes of

T

, leading to an increase in height.

Initially, we adopt each node in the graph as a single community, and then iteratively execute

F_{M} (\cdot)

and

F_{S} (\cdot)

until a K-dimensional coding tree is constructed. Given a sub-graph

G_{k}

of G in the k-th iteration, we merge the communities with the maximal

Δ H_{K} (G_{k})

greedily until there are no communities satisfying

Δ H_{K} (G_{k}) > 0

, thereby achieving the minimal K-dimensional structural entropy. The complete procedure is shown in Algorithm 1. Furthermore, a case study is provided in Appendix A.2 for a better illustration of Algorithm 1.

Algorithm 1: Coding Tree Construction

Theorem 1.

Given a graph

G = (V, E)

, for any target height K, Algorithm 1 is in the complexity of

O (| V |)

.

Proof.

Lines 1–6 of Algorithm 1 merge the nodes of

T

from bottom to top. For each iteration in

k : = [1 : 1 : K]

,

F_{M} (\cdot)

generates

l o g_{2} (| C_{k - 1} |)

new nodes for

C_{k}

. Then,

F_{S} (\cdot)

traverses and squeezes these nodes; this results in the retention of nodes no more than

l o g_{2} (| C_{k - 1} |)

. Therefore, the node size of

T

is lower than that of a binary tree

T_{B}

constructed on the

| V |

leaf nodes. The complexity of traversing

T_{B}

is calculated as follows

\begin{matrix} O (T_{B}) & = O (| E_{T_{B}} |) \\ = O (| V_{T_{B}} | - 1) \\ = O (| V_{T_{B}}^{0} | + | V_{T_{B}}^{1} | + \dots | V_{T_{B}}^{T_{B} . h e i g h t} | - 1) \\ < O (2 | V_{T_{B}}^{0} |) \\ = O (2 | V |) \\ = O (| V |) . \end{matrix}

(6)

Algorithm 1 executes K linear traversals and operations on

T

. The total number of operations is twice the node size of

T

. Thus, the complexity of Algorithm 1 is also

O (| V |)

. □

Structural Prefix Network. After calling Algorithm 1 on the label hierarchy, we obtain a coding tree

T = (V, E), V = {C_{k} | k \in [0, K]}

with height K. For representation learning, we initialize the leaf node embeddings

X^{0}

with the textual descriptions of their corresponding labels. Here, non-leaf nodes embeddings

{X^{k} | k \in [1, K]}

still remain unknown. To fetch intermediate node embeddings, we design a message-passing network as follows,

\begin{matrix} x_{α^{-}} = W_{2}^{k} (ReLU (\sum_{α \in C_{α^{-}}} W_{1}^{k} x_{α})), \\ α^{-} \in C_{k}, k \in [1, K], \end{matrix}

(7)

where

W_{1}^{k}, W_{2}^{k} \in R^{d \times d}

are learnable weights in the k-th layer, and

C_{α^{-}}

denotes the child nodes of

α^{-}

.

Depth-wise Reparametrization. In light of the introduction of reparametrization in previous prefix tuning studies [9,11], we propose a depth-wise reparametrization to construct prefix embeddings. First, we take the average of node embeddings at each layer in

T

as the initial prefix Q, formally,

Q = {q_{k} = W_{q} \frac{1}{| C_{k} |} \sum_{α \in C_{k}} x_{α} | k \in [1, K]},

(8)

where

W_{q} \in R^{d \times d}

denotes a group of learnable parameters. Next, given B LM layers, the final prefix

Q = {q_{k} \cdot σ (ρ (k, b)) | k \in [1, K], b \in [1, B]}

, where

σ (\cdot)

denotes the sigmoid function.

ρ (k, b)

is the relative distance of k and b, formally,

ρ (k, b) = - \tan [π (| \frac{k}{K} - \frac{b}{B} | - 0.5) (1 - ϵ)],

(9)

where

ϵ = 1 e^{- 5}

for avoiding calculation error. The intuition of Equation (9) is that the node embeddings at lower levels of coding tree

T

would be more suitable for prompting layers next to the input layer in the LM while node embeddings at higher levels, preserving denser information, might be better prefix for layers beneath the output layer.

3.3. Model Training and Prediction

Hierarchy-constrained Prediction. At the output layer, the LM could predict the logits

S

that

x_{P}

corresponds to each label, formally,

\begin{matrix} S = {S^{l} | l \in [1, L]}, S^{l} = {s_{i}^{l} | i \in [1, | Y^{l} |]}, \end{matrix}

(10)

where

s_{i}^{l} \in R

is a single logit. When

s_{i}^{l} > 0

, the model will predict the corresponding label

y_{i}^{l}

as a result. At the inference stage, when the l-th layer prompt is observed, we further restrict the model to predict only the labels from

Y^{l}

.

ZLPR Loss. A majority of previous methods [1,2,4,14] adopt binary cross entropy (BCE) loss as the optimization target, which intrinsically degrades HTC into a naive multi-label classification. Following recent works [5,6] in HTC, we alternatively utilize the “Zero-bounded Log-sum-exp & Pairwise Rank-based” (ZLPR) loss proposed by [15], formally,

L_{ZLPR}^{l} = \log (1 + \sum_{i \in Ω_{pos}^{l}} e^{- s_{i}^{l}}) + \log (1 + \sum_{j \in Ω_{neg}^{l}} e^{s_{j}^{l}}),

(11)

where

Ω_{pos}^{l}, Ω_{neg}^{l}

denote the target labels and non-target labels in

Y^{l}

, dynamically obtained using the ground truths. ZLPR loss explicitly captures the label correlations in the label hierarchy, which is proved beneficial for HTC [5,6].

Masked LM Training. We keep the original masking strategy and loss as BERT [16] to train our model, where 15% words of the text are randomly masked. The final loss

L_{t o t a l}

is the summation of ZLPR loss with regard to classification subtasks

Y^{l}

at different levels and the MLM loss

L_{M L M}

, formally,

L_{t o t a l} = \sum_{l = 1}^{L} L_{Z L P R}^{l} + L_{M L M} .

(12)

4. Experimental Evaluation

Datasets and Evaluation Metrics. Experiments are conducted on four popular datasets in HTC: WOS [17], RCV1-v2 [18], NYTimes [19], and BGC (https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/blurb-genre-collection.html, accessed on 7 May 2024). We randomly split WOS and RCV1-v2 following [6] while keeping the official splits for NYTimes and BGC. The statistics of these datasets are shown in Table 1. Model performance is measured by Micro-F1 and Macro-F1 [20]. Micro-F1 is the harmonic mean of the overall precision and recall of all the test instances, while Macro-F1 is the average F1 score of each category. Thus, Micro-F1 reflects the performance on more frequent labels, while Macro-F1 treats labels equally.

Implementation Details. We adopt BERT [16] as the LM to be prompted and update its parameters with Adam [21] as the initial learning rate is

3 \times 10^{- 5}

. For simplicity, the hidden size d of all the learnable weights is set to 768 if not mentioned. The model is trained on a single Nvidia V100 32GB GPU with batch size of 24. Training will be terminated if the Macro-F1 does not increase on the dev set for five epochs. The optimal height K of coding trees are

[3, 2, 2, 3]

for WOS, RCV1-v2, NYTimes, and BGC, respectively. For all text samples, we truncate them to a maximum length of 500.

Baselines. All baselines adopt BERT [16]. The performance gaps compared with the vanilla BERT show the significance of their original parts. HiAGM [1], HTCInfoMax [3], HiMatch [2], and HiTIN [14] are all supervised models based on the dual-encoder framework within various techniques to enhance the combination. HGCLR [4] is the pioneer investigation of contrastive learning for HTC while HJCL [5] proposed an instance-wise and label-wise joint supervised contrastive loss. HPT [6] is the inaugural work to introduce prompt tuning for HTC, where a dynamic virtual template and label words are proposed to bridge the gap between MLM and HTC. HBGL [7] pretrains a set of label embeddings with a pretext task and feeds those of ground truth labels to LM during training, which could be treated as a pretrained prompt tuning [22] manner.

4.1. Results and Analysis

The main results are shown in Table 2. We can observe that SPIRIT brings 2.26 and 4.69 average improvements to vanilla BERT on Micro-F1 and Macro-F1, respectively. Compared to all the baselines, SPIRIT obtains a state-of-the-art performance on seven out of eight of the metrics. The Macro-F1 performance of SPIRIT is second only to that of HJCL; however, it is notable that HJCL introduces both ZLPR loss and the supervised contrastive learning loss, intrinsically utilizing more supervision signals. In contrast, MLM loss is naturally at a disadvantage due to the self-supervised feature.

Among those prompt tuning methods, SPIRIT outperforms all others across all datasets. Our model significantly outperforms when using datasets with more complex label hierarchies (i.e., RCV1-v2 and NYTimes), while the gaps between it and the runner-up are smaller on WOS. A probable reason is that WOS has only two levels of hierarchy and the ground truths are all single-path, which could not fully present the advantages of structural prefix tuning. Moreover, the ups and downs of HBGL’s ranking on the three datasets prove that its pretrained label embeddings are derived more from semantic information than structural information.

4.2. Ablation Studies

Prefix Tuning Only Works in the Framework of SPIRIT. To track the source of performance improvements, we conducted ablation studies on both WOS and NYTimes in Table 3. Among the four datasets, WOS has the shallowest label hierarchy (2), while that of NYTimes is the deepest (8). We first replace the structural prefix network (Section 3.2) with a two-layer MLP:

W_{2} {t a n h [((W_{1} \cdot T)]}

to generate the final prefix, where

T = {t_{1}, t_{2}, \dots, t_{L}}

are layerwise prompts (Section 3.1).

W_{1} \in R^{d^{'} \times d}

and

W_{2} \in R^{(d \cdot B) \times d^{'}}

. This is a common prefix tuning network adopted in [9,11]. The performance gaps demonstrate the superiority of structural entropy guided prefixes. Next, we skip Equation (9) and directly feed B copies of Q to the LM (r.p.

Q \to B \times Q

). The degraded results show the necessity of simulating information densification. Further, we replace

Q

with a random initialized matrix

ξ \in R^{K \times d \times B}

, providing a lower bound as random prefix tuning. The above results jointly suggest the integrity of SPIRIT.

Effects of the K-dimensional Structural Entropy. The only hyperparameter of SPIRIT to be tuned is K, which represents the dimension of structural entropy and the height of coding trees. To investigate the effects of K, we train SPIRIT with the value of K ranges in

{2, 3, 4, 5, 6}

while keeping other settings the same. Figure 3 shows the test performance of different height coding trees on WOS and NYTimes. As K grows, the performance of SPIRIT shows a trend of degradation. Higher coding trees may involve more explosive gradients, resulting in unstable optimization.

4.3. Evaluation on Path Consistency

We hold the same view as [8,23] that path consistency is crucial to HTC. We further evaluate models with P metrics [8] and C metrics [23], both of which are stricter than the vanilla Micro-F1 and Macro-F1 metrics. In P metrics, only if all labels in one path are predicted accurately is the prediction is considered correct in the confusion matrix. C metrics emphasize that a node label is considered predicted correctly only when all its ancestor nodes are correct. As depicted in Figure 4, our model can better convey the labeling hierarchy at different depths and outperforms HPT in P metrics and C metrics across the board. In comparison with the ablation model “r.p.

Q \to B \times Q

”, denoted as “SPIRIT (ones)” in Figure 4, SPIRIT also exhibits superior performance, highlighting the necessity of depth-wise reparameterization (Section 3.2).

4.4. Comparison with Large Language Models

Table 4 presents the performance of two emerging large language models, GPT-3.5-Turbo [24] and GPT-4o-0513 [25]. Following [5], we adopt the same prompt template to evaluate the in context learning ability of GPT-4o API (https://openai.com/api/). We also repeat the experiments utilizing a new prompt template presented in Table 5. It can be found that the instruction-tuned large language models are still inferior to the supervised fine-tuned models. Further, we could tell that the reasoning capability of large language models may not generalize to structural information, and the introduction of a structural encoder is necessary.

4.5. Qualitative Analysis

We illustrate a case study in Figure 5 and Table 6. The test case is from NYTimes [19]. Benefiting from the structural prefix tuning and depth-wise reparameterization, our model exhibits a more profound understanding of the hierarchy, successfully predicting all three paths of ground truths. In contrast, due to providing prompts in the input layer only, HPT rapidly forgets the label hierarchy, resulting in only one predicted path. The attention scores in SPIRIT (ones), surpass SPIRIT at the 3→9-th layer but drop dramatically at layer 10→11. Equally weighted prompts confuse the model output, resulting in only two successfully predicted paths.

To dynamically demonstrate the effects of structural prefixes and depth-wise reparameterization, we plot the attention scores of text tokens with respect to the structural as heat maps in Figure 6. Structural prefixes are given more attention by the model at both the input and output layers, which demonstrates structural prefix tuning plays an important role in input and output. In the middle layer of the model, influenced by depth-wise reparameterization, the model ignores the zeroth and second prefixes and pays more attention to the intermediate prefixes, which is in line with our expectations.

5. Related Work

Hierarchical Text Classification. Existing works for HTC could be categorized into local and global approaches [1]. Local approaches build multiple models for labels in different levels in the hierarchy, conveying the information from models in the upper levels to those in the bottom [17,26,27,28]. On the contrary, global studies treat HTC as a flat multi-label classification problem [1,2,3,20,29,30,31,32]. With the advent of pretrained language models (PLMs), numerous studies attempt to solve HTC tasks with PLMs. Techniques derived from pretrained language modeling such as contrastive learning [4,5] and prompt tuning [6,7] are successfully applied to HTC. The boundary between local and global methods is blurring as lots of efforts are paid to utilize the mixture of local and global information [5,7].

Prompt Tuning. There are several widely studied prompt tuning manners: (1) Naive prompt tuning, where continuous prompts are directly fed into the input layer for jointly modeling with the original inputs [10,33,34,35]. (2) Prefix tuning, where continuous prompts are not only added to the input but also to the hidden states [9,11]. (3) P-tuning, where a prompt encoder is involved to capture the correlation among continuous prompts [11,36]. (4) Pretrained prompt tuning, where continuous prompts are pretrained together with the language model [22]. Prompt tuning methods for HTC tend to produce continuous prompts based on the label hierarchy. According to the above taxonomy, HPT [6] could be exactly recognized as a p-tuning method since it generates virtual prompt tokens with a graph attention network [13], which acts as the prompt encoder. HBGL [7] could be referred to as a pretrained prompt tuning method where the label embeddings are pretrained under a pretext task along with BERT [16].

Structural Entropy. Structural entropy [12] is a natural extension of Shannon entropy [37] in a structural system, capable of measuring the system’s structural complexity. Non-Euclidean data, especially graph data, create a typical structured system. The structural entropy of a graph is defined as the average length of the codewords obtained by a random walk under a specific coding scheme, namely coding tree [12]. In the past few years, structural entropy has been successfully applied in various domains [38,39,40,41,42,43,44,45].

6. Conclusions

We introduce SPIRIT, a prefix tuning framework consisting of a structural entropy minimization algorithm, a structural prefix network, and a depth-wise reparameterization strategy. For HTC, structural entropy-guided prefixes could significantly improve the memory of language models for label hierarchy. Depth-wise reparameterization effectively adapts the prefixes to LM layers at different depths. Moreover, random prefixes and repetitive prefixes demonstrate that structural entropy-guided prefix tuning is feasible for improving HTC performance.

Author Contributions

Methodology, H.Z.; Software, H.Z. and J.X.; Validation, H.Z., J.X. and B.D.; Formal analysis, B.D.; Investigation, R.L.; Data curation, B.D.; Writing—original draft, H.Z.; Writing—review & editing, H.Z., J.X. and R.L.; Visualization, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HTC	Hierarchical Text Classification
NLP	Natural Language Processing
GNN	Graph Neural Network
MLP	Multi-Layer Perception
LM	Language Model

Appendix A. More Details

Appendix A.1. A Notation List for Structural Entropy

To further clarify the definition of structural entropy and coding trees, we provide a notation list in Table A1.

Table A1. Notations of Structural Entropy and Coding Trees.

Notation	Explanation
$G = (V, E)$	A graph G with node set V and edge set E.
$T$	A coding tree.
$α$	A non-root node on the coding tree.
$T_{α}$	A subset of V, the composition of which depends on $α$ .
$g_{α}$	The number of edges from $T_{α}$ to $V - T_{α}$ .
$v o l (α)$	The volume of $T_{α}$ , the total degree of nodes in $T_{α}$ .
$α^{-}$	The parent node of $α$ on the coding tree $T$ .
$α^{< i >}$	The i-th child node of $α$ on the coding tree $T$ .
$λ$	The root node of coding tree $T$
$H^{T} (G)$	The structural entropy of G.
$H_{K} (G)$	The K-dimensional structural entropy of G when the maximum height of $T$ is K.
$C_{k} = {C_{1}, \dots, C_{m}}$	A partition of V in the k-th layer of coding tree $T$ .

Appendix A.2. A Case Study of Algorithm 1

As illustrated in Figure A1, Algorithm 1 generates a coding tree with height

K = 3

carrying the minimal three-dimensional structural entropy for the 12-node simple graph. The coding tree provides hierarchical partitions of the nodes, revealing the essential structure of the simple graph. The essential structure is abstractive since new nodes are created for partitioning.

Figure A1. An example of coding tree with height

K = 3

for synthesized simple graph.

Figure A1. An example of coding tree with height

K = 3

for synthesized simple graph.

References

Zhou, J.; Ma, C.; Long, D.; Xu, G.; Ding, N.; Zhang, H.; Xie, P.; Liu, G. Hierarchy-Aware Global Model for Hierarchical Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1106–1117. [Google Scholar] [CrossRef]
Chen, H.; Ma, Q.; Lin, Z.; Yan, J. Hierarchy-aware Label Semantics Matching Network for Hierarchical Text Classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Paperspp), ACL/IJCNLP 2021, Virtual Event, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4370–4379. [Google Scholar] [CrossRef]
Deng, Z.; Peng, H.; He, D.; Li, J.; Yu, P.S. HTCInfoMax: A Global Model for Hierarchical Text Classification via Information Maximization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3259–3265. [Google Scholar] [CrossRef]
Wang, Z.; Wang, P.; Huang, L.; Sun, X.; Wang, H. Incorporating Hierarchy into Text Encoder: A Contrastive Learning Approach for Hierarchical Text Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 7109–7119. [Google Scholar]
U, S.C.L.; He, J.; Gutiérrez-Basulto, V.; Pan, J.Z. Instances and Labels: Hierarchy-aware Joint Supervised Contrastive Learning for Hierarchical Multi-Label Text Classification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 8858–8875. [Google Scholar] [CrossRef]
Wang, Z.; Wang, P.; Liu, T.; Lin, B.; Cao, Y.; Sui, Z.; Wang, H. HPT: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 3740–3751. [Google Scholar] [CrossRef]
Jiang, T.; Wang, D.; Sun, L.; Chen, Z.; Zhuang, F.; Yang, Q. Exploiting Global and Local Hierarchies for Hierarchical Text Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4030–4039. [Google Scholar] [CrossRef]
Ji, K.; Lian, Y.; Gao, J.; Wang, B. Hierarchical Verbalizer for Few-Shot Hierarchical Text Classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 2918–2933. [Google Scholar] [CrossRef]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
Qin, G.; Eisner, J. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 5203–5212. [Google Scholar] [CrossRef]
Liu, X.; Ji, K.; Fu, Y.; Tam, W.L.; Du, Z.; Yang, Z.; Tang, J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv 2021, arXiv:2110.07602. [Google Scholar]
Li, A.; Pan, Y. Structural Information and Dynamical Complexity of Networks. IEEE Trans. Inf. Theory 2016, 62, 3290–3339. [Google Scholar] [CrossRef]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zhu, H.; Zhang, C.; Huang, J.; Wu, J.; Xu, K. HiTIN: Hierarchy-aware Tree Isomorphism Network for Hierarchical Text Classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J.L., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 7809–7821. [Google Scholar] [CrossRef]
Su, J.; Zhu, M.; Murtadha, A.; Pan, S.; Wen, B.; Liu, Y. ZLPR: A Novel Loss for Multi-label Classification. arXiv 2022. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Meimandi, K.; Gerber, M.S.; Barnes, L.E. HDLTex: Hierarchical Deep Learning for Text Classification. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 364–371. [Google Scholar]
Lewis, D.D.; Yang, Y.; Rose, T.G.; Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res. 2004, 5, 361–397. [Google Scholar]
Sandhaus, E. The New York Times Annotated Corpus; Linguistic Data Consortium: Philadelphia, PA, USA, 2008. [Google Scholar] [CrossRef]
Gopal, S.; Yang, Y. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings. Bengio, Y., LeCun, Y., Eds.; ICLR: Appleton, WI, USA, 2015. [Google Scholar]
Gu, Y.; Han, X.; Liu, Z.; Huang, M. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv 2021, arXiv:2109.04332. [Google Scholar]
Yu, C.; Shen, Y.; Mao, Y. Constrained sequence-to-tree generation for hierarchical text classification. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1865–1869. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Shimura, K.; Li, J.; Fukumoto, F. HFT-CNN: Learning Hierarchical Category Structure for Multi-label Short Text Categorization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 811–816. [Google Scholar] [CrossRef]
Banerjee, S.; Akkaya, C.; Perez-Sorrosal, F.; Tsioutsiouliklis, K. Hierarchical Transfer Learning for Multi-label Text Classification. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; Volume 1: Long Papers. Korhonen, A., Traum, D.R., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6295–6300. [Google Scholar] [CrossRef]
Huang, W.; Chen, E.; Liu, Q.; Chen, Y.; Huang, Z.; Liu, Y.; Zhao, Z.; Zhang, D.; Wang, S. Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019. [Google Scholar]
Aly, R.; Remus, S.; Biemann, C. Hierarchical Multi-label Classification of Text with Capsule Networks. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; Volume 2: Student Research Workshop. Alva-Manchego, F., Choi, E., Khashabi, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 323–330. [Google Scholar] [CrossRef]
Mao, Y.; Tian, J.; Han, J.; Ren, X. Hierarchical Text Classification with Reinforced Label Assignment. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 445–455. [Google Scholar] [CrossRef]
Wu, J.; Xiong, W.; Wang, W.Y. Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4353–4363. [Google Scholar] [CrossRef]
Rojas, K.R.; Bustamante, G.; Oncevay, A.; Cabezudo, M.A.S. Efficient Strategies for Hierarchical Text Classification: External Knowledge and Auxiliary Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2252–2257. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
Zhong, Z.; Friedman, D.; Chen, D. Factual probing is [mask]: Learning vs. learning to recall. arXiv 2021, arXiv:2104.05240. [Google Scholar]
Hambardzumyan, K.; Khachatrian, H.; May, J. Warp: Word-level adversarial reprogramming. arXiv 2021, arXiv:2101.00121. [Google Scholar]
Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT understands, too. AI Open 2024, 5, 208–215. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Liu, Y.; Liu, J.; Zhang, Z.; Zhu, L.; Li, A. REM: From Structural Entropy to Community Structure Deception. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 12918–12928. [Google Scholar]
Wu, J.; Li, S.; Li, J.; Pan, Y.; Xu, K. A Simple yet Effective Method for Graph Classification. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23–29 July 2022; pp. 3580–3586. [Google Scholar] [CrossRef]
Zhang, C.; Zhu, H.; Peng, X.; Wu, J.; Xu, K. Hierarchical Information Matters: Text Classification via Tree Based Graph Neural Network. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, 12–17 October 2022; Calzolari, N., Huang, C., Kim, H., Pustejovsky, J., Wanner, L., Choi, K., Ryu, P., Chen, H., Donatelli, L., Ji, H., et al., Eds.; International Committee on Computational Linguistics: New York, NY, USA, 2022; pp. 950–959. [Google Scholar]
Wu, J.; Chen, X.; Xu, K.; Li, S. Structural entropy guided graph hierarchical pooling. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 24017–24030. [Google Scholar]
Yang, Z.; Zhang, G.; Wu, J.; Yang, J.; Sheng, Q.Z.; Peng, H.; Li, A.; Xue, S.; Su, J. Minimum Entropy Principle Guided Graph Neural Networks. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM 2023, Singapore, 27 February–3 March 2023; Chua, T., Lauw, H.W., Si, L., Terzi, E., Tsaparas, P., Eds.; ACM: New York, NY, USA, 2023; pp. 114–122. [Google Scholar] [CrossRef]
Wu, J.; Chen, X.; Shi, B.; Li, S.; Xu, K. SEGA: Structural entropy guided anchor view for graph contrastive learning. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Wang, Y.; Wang, Y.; Zhang, Z.; Yang, S.; Zhao, K.; Liu, J. USER: Unsupervised Structural Entropy-Based Robust Graph Neural Network. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, 7–14 February 2023; Williams, B., Chen, Y., Neville, J., Eds.; AAAI Press: Washington, DC, USA, 2023; pp. 10235–10243. [Google Scholar]
Zou, D.; Peng, H.; Huang, X.; Yang, R.; Li, J.; Wu, J.; Liu, C.; Yu, P.S. Se-gsl: A general and effective graph structure learning framework through structural entropy optimization. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 499–510. [Google Scholar]

Figure 1. In HPT [6], the attention scores from text tokens to prompts significantly drop in the deeper LM layers. Different colored lines represent different samples.

Figure 2. An example of SPIRIT with

K = 3

. Given a pretrained LM, we first feed input prompt (Section 3.1) to the model, then we utilize the structural entropy guided prefix network (Section 3.2) to generate hierarchy-aware prefix tokens. The LM is fine-tuned with ZLPR loss and MLM loss (Section 3.3). After observation of the prompt and prefix tokens, the model is constrained to predict labels according to the hierarchy (Section 3.3).

Figure 2. An example of SPIRIT with

K = 3

. Given a pretrained LM, we first feed input prompt (Section 3.1) to the model, then we utilize the structural entropy guided prefix network (Section 3.2) to generate hierarchy-aware prefix tokens. The LM is fine-tuned with ZLPR loss and MLM loss (Section 3.3). After observation of the prompt and prefix tokens, the model is constrained to predict labels according to the hierarchy (Section 3.3).

Figure 3. Test performance of SPIRIT with

K \in {2, 3, 4, 5, 6}

on WOS and NYTimes.

Figure 3. Test performance of SPIRIT with

K \in {2, 3, 4, 5, 6}

on WOS and NYTimes.

Figure 4. Evaluation on P metrics and C metrics.

Figure 5. LM attention scores from the text tokens to prompt to the prompt contents.

Figure 6. A heat map on attention scores to the prefixes in LM layers. Case: WOS # 619, Ground Truth: [‘Medical’, ‘Menopause’]. Model Prediction: [‘Medical’, ‘Menopause’].

Table 1. Dataset statistics.

| Y |

is the number of classes on the entire label hierarchy.

A v g (| Y |)

is the number of ground-truth labels per sample. ‘#’ denotes the number of cases in the Train/Dev/Test set.

Table 1. Dataset statistics.

| Y |

is the number of classes on the entire label hierarchy.

A v g (| Y |)

is the number of ground-truth labels per sample. ‘#’ denotes the number of cases in the Train/Dev/Test set.

Dataset	$\| Y \|$	$A v g (\| Y \|)$	Depth	# Train	# Dev	# Test
WOS	141	2	2	30,070	7518	9397
RCV1-v2	103	3.24	4	20,833	2316	781,265
NYTimes	166	7.60	8	23,345	5834	7292
BGC	146	3.01	4	58,715	14,785	18,394

Table 2. Test performance on four datasets. The best results are marked in bold while the runner-ups are marked with underline. †: results reported by [4,5]. “-” means not reported or not applicable.

Models	WOS		RCV1-v2		NYTimes		BGC
Models	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1
Supervised Fine-tuned Methods
BERT [16] †	85.63	79.07	85.65	67.02	78.24	65.62	78.84	61.19
HiAGM(BERT) [1] †	86.04	80.19	85.58	67.93	78.64	66.76	79.48	62.84
HTCInfoMax(BERT) [3] †	86.30	79.97	85.53	67.09	78.75	67.31	79.16	62.94
HiMatch(BERT) [2] †	86.70	81.06	86.33	68.66	-	-	78.89	63.19
HiTIN(BERT) [14]	87.19	81.51	86.71	69.95	79.65	69.31	-	-
Supervised Contrastive Learning Methods
HGCLR [4]	87.11	81.20	86.49	68.31	78.86	67.96	79.22	64.04
HJCL [5]	-	-	87.04	70.49	80.52	70.02	81.30	66.77
Supervised Prompt Tuning Methods
HPT [6]	86.94	81.46	87.21	69.32	80.35	70.28	80.27	67.80
HBGL [7]	87.36	82.00	87.23	70.17	80.47	70.19	-	-
SPIRIT (Ours)	87.44	82.04	87.42	70.29	80.65	70.73	81.88	68.60

Table 3. Performance when replacing or removing some components of SPIRIT on the test set of WOS and NYTimes. r.p. denotes the replacement and r.m. denotes remove.

Ablation Models	WOS		NYTimes
Ablation Models	Micro-F1	Macro-F1	Micro-F1	Macro-F1
SPIRIT	87.44	82.04	80.65	70.73
r.m. SPN (Section 3.2)	87.19	81.84	80.27	70.43
r.p. $Q \to B \times Q$	87.04	81.47	80.40	70.35
r.p. $Q \to ξ$	86.72	81.34	80.07	70.09
r.m. $L_{M L M}$	86.94	81.56	80.24	70.42
r.p. $L_{Z L P R} \to L_{B C E}$	86.59	80.47	79.45	68.28

Table 4. †: results reported by [4,5]. *: Due to resource limitation, we tested the performance of GPTs on randomly sampled 10K cases in RCV1.

Models	WOS		RCV1-v2		NYTimes		BGC
Models	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1
GPT-3.5-Turbo [24]	50.30	26.06	45.43 *	26.72 *	37.34	20.41	52.91	34.94
+Prompt in Table 5	48.44	24.98	42.16 *	23.67 *	34.78	18.83	50.06	31.55
GPT-4o-0513 [25]	62.29	51.06	55.91 *	36.47 *	41.55	26.60	63.44	43.03
+Prompt in Table 5	63.17	49.68	59.24 *	33.40 *	43.58	24.78	65.06	42.20
BERT [16] †	85.63	79.07	85.65	67.02	78.24	65.62	78.84	61.19
SPIRIT (Ours)	87.44	82.04	87.42	70.29	80.65	70.73	81.88	68.60

Table 5. We rewrite the prompt in [5].

Prompt Template	You will be given a list of categories formatted as:
	A -> B -> C … -> X
	where A, B, C, X are categories
	and ’->’ represents the parent-child relationship.
	A -> B means A is a child of B.
	B -> C means B is a child of C.
	You should also predict A when you predict B.
	You should also predict B when you predict C.
	[Label Paths]
	Classify the following text into the categories above.
	Just predict the categories, no explanation is allowed.
	Texts: [Input]

Table 6. Complete label set for the case study diagram shown in Figure 5.

Methods	Labels
Ground Truths	Art and Design / Classifieds / Job Market / Job Categories Art / Features / Travel / Guides /Activities and Interests Features / Reviews / Art and Design / Arts
HPT	Art and Design / Classifieds / Job Market / Job Categories
SPIRIT (ones)	Art and Design / Classifieds / Job Market / Job Categories Features / Reviews / Art and Design / Arts
SPIRIT	Art and Design / Classifieds / Job Market / Job Categories Art / Features / Travel / Guides /Activities and Interests Features / Reviews / Art and Design / Arts

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, H.; Xia, J.; Liu, R.; Deng, B. SPIRIT: Structural Entropy Guided Prefix Tuning for Hierarchical Text Classification. Entropy 2025, 27, 128. https://doi.org/10.3390/e27020128

AMA Style

Zhu H, Xia J, Liu R, Deng B. SPIRIT: Structural Entropy Guided Prefix Tuning for Hierarchical Text Classification. Entropy. 2025; 27(2):128. https://doi.org/10.3390/e27020128

Chicago/Turabian Style

Zhu, He, Jinxiang Xia, Ruomei Liu, and Bowen Deng. 2025. "SPIRIT: Structural Entropy Guided Prefix Tuning for Hierarchical Text Classification" Entropy 27, no. 2: 128. https://doi.org/10.3390/e27020128

APA Style

Zhu, H., Xia, J., Liu, R., & Deng, B. (2025). SPIRIT: Structural Entropy Guided Prefix Tuning for Hierarchical Text Classification. Entropy, 27(2), 128. https://doi.org/10.3390/e27020128

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SPIRIT: Structural Entropy Guided Prefix Tuning for Hierarchical Text Classification

Abstract

1. Introduction

2. Preliminary

3. Methodology

3.1. Input Prompt

3.2. Structural Entropy Guided Prefix

3.3. Model Training and Prediction

4. Experimental Evaluation

4.1. Results and Analysis

4.2. Ablation Studies

4.3. Evaluation on Path Consistency

4.4. Comparison with Large Language Models

4.5. Qualitative Analysis

5. Related Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. More Details

Appendix A.1. A Notation List for Structural Entropy

Appendix A.2. A Case Study of Algorithm 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI