Next Article in Journal
Generalized Lorentzian Sasakian-Space-Forms with M-Projective Curvature Tensor
Next Article in Special Issue
Language Accent Detection with CNN Using Sparse Data from a Crowd-Sourced Speech Archive
Previous Article in Journal
Case-Based Reasoning and Attribute Features Mining for Posting-Popularity Prediction: A Case Study in the Online Automobile Community
Previous Article in Special Issue
Intent-Controllable Citation Text Generation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

by
Francisco J. Ribadas-Pena
*,†,
Shuyuan Cao
and
Víctor M. Darriba Bilbao
Department of Computer Science, University of Vigo, Edificio Politécnico, Campus As Lagoas s/n, 32004 Ourense, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2022, 10(16), 2867; https://doi.org/10.3390/math10162867
Submission received: 24 July 2022 / Revised: 8 August 2022 / Accepted: 9 August 2022 / Published: 11 August 2022

Abstract

:
In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.

1. Introduction

In Large-Scale Text Categorization (LSTC) we are confronted with textual classification problems where a very large and structured set of possible classes are employed. For the general case, not limited exclusively to text, the term eXtreme Multi-Label categorization (XML) is also often used. Usually, in these cases we are dealing with multi-label learning problems where models learn to predict more than one class or label to be assigned to a given input text.
Conventional approaches in multi-label learning either convert the original multi-label problem into a set of single-label problems or adapt well known single-label classification algorithms to handle multi-label datasets. In the context of LSTC and XML research, evolutions of both types of method, that employ what has been called label embedding (LE) or label compression (LC), have recently emerged, trying to take advantage of label dependencies to improve categorization performance. LE methods try to take advantage of label dependencies to improve categorization performance. The starting premise of LE is to convert the large label spaces to a reduced-dimensional representation space (the embedding space) where the actual classification is performed, the results of which are then transformed back to the original label space.
Autoencoders (AEs) are a classical unsupervised neural network architecture able to learn compressed feature representations from original features. Usually AEs are symmetrical networks with a series of layers that learn to transform their input to a latent space of lower dimension (encoder) and another series of layers that learn to regenerate that input from the latent space (decoder), both of them are connected by a small layer that acts as an information bottleneck. Training is carried out in an unsupervised way, presenting the same training vector in the input layer and in the output layer. AE are typically employed in data pre-processing, discarding the decoder and using the learned encoder to create rich representations of the input data useful in further processing.
Automatic semantic indexing is often modeled as an LSTC or XML problem. This task seeks to automate assigning to a given input document a sets of descriptors or indexing terms taken from a controlled vocabulary in order to improve further searching tasks. MeSH (Medical Subject Headings) is a large semantic thesaurus commonly used in the management of biomedical literature. MeSH labels are semantic descriptors arranged into 16 overlapping concept sub-hierarchies, which are employed to index MEDLINE, a collection of millions of biomedical abstracts.
Given this context, in this paper, a multi-label lazy learning approach is presented to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. This method is an evolution from the traditional k-Nearest Neighbors (k-NN) algorithm which exploits an AE trained to map the large label space to a reduced size latent space and to regenerate the output labels from this latent space. Our contributions are as follows:
  • We have employed MEDLINE as a huge labeled collection to train large label-AEs able to map MeSH descriptors onto a semantic latent space where label interdependence is coded.
  • Our proposal adapts classical k-NN categorization to work in the semantic latent space learned by these AEs and employs the decoder subnet to predict the final candidate labels, instead of applying simple voting schemes like traditional k-NN.
  • Additionally, we have evaluated different document representation approaches, using both sparse textual features and dense contextual representations. We have studied their effect in the computation of inter-document distances that are the starting point to find the set of neighbor documents employed in k-NN classification.
The remainder of this article is organized as follows. Section 2 presents the background and context of this paper. Section 3 describes the details of the proposed method and its components. Finally, Section 4 discusses the experimental results obtained by our proposals and Section 5 summarizes the main conclusions of this work.

2. Related Work

This work is framed at the confluence of three research fields: (1) large-scale multi-label categorization, (2) autoencoders and (3) semantic indexing. This section provides a brief review of the most relevant contributions in the state of the art of these topics in relation to our label autoencoder proposal.

2.1. Multi-Label Categorization

In multi-label learning [1] examples can be assigned simultaneously to several not mutually exclusive classes. This task differs from single-label learning (binary or multi-class) and has its own characteristics that make it more complex, while being able to model many relevant real-world problems. Formally, given L = { l 1 , l 2 , , l l } the finite set of labels in a multi-label learning task and D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x n , y n ) } the set of multi-label training instances, where x i is the i-example feature vector and y i L is the set of labels for that example, the multi-label categorization task aims to build a multi-label predictor f : x y , with y L , able to produce good classifications on incoming test instances from T = { x 1 , x 2 , , x m } .
The scientific literature on multiple-label learning [2,3] usually identifies two main groups of approaches when dealing with this problem: algorithm adaptation methods and problem transformation methods. Algorithm adaptation approaches extend and customize single-label machine learning algorithms in order to handle multi-label data directly. Several adaptations of traditional learning algorithms have been proposed in the literature, such as boosting (AdaBoost.MH) [4], support vector machines (RankSVM) [5], multi-label k-nearest neighbors (ML-kNN) [6] and neural networks [7]. On the other hand, problem transformation methods transform a multi-label learning problem into a series of single-label problems which already have well-established resolution methods. The solutions of these problems are then combined to solve the original multi-label learning task. For example, Binary Relevance (BR) [8], Label Powerset (LP) [9] and Classifier Chains (CC) [10] transform multi-label learning problems into binary classification problems.
A relevant aspect in multi-label learning approaches is the treatment given to inter-label dependencies. The simplest methods, such as BR, do not take into account correlation between labels, assuming label independence and neglecting the fact that some labels are more likely to co-exist. This assumption brings advantages in parallelization and training efficiency, but at the cost of lower performance in many real-word tasks that exhibit complex inter-label dependencies. Other approaches, like LP and CC, try to capture the dependencies between labels using different strategies. For example, CC sequentially creates a set of binary classifiers where labels predicted by previous classifiers are part of the features employed in successive classifications.
Recent research in multi-label learning propose a label embedding (LE) or label compression (LC) approach that tries to properly exploit correlation between label information by transforming the label space into a latent label space of reduced dimensionality. The actual categorization is performed in that latent space, where correlation between labels is implicitly exploited and a proper decoding process will map the projected data back onto the original label space. Early work in LE [11,12] typically considered linear embedding functions and worked with fairly small label space sizes. Other approaches overcome the limitations of linear assumptions evolving to non-linear embeddings [13,14,15], including several methods based on conventional or deep neural networks [16,17,18].
Finally, a prominent field in multi-label learning that have been attracting lots of research in recent times is eXtreme Multi-label Classification (XML) [19,20]. XML is a multi-label classification task in which learned models automatically label a data sample with the most relevant subset of labels from a large label set, with sizes ranging from thousands to millions. This is a challenging problem due to the scale of the classification task, label sparsity and complex label correlations. The ability to handle label correlations and the scalability of LE approaches [14] have shown many advantages in XML making embeddings one of the most popular approaches for tackling XML problems.

2.2. Autoencoders in Multi-Label Learning

Autoencoders (AEs) [21,22] are a family of unsupervised feedforward Neural Network architectures that jointly learn an encoding function, which maps an input to a latent space representation, and a decoding function, which maps from the latent space back onto the original space. Figure 1 shows this symmetric encoder-decoder structure, with;
  • An encoder function E n c : X Z , which maps the input vectors into a latent (often lower-dimensional) representation though a set of hidden layers.
  • A decoder function D e c : Z X , which acts as an interpreter of the latent representation and reconstructs the input vectors though a set of hidden layers, usually symmetric with the encoding layers.
  • A middle hidden layer representing in the latent space Z an encoding of the input data.
Figure 1. Architecture of a generic autoencoder.
Figure 1. Architecture of a generic autoencoder.
Mathematics 10 02867 g001
Training the model to reproduce the input data at its output, AE jointly optimizes the parameters of encoder E n c and decoder D e c functions and can learn in its hidden layer richer non-linear encoding features that can represent complex input data in a reduced dimensionality latent space. Most practical applications of AE exploit this latent representation in (1) data compression or hashing tasks, (2) classification tasks, using AE to reduce input features dimensionality with minimal information loss, (3) anomaly detection, by analyzing outliers and abnormal patterns in generated embeddings, (4) visualization tasks on the encoded space or (5) data reconstruction and noise reduction in image processing.
Usage of AEs in single-label and multi-label learning, including XML, has already been reported in many research works and AEs are frequently part of pre-processing steps performing dimensionality reduction in order to improve categorization performance and speed. Methods like AE-kNN [23] train an AE on high dimensional input features from a training dataset and employ the encoder sub-net as an input feature compressor, transforming the original input space into a lower-dimensional one where a conventional instance-based k-NN algorithm does the labeling.
The use of AEs over the label space has been less common in the literature. With the advent and explosion of XML methods, research proposals that try to take advantage of the capabilities of AEs to capture non-linear dependencies among the labels have appeared.
Wicker et al. [16], in a pioneering work in the use of label space AEs, introduce MANIC, a multi-label classification algorithm following the problem transformation approach which extracts non-linear dependencies between labels by compressing them using an AE. MANIC uses the encoder part to replace training labels with a reduced dimension version and then trains a base classifier (a BR model in their proposal) using the compressed labels as new target variables.
C2AE (Canonical-Correlated AutoEncoder) [17] and Rank-AE (Ranking-based Auto-Encoder) [18] follow a similar idea, which was later generalized in [24]. These approaches perform a joint embedding learning by deriving a compressed feature space shared by input features and labels. An input space encoder and a label space encoder, sharing the same hidden space, and a decoder that converts this hidden space back to the original label space are trained together to create a deep latent space that embeds input features and labels simultaneously.

2.3. Semantic Indexing in the Biomedical Domain

Controlled vocabularies provide an efficient way of accessing and organizing large collections of textual documents, specially in domains where a simple text-based representation of information is too ambiguous, like the biomedical or legal domains. Automatic semantic indexing seeks to build systems able to annotate an arbitrary piece of text with relevant controlled vocabulary terms. Aside from pure natural language processing (NLP) based methods, most of them following Named Entity Recognition (NER) or Entity Linking strategies, many approaches to semantic indexing model the assignment of controlled vocabulary terms as a multi-label categorization problem.
The Medical Subject Headings (MeSH) thesaurus [25] is a controlled and hierarchically-organized vocabulary, developed and maintained by the National Library of Medicine (https://www.nlm.nih.gov/, accessed on 24 July 2022), which was created for categorizing and searching citations in MEDLINE and the PubMED database. MEDLINE (https://www.nlm.nih.gov/medline/medline_overview.html, accessed on 24 July 2022) is an NLM bibliographic database that contains more than 29 million references to journal articles in life sciences published from 1966 to present, published in more than 5200 journals worldwide in about 40 languages. Each MEDLINE citation contains the title and abstract of the original article, author information, several metadata items (journal name, publishing dates, etc.) and a set of MeSH descriptors that describe the content of the citation and were assigned by NLM annotators to help indexing and searching MEDLINE records. The task of identifying the MeSH terms that best represent a MEDLINE article is manually performed by human experts. The average number of descriptors per citation in MEDLINE 2021 edition was 12.68.
MeSH vocabulary is composed of MeSH subject headings (commonly known as descriptors) which describe the subject of each article and a set of standard qualifiers (subheadings) which narrow down the MeSH heading topic. Additionally, Check Tags are a special subset of 32 MeSH descriptors that cover concepts mentioned in almost every article (human age groups, sex, research animals, etc.). MeSH descriptors are arranged in a hierarchy with 16 top-level categories that constitute a collection of overlapping topic sub-thesauri. A given descriptor may appear at several locations in the hierarchical tree and can have several broader terms and several narrower terms. The 2021 edition of the MeSH thesaurus is composed of 29,917 unique headings, hierarchically arranged in one or more of the 16 top-level categories, with the distribution shown in Table 1.
Automatic indexing with MeSH poses great research challenges. (1) Beyond its large size, the distribution of descriptors follows a power-law, where a few labels (Check Tags and very general descriptors) appear in a large number of citations, whereas most descriptors are employed to annotate very few abstracts According to statistics in [26], only 1% of all MeSH headings which have more than 5000 occurrences contribute to more than 40% of indexing. (2) Simultaneously indexing within 16 top-level overlapping topic sub-hierarchies leads to complex label interdependency.
Over the years several proposals have attempted to tackle the problem of automatic MeSH indexing. The Medical Text Indexer (MTI) [27] is a tool in permanent development by NLM for internal usage as a preliminary annotation tool of incoming MEDLINE citations. MTI is based on a combination of NLP based concept finding performed with MetaMap [28], k-NN prediction using descriptors from PubMed-related citations and several hand-crafted rules and state-of-the-art machine learning techniques that have been incorporated over years of development.
Semantic indexing with MeSH descriptors has also been boosted in recent years by competitions such as the BioASQ challenge [29], which, since 2013, has been organizing an annual shared-task dedicated to semantic indexing in MEDLINE. Several state-of-the-art methods for MeSH indexing were introduced by teams participating in this challenge, most of them modeling the task as a multi-label learning problem [30]. Some relevant recent developments in MeSH indexing are MeSHLabeler [31], DeepMeSH [32], MeSH Now [33], AttentionMeSH [34], MeSHProbeNet [35], FullMeSH [26], BERTMeSH [36] and k-NN methods using ElasticSearch and MTI such as [37].

3. Materials and Methods

Our proposal models semantic indexing over MeSH as a multi-label categorization problem with the following specific characteristics:
  • MEDLINE provides us with an extensive collection of manually annotated documents to train our models.
  • MeSH is a rich hierarchical thesaurus with a large set of descriptors and complex label co-occurrence.
The approach that we describe in this work tries to take advantage of these characteristics through the use of label autoencoders (label-AEs). Our method starts by training a label-AE using the historical information available in MEDLINE. Once trained, the components of this AE allow us to (1) convert the MeSH descriptors assigned to a MEDLINE citation to an embedded semantic space using the encoder part and (2) use the decoder part and a simple threshold scheme to return from that reduced-dimensional space back to the MeSH descriptor space.
The proposed multi-label classification follows a label embedding approach. Our method aims to take advantage of the reduced-dimensional semantic space learned by the label-AE so that a lazy learning scheme operates on it performing the actual classification. This results in a k-NN classifier enriched with the compact semantic information provided by the AE components. This section details the elements that make up our proposal.

3.1. Similarity Based Categorization (k-NN)

The k-Nearest Neighbor (k-NN) algorithm [38] is a lazy learning method which classifies new samples using previous classifications of similar samples assuming the new ones will fall into the same or similar categories. For a given test instance, x, the k most similar instances (the k-nearest neighbors), denoted as N ( x ) , are taken from the training set according to a certain similarity measure. Votes on the labels of instances in N ( x ) are taken to determine the predicted label for that test instance x.
Approaches based on k-NN have been widely used in large-scale multi-label categorization in many domains, including MEDLINE documents [37,39,40]. This preference for this lazy learning method is mainly due to its scalability, minimum parameter tuning and, despite its simplicity, its ability to deliver acceptable results when a large number of training samples are available.
The basic k-NN method we employ in our proposal follows these steps:
1.
Create an indexable representation from the textual contents of every document (MEDLINE citations in our case) in the training dataset.
Two different approaches for creating these indexable representations, dense and sparse, were evaluated in our study as is shown in Section 3.2.
2.
Index these representations in a proper data structure in order to efficiently query it to retrieve sets of similar documents.
3.
For each new document to annotate, the created index is queried using the indexable representation of the new document.
The list of similar documents retrieved in this step together with their corresponding similarity measures are used to determine the following results:
(a)
expected number of labels to assign to the new document
(b)
ranked list of predicted labels (MeSH descriptors in our case)
The first aspect conforms to a regression problem, which aims to predict the number of labels to be included in the final list, depending on the number of labels assigned to the most similar documents identified during the query phase and on their respective similarity scores. In our method the number of labels to be assigned is predicted by simply averaging the length of label lists in neighbor samples.
The other task is a multi-label classification problem, which aims to predict an output label list based on the labels manually assigned to the most similar documents. Our method creates the ranked list of labels using a simple majority voting scheme. Since this is actually a multi-label categorization task, there are as many voting tasks as there were candidate labels extracted from the neighboring documents retrieved by the indexing data structure. For each candidate label, positive votes come from similar documents annotated with it and negative votes come from neighbors not including it. The topmost candidate labels are returned as classification output.

3.2. Document Representation

As noted in the preceding section, our proposal indexes representations of the training documents in order to implement the similarity function that provides the set of neighbors and their similarity scores. In this work we have evaluated two different approaches in document representation, which determine their respective indexing and query schemes, together with document preprocessing.
  • Sparse representations created by means of several NLP based linguistically motivated index term extraction methods, employed as discrete index terms in an Apache Lucene index (https://lucene.apache.org/, accessed on 24 July 2022).
  • Dense representation created by using contextual sentence embeddings based on Deep Learning language models, stored in a numeric vector index.

3.2.1. Sparse Representations

This approach is essentially a large multi-label k-NN classifier backed by an Apache Lucene index. Lucene is an open-source indexing and searching engine, that implements a vector space model for Information Retrieval, providing several similarity ranking functions, such as BM25 [41]. Textual content of training documents is preprocessed in order to extract a set of discrete index terms which Lucene conveniently stores in an inverted index. When labeling, text from new documents is preprocessed and the extracted index term are treated as query terms and linked together using a global OR operator to conform the final query sent to the indexing engine to retrieve the most similar documents and their corresponding similarity scores.
In our case, we have employed the BM25 similarity function. The scores provided by the indexing engine are similarity measures resulting from the engine’s internal computations and the weighting scheme being employed, which do not have a uniform and predictable upper bound. In order for these similarity scores to behave like a real distance metric, we have applied a normalization procedure, that transforms them into a pseudo-distance in [ 0 , 1 ] .
Regarding sparse document representation we have evaluated several linguistically motivated index term extraction approaches as introduced in [40] for a similar problem in Spanish. We employed the following methods:
  • Stemming based representation (STEMS). This was the simplest approach which employs stop-word removal, using a standard stop-word list and the default stemmer from the Snowball project (http://snowball.tartarus.org, accessed on 24 July 2022).
  • Morphosyntactic based representation (LEMMAS). In order to deal with morphosyntactic variation we have employed a lemmatizer to identify lexical roots and we also replaced stop-word removal with a content-word selection procedure based on part-of-speech (PoS) tags.
    We have delegated the linguistic processing tasks to the tools provided by the spaCy Natural Language Processing (NLP) toolkit (https://spacy.io/, accessed on 24 July 2022). In our case we have employed the PoS tagging and lemmatization information provided by spaCy, using the biomedical English models from the ScispaCy project (https://allenai.github.io/scispacy/, accessed on 24 July 2022).
    Only lemmas from tokens tagged as a noun, verb, adjective, adverb or as unknown words are taken into account to constitute the final document representation, since these PoSs are considered to carry most of the sentence meaning.
  • Noum phrases based representation (NPS). In order to evaluate the contribution of more powerful NLP techniques, we have employed a surface parsing approach to identify syntactic motivated nominal phrases from which meaningful multi-word index terms could be extracted.
    Noun Phrase (NP) chunks identified by spaCy are selected and the lemmas of constituent tokens are joined together to create a multi-word index term.
  • Dependencies based representation (DEPS). We have also employed as index terms triples of dependence-head-modifier extracted by the dependency parser provided by spaCy.
    In our case the spaCy dependency parsing model identifies syntactic dependencies following the Universal Dependencies(UD) scheme. The complex index terms were extracted from the following UD relationships Detailed list of UD relationships (available at https://universaldependencies.org/u/dep/, accessed on 24 July 2022): acl, advcl, advmod, amod, ccomp, compound, conj, csuj, dep, flat, iobj, nmod, nsubj, obj, xcomp, dobj and pobj.
  • Named entities representation (NERS). Another type of multi-word representation taken into account is named entities. We have employed the NER module in spaCy and the ScispaCy models to extract general and biomedical named entities from document content.
  • Keywords representation (KEYWORDS). The last kind of multi-word representation we have included is keywords extracted with statistical methods from the textual content of articles. We have employed the implementation of the TextRank algorithm [42] provided by the textacy library (https://textacy.readthedocs.io, accessed on 24 July 2022).

3.2.2. Dense Representations

The recent rise of powerful contextual language models such as BERT and similar approaches have boosted the performance of multiple language processing tasks and Transformer based solutions dominate the state-of-the-art in many NLP areas. A natural evolution of these contextual word embeddings is to move them towards embeddings at the sentence-level with approaches such as those in the Sentence Transformers [43] project (https://www.sbert.net/, accessed on 24 July 2022) that provides pre-trained models to convert sentences in natural languages into fixed-size dense vectors with enriched semantics.
We have taken advantage of dense semantic representations of whole sentences as a basis for converting a search for similar documents into a search for similar vectors in the dense vector space where documents from the training dataset are represented.
We have employed the sentence-transformers/allenai-specter model (https://huggingface.co/sentence-transformers/allenai-specter, accessed on 24 July 2022) to represent a given MEDLINE abstract as a dense vector. This is a conversion of the AllenAI SPECTER model [44], originally trained to estimate the similarity of two publications, to SentenceTransformers, which exploits the citation graph to generate document-level embeddings of scientific documents. This model returns a 768-dimension vector from inputs in the form paper[title] + ’[SEP]’ + paper[abstract].
Once we have the dense representations of the training documents using this procedure, we use the FAISS [45] library (https://github.com/facebookresearch/faiss, accessed on 24 July 2022) to create a searchable index of these dense vectors. This index allows us to efficiently calculate distances between dense vectors and determine for the dense vector associated with a given test document (our query vector) the list of k closest training dense vectors using the Euclidean distance or other similarity metrics.
With this mechanism of similarity between dense vectors we can apply the k-NN classification procedure described previously. In this case we can use the real distances provided by the FAISS library between the query vector generated from the text to be annotated and the most similar k dense vectors directly.

3.3. Label Autoencoders

Our proposal is a special case of eXtream Multi-Label categorization (XML) using a label embedding approach. In our case a lazy learning method works on a low dimensional projection of the label space build with a label autoencoder (label-AE).
Our method is similar to MANIC [16]. Both of them learn a conventional AE. In MANIC the encoder is applied to the entirety of labels from the training examples and uses thresholds to convert their embeddings to a smaller binary label space in which Binary Relevance classifiers are trained. In our case the encoder only acts on the subset of training examples that are part of the neighbors set, N ( x ) . The embedded vectors of neighbors are averaged and the decoder transforms that average vector to the original label space. The AEs used by C2AE [17] and Rank-AE [18] are very different to ours. They jointly train two input subnets that share the inner embedding layer, one generates embeddings from the input features and the other one generates the same embedding space from labels, These two subnets are trained together with an output subnet that decodes the reduced embedding space to the actual label space. In the annotation phase, only the subnet that creates the embedding from input features and the decoding subnet are employed.
The first step of our proposal involves training a label-AE using the set of labels taken from the training samples. In the experiments reported in this paper, those labels are the lists of MeSH descriptors assigned to the MEDLINE citations in our training dataset. For MeSH, this results in a very large label-AE, with >29 K units in the input layer and another >29 K output neurons. Also, input and output vectors are extremely sparse, with an average of 12 values set to 1. On the other hand, the set of training samples is very large and can reach several million if the entire MEDLINE collection is used.
A tentative preliminary study was performed on a portion of the MEDLINE and MeSH datasets. In those preliminary runs we assessed the reconstruction capability of the trained AEs. As a result, the topology and main parameters of the label-AE scheme used in the experiments reported in this paper have been defined as follows:
  • Encoder with 2 hidden layers of decreasing size.
  • Inner hidden embedding layer.
  • Decoder with 2 hidden layers of increasing size, symmetrical to the encoder.
  • ReLU (Rectified Linear Unit) activation function in hidden layer neurons.
  • Feed-forward fully connected layers with a 0.2 Dropout in each hidden layer and batch normalization.
  • Output layer with SIGMOID activation function (operating as a multi-label classifier).
  • Binary cross-entropy loss function.
The second step of our method is to extract the internal representations for the training documents and store them in the corresponding index. As is shown in Section 3.2 an Apache Lucene textual index is employed for NLP based sparse representations and an FAISS index stores the dense contextual vector representations.
Once we have trained our label-AE and a properly indexed version of the training dataset is available, to annotate a new MEDLINE citation x, we apply the following procedure, illustrated in Figure 2:
1.
The index is queried and the set N ( x ) = { n 1 , n 2 , , n k } with the k documents closest to x is retrieved, along with their respective distances to x, ( d i for each n i N ( x ) ).
  • Depending on the representation being used, title and abstract of x are converted into a sparse set of Lucene indexing terms or into a dense vector.
  • Once the respective index (Lucene or FAISS) is queried, an ordered list of most similar citations is available, together with an estimate of their distances to the query document x.
    BM25 scores converted to a pseudo-distance in [ 0 , 1 ] with Lucene index
    euclidean distance between dense representations with FAISS index
2.
The encoder is applied to translate the set of labels assigned to each neighbor n i N ( x ) into the reduced semantic space, computing z i = E n c ( y n i ) ) x i N ( x ) , with y n i the set of labels in neighbor n i .
3.
We create the weighted average vector z = i = 1 k w i w TOTAL · z i in the embedding space, where w TOTAL = j = 1 k w j .
Several distance weighting schemes have been discussed in k-NN literature [38]. In our case we have employed two: (1) weight neighbors by 1 minus their distance ( w i = 1 d i ) and (2) weight neighbors by the inverse of their distance squared ( w i = 1 d i 2 ).
4.
The decoder is used to convert this average vector z from the embedding space to the original label space as y = D e c ( z )
Various cutting and thresholding schemes can be used to binarize this vector and return the list of predicted labels.
  • Estimate the number of labels to return, r, from the sizes of label sets of documents in N ( x ) , as described in citecual, and return the r predicted labels with the highest score.
  • Apply a threshold on the activation of decoder output neurons to decide which labels have an excitation level high enough to be part of the final prediction.
Figure 2. Categorization using k-NN with label autoencoders.
Figure 2. Categorization using k-NN with label autoencoders.
Mathematics 10 02867 g002

4. Results and Discussion

This section conducts an exhaustive set of experiments on a large portion of the MEDLINE collection. In these experiments we validate the effectiveness of our proposal of a multi-label k-NN text classifier assisted by a label-AE in a complex semantic indexing task. Different parameters and options were evaluated on the test dataset in order to determine the best setting for our system with the aim to answer the following research questions:
  • What is the effect on classification performance of the choice of training document representations? Are there substantial differences between sparse term-based similarity and dense vector-based similarity?
  • What are the best parameterizations for label-AEs (size of embedding representation layer, sizes of encoder and decoder layers, etc)? What are the effects of retrieving different number of neighbor documents on the classification performance and how affects the weighting scheme employed when creating the average embedded vector?
In this section we provide a description of our evaluation data and the performance metrics being employed and discuss the experimental results. The source code used to carry out the reported experiments is available at https://github.com/fribadas/labelAE-MeSH (accessed on 8 August 2022).

4.1. Dataset Details and Evaluation Metrics

Our experiments were conducted on a large textual multi-label dataset created as a subset of the 2021 edition of MEDLINE/PubMed baseline files (ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline, accessed on 24 July 2022), which comprises over 6 million citations from 2010 onwards. For convenience, the actual dataset was retrieved from the BioASQ challenge [29] repository (http://www.bioasq.org/, accessed on 24 July 2022) rather than from the original sources. BioASQ organizers retrieved the citations from the MEDLINE sources, extracting the relevant elements (pmid, ArticleTitle, AbstractText, MeshHeadingsList, JournalName and Year) and distributed then conveniently formatted as a JSON document. Table 2 summarizes the most relevant characteristics of the resulting dataset.
In our study we have employed two complementary sets of evaluation metrics that are commonly used in evaluating multi-label and XML problems.
  • The evaluation of binary classifiers typically employs Precision (P), which measures how many predicted labels are correct, Recall (R), which counts how many correct labels the evaluated model is able to predict, and F-score (F), which combines both metrics by calculating the harmonic mean of P and R. In multi-class and multi-label problems these metrics are generalized by calculating their Macro-averaged and Micro-averaged variants. A Macro-averaged measure computes a class-wide average of the corresponding measure while a Micro-averaged one computes the corresponding measure on all examples at once and, in the general case, uses to have the advantage of adequately handling the class imbalance. In our evaluation we followed the BioASQ challenge proposal [29] that employs the Micro-averaged versions of Precision ( M i P ), Recall ( M i R ) and F-score ( M i F ) as main performance metrics, using M i P as a ranking criteria.
  • In XML, where the number of candidate labels is very large, metrics that focus on evaluating the effectiveness in predicting correct labels and generating an adequate ranking in the predicted label set are frequently used. Precision at top k ( P @ k ) computes the fraction of correct predictions in the top k predicted labels. Normalized Discounted Cummulated Gain at top k ( n D C G @ k ) [46] is a measure of the ranking quality at the top k predicted labels, which evaluates the usefulness of a predicted label according its position in the result list. In our experimental results, we report the average P @ k and n D C G @ k on the testing set with k = 5 and k = 10 , in order to provide a measure of prediction effectiveness.

4.2. Experimental Results

In the first place we have evaluated the performance of the different approaches described in Section 3.2 for document representations. Secondly, another set of experiments has evaluated the performance of the k-NN method assisted with label-AEs described in Section 3.3 using different AE configurations.

4.2.1. Dense vs. Sparse Representations

In order to evaluate the influence of the document representation being used on the categorization performance we have performed a battery of experiments comparing the use of the dense representations with contextual vectors (runs dense) and the use of different alternatives for extracting sparse representations. In particular, we have evaluated the performance of single terms extracted by stemming (runs Stems) and lemmatization (runs Lemmas), the combination of the different methods for extracting compound terms (runs multi where we combine NERS, NPS and Keywords) and the joint use of the terms extracted using all the methods described in Section 3.2 (runs all). The effect of the number of neighbors considered in each case has also been evaluated, taking k values in { 5 , 10 , 20 , 30 , 50.100 } .
As can be seen from the results shown in Table 3 and summarized in Figure 3, for these experiments the dense representation performs worse than most sparse representations in all performance metrics being considered and for all values of k. The contribution of multi-word terms in the sparse representations is very limited. Although the best results are obtained by combining all term extraction methods (runs all), it is observed that in all metrics the results obtained using single-word terms of type Stems and Lemmas dominate. We hypothesize that when applying this kind of k-NN method on a relatively large dataset (>6 M documents in our case) the contribution of more sophisticated representation methods is diluted. In smaller datasets the use of very specific and precise multi-word terms can help to greatly improve the representation of a document when searching for similar ones.
In this context it is surprising that an apriori simpler approach such as the extraction of sparse representations and the use of the Apache Lucene similarity performs better than the transformer-based contextual representations that currently dominate in the NLP research. An in-depth review of this phenomenon is beyond the scope of this paper, it may be due to the lack of a prior fine-tuning phase with the employed MEDLINE dataset, a poor suitability as a similarity metric of the Euclidean distance computed by the FAISS library or an inherent limitation of large pre-trained language models based on transformers as is discussed in [47].
With respect to the number of neighbors to consider in the k-NN classification, the best results are usually obtained with k = 20 and k = 30 , which is in line with previous publications [39] in MeSH semantic indexing.

4.2.2. k-NN Prediction with Label Autoencoders

Regarding the experiments evaluating the performance of our proposal of a label-AE as a mechanism for improving k-NN classification, our objective has been to evaluate three aspects: (1) the performance of different label-AE topologies (2) the effect of the distance weighting scheme used to create the average vectors feeding the decoder and (3) the most appropriate threshold values to generate the list of predicted labels from the reconstruction of the label space provided by the decoder.
Table 4 shows the characteristics of the label-AEs we have used in this series of experiments We have employed a fixed neural network architecture, using two fully connected layers in both encoder and decoder and one fully connected layer as embedding layer. We have trained and evaluated an encoder, named small label-AE, that uses a 64-dimensional embedding vector and an initial encoder and final decoder layer with 1024 neurons. We have also employed two AE architectures with a 128-dimensional embedding space with two encoder-decoder sizes, one with layers of 2048 and 256 neurons, called medium label-AE, and another with encoder-decoder layers of 4096 and 512 neurons, denoted as large label-AE. We aimed to evaluate the effect of the size and the number of parameters in the learned label-AEs on their quality in the label encoding and reconstruction tasks.
The detailed results obtained with the small label-AE are shown in Table 5, those for the medium label-AE in Table 6 and those for the large label-AE in Table 7. Regarding the thresholds to be applied on the decoder output to create the list of predicted labels, two values have been evaluated, selecting those labels whose output activation exceed the value 0.5 in one case and the value 0.75 in the other. In this way we intended to evaluate the effect of considering more or less demanding selection criteria in conforming the predicted label list. Finally, the effect of the two distance weighting schemes introduced in Section 3.3 to combine the embedded vectors has been evaluated in the different scenarios. In both cases, weighting by 1 minus distance (difference) and weighting by the inverse of distance squared (square), the dense representation and the sparse representation using all of the term extraction methods have been employed, using a number of neighbors k { 5 , 10 , 20 , 30 , 50 , 100 } . Figure 4 summarizes the M i F , M i P and M i R results for the best configurations of label-AE, threshold, distance weighting scheme and k.
As can be seen in Table 6 the best results in all metrics are obtained with the medium label-AE, showing the 64-dimensional embedding space of the small label-AE as apparently incapable of adequately capturing relationships between MeSH descriptors and reconstructing them later. The comparison between the performances of medium label-AE and large label-AE apparently confirms that 2048 dimensions in the input layer of the encoder are sufficient to provide an embedded representation capable, once reconstructed by the decoder, of offering a performance in terms of M i F similar to that of a basic k-NN method, improving its precision values at the cost of a slightly reduced recall. Regarding the P @ k values and the measurement of ranking quality using n D C G @ k , the label-AE results are also able to equal those of the basic k-NN method. However, in this case it is noteworthy that the label-AE method does not exceed the basic k-NN approach despite its good performance with respect to the M i P metric.
With respect to the thresholds, both values have similar performance without great differences, being slightly better to prefer the stricter output criterion provided by the value 0.75. Performance with sparse representations is still better than with dense context vectors, and there is a slight tendency to get better results using fewer neighbors than the basic k-NN method. Finally, the results using the inverse of distance squared as the distance weighting scheme are superior in all scenarios, because it boosts the contribution of the most similar examples when constructing the average embedded vector.
When comparing the medium label-AE best results from Table 6 with the basic k-NN best results from Table 3 we can see that they show very similar M i F values, which in the case of the mediumlabel-AE model is obtained with relatively high values of M i P at the expense of lower values in M i R , whereas the basic k-NN method offers values more uniform in both metrics. After a detailed analysis of the predictions made by both models we have found that the number of labels predicted by the medium label-AE model is substantially smaller. In our study we have obtained that the average number of labels predicted for each document by the simple k-NN method is 13.34 for the sparse representation and 13.13 for the dense representation. For the medium label-AE model with a threshold of 0.50 its average length is 10.44 labels with the sparse representation and 10.61 with the dense representation, whereas with a threshold of 0.75 we have, respectively, 8.65 and 8.69 labels. This behavior makes the basic k-NN method start with an initial advantage in providing better values for M i R .
We hypothesize that the proposed label-AE assisted k-NN method is capable of (1) providing faithful embedded vectors via its encoder and (2) acceptably reconstructing the output labels from the averaged embedded vectors using its decoder, hence offering high values in M i P , but it leaves behind the less frequent labels, which have few training examples to assert their presence in the encoder and decoder weights. On the other hand, the results seem to indicate that the basic k-NN method is able to satisfactorily circumvent the treatment of infrequent labels, at least in a large datasets such as the one we are dealing with. This is probably due to the very nature of the k-NN method. These infrequent labels appear in documents with very specific contents, which leads to a very particular set of neighbors that target the k-NN classifier to these rare labels.
In order to try to combine the best aspects of both approaches, which are the high M i P of our proposed k-NN with label-AE and the best recall capabilities of the classical k-NN method, we have carried out a battery of additional tests. For this purpose, we have taken as a starting point the labels predicted by the label-AE method and combined them with the predictions provided by the basic k-NN method. To build the final set of output labels, to the labels predicted by the label-AE based method we add labels taken from the basic k-NN prediction until the number of output labels predicted by the basic k-NN is reached.
Table 8 shows the results obtained by combining according to the described scheme the predictions of the basic k-NN model with the predictions provided by the mediumlabel-AE. In the case of the M i F , M i P and M i R metrics all of them are substantially improved with respect to the values obtained with these methods separately. In contrast, the values of P @ k and n D C G @ k are penalized.
Although these results improve those provided by the basic k-NN method and those of the k-NN method assisted with our label-AE, they are far from those offered by the best state-of-the-art semantic indexing systems for MeSH. If we take as a reference the latest editions of the BioASQ challenge [29,48], which proposes an evaluation scenario very similar to the one presented in this work, we see that the best systems are capable of reaching M i F scores somewhat higher than 70%, while the Default MTI (Medical Text Indexer) reached values between 53% in the first edition of the challenge and values in the range 62–64% in the last two editions. The baseline used in the first editions of this challenge, which performed a simple string match of the label text, reached values around 26%.

5. Conclusions and Future Work

In this paper we propose a novel multi-label text categorization method able to deal with a very large and structured label space, that it is suitable to be applied in semantic indexing tasks using controlled vocabularies, such it is the case of the Medical Subject Headings (MeSH) thesaurus. The proposed method trains a large label-AE capable of simultaneously learning an encoder function that transforms the original label space into a reduced-dimensional space, along with a decoder function that transforms vectors from that space back into the original label space. The proposal adapts classical k-NN categorization to work in the semantic latent space learned by this label-AE.
We have proposed and evaluated several document representation approaches, using both sparse textual features and dense contextual representations. We have evaluated their contribution in finding neighboring documents employed in the k-NN classification.
An exhaustive study on a large portion of MEDLINE collection has been carried out to evaluate different strategies in the definition and training of label-AEs for the MeSH thesaurus and to verify the suitability of the proposed classification method. The results obtained confirm the ability of the learned label-AEs to capture the latent semantics of MeSH thesaurus descriptors and leverage that representation space in the k-NN classification.
As a future work, a direct application of the method described in this paper is to test the usefulness of the label-AEs learned for MeSH on related thesauri in other languages. An example of such a thesaurus is the DeCS (Descriptores en Ciencias de la Salud, Health Sciences Descriptors) controlled vocabulary (http://decs.bvsalud.org/, accessed on 24 July 2022), which is a trilingual (In Portuguese, Spanish and English) version of MeSH, retaining its structure and adding a collection of specific descriptors. We hypothesize that it is possible to leverage the semantic information about MeSH condensed in the learned encoders and decoders to advantage of it in multilingual biomedical environments.

Author Contributions

Conceptualization, F.J.R.-P.; software, F.J.R.-P., V.M.D.B. and S.C.; validation, F.J.R.-P. and V.M.D.B.; investigation, F.J.R.-P.; resources, F.J.R.-P. and V.M.D.B.; data curation, F.J.R.-P., V.M.D.B. and S.C.; writing—original draft preparation, F.J.R.-P., V.M.D.B. and S.C.; writing—review and editing, F.J.R.-P. supervision, F.J.R.-P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Spanish Ministry of Science and Innovation through the project PID2020-113230RB-C22.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data was obtained from the repository of the BioASQ challenge [29] and are available under registration at http://participants-area.bioasq.org/datasets/.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. (IJDWM) 2007, 3, 1–13. [Google Scholar] [CrossRef]
  2. Madjarov, G.; Kocev, D.; Gjorgjevikj, D.; Džeroski, S. An extensive experimental comparison of methods for multi-label learning. Pattern Recognit. 2012, 45, 3084–3104. [Google Scholar] [CrossRef]
  3. Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2013, 26, 1819–1837. [Google Scholar] [CrossRef]
  4. Schapire, R.E.; Singer, Y. BoosTexter: A boosting-based system for text categorization. Mach. Learn. 2000, 39, 135–168. [Google Scholar] [CrossRef]
  5. Elisseeff, A.; Weston, J. A kernel method for multi-labelled classification. Adv. Neural Inf. Process. Syst. 2001, 14, 681–687. [Google Scholar]
  6. Zhang, M.L.; Zhou, Z.H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
  7. Zhang, M.L.; Zhou, Z.H. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 2006, 18, 1338–1351. [Google Scholar] [CrossRef]
  8. Boutell, M.R.; Luo, J.; Shen, X.; Brown, C.M. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef]
  9. Tsoumakas, G.; Katakis, I.; Vlahavas, I. Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 2010, 23, 1079–1089. [Google Scholar] [CrossRef]
  10. Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi-label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef]
  11. Hsu, D.J.; Kakade, S.M.; Langford, J.; Zhang, T. Multi-label prediction via compressed sensing. Adv. Neural Inf. Process. Syst. 2009, 22, 772–780. [Google Scholar]
  12. Tai, F.; Lin, H.T. Multilabel classification with principal label space transformation. Neural Comput. 2012, 24, 2508–2542. [Google Scholar] [CrossRef]
  13. Cisse, M.M.; Usunier, N.; Artieres, T.; Gallinari, P. Robust bloom filters for large multilabel classification tasks. Adv. Neural Inf. Process. Syst. 2013, 26, 933. [Google Scholar]
  14. Bhatia, K.; Jain, H.; Kar, P.; Varma, M.; Jain, P. Sparse local embeddings for extreme multi-label classification. Adv. Neural Inf. Process. Syst. 2015, 28, 495. [Google Scholar]
  15. Rai, P.; Hu, C.; Henao, R.; Carin, L. Large-scale bayesian multi-label learning via topic-based label embeddings. Adv. Neural Inf. Process. Syst. 2015, 28, 1805. [Google Scholar]
  16. Wicker, J.; Tyukin, A.; Kramer, S. A nonlinear label compression and transformation method for multi-label classification using autoencoders. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Auckland, New Zealand, 19–22 April 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 328–340. [Google Scholar]
  17. Yeh, C.K.; Wu, W.C.; Ko, W.J.; Wang, Y.C.F. Learning deep latent space for multi-label classification. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  18. Wang, B.; Chen, L.; Sun, W.; Qin, K.; Li, K.; Zhou, H. Ranking-Based Autoencoder for Extreme Multi-label Classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, Minnesota, 2019; pp. 2820–2830. [Google Scholar] [CrossRef]
  19. Agrawal, R.; Gupta, A.; Prabhu, Y.; Varma, M. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 13–24. [Google Scholar]
  20. Prabhu, Y.; Varma, M. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 263–272. [Google Scholar]
  21. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
  22. Charte, D.; Charte, F.; García, S.; del Jesus, M.J.; Herrera, F. A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines. Inf. Fusion 2018, 44, 78–96. [Google Scholar] [CrossRef]
  23. Pulgar-Rubio, F.; Charte, F.; Rivera-Rivas, A.; del Jesus, M.J. AEkNN: An AutoEncoder kNN-Based Classifier With Built-in Dimensionality Reduction. Int. J. Comput. Intell. Syst. 2018, 12, 436–452. [Google Scholar] [CrossRef]
  24. Jarrett, D.; van der Schaar, M. Target-Embedding Autoencoders for Supervised Representation Learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
  25. U.S. National Library of Medicine. Medical Subject Headings. 2022. Available online: https://www.nlm.nih.gov/mesh/meshhome.html (accessed on 24 July 2022).
  26. Dai, S.; You, R.; Lu, Z.; Huang, X.; Mamitsuka, H.; Zhu, S. FullMeSH: Improving large-scale MeSH indexing with full text. Bioinformatics 2020, 36, 1533–1541. [Google Scholar] [CrossRef]
  27. Mork, J.; Aronson, A.; Demner-Fushman, D. 12 years on—Is the NLM medical text indexer still useful and relevant? J. Biomed. Semant. 2017, 8, 8. [Google Scholar] [CrossRef]
  28. Aronson, A.R.; Lang, F.M. An overview of MetaMap: Historical perspective and recent advances. J. Am. Med. Inform. Assoc. 2010, 17, 229–236. [Google Scholar] [CrossRef]
  29. Tsatsaronis, G.; Balikas, G.; Malakasiotis, P.; Partalas, I.; Zschunke, M.; Alvers, M.R.; Weissenborn, D.; Krithara, A.; Petridis, S.; Polychronopoulos, D.; et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 2015, 16, 138. [Google Scholar] [CrossRef]
  30. Gargiulo, F.; Silvestri, S.; Ciampi, M.; De Pietro, G. Deep neural network for hierarchical extreme multi-label text classification. Appl. Soft Comput. 2019, 79, 125–138. [Google Scholar] [CrossRef]
  31. Liu, K.; Peng, S.; Wu, J.; Zhai, C.; Mamitsuka, H.; Zhu, S. MeSHLabeler: Improving the accuracy of large-scale MeSH indexing by integrating diverse evidence. Bioinformatics 2015, 31, i339–i347. [Google Scholar] [CrossRef]
  32. Peng, S.; You, R.; Wang, H.; Zhai, C.; Mamitsuka, H.; Zhu, S. DeepMeSH: Deep semantic representation for improving large-scale MeSH indexing. Bioinformatics 2016, 32, i70–i79. [Google Scholar] [CrossRef] [PubMed]
  33. Mao, Y.; Lu, Z. MeSH Now: Automatic MeSH indexing at PubMed scale via learning to rank. J. Biomed. Semant. 2017, 8, 1–9. [Google Scholar] [CrossRef] [PubMed]
  34. Jin, Q.; Dhingra, B.; Cohen, W.; Lu, X. AttentionMeSH: Simple, effective and interpretable automatic MeSH indexer. In Proceedings of the 6th BioASQ Workshop A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, Brussels, Belgium, November 2018; pp. 47–56. [Google Scholar]
  35. Xun, G.; Jha, K.; Yuan, Y.; Wang, Y.; Zhang, A. MeSHProbeNet: A self-attentive probe net for MeSH indexing. Bioinformatics 2019, 35, 3794–3802. [Google Scholar] [CrossRef] [PubMed]
  36. You, R.; Liu, Y.; Mamitsuka, H.; Zhu, S. BERTMeSH: Deep contextual representation learning for large-scale high-performance MeSH indexing with full text. Bioinformatics 2020, 37, 684–692. Available online: https://academic.oup.com/bioinformatics/article-pdf/37/5/684/37808596/btaa837.pdf (accessed on 24 July 2022). [CrossRef] [PubMed]
  37. Bedmar, I.S.; Martínez, P.; Martín, A.C. Search and graph database technologies for biomedical semantic indexing: Experimental analysis. JMIR Med. Inform. 2017, 5, e7059. [Google Scholar]
  38. Aha, D.W.; Kibler, D.; Albert, M.K. Instance-based learning algorithms. Mach. Learn. 1991, 6, 37–66. [Google Scholar] [CrossRef]
  39. Trieschnigg, D.; Pezik, P.; Lee, V.; De Jong, F.; Kraaij, W.; Rebholz-Schuhmann, D. MeSH Up: Effective MeSH text classification for improved document retrieval. Bioinformatics 2009, 25, 1412–1418. [Google Scholar] [CrossRef]
  40. Ribadas-Pena, F.J.; Cao, S.; Kuriyozov, E. CoLe and LYS at BioASQ MESINESP Task: Large-scale multilabel text categorization with sparse and dense indices. In Proceedings of the CLEF (Working Notes), Bucharest, Romania, 21–24 September 2021; pp. 313–323. [Google Scholar]
  41. Robertson, S.E.; Walker, S.; Jones, S.; Hancock-Beaulieu, M.M.; Gatford, M. Okapi at TREC-3. Nist Spec. Publ. 1995, 109, 109. [Google Scholar]
  42. Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
  43. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PL, USA, 2019. [Google Scholar]
  44. Cohan, A.; Feldman, S.; Beltagy, I.; Downey, D.; Weld, D. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PL, USA, 2020; pp. 2270–2282. [Google Scholar] [CrossRef]
  45. Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with gpus. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
  46. Järvelin, K.; Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. Acm Trans. Inf. Syst. (TOIS) 2002, 20, 422–446. [Google Scholar] [CrossRef]
  47. Ranaldi, L.; Fallucchi, F.; Zanzotto, F.M. Dis-Cover AI Minds to Preserve Human Knowledge. Future Internet 2022, 14, 10. [Google Scholar] [CrossRef]
  48. Nentidis, A.; Katsimpras, G.; Vandorou, E.; Krithara, A.; Gasco, L.; Krallinger, M.; Paliouras, G. Overview of BioASQ 2021: The Ninth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF2021), Bucharest, Romania, 21–24 September 2021; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Figure 3. Summary of performance metrics with sparse vs. dense representations for values of k with best M i F values.
Figure 3. Summary of performance metrics with sparse vs. dense representations for values of k with best M i F values.
Mathematics 10 02867 g003
Figure 4. Summary of M i F , M i P , M i R metrics for values of k and distance weighting with best M i F values in each label-AE configuration (small, medium, large).
Figure 4. Summary of M i F , M i P , M i R metrics for values of k and distance weighting with best M i F values in each label-AE configuration (small, medium, large).
Mathematics 10 02867 g004
Table 1. Descriptor distribution in MeSH top-level categories.
Table 1. Descriptor distribution in MeSH top-level categories.
Number of
SubhierarchyDescriptors
(A) Anatomy3303
(B) Organisms5210
(C) Diseases12,749
(D) Chemicals and Drugs23,589
(E) Analytical, Diagnostic and Therapeutic Techniques and Equipment5327
(F) Psychiatry and Psychology1435
(G) Biological Sciences3794
(H) Physical Sciences582
(I) Anthropology, Education, Sociology and Social Phenomena841
(J) Technology and Food and Beverages765
(K) Humanities216
(L) Information Science552
(M) Persons345
(N) Health Care2860
(V) Publication Characteristics231
(Z) Geographic Locations517
Table 2. Evaluation dataset statistics and MeSH descriptor distribution.
Table 2. Evaluation dataset statistics and MeSH descriptor distribution.
Collection Statistics
# citations6,791,951
# citations in dev dataset10,000
# citations in test dataset10,000
# MeSH descriptors29,483
minmaxavg
descriptors per citation11912.90
descriptor occurrences14,621,007 2972.26
MeSH descriptor distribution
occurrencesnumber of descriptors
≥ 1 M7
≥ 100 K65
≥ 10 K1314
≥ 1 K8752
≥ 10021,310
Descriptor D006801: Humans.
Table 3. Performance metrics with sparse vs. dense representations.
Table 3. Performance metrics with sparse vs. dense representations.
kMiFMiPMiRP@5P@10nDCG@5nDCG@10
Stemms50.49430.48550.50350.72990.55270.77060.6613
100.51910.50820.53050.75500.58040.79990.6925
200.52590.51510.53710.75740.58490.80430.6986
300.52400.51290.53550.75790.58370.80450.6974
500.52230.51110.53400.75270.58050.79910.6934
1000.51470.50310.52690.74130.57310.78980.6857
Lemmas50.49200.48230.50200.72260.54960.76440.6574
100.51820.50780.52900.74850.57680.79310.6878
200.52200.51140.53300.75380.58190.80100.6951
300.52190.51070.53360.75440.58220.80180.6959
500.51970.50870.53120.74880.57800.79600.6906
1000.51430.50290.52620.73700.57040.78540.6821
multi50.45870.44920.46860.69090.51750.73290.6230
100.48750.47770.49770.71110.54410.75840.6529
200.49720.48750.50720.72230.55400.77090.6646
300.49810.48840.50820.71650.55360.76700.6639
500.49450.48450.50490.71330.55000.76430.6605
1000.48970.47960.50020.70260.54370.75310.6519
all50.49450.48560.50360.72760.55300.76810.6610
100.52070.51110.53070.75440.57950.80030.6930
200.52760.51660.53900.76490.58940.81030.7035
300.52740.51630.53890.76110.58610.80790.7009
500.52370.51270.53520.75520.58210.80220.6958
1000.51760.50640.52930.74530.57530.79330.6884
dense50.47790.47250.48340.70560.52990.74790.6380
100.49960.49360.50580.73480.55410.78000.6675
200.50300.49700.50930.73270.55750.78260.6728
300.50340.49700.51000.73500.55560.78430.6718
500.50160.49500.50840.72910.55540.77890.6697
1000.49180.48480.49910.71610.54510.76720.6586
Table 4. Configuration of label autoencoders in our experiments.
Table 4. Configuration of label autoencoders in our experiments.
Input/OutputEncoder-1Encoder-2EmbeddingDecoder-1Decoder-2# Parameters
small29,483102425664256102460,975,467
medium29,48320485121285122048123,035,563
large29,4834096102412810244096250,256,299
Table 5. Performance with small label-AE.
Table 5. Performance with small label-AE.
ThresholdNeighborsWeightingkMiFMiPMiRP@5P@10nDCG@5nDCG@10
0.50 sparse(all)difference50.49400.53380.45970.70420.52180.74920.6330
100.49810.53880.46320.71530.52560.76200.6408
200.49140.52990.45810.70940.52090.75660.6362
300.48860.52900.45400.70810.51730.75460.6322
500.48260.52500.44660.69920.50950.74750.6248
1000.47190.51900.43260.69090.49560.73960.6120
0.50 sparse(all)square50.49500.53190.46290.70450.52210.74770.6318
100.50380.54340.46960.71680.53020.76250.6439
200.49990.53990.46550.71890.52780.76400.6426
300.49650.53670.46190.71930.52560.76380.6405
500.49060.53290.45450.71140.51800.75810.6342
1000.48100.52650.44270.70180.50480.74920.6218
0.50 densedifference50.48210.51400.45390.69290.51290.73900.6234
100.48740.51680.46110.70190.52070.74930.6331
200.48570.51420.46010.70260.51920.75000.6321
300.48220.51010.45710.69870.51710.74670.6297
500.47550.50400.45010.69310.51040.74200.6236
1000.46880.49690.44370.68500.50390.73490.6169
0.50 densesquare50.48250.51420.45450.69430.51260.74020.6234
100.48730.51680.46090.70240.52020.74970.6328
200.48500.51300.45990.70450.51870.75160.6321
300.48110.50890.45610.69540.51530.74400.6279
500.47340.50170.44810.68840.50690.73860.6204
1000.46310.49030.43880.67820.49630.72910.6097
0.75 sparse(all)difference50.48930.62810.40070.69000.48360.73840.6024
100.49560.64330.40300.69850.48690.74950.6100
200.48980.64110.39630.69280.48050.74410.6036
300.48570.64050.39110.68830.47490.73980.5982
500.47940.63970.38330.67870.46620.73200.5896
1000.46770.63500.37020.66820.45080.72190.5749
0.75 sparse(all)square50.49150.62470.40510.69410.48740.74010.6051
100.49980.64360.40850.70380.49140.75270.6131
200.49850.64960.40440.70460.48890.75320.6116
300.49410.64880.39890.70070.48400.74980.6071
500.48790.64600.39190.69220.47590.74370.6002
1000.47730.64260.37960.67870.46080.73160.5855
0.75 densedifference50.48280.61090.39910.68280.47930.73160.5971
100.48420.61720.39830.68850.48130.73940.6020
200.48500.62200.39740.69000.48140.74070.6025
300.48170.61960.39410.68510.47780.73670.5988
500.47530.61300.38810.67890.47120.73150.5926
1000.46730.60830.37940.66920.46160.72320.5834
0.75 densesquare50.48310.61060.39970.68360.47940.73230.5973
100.48510.61820.39910.68930.48200.73990.6026
200.48420.62100.39670.69110.48070.74170.6022
300.48030.61760.39300.68180.47600.73400.5970
500.47180.60960.38490.67420.46770.72800.5894
1000.46170.60310.37400.66110.45520.71630.5768
Table 6. Performance with medium label-AE.
Table 6. Performance with medium label-AE.
ThresholdNeighborsWeightingkMiFMiPMiRP@5P@10nDCG@5nDCG@10
0.50 sparse(all)difference50.51360.55320.47930.72650.54350.77170.6559
100.52610.57810.48270.74210.54980.78960.6681
200.52230.58450.47200.74280.54440.79140.6650
300.52080.59030.46600.74020.53960.78890.6606
500.51530.59150.45650.73470.53020.78410.6525
1000.50560.59360.44030.72170.51570.77390.6396
0.50 sparse(all)square50.51140.53870.48670.71950.54350.76460.6539
100.52800.57160.49050.74200.55560.78900.6718
200.52980.58730.48260.74920.55440.79620.6732
300.52740.59240.47520.74480.54770.79310.6677
500.52400.59670.46700.74320.54090.79190.6627
1000.51550.59970.45210.73350.52820.78390.6513
0.50 densedifference50.49800.52790.47130.71090.53030.75660.6417
100.50770.55140.47050.72330.53560.77130.6513
200.50740.55830.46500.72340.53220.77370.6507
300.50630.56130.46110.71940.52900.77080.6478
500.50020.55930.45240.71520.52160.76760.6418
1000.49240.55730.44110.70610.51170.75920.6317
0.50 densesquare50.49920.52870.47280.71130.53070.75670.6420
100.50800.55120.47100.72210.53450.77020.6502
200.50630.55690.46410.72110.53100.77220.6497
300.50510.56070.45950.71790.52820.77000.6473
500.49770.55790.44920.71090.51780.76340.6377
1000.48640.55320.43400.69830.50410.75270.6245
0.75 sparse(all)difference50.51520.61110.44530.72170.52640.76820.6426
100.52510.64030.44500.73480.52880.78410.6515
200.52030.64940.43400.73240.51980.78380.6454
300.51670.65600.42610.72850.51310.78010.6394
500.50900.65740.41520.72090.50160.77360.6291
1000.49650.65870.39850.70760.48510.76300.6143
0.75 sparse(all)square50.51270.59430.45090.71630.52690.76220.6411
100.52650.63160.45140.73740.53410.78580.6553
200.52860.65120.44480.74160.53200.79070.6557
300.52600.65940.43750.73650.52390.78680.6490
500.51890.66310.42610.73240.51280.78380.6404
1000.50710.66510.40970.71910.49730.77310.6264
0.75 densedifference50.50180.58840.43740.70810.51390.75460.6292
100.50720.61220.43300.71750.51550.76700.6356
200.50710.62450.42680.71660.51030.76850.6335
300.50260.62460.42040.71170.50470.76500.6288
500.49560.62570.41030.70520.49520.76010.6206
1000.48750.62470.39980.69520.48490.75100.6104
0.75 densesquare50.50250.58810.43870.70820.51450.75450.6297
100.50750.61220.43340.71640.51440.76600.6346
200.50700.62500.42650.71470.50970.76740.6330
300.50050.62330.41810.71040.50320.76430.6276
500.49180.62310.40620.70100.49100.75600.6164
1000.48070.61980.39260.68690.47720.74400.6030
Table 7. Performance with large label-AE.
Table 7. Performance with large label-AE.
ThresholdNeighborsWeightingkMiFMiPMiRP@5P@10nDCG@5nDCG@10
0.50 sparse(all)difference50.49620.64570.40290.69340.48660.74390.6077
100.49590.67330.39250.70060.47960.75580.6082
200.48990.68570.38120.69710.47020.75340.6011
300.48430.69010.37310.68900.46210.74690.5935
500.47470.69320.36100.67780.44880.73700.5811
1000.46010.69810.34310.66090.42970.72260.5631
0.50 sparse(all)square50.49730.62850.41140.69240.49520.74270.6132
100.50060.66620.40100.70640.48600.75840.6123
200.49720.68640.38970.70780.47920.76130.6090
300.49210.69170.38190.70300.47150.75800.6030
500.48350.69600.37040.68960.45940.74730.5917
1000.47000.70170.35340.67270.44060.73330.5742
0.50 densedifference50.48740.61790.40240.68890.48590.74140.6065
100.48970.64730.39390.69430.48010.74960.6063
200.48680.66090.38530.69110.47460.74790.6024
300.48220.66460.37830.68790.46660.74530.5958
500.47540.66670.36940.67970.45720.73860.5873
1000.46590.66610.35820.66960.44500.73050.5763
0.50 densesquare50.48840.61810.40360.68940.48680.74140.6069
100.48990.64660.39440.69460.48030.74980.6065
200.48620.66060.38460.68940.47330.74620.6008
300.48140.66420.37760.68670.46510.74390.5941
500.47230.66560.36600.67490.45350.73520.5839
1000.46000.66290.35210.66270.43830.72410.5694
0.75 sparse(all)difference50.47950.69120.36710.67790.45410.73220.5811
100.47550.72120.35460.67720.44240.73770.5769
200.46770.73580.34280.66550.42960.72800.5651
300.46160.74500.33440.65840.42090.72280.5577
500.45080.74810.32260.64460.40640.71110.5441
1000.43390.75110.30510.62620.38680.69450.5245
0.75 sparse(all)square50.48220.67580.37490.67920.46110.73280.5863
100.48210.71430.36380.68540.45100.74270.5836
200.47710.73780.35260.68190.44180.74120.5770
300.47140.74680.34440.67350.43220.73530.5689
500.46110.75100.33260.65990.41900.72390.5564
1000.44320.75320.31400.63730.39740.70530.5360
0.75 densedifference50.47460.66790.36810.67410.45410.73040.5811
100.47290.69710.35780.67710.44610.73690.5790
200.46750.71570.34710.66910.43600.73140.5706
300.46230.71860.34080.66430.42890.72770.5645
500.45450.71990.33210.65330.41850.71880.5550
1000.44480.72260.32120.64100.40520.70850.5425
0.75 densesquare50.47610.66810.36980.67540.45600.73090.5822
100.47340.69830.35810.67600.44610.73610.5788
200.46680.71490.34660.66720.43460.72950.5690
300.46210.71970.34030.66260.42800.72560.5630
500.45160.71820.32940.64980.41520.71620.5519
1000.43810.72150.31460.63280.39750.70100.5344
Table 8. Performance mixing results from basic k-NN with medium label-AE.
Table 8. Performance mixing results from basic k-NN with medium label-AE.
ThresholdNeighborsWeightingkMiFMiPMiRP@5P@10nDCG@5nDCG@10
0.50 sparse(all)difference50.51090.50180.52040.51590.50760.51520.5341
100.53090.52080.54140.53840.53010.53980.5598
200.53450.52360.54600.55300.53830.55340.5694
300.53300.52180.54480.55540.53820.55640.5706
500.53040.51910.54220.55710.53710.55860.5705
1000.52430.51300.53610.55900.53460.56050.5694
0.50 sparse(all)square50.50630.49730.51550.49930.49680.49870.5205
100.52940.51970.53950.53060.52770.53030.5536
200.53850.52770.54980.54840.54040.54960.5700
300.54080.52990.55220.55620.54320.55540.5732
500.53990.52860.55170.55780.54290.55830.5743
1000.53550.52390.54760.55550.53900.55370.5688
0.50 densedifference50.49420.48850.50020.48250.48640.48030.5065
100.50980.50380.51590.50500.50370.50330.5275
200.51520.50890.52170.51010.51160.50710.5339
300.51480.50850.52120.51300.51340.50930.5359
500.51220.50590.51870.50800.50950.50700.5333
1000.50670.49980.51390.50780.50650.50420.5290
0.50 densesquare50.49440.48860.50030.48280.48540.48040.5056
100.50950.50360.51560.50430.50330.50330.5273
200.51460.50830.52110.50860.51120.50530.5330
300.51360.50720.52000.50960.51130.50670.5339
500.50880.50220.51550.50660.50680.50410.5296
1000.49930.49210.50660.50210.50070.49790.5227
0.75 sparse(all)difference50.51350.50430.52300.57060.53540.56940.5714
100.53460.52440.54520.59840.56120.59970.6010
200.54120.53010.55280.61380.56920.61630.6124
300.54090.52950.55280.61990.57120.62330.6166
500.53770.52620.54960.61990.56970.62270.6147
1000.53080.51930.54270.61530.56250.62060.6097
0.75 sparse(all)square50.50550.49660.51480.55000.52350.54780.5546
100.53210.52230.54230.58730.55490.58830.5923
200.54460.53370.55600.61080.57290.61260.6136
300.54830.53720.55990.61720.57680.62080.6198
400.54920.53780.56120.61930.57550.62170.6188
500.54820.53670.56010.61850.57430.62080.6173
1000.54300.53130.55520.60800.56540.60860.6064
0.75 densedifference50.49520.48940.50110.53840.51430.53460.5433
100.51330.50730.51940.56630.53200.56280.5660
200.52210.51570.52870.57560.54460.57170.5778
300.52010.51380.52660.57440.54430.57170.5781
500.51810.51170.52460.57680.54330.57270.5773
1000.51460.50750.52190.57510.53780.56950.5718
0.75 densesquare50.49510.48940.50110.53870.51340.53350.5416
100.51420.50820.52030.56540.53200.56160.5654
200.52110.51470.52770.57570.54430.57220.5780
300.52000.51360.52660.57220.54250.56900.5759
500.51550.50880.52230.57300.53780.56910.5725
1000.50850.50130.51600.56860.53160.56330.5654
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ribadas-Pena, F.J.; Cao, S.; Darriba Bilbao, V.M. Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders. Mathematics 2022, 10, 2867. https://doi.org/10.3390/math10162867

AMA Style

Ribadas-Pena FJ, Cao S, Darriba Bilbao VM. Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders. Mathematics. 2022; 10(16):2867. https://doi.org/10.3390/math10162867

Chicago/Turabian Style

Ribadas-Pena, Francisco J., Shuyuan Cao, and Víctor M. Darriba Bilbao. 2022. "Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders" Mathematics 10, no. 16: 2867. https://doi.org/10.3390/math10162867

APA Style

Ribadas-Pena, F. J., Cao, S., & Darriba Bilbao, V. M. (2022). Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders. Mathematics, 10(16), 2867. https://doi.org/10.3390/math10162867

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop