1. Introduction
Music is one of the cornerstones of cultural heritage [
1]. Throughout history, the main means of transmitting and preserving this art has been its engravement in so-called music scores, i.e., documents in which music composers graphically encode a piece of music as well as the way to perform it [
2]. This particular visual language is identified as music notation and is known to be considerably different depending on epoch, culture, instrumentation, and genre, among many others [
3].
This engraving process was carried out manually until the end of the 20th century, mostly done by handwriting or typesetting processes; thus, millions of music documents are only available as physical documents [
4]. While large quantities of them have been scanned into digital images, only a few hundreds have been stored in a structured digital format that allows tasks such as indexing, editing, or critical publication [
5]. Since its manual transcription constitutes a costly, tedious, and prone-to-error task, the development of systems capable of automatically performing this process is of remarkable importance, even more so considering the vast amount of existing music archives [
6]. Optical Music Recognition (OMR) is the research field that investigates how to computationally read music notation in printed documents and to store them in a digital structured format [
7].
One of the main challenges in this field lies in the aforementioned heterogeneity of the documents, especially when considering the different notation styles present in historical documents [
8]. Thus, while OMR has been an active research area for decades [
9,
10], such particularity prevents the field from providing general developments for any type of document.
Traditionally, most OMR systems rely on a multi-stage pipeline, which, in a broad sense, is devoted to segmenting the complete score into single units (e.g., staves or symbols), which are eventually recognized [
11] using traditional machine learning or signal processing techniques. However, it must be noted that such approaches achieve competitive recognition rates at the expense of using certain heuristics adequately designed for a particular notation and writing/printing style, being only applicable to the case for which they were devised [
12]. In this regard, this framework becomes considerably limited in terms of scalability given that a new pipeline and a set of heuristics must be designed for each particular document and/or notation.
Recent developments in the so-called
deep learning paradigm have led to a considerable renewal of learning-based approaches [
13]. This area represents the current state-of-the-art in a wide range of applications, with
image classification being a particular example in which traditional machine learning techniques seemed to have reached a glass ceiling [
14]. Hence, it is not strange that OMR has severely benefited from that field to somehow palliate its issues.
One of the key advantages of deep learning is that it allows for the creation of the so-called
holistic or
end-to-end neural solutions to tackle recognition tasks, i.e., systems in which both feature extraction and classification processes are learned jointly [
15]. This particularity prevents the need for designing an entire pipeline for each particular case of study since the traditionally hand-crafted tasks are directly inferred from the data itself. Such a functionality has been largely explored in the literature, with OMR being a case in which it has clearly proven its usefulness [
16], at least in symbol recognition at the isolated staff level. It must be noted that, while generative methods based on Hidden Markov Models have also been considered for holistic OMR as in the works by Pugin [
17] or Calvo-Zaragoza et al. [
18], the performance of the commented neural methods outperforms them. Nevertheless, there is still room for improvement in neural-based holistic OMR by exploiting the inherent particularities of this type of data, which have not been totally explored.
When considering an
agnostic music representation [
19], i.e., a representation based on the graphical content rather than its musical meaning, symbols depict a two-dimensional nature. As a consequence, every single element reports two geometrical pieces of information at once [
20]: the
shape, which encodes the temporal duration of the event (sound or absence of it), and the
height, which indicates the pitch of the event represented with its vertical position in the staff. However, when holistic OMR systems are trained, each possible combination of shape and height is represented as unique categories, which leads to music symbols being treated the same way as text characters [
21]. Based on this premise, some recent research OMR works [
22,
23,
24,
25] have explored the aforementioned particularities of music notation, concluding that the individual exploitation of each dimension generally yields better recognition rates.
In this work, we propose to further explore and exploit the two-dimensional nature of music symbols and, more precisely, we focus on symbol recognition at a staff-line level of monophonic early music documents. Considering the end-to-end neural-based framework by Calvo-Zaragoza et al. [
20] as a starting point, we propose and discuss the separate extraction of shape and height features and then propose several integration policies. The neural end-to-end approach presented in this work fills a gap in the related literature as this double dimension has not been previously addressed in this particular context, thus constituting a clear innovation in the OMR field. The results obtained for two different corpora show that our proposal outperforms the recognition performance of the baseline considered, even in cases in which there is considerably narrow room for improvement.
The rest of the paper is organized as follows:
Section 2 overviews some related proposals for this topic;
Section 3 thoroughly develops our proposed approach;
Section 4 describes the experimental setup;
Section 5 presents and analyzes the results;
Section 6 provides a general debate of the insights obtained in the work; and finally,
Section 7 concludes the present work, along with some ideas for future research.
2. Background
Even though the term OMR encompasses a vast range of scenarios—research might vary depending on the music notation system or the engraving mechanism of the manuscripts—most of the existing literature is framed within a multi-stage pipeline that divides the challenge into a series of independent phases [
26]. Bainbridge and Bell [
9] properly described and formalized the de facto standard workflow, which was later thoroughly reviewed by Rebelo et al. [
10]. This sequential pipeline comprises four main blocks: (i)
image preprocessing, which aims at palliating problems mostly related to the scanning process and paper quality; (ii)
symbol segmentation and classification, which focuses on the detection and actual labeling of different elements of the image meant to be recognized; (iii)
reconstruction of the music notation, which postprocesses the recognition process; and (iv) an
output encoding stage that stores the recognized elements into a suitable symbolic format. In this work, we focus on the symbol segmentation and classification one.
The inclusion of deep learning strategies in the OMR field produced a shift towards the use of end-to-end or holistic systems for symbol recognition [
27]. Some examples of work addressing the recognition process at a staff level may be found in the literature for common Western notation [
12,
16,
28,
29] as well as mensural [
20], neumatic [
30], and ancient organ tablature [
31] handwritten music. While our proposal focuses on staff-line symbol recognition, it must be noted that research efforts are also devoted to addressing the issue of full-page recognition such as the proposal by Castellanos et al. [
21].
In this context of staff-level holistic approaches for symbol recognition, most approaches rely on the use of architectures based on
Convolutional Recurrent Neural Networks (CRNN). Such schemes work on the premise of using the convolutional stage to learn the adequate features for the case at issue, with the recurrent part being devoted to modeling the temporal (or spatial) dependencies of the symbols. This type of design has been mainly exploited using
Sequence-to-Sequence (seq2seq) architectures [
22,
25] or considering the
Connectionist Temporal Classification (CTC) training mechanism [
12,
16,
20,
28]. However, the work by Ríos-Vila et al. [
25] shows that, when targeting the same handwritten corpus, CRNN-seq2seq models are not competitive against CRNN trained with CTC.
As previously introduced, music notation depicts a two-dimensional nature since each symbol is actually defined by a combination of a certain shape or glyph and its height or vertical position within the staff lines. While proven to be beneficial, this particularity has been rarely exploited in the literature, with some examples being the work by Nuñez-Alcover et al. [
23], which shows that separately performing shape and height classification is beneficial in the context of isolated symbol classification of early music manuscripts; by van der Wel and Ullrich [
22] and Ríos-Vila et al. [
25], in which this conclusion was further reinforced in a context of CRNN-seq2seq models; and by Villarreal and Sánchez [
24], which also proved the validity of such an exploitation by integrating the shape and height information at a Language Model level, which was based on Hidden Markov Models rather than on the actual optical estimation.
For all of the above, this paper presents a novel proposal that takes advantage of the two-dimensional nature of music symbols in the context of neural end-to-end OMR systems at the staff level. The basic premise in this case is that, instead of relying on a single CRNN scheme for processing each staff, we can devote two CRNN schemes to concurrently exploit the shape and height information and to eventually merge them at the actual neural level. As will be shown, this separate exploitation of the music symbol information provides a remarkable improvement in terms of recognition performance when compared to the baseline in which this duality is not considered.
3. Methodology
This section describes the recognition framework of the work and the different proposals for properly exploiting the dual nature of agnostic music notation.
The proposed OMR recognition task works at the staff level; thus, we assume that a certain preprocess (e.g., [
32,
33]) already segmented the different staves in a music sheet. Our goal is, given an image of a single staff, to retrieve the series of symbols that appear therein, i.e., our recognition model is a
sequence labeling task [
34].
Formally, let represent a space of music staves. Additionally, let represent a symbol vocabulary and be the complete set of possible sequences that may be obtained from that vocabulary.
Since we are dealing with a supervised learning task, we assume the existence of a set , which relates a given staff to the sequence of symbols . Furthermore, we assume that there is an underlying function , which is the one we aim to approximate as .
In this work, given its competitive performance reported in the literature, we consider the introduced Convolutional Recurrent Neural Network (CRNN) scheme together with the Connectionist Temporal Classification (CTC) training algorithm [
35] for approximating function
. Based on this premise, we derive different neural designs for performing the recognition task.
The rest of this section further develops the idea of the two-dimensional natural nature of music symbols when considering agnostic notation and its application in our case as well as the different neural architectures considered.
3.1. Symbol Representation
As commented upon, the agnostic notation considered allows for defining each music symbol using its individual shape and height graphical components. Note that this duality can be applied to all symbols, even to those that indicate the absence of sound, i.e., rests, because they may also appear at different vertical positions.
Let and be the spaces for the different shape and height labels, respectively. Formally, represents the set of all possible music symbols in which a given ith element is denoted as the 2-tuple . However, while all aforementioned combinations are theoretically possible, in practice, some pairs are very unlikely to appear; hence, . In a practical sense, to facilitate the convergence of the model, we restrict the vocabulary to the 2-tuples elements present in the corpus, i.e., .
Figure 1 shows a graphical example of the commented agnostic representation for a given symbol in terms of its shape (
), height (
), and combined labels (
).
3.2. Recognition Architectures
The architecture of both recognition frameworks, baseline and proposed, are described below.
3.2.1. Baseline Approach
A CTC-trained CRNN is considered the state-of-the-art for the holistic approach in OMR, having been successfully applied in a number of work in the literature [
12,
16,
20,
28]. As aformentioned, this architecture models the posterior probability of generating a sequence of output symbols given an input image.
A CRNN is formed by an initial block of
convolutional layers followed by another group of
recurrent stages [
28]. In such a particular configuration, the
convolutional block is meant to learn adequate features for the case at issue, while the set of
recurrent layers model the temporal or spatial dependencies of the elements from the initial feature-learning block.
The network is trained using the commented upon Connectionist Temporal Classification (CTC) training function [
35], which allows for training the CRNN scheme using unsegmented sequential data. In our case, this means that, for a given staff image
, we only have its associated sequence of characters
as its expected output, without any correspondence at the pixel level or similar input-output alignment. Due to its particular training procedure, CTC requires the inclusion of an additional “
blank” symbol within the
vocabulary, i.e.,
.
During the prediction or decoding phase, CTC assumes that the architecture contains a fully connected network with
outputs and a
softmax activation. While there exist several possibilities for performing this inference phase [
36], we resort to so-called
greedy decoding for comparative purposes with the considered baseline works. Assuming that the recurrent layer outputs sequences of length
K, this approach retrieves the most probable symbol per step. Equation (
1) mathematically describes this process.
where
represents the activation probability of symbol
and time-step
k and where
is the retrieved sequence of length
K.
Eventually, a squash function that merges consecutive repeated symbols and removes the blank label is applied to the sequence obtained as an output of the recurrent layer. Thus, the actual predicted sequence is obtained as , where .
In this work, we consider as the baseline the particular CRNN configuration proposed in the work by Calvo-Zaragoza et al. [
20]. This neural model comprises four convolutional layers for the feature extraction process followed by two recurrent units for the dependency modeling. As commented upon, the output of the last recurrent layer is connected to a dense unit with
output neurons. A graphical representation of this configuration is depicted in
Figure 2.
It must be finally noted that the different parameters for each layer is commented upon in
Section 4, which deals with the actual experimentation carried out.
3.2.2. Proposed Approach
This section presents the different neural architectures proposed for exploiting the individual shape and height properties when considering an agnostic representation of music notation. For that, we modify the base CRNN architecture introduced in
Section 3.2.1 by adding different layers to adequately exploit such pieces of information with the aim of improving the overall recognition rate.
More precisely, our hypothesis is that having two CRNN models that are specialized in retrieving the shape and height features may be beneficial with respect to having a unique system that deals with the task as a whole. The input staff image is individually processed by each model, and the different characteristics obtained by each single model may be gathered at some point of the neural model before the actual classification process.
Based on that premise, we propose three different end-to-end architectures that basically differ on the point in which the two CRNN models are joined: (i) the
PreRNN one, which joins the extracted features by each model right before the recurrent block; (ii) the
InterRNN one, which performs this process after the first recurrent layer; and (iii) the
PostRNN one, which gathers both sources of information after the recurrent block. These proposals are graphically shown in
Figure 3.
As it may be noted, all models depict three differentiated parts. Out of these three parts, two of them constitute complete CRNN models specialized on a certain type of information (either shape or height) while the third one is meant to join the previous sources of information for eventual classification. These parts are thoroughly described below:
Shape branch: a first CRNN model that focuses on the holistic classification of musical symbols in terms of their shape labels. Its input is the initial staff , while its output is a sequence of shape symbols.
Height branch: the other CRNN model devoted to recognition of the vertical position labels. In this case, a sequence of height symbols is retrieved out of the initial staff .
Combined branch: the one that combines the extracted features of the other two branches to perform joint estimation of music symbols in terms of their combined <shape:height> labels. Thus, given an initial input staff , the branch retrieves a sequence of combined labels.
Note that all branches are separately trained using the same set of staves with the CTC learning algorithm, simply differing on the output vocabulary considered. This way, we somehow bias the different shape and height CRNN branches to learn specific features for those pieces of information, whereas in the case of the combined branch, the training stage is expected to learn how to properly merge those separate pieces of information.
In a practical sense, in this work, we consider two different policies for training the introduced models: a first scenario in which the parameters of the entire architecture are learned from scratch, i.e., we sequentially train the different branches without any particular initialization, and a second case in which the shape and height branches are separately trained and, after their convergence, the same procedure of the first scenario is reproduced. In this regard, we may assess how influential the initial training stage of the different branches of the scheme is on the overall performance of the system.
It must be remarked that all proposed architectures perform the feature-merging operation after the convolutional block, having discarded other policies which may affect this stage. The reason for this is that, since the convolutional block is responsible for extracting the appropriate features out of the input image staff, we consider that the potential benefit may be achieved by merging those specialized characteristics in the recurrent stage.
While these new architectures suppose an increase in the network complexity with respect to the base model considered, separate exploitation of the graphic components of the commented agnostic music notation is expected to report an improvement in terms of recognition. Nevertheless, this increase does not imply a need for using more data since each part of the network specializes in a set of image features. Finally, note that, while the training stage may require more time until convergence, the inference phase spans practically the same time-lapse as in the base network since the different branches work concurrently before the merging phase.
5. Results
Having introduced the considered experimental scheme and the different models proposed, we now present and discuss the results obtained. Since the experiments have been performed in a cross-validation scheme, this section provides the average values obtained for each of the cases considered. It must be remarked that the figures reported constitute that of the test data partition for the case in which the validation data achieve their best performances.
The results obtained in terms of the Symbol Error Rate (Sym-ER) for the base neural configuration and the three different proposed architectures for each data corpus considered are shown in
Figure 5. Note that these recognition models consider the
vocabulary case, i.e., each detected element is represented as the 2-tuple that defines both the shape and pitch of the symbol.
An initial remark is that, as it can be appreciated in
Figure 5, all proposed architectures improve the results obtained by their respective baselines except for the case of the
PreRNN architecture in the
raw scenario for the
Capitan corpus. By examining this graph, we may also check that the
InterRNN model is the one that consistently achieves the best error rates for all scenarios considered, tied only with the
PostRNN proposal for the
raw case of the
Il Lauro Secco set.
It must be also noted that, for all cases, the error rates obtained with the Capitan corpus are remarkably higher than the ones obtained with the Il Lauro Secco set. This is a rather expected result since the graphical variability inherent to handwritten data compared to the typeset format supposes a drawback to the recognition algorithm. Moreover, the latter corpus depicts a lower vocabulary size, which also simplifies the complexity of the task.
Having visually checked the behavior of the different schemes considered,
Table 3 numerically reports these Symbol Error Rate (Sym-ER) figures as well as the Sequence Error Rate (Seq-ER) ones for the same scenarios considered for a more detailed analysis.
The trends observed in the sequence error rate are consistent with the ones already provided by the symbol error rate metric in
Figure 5. In general, for this Seq-ER figure of merit, all results improve the base neural configuration considered except for the
PreRNN model in two particular situations: when tackling the
Capitan set in the
raw scenario and for the
Il Lauro Secco corpus in the
pretrained case. The two other models, as commented upon, consistently outperform the baseline considered in all metrics and scenarios posed, with the
InterRNN case being the one achieving slightly better results than the
PostRNN one. A last point to comment upon from these figures is that, while differences between the
raw and
pretrained cases do not generally differ in a remarkable sense, convergence is always faster in the latter approach than in the former one.
We now further analyze this
InterRNN model since, as commented upon, it is the one showing the best overall performance. To do so, we compare its performance when decoupling the predicted sequence of labels into their shape and height components against the base neural architecture when trained specifically for the
and
vocabularies.
Table 4 provides the results of this analysis and reports the figures obtained for the baseline model.
As it may be checked, the results comparing the baseline and best model architectures when decoupling the combined labels into their shape and height components depict the expected behavior: for all cases and metrics, the best model decreases the error rate with respect to baseline. In the particular case of Capitan, the Symbol Error Rate decreases approximately by while, for the Il Lauro Secco, this improvement is around for both the shape and height individual vocabularies. The same trend is observed for the Sequence Error Rate, with the error decrease in the case of the shape vocabulary space for the Capitan corpus is especially noticeable, where the improvement is around .
We now compare the performance of the best model configuration with that of the shape and height models. Regarding the Capitan set, the error rates achieved when decoupling the resulting predictions into the and vocabularies are lower than those of the actual dedicated models. These results suggest that the configuration exploits the individual sources of information in a synergistic manner that leads to such an improvement. Nonetheless, in the case of the Il Lauro Secco corpus, since the shape and height architectures achieve better recognition rates than those of the decoupled evaluation, there is still some room for improvement, which suggests that this may have not been the best architecture for this particular set.
Finally, let us comment that the narrow improvement margin that the different corpora depict must be taken into consideration. Focusing on the Capitan corpus, the baseline model depicts an initial Symbol Error Rate of , which is reduced to with the proposed architecture. While the absolute error reduction is , it must be noted that it supposes an improvement of with respect to the initial figure. Similarly, the Il Lauro Secco improves from an initial value of to , which is an absolute improvement of but implies an error reduction of with respect to that of the baseline model.
Statistical Significance Analysis
While the results obtained report that there is an improvement in the overall performance of the recognition task, at least when considering some of the configurations proposed, we now assess the statistical significance of the improvement. For that analysis, we resort to the nonparametric Wilcoxon signed-rank test [
41] with a significance value of
.
Since the idea to compare whether the proposed architectures improve the base neural model, the analysis considers that each result obtained for each fold of the two corpora constitutes a sample of the distributions to be compared. It must be noted that, in this significance test, we focus on the symbol error rate as the metric to be analyzed.
Considering this assessment scheme, the results obtained are reported in
Table 5.
A statistical assessment of the results obtained shows two clear conclusions, which were already intuitive from the previous analysis. On the one hand, the performance of the PreRNN architectures does not significantly differ from the baseline model. On the other hand, both of the InterRNN and PostRNN models do significantly improve the results of the baseline scheme since, as it can be seen, the error rate is consistently superior in the latter model than in the hybrid schemes.
As a last point to mention, the InterRNN and PostRNN models were also confronted with the Wilcoxon signed-rank test in the same conditions as in the previous analysis. Nevertheless, as expected from the average and deviation error rate figures aforementioned, there was no statistical difference between them.
6. Discussion
Current state-of-the-art OMR technologies, which are based on Convolutional Recurrent Neural Networks (CRNN), typically follow an end-to-end approach that operates at the staff level: they map the series of symbols that appear in an image of a single staff to a sequence of music symbol labels. Most commonly, these methods consider an agnostic music representation that defines every symbol as a 2-tuple element that encodes its two graphic components: shape and height. However, as mentioned in
Section 1, holistic OMR systems [
12,
16,
20] do not take advantage of this double dimension since each possible combination of shape and height is represented as a unique category. This leads to a recognition formulation completely equivalent to the one used in fields as text recognition [
21]; hence, the particularities of this data and notation are neglected in the framework.
Nevertheless, we consider that exploitation of this trademark of music notation could increase the recognition rates, as recent work have provided insights about its benefits [
22,
23,
24,
25]. Considering the end-to-end neural-based framework by Calvo-Zaragoza et al. [
20] as a starting point, we pose the following hypothesis: devoting independent CRNN schemes to separately exploit the two aforementioned graphical components of agnostic labeling and, in time, gathering them at the actual neural level might boost the overall performance of the model. In a practical sense, the considered CRNN architectures for each graphical component are equivalent to that of the starting point to perform a fair performance comparison between both the existing and proposed approaches.
In this work, we empirically evaluated three different integration policies differing in the point in which the specialized branches are merged: the
PreRNN, which merges the information after the convolutional stage; the
InterRNN, which joins the different branches after a first recurrent layer; and the
PostRNN, which gathers the information after a second recurrent stage. These architectures are evaluated on two collections of monophonic early music manuscripts, namely the
Capitan and the
Il Lauro Secco corpora. The results presented in
Section 5 show that the merging point affects the performance, as both the
InterRNN and
PostRNN models significantly improve the results of the baseline considered. Moreover, the
InterRNN model yields the best overall performance, reducing the baseline error rate by
and
for the
Capitan and
Il Lauro Secco corpora, respectively.
In spite of this benefit, the proposed framework entails an increase in the network complexity with respect to the base model considered. Nevertheless, this increase does not imply more training data as each part of the network is meant to specialize in certain features of the staff image. Moreover, regarding processing time, the increase in the complexity of the network does not affect the inference phase, as the different branches work concurrently. The main drawback in this case though is the training lapse, which is certainly expanded. However, note that this drawback is assumable given the remarkable error rate decrease with respect to the figures obtained with the baseline considered.
Finally, let us note the generalization capability of the proposal, regarding the reach and performance of this neural architecture when tackling different corpora. As mentioned in
Section 1, CRNN-based models are rather adaptable since the same architecture is capable of obtaining very competitive recognition rates on two very different corpora. Therefore, the inherent lack of generalization of heuristic systems mentioned in
Section 1 is considered solved under the CRNN paradigm.
7. Conclusions
Holistic symbol recognition approaches have proven their usefulness in the context of Optical Music Recognition (OMR) since, given that no alignment is required between the input score and the sequence of output elements, corpora are relatively easy to create. In such a context, these sets are commonly labeled using either a semantic notation, which codifies the actual musical meaning of each element in the score, or an agnostic representation, which encodes the elements as a combination of a shape tag and the vertical position (height) in the score. While this latter representation lacks the underlying music sense of the former, it has the clear advantage of perfectly suiting an image-based symbol recognition task as the vocabulary is defined directly on visual information. However, in spite of having been considered for a number of work, it is still unclear how to properly take advantage of the fact that each symbol is actually a combination of two individual primitives representing the shape of the element and its height.
This work presents an end-to-end approach that exploits this two-dimensional nature of the agnostic music notation to solve the OMR task at a staff-line level. We considered two Convolutional Recurrent Neural Network (CRNN) schemes to concurrently exploit the shape and height information and to then merge them at the actual neural level. Three different integration policies were empirically studied: (i) the PreRNN one, which joins the shape and height features right before the recurrent block; (ii) the InterRNN one, which does the merging process after the first recurrent layer; and (iii) the PostRNN one, which collects both sources of information after the recurrent block. The results obtained confirm that the gathering point impacts the performance of the model, with the InterRNN and PostRNN models being the ones that significantly decrease the error rate with respect to the baseline considered. Quantitatively, in the best-case scenarios, this error reduction ranges between 14.4% and 25.6% referred to the base neural model.
In light of the obtained conclusions, this work opens new research points to address. In that sense, future work considers extending the approach from this monophonic context to a homophonic one [
42]. Additionally, given the variability of music notation and the relative scarcity of existing labeled data, we aim to explore
transfer learning and
domain adaptation techniques to study different strategies to properly exploit the knowledge gathered from a given corpus on a different one. Moreover, we also consider that other network architectures may provide some additional insights to the ones obtained in this work as well as more competitive recognition rates. Finally, since the
greedy decoding strategy considered does not take advantage of the actual Language Model inferred in the recurrent layer of the network, our premise is that more sophisticated decoding policies may report improvements in the overall performance of the proposal.