1. Introduction
In the context of the current worldwide industrial demand of quality and efficiency in crop and food production, the importance of phenotyping arises every day. Plant phenotyping refers to a quantitative description of the plant’s physiological, biochemical and morphological properties, among others [
1]. It consists of the identification of effects on the phenotype as a result of genotype differences and the environmental conditions to which a plant has been exposed [
2]. The development of new usable technologies and its direct availability have driven the latest plant phenotyping approaches that have emerged and have already been applied in several environments [
3]. These technologies have enabled the performance of phenotyping tasks with reduced time and monetary costs (much sought after by the industrial actors) and remain under the focus of researchers from different currents of investigation, trying especially to provide realistic, applicable and suitable solutions. Proximal sensing approaches, such as spectroscopy sensors or hyperspectral imaging, have arisen in the last few years as fast, non-destructive resources for the gathering of crop spectral information that could characterize concrete phenotyping traits, providing the in-field methods with a high relevance due to their desirable capability of providing
in situ results.
Viticulture has benefited from these results of recent research that have developed methods and procedures for several vine- and wine-related problems using near-infrared (NIR) spectroscopy. NIR spectroscopy is a potent technology widely used in several agricultural areas due to its non-destructive nature and multi-parametric capabilities [
4]. Spectroscopic sensors have been proven to be fast for the real-time assessment of several grapevine-related traits, such as the grape composition [
5], the grapevine petiole nutrient concentration assessment [
6] or the identification of grape berry sunburn symptoms [
7]. Therefore, the possibility of the use of NIR technology for grapevine phenotyping arises as an attractive and promising tool for precision viticulture, especially when taking into account the fact that this technique is able to characterize more than one parameter using the same spectral measurement.
NIR devices are able to acquire large amounts of spectral data, making it necessary to manage them in efficient and automatic ways. Data mining has become one of the most valuable research fields in the latest few years due to its knowledge discovery power, direct applicability in several areas and, especially, its proven effectiveness in those problems where it is applied. Data mining through, among others, machine learning techniques have provided procedures for both descriptive (characterizations of the properties of the data) and predictive (learning and induction of the data for forecasting) tasks [
9]. Some of the most widespread applications of predictive techniques are decision trees [
10], decision forests [
11] and, particularly, artificial neural networks (ANNs) [
12] and support vector machines (SVMs) [
13], employed in several research areas, such as medicine [
14], business and industry [
15] or biology [
16]. Support vector machines [
17] are supervised learning methods used for classification and regression through the nonlinear mapping of the input data. SVMs transform the original dataset into a higher dimension using a kernel function and find an optimal separating hyperplane, the best one that maximally separates the samples. Rotation forests [
18] are machine learning ensemble methods that make use of several classification trees (hence the name) to build a meta-classifier. A rotation forest can be used both for classification or regression, depending on the kind of tree-based algorithm used. A robust regression tree is the M5 learner [
19], which, although not as familiar as other estimation methods in spectroscopy, like partial least squares (PLS) [
20], has demonstrated robustness and efficiency in other applications, such as pan evaporation prediction [
21], low-flow forecasting modeling [
22] or the water level-discharge relationship [
Two important grapevine phenotyping topics are varietal discrimination and water status assessment, tasks addressed in the literature and where spectroscopy especially has played a significant part in the last few years. Current varietal discrimination methods have some lack of aspects that are relevant for an industrial point of view, e.g., their need for a highly trained expert or their destructive nature [
24]. Water status assessment especially suffers from this last issue, as well as its time and labor-consuming nature, along with the lower representative capacity (limited number of samples measured) derived from it [
25]. Grapevine varietal discrimination using spectroscopic data has been recently attempted by hyperspectral imaging under laboratory conditions [
24]. Both in-lab or in-field water status assessment via spectroscopy have also been aimed at, attending to several plant water condition indicators, such as stem water potential [
27], leaf water potential [
29] or leaf stomatal conductance [
26]. It is worth highlighting that each and every one of the mentioned studies has one common factor: the use of partial PLS as the model training algorithm. PLS is a widespread statistical technique commonly used in spectroscopy for the regression of chemometric parameters. Qualitative prediction (e.g., discrimination among discrete classes) can also be achieved using PLS (as in [
30], where a binary classification is translated into a regression of two natural numbers) or via a purest discrete classification method, like partial least squares discriminant analysis (PLS-DA) [
31]. Still, discrimination models built with PLS-based approaches have not yielded remarkable results when taking into account a considerably large amount of classes. Hence, the attractive attempt to apply less often used data mining techniques for the modeling of NIR spectra, thus making it possible to carry fast, in-field solutions for these two grapevine phenotyping approaches into commercial and industrial demands.
The goal of this study was to evaluate the combined use of different data mining techniques along with a non-destructive NIR portable sensor for the in-field grapevine phenotyping of two concrete traits: the variety classification and the estimation of the plant water status.
4. Discussion
In this work, the appraisal of two important phenotyping features in agriculture—grapevine varietal discrimination and water status assessment—has been aimed at from an innovative approach that successfully combines an in-field measurement, using a proximal and non-invasive sensor, with different data mining processing methods. The results obtained have displayed the potential of effectively applying data mining techniques upon the spectral information retrieved from a non-destructive and proximal NIR sensor for grapevine plant phenotyping of two key traits.
Regarding variety classification, most of the widely-used methods for grapevine varietal discrimination have traditionally been either destructive or time-consuming, like classic ampelometry [
40] (which is subjected to expert visual description, but still prone to a considerable level of bias due to its human nature), DNA analysis [
41] or wet chemistry techniques [
42] (carried out by trained people and through destructive methods).
In our work, the 10-class variety classification models using SVMs from non-invasively acquired leaf spectra have yielded 88.7% and 92.5 values of correctly discriminated samples for cross- and external validations, respectively. These high percentages allow one to be reasonably optimistic about the suitability of SVMs for the grapevine varietal classification. These correctly-classified percentages are also supported by the high scores of additional classification statistics, such as the average precision (obtaining in several cases a perfect score and high mean values) and AUCs (an average of 0.991 and 0.997 for cross- and external validation, respectively).
Only very recently, grapevine varietal classification has been attempted by hyperspectral imaging [
24] and an NIR portable spectrophotometer [
43]. In [
24], hyperspectral imaging in the range between 280 nm and 1028 nm was used along with PLS for the classification of 300 leaves from three different varieties (Tempranillo, Grenache and Cabernet Sauvignon), under laboratory conditions. The cross-validation method used (Monte Carlo) yielded more than 92% of correctly classified samples in all cases. The outcomes reached in the present work, even when a large number of varieties was selected for the training, highlights the accuracy shown by data mining techniques for the same goal, particularly when the spectra were collected in the field and in a non-destructive way, different from [
24], where a hyperspectral camera was used indoors under controlled illumination conditions. In [
43], the authors used a portable NIR spectrophotometer of the same range as the one in this work for the acquisition of leaves’ spectra. Artificial neural networks (ANNs) and sequential minimal optimization for the training of SVMs were tested as classification algorithms for the development of two grapevine discrimination models for two different approaches: a site-specific model for 20 varieties (yielding 87.25% of correctly classified samples, using ANNs) and a global model using six varieties from different vineyards and seasons (obtaining 77.08%, again with ANNs). The higher percentages obtained in the present study could be explained by the selected SVM algorithm,
ν-SVM algorithm,
versus sequential minimal optimization, as well as the reduced number of classes.
Varietal discrimination using NIR spectroscopy has also been recently performed for waxy corn seed [
44] using SVM and in strawberry [
45] and plum [
46] using PLS-D. From these works, it is remarkable that a purer data mining technique, SVMs [
44], behaved better than the statistical method PLS, commonly used in spectroscopy and chemometrics, confirming the high suitability and adaptability of machine learning approaches for any kind of problem and specifically NIR spectroscopy. Five- and four-class varietal discrimination using PLS-DA was achieved in [
46] obtaining 69% and up to 96.5% values of correctly classified samples, respectively, presenting lower than and similar accuracies as in the present grapevine varietal discrimination, but taking into account that the number of varieties was reduced by half.
The proven flexibility, generalization capability and accuracy in so many dissimilar fields for discrimination issues given by data mining techniques, and confirmed by the results of the grapevine varietal classification via SVMs, demonstrates how well the numerous data mining algorithms fit in classification problems, specifically when working with NIR spectroscopy from proximal sensors.
Current water status assessment methods are mostly destructive, labor intensive, thus expensive, and, in many cases, only capable of being implemented in a limited number of samples, jeopardizing their representativeness and not suitable for characterizing the spatial variability of a vineyard’s water status. Therefore, new non-invasive and fast approaches are needed.
For the regression of ψstem conducted through rotation forests and M5 trees, the calibration R2 and RMSE reached the 0.97 and 0.083 values, respectively, while both validation results were virtually identical (R2 = 0.84; RMSE = 0.165). A relatively large divergence between calibration and validation results was found, where the latter’s RMSE nearly doubled that of the calibration. Still, this difference of 0.082 MPa, although, as said, being relatively wide, remains small in absolute terms, particularly when compared to the standard deviation of the population ψstem values (0.396), that is almost five-times larger. The high score of the determination coefficient of calibration is an aspect that could be generally expected when using data mining and machine learning techniques. Moreover, the training of decision trees is very sensitive to the examples used as input, having a high importance for the algorithm (that tries to extract underlying rules and correlations between the independent and dependent variables), so high results are likely to be obtained when testing with the same set that the algorithm was trained. The use of the calibration results should be carefully treated when applying data mining algorithms, and they should be contrasted with values that came from validation processes. However, the high results obtained for both cross- and external validations concede a considerable level of confidence in the suppression of any overfitting problem.
Additionally, the good outcomes obtained from the variety-specific model (although slightly lower than the multi-varietal one) show the robustness of the application of data mining algorithms for the accurate prediction of ψstem of samples from different seasons and locations when properly training the models with both kinds of examples. This could enable affirming that support vector machines are able to assess the grapevine water status within one variety and to discard the variety as a driving factor in good water status prediction. Still, should the variety-specific model have a higher number of samples and/or a wider range in the water status reference parameter (ψstem), the model’s performance would have probably yielded higher R2 and RMSE values. It should not be omitted that the Tempranillo dataset, compared to the multi-varietal one, contained a lower number of samples (56 vs. 120) and a narrower ψstem range ([−1.85, −0.8] vs. [−1.85, −0.42], MPa).
Stem water potential, as an indicator of plant water stress, has been previously predicted by NIR-based models developed using PLS regression [
29], returning determination coefficients between 0.71 and 0.85 (and error values around 0.1 and 0.2 MPa). The in-field multi-varietal study performed in the current work, using rotation forests and M5 trees, returned a similar determination coefficient for cross- and external validations, highlighting that a considerably higher number of varieties was used. The fact that both studies ([
26] and this one) clearly resulted in high
ψstem correlations from two different and scarcely overlapped NIR regions may drive one to conclude the adequate suitability of NIR spectral measurements from non-destructive sensors in water status prediction.
Models for the grapevine
ψleaf [
28] and
ψstem in olive trees [
27] were developed by VIS/NIR spectroscopy. These works have in common the use of PLS as a model training method, returning moderated values of cross-validation correlation (R
2 from 0.45 to 0.74) that are noticeably surpassed by the results from the rotation forest and M5 trees models described in this work, allowing one to confirm the effective application of data mining techniques to NIR spectral data for the estimation of
ψstem, hence the assessment of plant water status. Additionally, it must be highlighted that the spectral range used in this study (1600 to 2400 nm) completely covered the absorption band (O–H) corresponding to the water vibrational band (1940 nm) [
47], which could be one of the reasons for the high sensitivity in
ψstem changes; thus, a good predictive model could be obtained from this spectral range.
To the best of our knowledge, scarce studies have made use of data mining algorithms for water status assessment. In [
48], the authors built an artificial neural network for the in-lab relative water content (RWC) estimation from grapevine leaf’s hyperspectral imaging working in the range from 900 nm to 1700 nm. The authors asserted that the generated models (with average absolute error below the 3% mark) were shown to be leaf side, varietal and even clone dependent. Although no direct comparison can be made with the present work, because RWC was used as a water status indicator instead of
ψstem, both results have displayed the accuracy of the combination of NIR spectroscopy along with data mining and machine learning techniques for the reliable assessment of plant, grapevine specifically, water status.
The selection of a proper estimation method for a concrete algorithm and dataset is crucial for the evaluation of the results. In multivariate chemometrics, a classic approach of performance evaluation has been the dataset split into calibration (or training) and test partitions [
49]. Although the use of the same dataset for the training and testing is not generally recommended, because the obtained results are overly optimistic [
9], its value could be considered as an upper limit to what may be expected in other settings (e.g., cross- and external validation).
k-fold and leave-one-out cross-validation methods [
50] have been broadly extended in data mining and chemometrics. The selection of
k = 5 for the cross-validation in the present work, maintaining the 80:20 ratio as in the external validation, can lead to a higher consistency and reliability on the obtained results in both experiments.
It is also remarkable the duality brought by the spectral measurements obtained with the same NIR sensor. The capability of effectively addressing these two grapevine phenotyping traits from a single leaf spectral measurement along with its rapid, non-destructive and in-field nature makes the almost direct implementation of a grapevine phenotyping system on an NIR device a reasonable goal supported by the precision obtained in the developed models and the characterization of concrete and sufficient sets of samples for the training.