1. Introduction
Today, agricultural production is facing challenges to meet increasing food demand as well as to mitigate the consequences of the gradual reduction in cultivated land area by enhancing agricultural productivity in a sustainable manner [
1]. Additionally, there is a pressing need to meet the demand for more effective, more nutritious, and safer food production methods to ensure the well-being of both human health and the planet [
2]. Agricultural automation is being raised as a solution that can boost productivity and improve quality and resource-use efficiency [
1].
However, generating solutions for agricultural production is a complex task that requires the consideration of several variables. One critical variable is the nutritional status of crops, which is determined by approximately 14 fundamental nutrients that plants require for their growth [
3]. Each of these nutrients is found in specific amounts and plays essential roles in crop metabolism. Among these nutrients, nitrogen, phosphorus, and potassium are needed in much more significant quantities [
4]. In particular, phosphorus (P) plays a crucial role in various plant processes such as growth, reproduction, flowering, and environmental adaptation. A plant absorbs P in the form of inorganic phosphate (Pi). However, the concentration of Pi in the soil is typically quite low because it tends to strongly bind to the soil surface or form insoluble complexes, rendering more than
of it immobile and inaccessible for plant uptake [
5]. To maintain high productivity levels, a continuous supply of Pi from fertilizers is required. The contribution of phosphorus, like other nutrients, needs to be carefully regulated according to the specific growth stage of the plant [
4]. Therefore, it is crucial to assess and monitor the nutritional status of the crop throughout its entire life cycle. Traditionally, the assessment of nutritional status has relied on visual inspection, which has inherent limitations in terms of precision, as it is primarily a qualitative approach [
3]. Alternatively, more accurate methods involve analyzing nutrient concentrations in either leaves or soil. However, these techniques can be costly, as they require not only chemical processes but also the transportation of samples and the interpretation of results [
6].
Consequently, many types of technologies have been explored to overcome these problems. Given that nutritional deficiencies primarily manifest through visual characteristics, several explored options are based on automatic methods via image processing [
7]. Within these options, artificial vision stands out as a competitive choice due to its versatility and autonomy. Specifically, deep learning techniques employing convolutional neural networks (CNNs) have shown remarkable performance, surpassing traditional approaches based on texture or color analysis of images [
8].
The development of deep learning models typically involves a supervised process called end-to-end learning, which relies on known training data to make predictions on unknown data [
9]. However, there are several limitations to the applicability of CNN-based methods. Perhaps the most important is the amount of data the network needs to learn the characteristics of the images. Obtaining the required number of high-quality images with accurate labeling is a significant challenge, even more so in the agricultural case, where the field environment is often difficult to access and visual signs of interest are not always present or isolated [
6,
10]. To address these challenges, a commonly employed strategy is to leverage transfer learning, which involves utilizing pre-trained networks that have been trained on extensive datasets. This technique not only reduces the amount of data and computational cost that is needed to train the network but also allows a model developed for one application domain to be relatively easy to transfer to another [
9].
Many works that aim to recognize pathologies on plant leaves use transfer learning as the starting point to develop new models. These works usually propose a comparison between well-developed models and the new model to select the one that performs best for a specific problem. Regarding the recognition of maize diseases, Zhang et al. [
11] proposed an improved model based on GoogLeNet and Cifar10 architectures to classify eight disease types using images collected from both the PlantVillage dataset [
12] and other image search sites. Similarly, Bhatt et al. [
13] classified three disease types with a combination of enhanced models (VGG16, InceptionV2, ResNet50, and Mobilenet), only using the PlantVillage data. Both studies achieved a maximum classification accuracy of
. Furthermore, Chen et al. [
14] introduced INC-VGGN: composed of a VGGNet enhanced with the Inception module. The network was trained on a field-collected database composed with images of both maize and rice leaves. Results were subsequently compared with other common transfer learning models trained on PlantVillage, and it was shown that the proposed CNN performed the best. Likewise, Zeng et al. [
15] classified several diseases of maize using a database acquired with a cellphone and a digital camera. They created a model that integrates the ResNet50 architecture with the SK unit (found in SKNet). The results of their method were then compared with the results of state-of-the-art multiscale network models (InceptionV3, InceptionV4, and Inception-ResNet-V2) and showed that their proposal produced competitive results. On the other hand, Verma and Bhowmik [
16] created a new architecture named MDCNN (Maize Disease Detection CNN) and a database composed of publicly available databases and manually acquired leaf images. In this work, the results were also compared to those of several pre-trained networks, with the proposed model achieving the best results.
In the domain of maize leaf nutrition identification using artificial vision, various studies have explored the detection and analysis of nutritional deficiencies. For instance, Zúñiga and Bruno [
7] developed a system that relies on texture and color analysis to recognize deficiency levels of essential nutrients such as nitrogen (N), phosphorus (P), potassium (K), magnesium (Mg), and sulfur (S). K-nearest neighbors algorithm, Naive Bayes, and Support Vector Machines (SVM) classifiers were used, and the best results were obtained using SVM, which achieved no more than 82% accuracy. Furthermore, Leena and Saju [
3] classified macronutrient deficiency (N, P, and K) using an optimized multi-class SVM, with the highest classification accuracy being 90%. Similarly Guerrero et al. [
17], recognized NPK deficiencies on banana leaves by preprocessing the images through linear and color space transformations and using them as input for a VGG16 model, obtaining 98% as the maximum accuracy. Also Jahagirdar and Budihal [
18] utilized images with NPK nutrient-deficit maize leaves to train an Inception V3 model, reaching 80% training accuracy. In both cases, authentic datasets were used, but they are not publicly available.
Meanwhile, there are even fewer studies that specifically concentrate on the identification of single-nutrient deficiencies, such as the work conducted by de Fátima da Silva et al. [
19], wherein magnesium nutrition was assessed with texture classifiers, reaching a maximum classification accuracy of 75%. In addition, Condori et al. [
8] detected levels of nitrogen deficiency by comparing texture and transfer learning models. The main conclusion of this work was that the results of CNN-based models outperform those of texture methods in the majority of experiments.
In the realm of datasets utilized for studying nutrient deficiencies in maize leaves, Peng et al. [
20] stands out as a pioneering contribution. The dataset in [
20] comprises UAV images characterized by extensive spatial coverage and long time series purposefully crafted for distribution analysis of maize in China. Additionally, [
21] presents another noteworthy contribution, wherein images were sourced from well-known leaf databases, systematically curated, and categorized based on four common diseases. A distinctive feature of this work is the absence of self-generated images in the referenced dataset. As of our current understanding, there is no publicly available maize leaf database explicitly addressing single-nutrient deficiency, particularly phosphorus.
In the current research landscape, there is a conspicuous absence of studies exclusively dedicated to the classification of phosphorus deficiency in maize through the utilization of various deep transfer learning techniques. Furthermore, the absence of a well-established and publicly available database focused on this specific topic exacerbates the existing research gap. In light of these considerations, the primary objective of this study is to fill this void by addressing recent advancements in deep learning techniques. Specifically, our focus is on the classification of images derived from controlled environments featuring maize leaves that exhibit varying degrees of phosphorus deficiency: the complete absence of the nutrient, a half dose of the required phosphorus, and an adequate supply of phosphorus.
The structure of this work is as follows:
Section 2 provides an overview of the process involved to build the dataset and details the transfer learning approach utilized. In
Section 3, the results obtained from applying the transfer learning models to the created dataset are thoroughly reported. The paper finishes with
Section 4.
2. Materials and Methods
The workflow employed in this study, approaching the use of deep learning techniques to classify three levels of phosphorus deficiency in maize leaves, is illustrated in
Figure 1.
Firstly, we begin with a data preparation stage, which involves the collection, labeling, preprocessing, and splitting of data. This stage ends with the labeled samples divided into three data sets: train, validation, and test. In the second stage, a set of pre-trained models is chosen and implemented in MATLAB version 9.9.0 (R2020b), The MathWorks Inc., Natick, MA, USA. One is selected for a fine-tuning stage, for which the inputs are the training and validation sets and the output is a trained model. Hence, the fine-tuned model is used to classify new images from the test set. The prediction results are then evaluated with classification metrics, and the next model is chosen from the aforementioned set to restart the second stage. Once all pre-trained models have been tested, a comprehensive performance evaluation is conducted based on the metric scores, thereby concluding the workflow.
The following subsections present the procedures’ details, providing a full overview of their specific information and methodologies.
2.1. Dataset Building
The images of nutrition-deficient maize leaves (
Zea mays L. improved variety ICA—V 109) used in this study were collected from mid-June to early August 2022 in a plastic shed from the area of Natural Systems and Sustainability of Universidad EAFIT, Medellin, Colombia (6°11′53.80″ N, 75°34′43.23″ W). The experimental design followed a
scheme comprising ten replications of three phosphorus levels: P absence (-P), half dose (-P50), and complete supply (C), resulting in a total of 30 plants (see
Figure 2).
To induce the phosphorus deficiency levels, Hoagland’s complete solution [
4] was modified by taking into account only macronutrients and adjusting the net contribution of each nutrient according to the concentration of minerals in the solution.
2.1.1. Image Collection
A total of 3934 images were acquired. Photographs included the growth stages of seedling, jointing, and flowering. The experiment involved natural illumination. Both sunny and cloudy days were considered in order to increase diversity in the illumination conditions.
Five acquisition devices were utilized, encompassing two types of regular smartphones, a digital camera, a single-lens reflex camera, and a compact scientific camera. In
Table 1, the specifications of the tested cameras are presented.
Nevertheless, previous experiments have determined that images captured by the scientific camera consistently yield superior classification performance. The outcomes of image classification using the GoogLeNet architecture for each camera type are provided in
Appendix A. Consequently, this study exclusively concentrates on the dataset comprising images acquired solely by the scientific camera.
The image collection process was conducted according to the following steps: (1) One leave per plant exhibiting prominent visual symptoms, predominantly observed in older leaves, was selected for sampling. Specifically, the mid-leaf area, as depicted in
Figure 3, was the focal region of interest. (2) A white background sheet was carefully positioned: the intent was to prevent the formation of shadows caused by the leaf and to minimize background-related noise. (3) The leaf was securely held, and a total of five photographs were captured for each leaf. Either the capture angle or the leaf section was adjusted between each shot, ensuring diverse perspectives. An illustration of this process is presented in
Figure 4.
The resulting images were saved and labeled according to the treatment and growth stage. Examples of images obtained using this method are shown in
Figure 4.
2.1.2. Image Pre-Processing and Data Augmentation
The original images obtained with the scientific camera underwent automatic size processing using Python (version 3.8.8, Python Software Foundation, Wilmington, DE, USA) code with two concurrent methods: (1) All
pixel original images were cropped to a central square with sides equal to the smallest image side (n), i.e., 1020 px. Then, cropped images were resized to
pixels according to the method shown in
Figure 5a. (2) All images cropped to a central square were subsequently divided into four individual images with a size of
pixels each. Similarly, these cropped images were resized to
pixels. The process is shown in
Figure 5b.
After the above process, the number of images increased five-fold. However, the automated cropping mechanism introduced certain issues, including producing blank images or images capturing only a small portion of the leaf, which led to images with limited or irrelevant content. An example of this is seen in image
of
Figure 5b. Based on supplementary experiments presented in
Appendix B, these images introduce confusion to the neural network, hindering the accurate extraction of pertinent features and consequently resulting in a decline in performance metrics. To address this problem, an algorithm was developed to select valid images. The algorithm involved the following steps: First, the image was split into its RGB components. Based on the histogram analysis of the images, we determined that the blue (B) channel provided more contrast to distinguish the leaf from the background, so only the B channel was preserved. Next, a thresholding process was applied to distinguish leaf pixels (set to 255) from the background (set to 0). The algorithm then counted the number of leaf pixels and considered a minimum count of 15k pixels as indicative of a significant leaf presence. Finally, images with a pixel count below this threshold were excluded from further analysis. The effectiveness of the filtering process is illustrated in
Figure 6.
Following the preprocessing and data augmentation procedures, the resulting dataset contained the number of images indicated in
Table 2.
Finally, the training, validation, and testing image sets were composed with a ratio of 7:2:1 and in such a way that the five subimages obtained from each original image belonged to a single set, thus ensuring the independence between the image sets. The correspondence of the totals of the images for each set is detailed in
Table 3.
2.2. Transfer Learning Approach
A deep learning approach is employed to classify the three levels of phosphorus deficiencies. Given the challenges associated with acquiring an ample supply of images and the potential scarcity of publicly available datasets for training convolutional neural networks (CNNs), it is a common practice to adopt transfer learning. Transfer learning is a powerful machine learning technique that involves repurposing an existing trained model for a new—often related—problem. This approach capitalizes on the capability of the initial layers in the original model to detect general features. Subsequently, the output of the last layer is adapted to the specific requirements of the new task. This adjustment is achieved by replacing the last fully connected layer with a new one representing the classes relevant to the new problem. Additionally, it is possible to fine-tune the transfer learning process by selectively freezing or updating specific weights in the initial layers [
22].
The models used for transfer learning in this study are primarily associated with the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [
23], which has produced some of the most accurate models. These models have served as inspiration for numerous versions and improvements, as well as being the foundation for other models. Considering the existing literature, five architectures were selected based on their frequent utilization and high accuracy. Therefore, the following models were included in this study.
2.2.1. VGG16
These models were introduced in 2014 by Oxford’s Visual Geometry Group [
24] but are still popular today. The VGG networks consist of multiple blocks of stacked convolutional layers with smaller filters (i.e.,
layers) combined with a max-pooling and another fully connected layer. This set of layers is used instead of a larger filter size (such as
) in order to increase efficiency and make the decision function more discriminative [
22]. The latter ultimately means that this model type generalizes well to a wide range of tasks [
24]. One of the most popular variants is VGG16, which is composed of 16 layers in weight and is available pre-trained on the ImageNet dataset [
25].
2.2.2. ResNet50
Residual networks were first introduced in 2015 by He et al. [
26] and consist of blocks with two or three sequential convolutional layers with a parallel but separate identity layer that connects the input of the first layer to the output of the last one [
22]. These identity layers, called ’skip connections’, solve errors generated during training and testing when the model goes deeper. Furthermore, they can mitigate the vanishing gradient problem when placed before the activation function [
27]. This study utilizes ResNet50, one of the evolved versions of ResNet. ResNet50 is chosen as it is a 50-layer deep architecture known for its remarkable performance and effectiveness at various tasks.
2.2.3. GoogLeNet
The GoogLeNet model is a special manifestation of the Inception architecture. This type of block splits the input into parallel and multiple pillars containing convolutional layers with a different-sized filter and a pooling layer. Those are followed or preceded by a downsampling convolution to reduce the output depth, which is finally concatenated. This enables the saving of computing resources [
22]. The GoogLeNet structure uses nine Inception modules accompanied by pooling, regularization, and fully connected layers. For additional information, refer to the original paper [
28].
2.2.4. DenseNet201
The creators of Dense Convolutional Network (DenseNet) [
29] took some inspiration from the residual network’s idea to introduce dense blocks. These are modules of sequential convolutional layers for which any layer has a connection to every other layer in a feed-forward way in terms of concatenation operation. In this way, successive layers receive information from preceding ones, including feature maps, for better feature propagation and reuse. This process causes the number of channels to grow, despite reducing the number of parameters as compared to conventional CNN [
30]. Three versions are highlighted: DenseNet121, DenseNet169, and DenseNet201, which are differentiated by the number of layers. The latter is used in this study.
2.2.5. MobileNetV2
MobileNet was first introduced by Howard et al. [
31] using the concept of depthwise separable blocks and consists of (1) depthwise convolution: performing a single convolutional filter per input channel, followed by (2) pointwise convolution: computing a linear
convolution of the input channels. Later, Sandler et al. [
32] improved the original version by incorporating bottleneck blocks between input and output layers; bottleneck blocks are similar to residual connections but are considerably more memory efficient [
33].
2.3. Model Implementations
The transfer learning models are implemented using MATLAB’s Deep Learning Toolbox™ release R2020a, The MathWorks Inc., Natick, MA, USA [
34]. This package provides access to the pre-trained models mentioned earlier, which have been specifically trained on the ImageNet dataset. Some specifications of these models are presented in
Table 4.
The computer code is executed on a machine equipped with an i7-9700K 3.6 GHz processor, 64 GB RAM (Intel, Santa Clara, CA, USA), and an NVIDIA GeForce RTX 2080 40 GB GPU (Santa Clara, CA, USA). To apply the transfer learning approach for each model, the following process is performed (see
Figure 7):
(1) Data with its ground-truth labels are read. The five sub-images are randomly chosen to form the train set with
of available samples,
for the validation set, and the remaining
images as a test set, ensuring the independence of sets. (2) Each model is loaded separately, and its initial layers are frozen to reuse the already learned general features. Moreover, the last fully connected layer is substituted to match the three classes’ outputs. (3) Hyperparameters are predefined with specific values as outlined in
Table 5. These hyperparameters control various aspects of the deep learning model and its training process. The details and explanations of these hyperparameters are provided in the following paragraph. (4) The training process involves training each model on the training set and validating it at each epoch using the validation set. The training continues until either the ’maximum number of epochs’ requirement is met or the validation patience is satisfied. (5) The fine-tuned model is used to classify new images from the testing set. Consequently, the predicted labels are obtained. (6) The predicted labels are compared with the ground-truth labels to validate the model’s performance.
In
Table 5, the Solver represents the optimizer used for the loss function, which is stochastic gradient descent with a momentum of 0.9. The batch size indicates the number of images processed by the network in each batch for error computation and weight updates. The initial learning rate is used at the beginning of training and decreases by 0.96 per epoch in a stepwise manner. To prevent overfitting, the training process considers both a maximum number of epochs and a validation patience method. The validation patience method monitors the validation error for consistent behavior within a certain number of epochs to determine when to stop training.
The hyperparameters presented in
Table 5 were determined based on findings reported in the existing literature for similar studies: Mohanty et al. [
35], Barbedo [
36], Zhang et al. [
11], Maeda-Gutiérrez et al. [
37], and Nagaoka [
38]. These values have been widely used and are recognized as effective choices for achieving good performance using deep learning models.
3. Results
The proposed transfer learning approach was employed to classify three levels of maize leaf phosphorous deficiency using the aforementioned deep learning models (VGG16, ResNet50, GoogLeNet, DenseNet201, and MobileNetV2). The following subsections describe the evolution of accuracy and loss values during the training stage and present the results obtained to evaluate the overall performance of the studied models on the dataset built specifically for this study.
3.1. Learning Curves
To evaluate the training performance, accuracy and loss curves are examined for each epoch.
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12 depict the training progress with the training set and visual represent how well each model learns. On the other hand, the validation curves in those figures provide insight into how well the models generalize.
Each model was run for 30 epochs, and it was found that at around ten epochs, the models started to converge with high accuracy. Specifically, DenseNet, VGG, and ResNet achieved more than validation accuracy. They were followed by MobileNet, which obtained an accuracy of more than on the validation sets. In the same way, as is expected, losses seem to be lower as accuracy increases and had values ranging from 0.18 to 0.4.
In addition, model behavior can be diagnosed by the shape of the learning curves. One common dynamic that can be concluded by observing the graphs is overfitting. Overfitting refers to a model that has learned the training dataset too exactly. This causes it to be less able to generalize unseen data. Based on
Figure 9, the ResNet loss validation curve continues to increase after a minimum validation point. Similarly, the GoogLeNet loss curve (
Figure 10) has two peaks where the loss increases, causing training to stop at only 14 epochs.
Another consideration to be seen on the learning curves is the gap between the validation and training loss curves, which indicates an insufficient dataset size. It can be observed on
Figure 12 that the MobileNet loss curve has a more considerable gap distance, followed by ResNet and VGG (
Figure 8 and
Figure 9, respectively).
Finally, the most consistent performance is obtained by the DenseNet model in
Figure 11 since both curves reach a point of stability with a minimal gap between the final values. In addition, the training stops in 20 epochs, indicating good learning and generalization of features in the images.
3.2. Performance Analysis
Once each model is trained, it can further be used to infer features of interest in unknown data to test its generalization. In order to both assess the effectiveness of the studied models and to determine the superiority of one model over the others, four performance metrics were utilized, as described below:
Accuracy: This is the most common classification metric. This metric describes the ratio between the number of correct predictions and the size of the data. The metric is defined in Equation (
1).
Precision: This is a performance metric that measures the proportion of correct predictions for a specific class out of all the predictions made by the model for that class. It provides insight into the model’s ability to accurately classify instances for a particular class, regardless of the overall accuracy. Precision focuses on the relevance of the model’s predictions compared to the actual ground truth. This metric is defined in Equation (
2).
Recall: Also known as sensitivity or true positive rate, recall is a performance metric that measures the proportion of correctly predicted instances for a specific class out of all the instances that actually belong to that class. It quantifies the model’s ability to identify and capture the positive instances, or true positives, in relation to the actual ground truth. Recall emphasizes the model’s capability to recognize and recall the relevant instances of a particular class, without considering the incorrect predictions. This metric is computed as presented in Equation (
3).
F1-score: The F1-score is a performance metric that combines precision and recall into a single value by taking their harmonic mean. By incorporating both precision and recall, the F1-score provides a comprehensive evaluation of the model’s ability to achieve both high precision and high recall, promoting a balanced trade-off between the two measures. The metric is defined by Equation (
4).
Since the metrics of precision, recall, and F1-score are performance measures for
n classes, there are different ways to combine these scores to have an overall value. One way to do this is to calculate the simple arithmetic mean, which is known as the macro-averaged score and is defined by Equation (
5). With this technique, all classes contribute equally to the final averaged metric.
Table 6 presents a comparison of the performance metrics, including macro-averaged precision, recall, F1-score, and accuracy, on the testing set. DenseNet is the model with the best results and is highlighted.
In these terms, GoogLeNet obtained the lowest scores, followed by the VGG and MobileNet architectures. As was discussed based on observing the learning curves, these models had problems with training in terms of overfitting and insufficient dataset size. Both aspects would negatively impact the model’s performance. In the same way, the most consistent training was done by DenseNet and ResNet, and this is depicted by their high performance.
To finish the evaluation of the studied models, the confusion matrix is used. This tool records all the predictions made on the test set, allowing the visualization of the performance for each class. On one side of the matrix, the ground truth is arranged against predictions of the model. The confusion matrices for all models are shown in
Figure 13.
Based on the graphics, it can be seen that in almost all cases, the prediction of the
class has the lowest performance values, except for DenseNet (
Figure 13d), which has lower recognition rates for the
label. Concerning this architecture, the color map shows high homogeneity of correct classifications for all classes, i.e., this model has no strong inclination to recognize one class more than another. In the opposite case, it can be seen in
Figure 13c that the GoogLeNet model has a classification weakness for the
label (
accuracy), although all other classes are identified with good accuracy (both higher than
). This same behavior is traced by MobileNet in
Figure 13e but with a slightly higher accuracy rate. Finally, both the VGG and ResNet models had similar behavior (
Figure 13a and
Figure 13b, respectively).
This difficulty in recognizing the label is observed in almost all models and is explained by the overlap of visual characteristics between this class and the other two. This makes it as difficult for a human as it is for a machine to recognize the differences between a leaf with sufficient nutrition or low nutrition and a leaf with medium nutrition.
In addition to nutrition evaluation being relevant to ensure good agricultural production, other issues, such as maize diseases, are also important and been studied using the same deep learning framework, so it is possible to compare our results to the results obtained by other studies from the literature; this comparison is presented in
Table 7. It was observed from the analysis that this work places within the state-of-the-art results and also that only a few studies have attempted to acquire their own images.
4. Conclusions
The detection and identification of plant leaf issues is a relevant task in farm management. The care of each plant leads to a healthy plantation, which results in high production and excellent quality. Despite the development of many deep learning methods for the classification of plant diseases, including leaf nutrition deficiency, they do not respond the same to all situations. For this reason, there is an existing need to test deep learning model performance for specific tasks, as learned features are extracted uniquely depending on image characteristics.
In this study, five transfer deep learning architectures pre-trained on the ImageNet database (i.e., VGG16, ResNet50, GoogLeNet, DenseNet201, and MobileNetV2) were trained to classify three phosphorus deficiency levels. The training was conducted on a self-made database comprising images taken by five different acquisition devices (but just one camera’s images were selected for this analysis). It was found that DenseNet201 performed the best for this specific problem: giving the most consistent training performance and best recognition metrics and leading with an overall accuracy score of as well as correct prediction rates uniformly distributed among all classes. This performance is placed within the state-of-the-art results, so further investigations can be done focusing on the performance of this architecture against more current models.
The second best-behaved model was ResNet50: reaching an accuracy of but with better recognition rates for and C labels than for those with . Finally, GoogLeNet and VGG16 models had lower overall accuracy ( and , respectively). The first, probably due its huge number of layers, either does not learn enough features from the database or requires more data to correctly adjust the network weights. Meanwhile, the dataset is possibly unrepresentative for the second model, judging by the shape of the loss curve.
Moreover, the analysis of learning curves supports this hypothesis and allows us to understand that for most architectures, either the amount or the quality of training data is not sufficient for the models to make a perfect generalization of the features in the images. Therefore, it is necessary to increase the size of the database in the future. In the same way, different regularization techniques can be explored to avoid overfitting.
This study generates a comprehensive evaluation of the performance of the mentioned models that contributes to the understanding of deep learning models applied to detection of single-nutrient deficiency on plants. This also contributes to faster and economical identification of nutritional phosphorus issues so that a crop’s fertilization schedule can be focused on specific plants and thus make more rational use of resources, taking care of both the farmer’s budget as well as the health of the environment. We remark that the setup proposed in this research can be extended easily to real-time monitoring of other crop types and even to analyzing different kinds of leaf issues that can be inferred from visual inspection.
In future work, we aim to explore the development of a novel model directly from our dataset. This endeavor will involve experiments without pre-training to assess the model’s performance and the potential advantages of a more specific dataset tailored to address agricultural challenges. Additionally, we will investigate the implications of this approach for broader applications in similar agricultural problems.