1. Introduction
Pests and diseases, in the form of pathogens and pests, adversely affect crops, resulting in crop yield reduction, quality loss, and even death. The primary risks they pose to food crops are manifold: pathogens like mildew and rust infiltrate crops, causing the plants to wither and wilt, while pests such as aphids and stick insects nibble on the plants, which jeopardises the yield. Furthermore, pests and diseases lessen the quality and yield of grain. Statistics show that global food losses due to pests and diseases amount to 30% per annum. Furthermore, pests and diseases can cause food crops to produce toxins, such as cereals contaminated with fungi that endanger the health of both humans and animals. Severe pest and disease outbreaks can also result in significant reductions in food production, thereby affecting market supply.
Ensuring food security is vital for both society and the economy. The food industry can act as a catalyst for the development of related industries and create jobs. The cultivation, storage, processing, and transportation of food can generate many employment opportunities. Additionally, trade in food can encourage economic cooperation and exchange between various countries and regions, and its growth can help optimize the allocation of resources and lead to mutual benefits for all involved.
Controlling pests and diseases is a crucial aspect of crop growth [
1,
2,
3]. The first step in achieving effective pest and disease control is to accurately identify the specific types present. Unfortunately, the requisite professional knowledge surrounding pests and diseases is generally limited to agricultural experts, with most farmers lacking the necessary conditions to identify the different pests and diseases promptly and correctly. Meanwhile, the number of agricultural specialists is insufficient to cover vast farmland areas for pest and disease identification, and they face constraints in accessing locations where these issues need prompt attention. Even with online connectivity, using techniques like video telephony might obstruct the identification process because of network and equipment difficulties, and manual identification often carries a risk of inaccurate assessment. Enabling farmers to identify the type of pests and diseases in the initial stages and implementing early targeted control measures is crucial for effective crop pest and disease management. This approach not only reduces the risk of pest and disease spread but also prevents misidentification of the pest or disease type, which prevents incorrect control measures from being taken. Hence, optimum protection for healthy crop growth can be achieved. Protecting food security and mitigating the impact of crop pests and diseases is crucial for ensuring social stability and economic development. To this end, it is imperative to implement effective measures such as scientific and technological innovations, policy support, and other strategies to enhance pest and disease resistance and secure food supplies [
4,
5].
Figure 1 illustrates the extent of pest and disease occurrence, control, and yield loss in China between 1990 and 2020.
The previous model for classification in compression-aware networks carries out the classification process by initially processing the image using various preprocessing techniques, such as image size restoration and flipping amongst others. Following the preprocessing, the algorithm compresses the information, and the model proceeds to extract the relevant features, reconstruct these features, and subsequently classify them. In the compressed perception algorithm, the reconstruction segment frequently requires a significant amount of computational resources. This not only increases the amount of computation but also limits the algorithm’s usage. As the compressed perception reconstruction issue is essentially a combinatorial optimisation problem, a considerable number of iterative calculations are necessary to solve it, and the computational volume is significant. The outcome of the reconstruction is highly reliant on the algorithm used, and various algorithms may result in significant differences in the reconstruction. Nonetheless, choosing and designing the optimal algorithm for a particular problem is exceedingly challenging, and guaranteeing optimal reconstruction is difficult. The time required for reconstruction is comparatively lengthy, and the computational iteration results in significant time consumption, rendering it inadequate for time-sensitive applications such as real-time monitoring.
With the advancement of deep compression perception, an increasing number of deep compression models have been put forward. Han et al. (2015) introduced the concept of deep compressed perception, which they then applied to guide neural network compression. The authors validated the efficacy of this method on several datasets [
6]. Kim et al. (2016) investigated the impact of deep compressed perception on mobile applications, demonstrating through analysis of several mobile tasks that the approach can successfully compress the model and speed up computation [
7]. Shi, Jiang, and Zhang (2017) introduced the CSNet network as a CS reconstruction model that utilises a network model for sampling matrix training [
8]. In 2020, Wuzhen Shi, Feng Jiang, et al. improved upon this model with TIP-CSNet. TIP-CSNet utilises a neural network to learn the three types of sampling matrices and applies partial linear and nonlinear reconstruction networks for enhanced reconstruction [
9]. J. Du, X. Xie, C. Wang, and G. Shi et al. (2018) utilised perceptual loss at the feature-level to improve the structural information of restored images [
10]. In 2018, Zhang J, Ghanem B, and colleagues presented ISTA-Net, a deep network for image compression perception that is both interpretable and optimizable. The network draws inspiration from iterative shrinkage thresholding algorithms and employs nonlinear transforms to solve proximal mappings connected to sparse-induced regularizers. Technical term abbreviations are explained upon their first use [
11]. Adler and Boublil (2016) proposed a compression-aware deep learning method using blocks. A fully connected network was designed to carry out linear perception and nonlinear reconstruction phases [
12]. Yao, Dai, Zhang, and their colleagues (2017) proposed the DR2-Net model, which utilizes a linear mapping network within the fully connected layer of a neural network and enhances the reconstruction component of the model through the inclusion of multiple residual learning blocks [
13]. Kulkarni, Lohit, and Turaga et al. (2016) proposed the ReconNet model, which utilises a non-iterative algorithm to directly reconstruct images from compressed perceptual random measurements [
14]. Canh T N, Jeon B, and colleagues (2019) put forward the DoC-DCS, a convolution-based multiscale DCS scheme aimed at learning decomposition, in contrast to the wavelet domain-based multiscale DCS that uses well-designed filters [
15]. In 2018, Xu et al. proposed the LAPRAN network—a scalable adversarial network that uses Laplace pyramid reconstruction to gradually reconstruct multiple phases of the image [
16].
Neural networks possess an extensive array of uses in current agriculture. Qian Xiang and colleagues employed MobileNetV2 through migration learning for the categorisation and identification of diverse fruit types. They conducted trials employing optimiser segments with varying learning rates during the training process, ultimately yielding an accuracy of 85.12% [
17]. Jiang and Liu incorporated the EdgeNeXt model as the backbone for detecting the ripeness of Lucerne. They utilized the pyramid attention-based feature pyramid network (PAFPN), along with the efficient strip attention module (ESA) for improved precision [
18]. Dhiman and colleagues systematically summarised the application and development of pests and diseases affecting citrus, providing meta-analyses, outlining limitations, and offering directions for future research [
19]. In 2023, Mamat and colleagues introduced a novel technique for enhancing image annotations, which demonstrated accurate annotation of a large number of images. In addition, they utilized the YOLO model based on transfer learning to classify oil palm fruits, achieving a mean average precision of 98.7% [
20]. In 2020, Gulzar and colleagues proposed a seed classification system that utilised a CNN model and migration learning approach for classifying 14 common seeds. The system achieved 99% classification accuracy during training [
21]. Liu and Lin (et al.) developed the ATC-YOLOv5 model for the detection and classification of passion fruit quality. The model incorporates Multi-Head Self-Attention (MHSA) to improve recognition accuracy, whilst maintaining a lightweight feature [
22].
Figure 2 is classification process of the traditional compressed sensory network.
The model proposed in this paper differs from existing models by eliminating the reconstruction step. It classifies the compressed data directly. The ultimate classification task is performed in the compressed domain. The proposed model only necessitates four steps for the classification task, compared to prior compressed perceptual network models. These four steps comprise of preprocessing, compressed data, feature extraction, and classification.
Figure 3 is our proposed compressed domain classification process.
The paper adopts convolution in place of the conventional sampling matrix to attain compressed sampling that adjusts to varying sampling rates. This approach enhances the robustness of learning across all sampling rates. Additionally, the parallel computing framework based on deep learning considerably increases computational speed, addressing real-time processing requirements and expanding the algorithm’s potential applications. The following sections present the contributions of this study.
A module for compression is presented, which is made up of three convolutional layers stacked one after the other. This module compresses the information in the model instead of using the sampling matrix in traditional methods of perception. The tensor used in the network differs based on the compression’s sampling rate within this compression module.
A modified parallel convolution module, referred to as the LS block, is utilised in the model proposed in this paper. The LS block processes the identical tensor through multiple layers and combines the ultimate outcome. The application of distinct convolutional kernels for parallel computation serves to broaden network width and capture features of varying dimensions contemporaneously, whilst simultaneously downscaling dimensionality by utilising 1 × 1 convolutional kernels. This process reduces computational effort linked with larger-size convolutional kernels.
In the previously described parallel convolution block, the residual structure is incorporated, forming the parallel convolution block with residuals, namely the LSR block. Shortcut branches capable of connecting to the initial tensor are included in the parallel convolution block, allowing for the summing of outcomes upon connection with the parallel convolution. The residual connection transfers the gradient directly to the front layer, mitigating the gradient fading issue resulting from network depth. Moreover, it enhances the flow of information and augments it between the back layers, thereby facilitating deeper network model training. In addition, the residual structure integrates disparate features and amplifies their expression.
In this paper,
Section 2 is Materials and Methods which includes the methods and datasets we used, and the methods and models we proposed.
Section 3 is Results which contains the experimental results and data analysis.
Section 4 is Discussion, which contains our discussion of the experimental results, as well as application scenarios for the model and perspectives for future work.
2. Materials and Methods
2.1. Compressed Sensing
Compressed sensing was initially proposed by D. Donoho, E. Candes, and Tao [
23,
24,
25,
26,
27]. It is a signal processing approach whereby a signal can be acquired during the acquisition phase utilizing a sample rate lower than the requirement stated in the Nyquist Sampling Theorem, thus accomplishing signal compression. This signal-processing technique is known as Compressed Sensing (CS). The signal must comply with the RIP property during the acquisition phase, and Equation (1) represents the constrained isometric condition, which means that there is a constant for each ratio:
By utilizing the sparsity or compression of the signal, it becomes feasible to retrieve the signal from a small number of samples. The theory of compressed perception suggests that a significantly smaller number of samples than stipulated by the traditional sampling theorem is necessary to restore the original signal for sparse or compressed signals. As per this theory, Nyquist’s law no longer constrains the signal sampling sample rate, and compression can be expressed through Equation (2), where
represents the measured signal, a measurement matrix of M × N (M << N), and x represents the original signal. The reconstructed signal can only be obtained if
satisfies Equation (1) and x demonstrates sparsity in a particular domain.
Compression perception applies optimization and probability theories to solve the signal recovery problem through linear equation systems or optimization problems. Commonly used recovery algorithms are Basis Pursuit, Orthogonal Matching Tracking (OMP), and others. Compressed sensing techniques have the capability to significantly minimize signal sampling, leading to decreased requirements for storage and transmission by the sensor. These techniques are currently implemented in various domains, including medical imaging and communications.
2.2. Inception Architecture
The Inception structure is a vital module in the Google Net network, proposed in 2014 by Christian Szegedy, Wei Liu, et al. [
28]. The Inception module, illustrated in
Figure 4, is a mini neural network that allows easy extension of network width and depth via a modular design. Abbreviations will be explained upon first use. Each Inception module includes convolutional kernels of varying sizes, namely 1 × 1, 3 × 3, and 5 × 5, which can concurrently extract features of differing scales. Using a 1 × 1 convolutional layer for filter grouping minimises computational effort. The different convolutional branches of the Inception module ultimately merge the feature maps by concatenation for increased network width. Simultaneously, a 1 × 1 convolutional layer is utilised for dimensionality reduction, decreasing the number of parameters and computations. The Google Net network is formed through duplicating stacked Inception modules and features a regularised network structure. Compared to regular convolutional networks, the Inception structure has a smaller number of parameters and computations, resulting in faster training speeds.
The Inception structure design maximizes the filter grouping effect of 1 × 1 convolution and the feature extraction capability of multiscale convolution and residual linkage to construct an efficient and expressive network structure. This design is considered one of the most effective classical network designs.
2.3. Residual Structure
Residual Structure constitutes the foundational element of Residual Network (ResNet), which Kaiming He and a number of colleagues at Microsoft Labs introduced in 2015 [
29].
Figure 5 demonstrates how Residual Structure functions by incorporating Identity Mapping, a constant map that connects two consecutive network layers, thus allowing for the creation of skipping connections.
The proposed residual structure improves the gradient vanishing problem, and by skipping connections, the vanishing or explosion of the gradient in the deep network can be avoided. Instead of learning a new feature mapping, the residual structure learns the residual function, which can speed up network training. At the same time, the use of the residual structure allows the number of network layers to increase without loss of accuracy or gradient vanishing. The introduction of the residual function increases the expressiveness of the network, allowing it to better approximate complex functions. With residual connections, the underlying features can be reused, avoiding repeated learning and reducing the number of parameters. The residual structure can increase the depth of the network without increasing the number of parameters, improving computational efficiency. The residual structure is easy to compute and does not involve too much additional computation.
The residual structure successfully solves the problems of gradient vanishing and network degradation in convolutional neural networks, enabling extremely deep networks and dramatically improving the expressiveness and accuracy of the model. It has become a standard building block of many efficient networks.
2.4. Attention Mechanism
The attention mechanism is a commonly used mechanism to improve the performance of a model that simulates human visual attention and allows the model to focus on important parts of the input. The main effect of the attention mechanism is to increase the model’s focus on important information. The attention mechanism allows the model to automatically learn which parts of the input are more important, thus increasing the model’s focus on key information. For tasks with long input sequences, the attention mechanism can capture long-range dependencies. The attention mechanism performs weighted selection, which can filter out unimportant content and reduce the amount of subsequent computation. At the same time, the attention mechanism can also visualise which keywords the model pays attention to when making predictions, improving the interpretability of the model. The most common attention mechanisms in the field of image classification are the squeeze-excitation module [
30], the ECA attention mechanism module [
31], the CBAM attention mechanism module [
32], the NAM attention mechanism module [
33], and so on.
Figure 6 above shows the principle and structure of the SE attention mechanism, where four colourless squares are the squeeze part, which compresses each 2D feature (H*W) into a real number through the average pooling layer, and then transforms the feature map [h, w, c] into [1, 1, c], and puts the “attention” on the channel, and the subsequent coloured squares belong to the eigenscheme. The coloured squares belong to the excitation part, which generates a weight for each feature channel, constructs the correlation between the channels through the fully connected layer, and outputs the same number of weights as the number of channels in the input feature map, and then finally, through the scaling part, weights the normalised weights to the features of each channel.
Attention mechanisms in deep learning can be divided into two types: channel attention and spatial attention; channel attention determines the weight relationship between different channels, to enhance the weight of the focus channel, and inhibit the channels that do not play a big role; spatial attention is used to determine the weight relationship between different pixels in the spatial neighbourhood, to enhance the weight of the pixels in the focus area, so that the algorithm can pay more attention to the study area that we need, and reduce the non-essential area and the weight of non-essential regions. Overall, the attention mechanism is a very effective mechanism in neural networks and can significantly improve the performance of many tasks.
2.5. Data Set
Tomatoes are one of the largest vegetables grown in the world. The global area under tomato cultivation is about 5 million hectares, and China is the world’s largest tomato producer with an area of about 1.3 million hectares [
34]. Tomatoes are not only an important vegetable, but also an important raw material for industrial processing. The experimental data used in this paper are tomato pest and disease leaves selected from Plant Village [
35]. The Plant Village dataset contains 54,303 healthy and diseased images divided into 38 classes, which contains a large number of common crop image data,
Table 1 provides a summary of the datasets used in this paper. In this paper, 4 different images of tomato pest and disease leaves were used from the Plant Village dataset, which contained a total of 3 common diseases and 1 healthy one, and a total of 4746 images were intercepted as the training and validation sets. Among the four intercepted data images,
Figure 7 shows the three disease leaf images of (a) early blight, (b) mosaic virus, and (c) yellowing varroa virus, and the remaining one of (d) healthy tomato leaf.
2.6. Data Set Pre-Processing
In neural network training, preprocessing is the first step before the data enters the network. Preprocessing can reduce the dimensionality and complexity of the input data, thereby speeding up the network training process, reducing the consumption of computational resources, and improving model performance. At the same time, preprocessing can also reduce invalid, redundant, and noisy data, making the input data more reliable and helping the network to learn better feature representations, which improves the model’s generalisation ability and avoids overfitting. Data standardisation: Feature standardisation can accelerate the convergence speed of the network, and also make different input features comparable to avoid certain feature dominance. Preprocessing can balance the proportion of samples in each category by sampling, to prevent the network from being biased by the small number of samples in some categories. At the same time, the simplified data can be trained with less network structure and lower configuration hardware, saving hardware resources. Preprocessing is an important part of improving the performance of neural networks and avoiding training problems, and choosing the appropriate preprocessing method is very critical to improving network performance.
In the experimental data of this paper, all of our original 256 px × 256 px RGB encoded images involved in the experiment were reconstructed into 224 px × 224 px 3-channel RGB images. The number of image instances created by other operations is as follows 10% of the random images are flipped horizontally, 10% of the random images are flipped vertically, 10% of the random images are rotated 90° to the left, 10% of the random images are rotated 90° to the right, 10% of the random images are reduced in brightness, and 10% of the random images are increased in brightness.
Table 2 shows the number of data enhancements. Examples of different pre-processing methods are shown in
Figure 8.
2.7. Proposed Methodology and Model Parameters
In this paper, we propose the network model CSLSNet for classification in the compressed domain of pests and diseases. This model is divided into four main parts, i.e., preprocessing, data compression, feature extraction, and classification sessions. When the data is loaded into the model, it first passes through the preprocessing section, where the images are uniformly processed to the same pixel size by the resizing method, and then operations such as random flipping, fixed angle rotation, boosting and reducing brightness are performed to enhance the robustness of the model. After normalisation, the data is made to reach the interval of [0, 1] or [−1, 1], which can speed up the convergence of the model and also inhibit overfitting to some extent. A compression module is used to realise the compression of the data, i.e., using CS-Block, a compression module consisting of three layers of sequential 2D convolution. For the feature extraction part, 2D convolution and pooling layers are used, as well as the LS block and LSR block proposed in this paper, the details of which are presented in the following subsections. The classification part uses a fully connected layer and a ReLU activation function, as well as a dropout layer for the classification task. The overall flow of the proposed CSLSNet model is shown in
Figure 9.
2.7.1. Compression Section
In conventional compressed perception, the following requirements are imposed on the measurement matrix: irrelevance, randomness, and sparsity. We have used sequential convolution to achieve the effect of the measurement matrix, since convolution can be represented by multiplication between matrices, a 3-layer convolution formula can be represented as:
In Equation (3),
and
are the weights and biases in the ith convolutional layer in the CS block, respectively. The convolution can be expressed as
which is a linear representation, and thus the convolution fulfils the role of a measurement matrix. When compressing the data, we used a different CS block than before to compress the data. Firstly, the original 4 layers of sequential convolution and 1 layer of upsampling layer, has been changed to sequential 3 layers of 2D convolution and secondly, the upsampling layer has been removed. This is because the upsampling layer requires pixel interpolation, which adds extra computation, whereas removing the upsampling reduces the computational complexity of the model, thus reducing the amount of computation. This upsampling and downsampling operation interrupts the flow of feature information between network layers, whereas the direct use of convolutional layers allows a smoother transfer of information. Between the second and third convolutional layers, the output tensor of the second layer is Y. At this point, the relationship between Y and the SR (sampling rate) is given by the following equation:
In Equation (4),
is the number of output channels,
is a constant, SR is the sampling rate, and the sampling rate is the ratio to the original data.
Figure 10 shows a schematic diagram of the CS block.
2.7.2. LS Module
In the feature map extraction part of the model, several LS blocks are used to stack to realise the feature extraction,
Figure 11 is the proposed LS block structure. The LS block is improved from the inception block as shown in
Figure 11. The LS block has a total of four convolutional branches. The first branch has three convolutional layers consisting of 1 × 1 and two 3 × 3 convolutions, the second branch consists of a 1 × 1 convolution and a 3 × 3 convolution, the third branch consists of a 1 × 1 convolution, and the fourth branch consists of a 5 × 5 average pooling layer and a 1 × 1 convolution. We replace the maximum pooling in the inception block with the average pooling because the maximum pooling only retains the maximum value and may lose the background information, while the average pooling retains the information of the whole sensory field and can provide richer features. Finally, the tensor computed from each branch is connected to form a complete tensor that is propagated forward.
The LS block mainly uses small convolutional kernels, which reduces the computational effort, improves the width and depth of the network, increases the network’s expressiveness, and simultaneously extracts feature information at several different scales within a module, improving the model’s robustness to scale changes. By stacking the modules, a more complex network structure can be constructed, improving the feature expression capability. This approach is more efficient than simply stacking the network, not only with fewer parameters but also with less computation.
2.7.3. LSR Module
In the feature extraction phase, we use LS blocks for stacking in addition to a parallel convolutional structure combined with a residual structure as shown in
Figure 12. There are three branches in the LSR block, one for the shortcut branch and the other two for the convolutional branch. Under the residual structure, a constant mapping is added to sum the input tensor with the results of the other branch computations before output. This improves the gradient vanishing problem, and by skipping connections, the vanishing or explosion of gradients in the deep network can be avoided. Instead of learning a new feature mapping, the residual structure learns the residual function, speeding up network training. Two other convolution branches, one using a 1 × 1 convolution and two asymmetric convolution kernels, 1 × 7 convolution and 7 × 1 convolution, and the other using a 3 × 3 convolution kernel for convolution. The use of asymmetric convolution kernels improves the representational power of the model; the asymmetric convolution kernels are able to capture the asymmetric patterns of the image or feature map in different directions, allowing the model to learn a richer representation of the features. Using a single asymmetric convolution kernel reduces the number of parameters compared to using multiple symmetric convolution kernels to capture features in different directions. An asymmetric convolution kernel also reduces the number of multiplications in the convolution operation, thus reducing the amount of computation.
Figure 12 shows that the LSR block incorporates both residual and parallel convolutional structures. The model gains greater expressive power from the employment of asymmetric convolutional kernels. This reduces the number of parameters and computation, thereby improving model efficiency and effectiveness.
2.7.4. Classification Module
In the classification section, we utilize entirely connected layers and dropout layers [
36] for constructing the classification part. The layers are ordered with dropout and fully connected layers, comprising 2 dropout layers and 3 fully connected layers in total. In the training phase, the dropout feature arbitrarily assigns the output of specific neurons as 0, based on the rough dropout rate, thus ‘eliminating’ those neurons. The first layer’s dropout rate was set to 0.5, while the second layer’s was set to 0.2. The aim was to control the proportion of unused neurons by setting the dropout rate, given the decreasing dimensionality of the fully connected layer. This would enable a better model effect.
3. Results
In this section, we compare the parameters of each model included in the comparison, encompassing total and trainable parameters. Additionally, we compare classification accuracy under various sensing rates and evaluate models against traditional compressed image algorithms. We also analyze common attention mechanisms for classification. In line with the given evaluation metrics, our proposed models have emerged as the frontrunners.
3.1. Experimental Environment Parameters
This experiment runs on the Windows 11 operating system, the CPU uses a piece of Intel(R) Core(TM) i7-11800 H 2.3 Ghz, the RAM used 32 GB, while using a piece of RTX 3060 6 GB to accelerate the training. For the software of the experiment, PyCharm Community Edition version 2023.2 was used for the compiler, Python version 3.10.9 was used, and the model was built using the PyTorch framework with PyTorch version 2.0 and CUDA version 11.8.
3.1.1. Experimental Parameters
Our experiments were primarily conducted on the Plant Village dataset, which consisted of 4746 images that were chosen for training purposes. The test set comprised a total of 474 images. In the experiment, we defined the SR as follows:
In Equation (5), SR represents the sampling rate, x denotes the image sampled by CS-Block compression, and y stands for the original image data. For this experiment, we selected the following sampling rates: 0.05, 0.1, 0.2, 0.3, 0.4, 0.5.
In the field of machine learning, the loss function is utilised to quantify the disparity between predicted and actual results through a numerical measurement of the discrepancy between the predicted and true values. The training of a model aims to optimise the model by minimising the loss function, allowing for predictions to more closely align with true results. The model’s training effectiveness is determined by the size of the loss function, and a well-performing model reduces the loss function to an adequate extent. Our proposed model utilises the cross-entropy loss function, which presumes a predicted probability distribution p and an authentic probability distribution q. Equation (6) defines the cross-entropy.
To achieve minimisation of the loss function, the introduction of an optimiser is vital. The optimiser iteratively updates the parameters through algorithms like gradient descent. By doing so, optimisers achieve faster and more consistent attainment of the minimum loss function value, which in turn accelerates the model’s training. There are various optimisers with differing updating strategies like momentum and adaptive learning rate, capable of achieving complex updates. The proposed model integrates the Adam optimiser with the learning rate set to 0.0001.
3.1.2. Evaluation Indicators
Commonly utilised assessment metrics for classification tasks encompass accuracy, precision, recall, F1 score, and confusion matrix. The precision rate specifies the ratio of samples anticipated as positive classes by the classifier that genuinely belong to positive classes, indicating the accuracy of classification. On the other hand, the recall rate indicates the proportion of samples that truly belong to the positive class, which are precisely predicted as positive, reflecting the coverage of classification. The precision rate pertains to the ratio of accurately classified samples by the classifier among the total samples. This indicates the classifier’s effectiveness in classifying overall. The F1 score is a measure of evaluation for dichotomous classification issues. It considers the reconciled average of precision and recall rates. Firstly, let us explain the composition of the confusion matrix, which comprises four elements. TP predicts positive samples as positive samples, while FP predicts negative samples as positive samples. TN predicts negative samples as negative samples, and FN predicts positive samples as negative samples. The formulas for each are as follows:
Model prediction predicts the proportion of elements that belong to the positive sample result:
In the prediction model, the proportion of elements belonging to positive sample results that are correctly predicted is:
The values of F1 are precision (P) and recall (R):
3.2. Comparing Different Model Parameters
In this experiment section, a comparison of the parameters of various models is presented, including the total number of parameters, trainable parameters, forward propagation parameter sizes, and more. The size of a model’s parameter number reflects its expressiveness and complexity. While larger parameter numbers signify better fitting ability and expressiveness, it also poses a risk of overfitting. The typical size of model parameters falls between hundreds of thousands and hundreds of millions, varying with the structure of different models. As the number of layers and network width increases, the number of parameters also increases substantially. The larger the number of parameters, the longer the time required for model training. Thus, it is necessary to evaluate the model’s performance and number of parameters and select a model with a small number of parameters that guarantees performance.
Table 3 shows that our proposed model parameters have a smaller size of 11.3 M in comparison to 62.3 M for the standard AlexNet model, and 21.7 M and 25.5 M for the ResNet 34-layer and 50-layer models respectively, which possess deeper networks. In LS and LSR blocks, we utilise mainly 1 × 1 and 3 × 3 convolutional kernels. Conversely, networks such as ResNet, AlexNet, and others utilise large convolutional kernels such as 7 × 7 and 11 × 11, resulting in increased computation and parameters. In this experimental section, our model holds a significant advantage as it incorporates the total and trainable parameters while reducing their overall number. This reduction in parameters leads to faster network training and significant savings in resources. We posit that this can be attributed to the impact of employing small convolutional kernels and asymmetric convolutional kernels, which possess considerably fewer parameters in comparison to larger convolutional kernels, thus decreasing the number of parameters.
Reduced parameters: Our proposed model has a lower size and demands fewer computational resources, making it faster to train and suitable for devices with limited resources and memory. Nevertheless, it is essential to note that evaluating model performance requires a consideration of factors beyond the size and number of parameters. These factors include model accuracy, generalisation ability, robustness, applicability to specific tasks, and more. We will present additional performance comparisons in later subsections.
3.3. Comparison of Attention Mechanisms
In this study, the main objective is to compare the efficiency of frequently used attention mechanisms to determine the most appropriate one for this specific model. We have ultimately opted for SE as the channel attention mechanism. We have selected four attentions, namely SE, CBAM, ECA, and NAM, to undergo ablation experiments for comparison. The study evaluates how the classification performance fluctuates under different sampling rates of SR. The four attention mechanisms are compared using the proposed model, and the experimental data comprise the four classified datasets chosen in this research. The four attention mechanisms are integrated into the model prior to the classification module for the experiments. We conduct experimental assessments at SR rates of 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5, respectively.
Table 3 displays the performance comparison findings for different forms of attention. Meanwhile,
Figure 13 illustrates how accuracy compares across different SR values.
Based on the data presented in
Table 4, it can be observed that the model attains the highest accuracy with the SE module across all sampling rates. At lower sampling rates, specifically when SR is 0.05 and 0.1, CBAM and ECA have comparable performances, while SE and NAM perform better in comparison to SE and NAM’s slightly inferior performance. When SR is 0.2, the performance of NAM is nearly as good as that of SE, with only a 0.42% variance between them. However, when considering an SR of 0.3, 0.4, and 0.5, the SE module demonstrates complete superiority over the other three attention modules. The SE module attains an accuracy of 90.08% at an SR of 0.5, indicating that the SE attention mechanism is the most fitting for our proposed model based on these results.
By referring to
Figure 13, patterns and results at varying sampling rates can be observed more clearly. The classification accuracy is linearly and positively correlated with the sampling rate, hence, increasing SR leads to an increased accuracy of classification. Additionally, the experimental results demonstrate that the SE channel attention mechanism is more suitable for the classification task in the compressed domain.
When selecting the SE attention mechanism, it is necessary to compare the effects of not adding any SE attention mechanism module, adding two SE attention mechanism modules, and adding three attention mechanism modules. This is done to demonstrate the effectiveness of the attention mechanism and the appropriateness of the number of additions. To achieve this, we initially set the sampling rate at 0.05 before conducting experiments under different scenarios. (a) Adding SE attention mechanism before the classification module. (b) No SE attention mechanism is added. (c) Add before classification module and before feature extraction respectively. (d) Added before the classification module, after the compression module, and before the feature extraction module, respectively.
Table 5 compares the results of adding different positions and numbers of SE modules.
Table 5 shows that incorporating the SE attention mechanism solely before the classification module leads to enhanced classification performance. The accuracy of (a) is 73.42%, which is 3.38% better than (b), whose accuracy is 70.04% when the SE attention mechanism is added. However, incorporating the SE attention mechanism before the feature extraction module and after the compression module has less of an effect and is not as satisfactory. It can be deduced that the employment of the attention mechanism is superior in enhancing the model’s performance between the feature extraction module and the classification module.
3.4. Comparison of Activation Functions
The activation function constitutes a fundamental element of neural networks, and the fitting ability and computational efficiency of the model can be notably influenced by different activation functions. Visualising the discrepancies between several frequently used activation functions can be achieved through comparative experiments. To explore and contrast various activation functions, we utilise diverse ones in this section of our experiments to determine the most fitting activation function for our model. We selected five activation functions—specifically ReLu, LeakyRelu, SiLu, GeLU, and Tanh—to be tested in our experiments. To ensure accuracy, we maintained a fixed value for the sampling rate, SR, throughout each experiment. The resulting experimental data, when the sampling rate was set to 0.05, is presented in
Table 6.
Table 6 displays the correct identifications for each category, alongside their associated Accuracy, Recall, and F1-Score metrics. The ReLU activation function exhibits the most impressive performance for our model at SR = 0.05, due to the production of sparse neuron activations, with the majority outputting 0. This reduces parameter dependencies, which, in turn, mitigates overfitting risks.
Figure 14 provides a comparison of the outcomes achieved by using different activation functions at a sampling rate of 0.05.
The matrix of confusion, as depicted in
Figure 15, provides a clear and visual representation of the classification of each experiment for every class. The X and Y axes of the matrix correspond respectively to the true and predicted values, while the diagonal entries indicate the number of correctly identified classes.
3.5. Comparison between Different Models
In this section of the experiment, we have chosen to compare our suggested model with AlexNet, CSBNet, and GoogleNet models. To maintain consistent experimental conditions, each model has integrated the CS-Block module. We set the epoch for each network to 50, the learning rate to 0.0001, and sampling rates to 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5. The experiments were conducted using the identical dataset, and
Section 3.1 of this paper highlights the hardware and software parameters employed.
Table 7 presents the classification results for each participating model, while
Figure 16 illustrates the trend of accuracy at a sampling rate of 0.05.
Figure 17 depicts the loss trend for each model at a sampling rate of 0.05.
Table 7 displays the classification results of all models in the compressed domain case. Our proposed model, CSLSNet, outperforms all others at every sampling rate. At 0.05 sampling rate, CSLSNet achieves the highest scores for tomato disease leaf classification with 73.42% accuracy, 75.75% recall, and 74.57% F1-Score. The performance is equally outstanding at a high sampling rate. At a sampling rate of 0.5, the accuracy attained a maximum of 90.08%, outperforming other models by up to 11.81%. The superiority can be attributed to ablation experiments on the attentional mechanism, which determined the optimal locations and numbers of attentional modules for deployment. The models with attention mechanisms prove adept at capturing useful inter-channel neurons that enhance task accuracy.
By analysing
Figure 16 and
Figure 17, it is evident that the model experiences rapid convergence after 10 epochs, and the curve of our proposed model is smoother. The Loss and Accuracy curves can effectively highlight the expressive and fitting ability of the model. A smoother curve indicates better trained models and therefore less variation in the curve can be observed.
3.6. Comparison of the Performance of Traditional Compression Algorithms and Model Classification
In this subsection’s experiments, we compared the performance of contemporary network models for classifying data in the compressed domain. Additionally, we compared the effectiveness of Discrete Cosine Transform (DCT) and Singular Value Decomposition (SVD) algorithms. DCT is a standard transformation in signal processing and image compression, which breaks down signals in the frequency domain. It effectively concentrates signal energy in a handful of low-frequency components while efficiently representing and compressing the signal, serving as the foundation for lossless and lossy compression methods. SVD is a crucial matrix decomposition technique that factorises a matrix A into three matrices. By utilizing SVD, we can preserve the singular vectors that correspond to the initial K singular values for various operations, including data dimensionality reduction.
Table 8 provides a comparison of DCT results at different SR values, while
Table 9 shows a comparison of SVD outcomes under different SR values.
Figure 18 displays the visual representations of our suggested CS-Block, DCT, and SVD algorithms at varying sampling rates for every SR.
Upon observation of
Figure 18 and
Figure 19, it is evident that when compared with the traditional DCT method, CS-Block preserves a higher level of detail during the process of compressing data. This effect is particularly notable when the sampling rate is low, as CS-Block is able to maintain the detailed features with greater clarity. The ability of CS-Block to retain more detailed information during the compression process is a valuable feature, as it provides richer information for model training and improves the performance of classification models. Therefore, our main focus is on the classification performance at low sampling rates, based on the classification outcomes of SR at 0.05, 0.1, 0.2, and 0.3.
Table 8 illustrates a comparison of classification outcomes using DCT algorithm images at different sampling rates, while
Table 9 shows a comparison of classification outcomes using SVD algorithm images at different sampling rates.
By examining the comprehensive data in
Table 8 and
Table 9, it is evident that CSLSNet holds a remarkable advantage in performance when conducting low-sample-rate image classification tasks under identical sampling rate conditions. This excellent performance is especially relevant to the task of processing low-sampling-rate images. The crux of CSLSNet is the incorporation of parallel convolution and residual strategy. This technique enables the network to expand and absorb more intricate details, while maintaining more initial data by using residual structures. By employing this strategy, the model can grasp important local information effectively, leading to more precise classification outcomes. Additionally, the classification module of CSLSNet incorporates the channel attention mechanism, enhancing its ability to comprehend the image’s global information and resulting in a more noteworthy classification performance. The comparison results of various models in dct and svd compression domains are depicted in
Figure 20.
Experimental findings demonstrate that our proposed CS-Block, a method for reducing image dimensionality, is capable of preserving more details at lower sampling rates. In the context of conventional methods for compressing dimensionality, our model achieves superior performance. Using a residual structure and asymmetric convolution in the LSR block enables the fusion of original data with the extracted data, hence maintaining more image features. The models without residual structures or asymmetric convolution do not perform well in DCT and SVD classification results compared to ours. Our proposed model attains a 93.25% and 89.45% accuracy for DCT and SVD, respectively, and exhibits superior results.