Development of a Lightweight Model for Rice Plant Counting and Localization Using UAV-Captured RGB Imagery

Sun, Haoran; Tan, Siqiao; Luo, Zhengliang; Yin, Yige; Cao, Congyin; Zhou, Kun; Zhu, Lei

doi:10.3390/agriculture15020122

Open AccessArticle

Development of a Lightweight Model for Rice Plant Counting and Localization Using UAV-Captured RGB Imagery

by

Haoran Sun

¹,

Siqiao Tan

^1,2,

Zhengliang Luo

^2,3,

Yige Yin

¹,

Congyin Cao

¹,

Kun Zhou

^2,3,* and

Lei Zhu

^1,2,*

¹

College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China

²

YueLuShan Labortory, Changsha 410128, China

³

Hunan Rice Research Institute, Changsha 410125, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2025, 15(2), 122; https://doi.org/10.3390/agriculture15020122

Submission received: 20 December 2024 / Revised: 3 January 2025 / Accepted: 6 January 2025 / Published: 8 January 2025

(This article belongs to the Special Issue Application of UAVs in Precision Agriculture—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Accurately obtaining both the number and the location of rice plants plays a critical role in agricultural applications, such as precision fertilization and yield prediction. With the rapid development of deep learning, numerous models for plant counting have been proposed. However, many of these models contain a large number of parameters, making them unsuitable for deployment in agricultural settings with limited computational resources. To address this challenge, we propose a novel pruning method, Cosine Norm Fusion (CNF), and a lightweight feature fusion technique, the Depth Attention Fusion Module (DAFM). Based on these innovations, we modify the existing P2PNet network to create P2P-CNF, a lightweight model for rice plant counting. The process begins with pruning the trained network using CNF, followed by the integration of our lightweight feature fusion module, DAFM. To validate the effectiveness of our method, we conducted experiments using rice datasets, including the RSC-UAV dataset, captured by UAV. The results demonstrate that our method achieves a MAE of 3.12 and an RMSE of 4.12 while utilizing only 33% of the original network parameters. We also evaluated our method on other plant counting datasets, and the results show that our method achieves a high counting accuracy while maintaining a lightweight architecture.

Keywords:

rice plant counting; pruning; lightweight architecture; deep learning; UAV

1. Introduction

Rice is a pivotal food crop globally, intimately associated with both food security and economic stability [1]. Research on rice is imperative for enhancing agricultural productivity and advancing technological innovations. Accurately counting rice seedlings is essential for devising subsequent management strategies such as transplantation and the administration of water and fertilizers [2]. Such counts enable assessments of density, effectiveness, and overall growth quality within rice fields [3]. Furthermore, early plant counting facilitates the monitoring of genetic variations across fields, which is particularly crucial in large-scale breeding experiments [4]. Currently, rice counting predominantly relies on manual sampling and statistics [5], involving the selection of specific plots for sampling. This traditional method, however, is labor-intensive, time-consuming, and often yields results that are not representative of the entire field. Manual counting is prone to human error and can cause damage to the rice field environment. An optimal solution would be the development of a model capable of automatically counting rice plants, thereby enhancing accuracy and efficiency.

In recent years, the advancement of deep learning technologies has significantly enhanced the development of image-based plant counting models. Predominantly spearheaded by convolutional neural networks, these deep learning models have catalyzed substantial technological advancements in the field of plant counting. Unlike traditional machine learning models, which rely on manually selected phenotypic features such as color and texture, deep learning models leverage their robust feature extraction capabilities. This allows them to effectively differentiate between complex field environments and plant phenotypes, leading to impressive performance across various scenarios. Similar to applications in crowd and vehicle counting, plant counting models predominantly utilize density maps and regression boxes to quantify counts accurately.

Motivation. Although significant advancements have been made in models for counting rice, several unresolved issues persist. First, many existing plant counting models are characterized by excessively large parameters, necessitating substantial computational resources. This requirement renders them impractical in resource-constrained agricultural settings, particularly when deployed on edge devices with limited computing capabilities. Furthermore, prevalent methods such as density map [6,7] approaches rely on per-pixel regression calculations, which fail to provide specific location information for individual plants. This limitation hampers detailed plant-level analysis, which is critical for precise agricultural interventions. Finally, the variability in plant morphology within rice fields, especially during different growth stages, poses additional challenges. Different rice varieties exhibit distinct phenotypic traits that can significantly impact the accuracy of counting and localization methods.

The advent of P2PNet [8] signifies a pivotal shift from conventional counting methods. This point-to-point architecture discards traditional density maps and detection-based techniques; hence, it has been adopted as the baseline model. Works such as those by [9,10] have demonstrated effective adaptations of P2PNet for specialized plant counting applications. Despite these advancements, the augmented versions of P2PNet remain overly cumbersome for deployment in environments with limited computational resources. Moreover, the feature fusion module of P2PNet, which solely merges mid-level and deep-layer features, is insufficient for adequately capturing the varied plant characteristics present in complex rice field environments. This limitation underscores the need for a more comprehensive approach to feature integration to enhance the model’s applicability and efficiency.

Contributions. To address the above issues, this work presents three principal contributions:

First, this paper introduce P2P-CNF, a lightweight rice counting model optimized for resource-constrained agricultural settings. The reduced parameter size of P2P-CNF ensures its suitability for enhanced counting and positioning tasks, facilitating efficient deployment in fields where computational resources are limited.
Second, a novel pruning method, Cosine-Norm Fusion (CNF), is proposed, designed to decrease the model’s parameter count while retaining maximal information from the original model. This pruning technique enables the model to maintain its performance integrity, ensuring minimal loss in functional efficacy.
Lastly, the Depth-Attentive Fusion Module (DAFM), a lightweight feature fusion module that significantly enhances the model’s overall performance, is developed. The DAFM utilizes depthwise separable convolution combined with a linear self-attention mechanism to achieve superior results while maintaining a low-parameter footprint.

2. Related Work

The section will discuss the relevant work from two aspects.

2.1. Network Pruning

In recent years, the relentless advancement of computational capabilities has catalyzed significant developments in deep learning across diverse domains [11], encompassing computer vision [12,13,14,15,16] and natural language processing [17,18], resulting in groundbreaking achievements. Notably, modern visual models such as vision transformers possess parameter counts reaching up to 632 million, while traditional convolutional neural networks (CNNs) like ResNet-150 and VGG16 have parameter counts of 230 million and 528 million, respectively [19,20,21,22,23,24]. These high parameter counts significantly elevate computational costs, thereby necessitating advanced hardware for deployment.

To address these challenges, substantial efforts have been devoted to refining models through various optimization techniques, with network pruning emerging as a particularly effective strategy [25,26]. Network pruning systematically eliminates weights and parameters that minimally impact the model’s output, thereby minimizing performance degradation. This process is primarily categorized into structured [27,28,29] and unstructured pruning [30,31]. Unstructured pruning [30,32,33,34] aims at meticulously optimizing the model by removing specific parameters or connections, although it often requires specialized hardware for full effectiveness. Conversely, structured pruning [28,35,36] targets entire neurons or channels, preserving the network’s architectural framework and facilitating deployment on standard hardware without the need for specialized support. This dichotomy in pruning methodologies underscores a pivotal aspect of contemporary deep learning—balancing model complexity with computational efficiency, which is crucial for practical deployment across varied platforms.

Significant strides have been made in the domain of network pruning, particularly in the methodologies for pruning and merging filters within convolutional neural networks. He et al. [37] investigated the redundancy of filters based on their relationships within the same layer, subsequently devising the Filter Pruning via Geometric Median (FPGM) approach to efficiently eliminate these redundancies. Concurrently, Wan [38] developed a correlation-based pruning algorithm (COP), which enhances model simplicity by normalizing and comparing filters at various levels to a universal scale.

The incorporation of batch normalization (BN) layers [39], which are prevalent in contemporary CNNs, has further influenced pruning strategies. Liu et al. [40] utilized the gamma parameter in BN layers as a scaling factor, integrating the L1 norm into the loss function to induce sparsity and selectively prune channels with minimal gamma values. Expanding upon this framework, You et al. [41] introduced the Tick Lock approach, applying Taylor expansion to evaluate the significance of gamma values in BN layers and facilitating the removal of inconsequential channels. Furthermore, Zhuang et al. [42] proposed polarization regularization (PR), akin to the method by [40], but with a nuanced approach that avoids indiscriminate sparsity by only targeting less critical channels and amplifying the gamma values of vital channels. An additional penalty term was incorporated to refine the efficacy of the L1 norm. Similarly, structured channel pruning (SCP) [43] methodologically integrates the pruning process with BN layers and ReLU activation functions, illustrating an advanced technique to optimize network efficiency and functionality.

These innovations underscore a trend toward sophisticated, adaptive pruning techniques that not only reduce computational overhead but also maintain or enhance model performance across diverse operational contexts.

2.2. Plant Count

The counting and localization of plants have been central areas of research within agricultural studies, given their critical importance for both agricultural experimental and production processes [44,45]. Earlier approaches, constrained by limited computational power, predominantly relied on machine learning techniques to achieve plant counting [46,47,48]. These methods heavily depended on manually extracting the phenotypic features of plants, necessitating considerable feature engineering, which ultimately limited the robustness of these models in complex, real-world environments.

With the advent of deep learning, significant progress has been made in overcoming these limitations. There are currently two primary categories of plant counting methodologies: regression-box-based detection algorithms [49] and density-map-based algorithms [50,51]. Thanks to advancements in object detection techniques such as the YOLO series [15] and Faster R-CNN [16], algorithms for plant counting and localization have seen substantial breakthroughs. For example, Buzzy et al. [52] successfully implemented detection and recognition functionalities for tree leaves using YOLO v3, while Wang et al. [53] developed a flower detection and counting system by integrating the Ghost module and P2 detection head into YOLOv8. Similarly, Zhang et al. [54] used UAV-captured images of dense holly trees and constructed a YOLOX network, which demonstrated robust performance across different scales and scenes. Chen [55] extended this work by employing drones to capture sorghum seedlings at varying heights, validating the counting results using support vector machines (SVM) alongside YOLOv5 and YOLOv8, finding YOLOv5 to achieve the highest accuracy.

In parallel, density-map-based algorithms have also been widely adopted in plant counting applications. The TasselNet series [56,57,58] effectively utilized density maps in conjunction with regression techniques to accurately count maize tassels. Bai et al. [6,7] introduced Ricenet and Rpnet, leveraging attention mechanisms and positive-negative loss to count rice plants in complex field conditions. In addition, Wu et al. [59] combined UAV imagery with fully convolutional networks and density maps to count rice seedlings, providing a promising approach for yield estimation. Liu et al.’s [60] proposed IntegrateNet, a novel architecture employing local supervised density maps and local counting, demonstrated significant improvements in maize counting accuracy.

3. Materials and Methods

3.1. UAV-Based RGB Image Collection

In this study, UAV-based RGB imagery was employed to monitor rice growth. The UAV flights were conducted over a rice field located in Changsha, Hunan Province, China, characterized by a subtropical monsoon climate. Figure 1 provides a detailed overview of the geographical location, the equipment utilized, and the collected imagery. Data collection was carried out from May 2024, utilizing a DJI M300 RTK drone equipped with a DJI P1 camera. Flight operations were scheduled between 8:00 a.m. and 10:00 a.m. daily to ensure optimal lighting conditions provided by sufficient sunlight. The UAV was operated at an altitude of 15 m and a speed of 3 m per second. The onboard RGB camera, positioned perpendicularly to the ground, captured high-resolution images with a focal length of 35 mm. Image acquisition was strategically planned to achieve an 80% frontal overlap and a 70% side overlap among consecutive images, ensuring comprehensive coverage of the surveyed area. The resultant images boasted a resolution of 8192 × 5460 pixels, facilitating detailed analysis of the crop’s developmental stages.

3.2. Rice Plant Counting Dataset Collected by UAV

To address the computational challenges posed by the high-resolution raw images obtained via UAV, the images were processed to isolate specific stages of rice tillering. Each image was systematically cropped to a uniform size of 1400 × 1400 pixels, ensuring a focused dataset suitable for rice counting tasks. This curated dataset, designated as the RSC-UAV (Rice Seedling Counting-UAV) dataset, includes 401 high-quality images of rice plants at various tillering stages, illustrated in Figure 2. Each image was meticulously annotated using the LabelMe(3.11.2) software to identify the center point of individual rice plants. The annotation process involved manually marking the center point of each rice seedling in the image by visual inspection, ensuring accuracy in the placement of the annotation. These center points serve as the ground truth for training and evaluating the rice counting model. Ground truth refers to the manually labeled data that represent the correct answer for the task at hand, and in this case, it consists of the accurately marked center points of each rice plant in the image. It is used to assess the performance of the model by comparing its predictions with these manually annotated coordinates. The dataset was split into a training set comprising 280 images, which collectively contain 49,428 annotated center points, with each image featuring between 43 to 212 rice plants. The testing set consists of 121 images with a total of 20,340 center points, each image featuring 67 to 217 rice plants. Figure 3 illustrates the distribution of the training and testing sets, providing a visual overview of the dataset’s composition and the density of the rice plant annotations.

3.3. Other Dataset

URC Dataset: The URC (UAV-based Rice Plant Counting) dataset, collected from 2018 to 2019, is a specialized dataset for rice plant counting [6]. It comprises 355 original high-resolution images, each with dimensions of 5472 by 3648 pixels. Within each image, rice plants are annotated at their center points. The dataset is divided into a training set of 246 images and a test set of 109 images, with individual images containing between 84 and 1125 rice plants.

WED Dataset: The WED (Wheat Ear Detection) dataset, proposed by Madec et al. [4], comprises images each of 6000 by 4000 resolution, featuring over 20 different wheat genotypes. Each image typically contains between 80 and 170 wheat ears. Due to its bounding box annotations, we converted these to center point annotations to better suit our methodology.

MTC dataset: The MTC (Maize Tassel Counting) dataset, proposed by Lu et al. [56], is dedicated to corn tassel counting. It encompasses images sourced from four experimental fields over the period from 2010 to 2015, totaling 361 images divided into 186 for training and 175 for testing. The images are originally in resolutions of 3648 × 2736, 4272 × 2848, or 3456 × 2304, respectively.

3.4. Methond

3.4.1. Problem Definition

Let I denote a rice field image containing N seedlings, where

r_{i} = (x_{i}, y_{i}), i \in {1, \dots, N}

represents the center coordinates of the i-th rice seedling in the image. The complete set of all center points is denoted by

R = {r_{i} ∣ i \in {1, \dots, N}}

. Given a rice seedling counting model

f (I; θ_{f})

, the objective of rice seedling counting and localization is to identify and count all seedlings within the image, expressed as

{\hat{R}, \hat{P}} = f (I; θ_{f})

. In this formulation,

\hat{R} = {{\hat{r}}_{j} ∣ j \in {1, \dots, M}}

and

\hat{P} = {{\hat{p}}_{j} ∣ j \in {1, \dots, M}}

denote the predicted center coordinates of the seedlings and their associated confidence scores, respectively. Here,

θ_{f}

represents the learnable parameter vector of the model, and M is the number of predicted seedlings.

In this study, our objective is to make the model

f (θ_{f})

more lightweight by employing pruning strategies, specifically focusing on reducing the number of parameters and computational costs while maintaining performance in counting and locating rice seedlings. By introducing pruning techniques, we aim to optimize

θ_{f}

such that

f (θ_{f})

becomes more efficient and deployable, particularly in resource-constrained agricultural settings.

3.4.2. P2P-CNF

P2P, a novel crowd counting network developed by Tencent YouTu Laboratory, departs from traditional detection algorithms reliant on density maps and regression boxes. It employs a point-to-point network architecture that simplifies the detection process by directly outputting a set of point coordinates representing the center of objects. The network architecture of P2Pnet comprises three principal components: a feature extractor, a feature fusion module, and a dual-branch prediction head. The feature extractor used the VGG16_BN backbone to derive feature representations from the input image. In the subsequent stage, the feature fusion module integrates these features at both deep and middle levels, channeling them towards the predictive heads—one for regression, determining the coordinates of targets, and another for classification, assessing their confidence levels.

P2P-CNF is an advanced lightweight model specifically developed for rice counting, designed for implementation in resource-constrained agricultural settings while preserving optimal model performance. To achieve a balance between model efficiency and accuracy in rice plant counting, several improvement have been integrated into the model:

Backbone Network Pruning: We introduce Cosine-Norm Fusion (CNF), a pruning methodology that minimizes information loss, ensuring the model remains lightweight while retaining high accuracy for rice counting.
Lightweight Feature Fusion Module: The upgraded feature fusion module DAFM significantly cuts down the number of parameters, enhancing the model’s capability to capture and integrate multi-layered information effectively. This allows for improved recognition of rice plants in diverse field conditions.

The architecture of the P2P-CNF is depicted in Figure 4. Initially, the original P2P network is trained, followed by the application of our CNF method to prune this trained network. After a fine-tuning process, a more compact and efficient network is obtained. The process flowchart is illustrated in Figure 5.

This reduction significantly compacts the network’s volume, resulting in a more concise representation of information. The refined feature maps

f_{1}

to

f_{4}

are then inputted into our Depth-Attentive Fusion Module (DAFM). Within the DAFM, features are progressively fused from deeper to shallower layers. Prior to each fusion step, a linear attention mechanism is applied to each layer to enhance the extraction of effective information, thereby optimizing the utilization of multi-scale feature layers. The multi-scale features that are fused are then forwarded to execute regression and classification tasks at the prediction head. During the prediction process, the Hungarian algorithm is employed to match predicted points with generated points one-to-one, as detailed in Section 3.4.1. During this phase, an

N \times M

matrix is generated, where each matrix element

e_{i j}

is defined as follows:

r_{i, j} = ϵ {∥r_{i} - {\hat{r}}_{j}∥}_{2} - {\hat{p}}_{j}

(1)

In the formulation,

ϵ

serves as a weighting factor to modulate the influence of the L2 distance. The variable

{\hat{p}}_{j}

denotes the confidence level associated with the predicted position

{\hat{c}}_{j}

. Subsequently, the optimal matching is ascertained by evaluating the values within the matrix.

Owing to the adoption of our CNF pruning method, an L1 norm is employed to induce sparsity in the batch normalization (BN) layers. This approach results in the specified loss function:

L = \frac{1}{N} [λ_{1} \sum_{i = 1}^{N} L_{loc} (r_{i}, {\hat{r}}_{i}) + \sum_{i = 1}^{N} L_{cls} ({\hat{p}}_{i}) + λ_{2} \sum_{γ \in BN} {∥ γ ∥}_{1}]

(2)

where

L_{loc}

is the regression loss calculated using the Euclidean distance between the actual and predicted positions, and

L_{cls}

represents the classification loss, modeled by cross-entropy. The term

{∥ γ ∥}_{1}

imposes a sparsity penalty on the

γ

coefficients within the BN layers, controlled by

λ_{2}

.

3.4.3. Cosine-Norm Fusion

This study introduces Cosine-Norm Fusion (CNF), a novel approach designed for pruning to substantially reduce the number of parameters in neural networks, as shown in Figure 6. Inspired by Liu et al.’s [40] application of scaling factors for channel sparsity and Kim et al.’s [61] dynamic token fusion to optimize transformer architectures, CNF integrates channel-level sparsity with token fusion processes within convolutional neural networks. This method strategically employs norm fusion during the pruning process to minimize information loss, thereby maintaining computational efficiency without significantly compromising accuracy. CNF addresses the enduring challenge of balancing performance with model compactness, proposing a solution that is both innovative and practical for advanced neural network applications.

CNF follows the pruning process established by [40], integrating both the L1 norm and the scaling factor

γ

from batch normalization (BN) layers as regularization terms in the loss function. Leveraging the rapid convergence and strong generalization capabilities of the BN layer, the pruning is optimized by emphasizing unimportant channels. In the BN layer, given the input

x_{i n}

, the output

x_{o u t}

, and a mini-batch

B

, the transformation is defined as follows:

\hat{x} = \frac{x_{i n} - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}}; x_{o u t} = γ \hat{x} + β

(3)

With this, our optimization objective is defined as follows:

L = \sum_{(x, y)} l (f (x, w), y) + λ \sum_{γ \in Γ} p (γ)

(4)

where x and y, respectively, represent the input and target of the network,

μ_{B}

and

σ_{B}

are the mean and standard deviation of the input activation values over the mini-batch

B

, and W represents the weights of the network. The function

p (\cdot)

denotes the L1 norm used as a penalty term, and subgradient descent is applied to sparsify the scaling factor

γ

. The parameter

λ

is a balancing factor between the main loss and regularization terms, while

γ

and

β

are trainable parameters in the batch normalization layer. To achieve a more compact network, channel pruning is applied based on a global threshold across all layers. Specifically, the threshold is defined as a certain percentile of all scaling factor values in the network. For example, by setting the percentile threshold to 50%, the 50% of channels with the lowest scaling factors are pruned. This approach allows for a reduction in the number of parameters and run-time memory requirements, as well as a decrease in computational overhead, resulting in a streamlined model optimized for efficiency.

The cosine similarity between the pruned channels and the retained channels at the same network level is then iteratively computed to identify pairs for merging. The merging method proposed by [61] is adopted, with the distinction that we introduced a weight for each merged channel. To facilitate this, a bipartite graph is constructed where edges represent the similarity weights between pruned and retained channels. Using the Hungarian algorithm, the channel pairs with the highest similarity weights, indicating the most similar channels, are determined, and subsequently, norm merging is performed. This process is defined by the following operations:

\bar{w} = \frac{s i m i l a r i t y \cdot w_{pruned} + w_{remaining}}{2}

(5)

w_{new} = \frac{\bar{w}}{∥ \bar{w} ∥} \cdot ∥ w_{remaining} ∥

(6)

Here,

\bar{w}

represents the intermediate merged weight, calculated as the average of the similarity-weighted pruned channel and the retained channel. The final weight

w_{new}

is then normalized to match the magnitude of

w_{remaining}

, preserving the scale of the retained channel while incorporating information from the pruned channel.

3.4.4. Depth-Attentive Fusion Module

In the initial P2P feature fusion module, fusion was restricted to three feature modules from the middle and deep layers, overlooking the valuable information present in the shallow layers. Furthermore, the use of conventional convolution operations in the fusion process introduced a substantial parameter overhead. To address these limitations and enhance model efficiency, this paper introduce the Depth Attentive Fusion Module (DAFM). As depicted in Figure 7, the DAFM extracts and integrates four feature modules spanning shallow to deep layers from the backbone network, significantly enriching the model’s feature representation capacity.

During the fusion process at each layer, the feature maps first undergo depthwise separable convolution, which adjusts the channel dimensions. This convolution technique, comprising depthwise and pointwise convolutions, reduces computational demands by separately convolving each input channel in the depthwise phase and subsequently combining these outputs through pointwise convolution. This method drastically cuts the computational burden associated with traditional convolutions. Following this, the processed feature modules are inputted into the linear attention mechanism. This simplified version of self-attention ensures the model remains lightweight while maintaining high performance. Linear attention operates by projecting the feature map into query (q), key (k), and value (v) vectors using fully connected layers:

[\begin{matrix} q \\ k \\ v \end{matrix}] = [\begin{matrix} W_{q} \\ W_{k} \\ W_{v} \end{matrix}] x_{f} + [\begin{matrix} b_{q} \\ b_{k} \\ b_{v} \end{matrix}]

(7)

It then computes the original attention scores through the dot product of q and k, followed by a normalization using the softmax function:

A t t e n t i o n s c o r e s = s o f t m a x (\frac{q \cdot k^{T}}{\sqrt{d_{k}}})

(8)

where

d_{k}

is the dimensionality of the key vectors, used to scale the dot product for stabilization. The resulting attention scores are employed to weight and synthesize the value vector v, allowing the model to dynamically focus on the most pertinent features according to the computed attention:

O u t p u t = A t t e n t i o n s c o r e s \cdot v

(9)

This output captures an integrated representation of the input feature map, substantially enhancing the model’s ability to discern crucial information. Through the integration of the DAFM, our model achieves enhanced feature fusion, enabling superior differentiation of individual rice plants in complex field environments while maintaining a streamlined architecture, thus boosting both the efficiency and accuracy of counting.

4. Results

4.1. Implementation Details and Evaluation Metric

Similar to the base model P2PNet [8], the Adam optimizer was used during the training phase. The hyperparameter settings for training were kept consistent with those used in P2PNet to ensure comparability between the models. The VGG16-BN backbone was initialized using pre-trained weights provided by PyTorch. The specific experimental runtime environment is detailed in Table 1.

To assess the performance of the model, two commonly used evaluation metrics were employed: the mean absolute error (MAE) and root mean squared error (RMSE). These metrics are defined as follows:

M A E = \frac{1}{N} \sum_{i = 1}^{N} |R_{i} - {\hat{R}}_{i}|

(10)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(R_{i} - {\hat{R}}_{i})}^{2}}

(11)

where

R_{i}

represents the actual number of rice plants in the i-th image,

{\hat{R}}_{i}

denotes the predicted number of rice plants for the same image, and N is the total number of images in the test set.

We use precision, recall, and the F-measure to evaluate the plant positioning indicators. Their definitions are as follows:

F - m e a s u r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(12)

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

In this context, TP refers to true positives, FP to false positives, and FN to false negatives.

4.2. Experiment on the RSC-UAV Dataset

In this section, the method is compared with several state-of-the-art techniques. Specifically, both advanced plant phenotyping methods (RiceNet and RPNet) and mainstream population counting models (CSRNet, FIDTM, PET, and P2PNet) are evaluated, with the results presented in Table 2. For this comparison, we set the sparsity threshold to 60%, meaning that 60% of the parameters are pruned from the backbone network. After pruning, the model’s total number of parameters is reduced to 7.31 M, which is approximately 67% of the parameters in the original P2PNet model. As shown in Table 2, the method achieves excellent results while maintaining a lightweight model architecture. The MAE and RMSE of the model are 3.12 and 4.12, respectively, which are second only to the baseline P2PNet. Compared to P2PNet, the model lags behind by only 0.18 in MAE, but with significantly fewer parameters.

When compared with the newer rice plant counting models, RiceNet and RPNet, P2P-CNF demonstrates clear advantages. Specifically, the MAE is reduced by 17% and the RMSE by 14% compared to RPNet, while the number of parameters in the model accounts for only 35% of RPNet. When compared to RiceNet, the MAE decreases by 15% and the RMSE by 13%. Furthermore, compared to several advanced crowd counting models (CSRNet, PET, and FIDTM), the method achieves a leading performance, with the best MAE and RMSE among these models, while maintaining a lower parameter count. This is largely attributed to the DAFM module, which integrates features at multiple levels in a lightweight manner, allowing for the precise identification of rice plants with diverse features. For the comparison of localization performance, we evaluated the precision, recall, and F-measure for PET, FIDTM, P2PNet, and P2P-CNF (pruned by 60%). As shown in Table 3, the precision, recall, and F-measure of P2P-CNF (60% pruned) reached 95.5%, 96.0%, and 95.7%, respectively, which are comparable to the original P2PNet model. Compared to FIDTM and PET, our method demonstrated a slight advantage, with significantly fewer parameters.

To further demonstrate the effectiveness of the method, the detection results of these models are visualized in Figure 8. Figure 8a presents the ground truth, followed by the detection results from different models, displayed from Figure 8b–h. The model, P2P-CNF(60% pruned), exhibits superior recognition accuracy across different rice growth stages, particularly during the late tillering stage. It is worth noting that in the final image of Figure 8b, P2P-CNF successfully identifies rice plants that are staggered together, whereas the baseline P2PNet struggles in this scenario. This improvement can be attributed to the lightweight multi-layer feature fusion module. Overall, the method achieves excellent performance while maintaining a balance between model efficiency and accuracy, making it well-suited for agricultural environments with limited computational resources.

4.3. Impact of Pruning Parameters on RSC-UAV Dataset Performance

In this chapter, the impact of network slimming on counting performance is investigated by pruning 40% to 80% of the model’s backbone parameters. The experiments were conducted on the RSC-UAV dataset, with the results summarized in Table 4. The unpruned P2PNet model yielded an MAE of 2.94 and an RMSE of 3.84. When 40% and 50% of the parameters were pruned from the backbone network, the resulting P2P-CNF models had 8.63 M and 10.42 M parameters, respectively. These models showed slight impacts on MAE and RMSE, suggesting a minimal degradation in performance compared to the original model.

The best results were obtained when 60% of the parameters were pruned, yielding a MAE of 3.12 and an RMSE of 4.12. In this case, the MAE increased by only 0.18 and the RMSE by 0.28 compared to the original model. Furthermore, when compared to the 50% pruning configuration, the MAE decreased by 9% and the RMSE by 8%, demonstrating the effectiveness of further pruning. When 70% of the parameters were pruned, the total number of parameters in the P2P-CNF model reduced to 6.51M, which is just 30% of the original model. At this pruning rate, the MAE was 3.21, second only to the 60% pruning configuration. At an 80% pruning rate, the parameters of the model were significantly reduced, and the MAE remained relatively high at 3.66, only 0.72 higher than the original P2PNet model.

To provide a more comprehensive analysis of the effect of different pruning ratios on model performance, the detection results and heatmaps for each pruned version were visualized. These visualizations are shown in Figure 9, where Figure 9a presents the ground truth, followed by the results from models with progressively pruned parameters from Figure 9b–f. It is worth noting that in the second image of Figure 9a, several rice plants are missing markers in the ground truth. The model was still able to identify rice plants effectively. As rice plants approached the late tillering stage, the original P2PNet model began to show repetitive recognition, where rice plants were continuously labeled. However, the lightweight method successfully mitigated this issue, thanks to the Depth-Attentive Fusion Module (DAFM). This module employs depthwise separable convolutions combined with a linear self-attention mechanism, enabling the model to focus more precisely on the rice plants while maintaining a lightweight architecture. Consequently, the model was better equipped to identify overlapping rice plants and handle occlusions, resulting in improved overall performance in complex scenarios.

4.4. Ablation Experiment

To demonstrate the effectiveness of the CNF pruning module and the DAFM feature fusion module, extensive ablation experiments were conducted on the RSC-UAV dataset, with the results presented in Table 5.

Initially, the baseline model, P2PNet, was applied for rice localization and counting, yielding a MAE of 2.94 and an RMSE of 3.84. When the scale sparse rate was set to 0.4, indicating a 40% reduction in the number of parameters from the backbone network by the CNF pruning method, the MAE increased to 5.01, and the RMSE rose to 6.83, with the model size reduced to 11.68 M parameters. Adding the DAFM to the network after CNF pruning reduced the model’s parameter count to 10.4 M, resulting in improved performance with a MAE of 3.67 and an RMSE of 4.57. Compared to the network after CNF pruning alone, the model with the DAFM demonstrated a significant reduction in both the MAE (by 1.34) and RMSE (by 2.26), indicating a substantial improvement in both precision and counting accuracy.

When the scale sparse rate was set to 0.6, the pruning performance of the CNF method reached its optimal level. Without DAFM integration, the MAE was 4.44, the RMSE was 5.93, and the model contained 8.58M parameters. Integrating DAFM further reduced the MAE and RMSE to 3.12 and 4.12, respectively, while reducing the parameter count by an additional 1.27 M, making the model both lighter and more accurate. At a scale sparse rate of 0.8, most parameters had been pruned from the model. The model without DAFM integration had 6.93M parameters, with an MAE of 6.59 and an RMSE of 8.77. However, upon integrating DAFM, both MAE and RMSE decreased significantly to 3.66 and 4.94, respectively, with further parameter reduction. This clearly demonstrates the effectiveness of the DAFM module in enhancing model efficiency and performance.

To further investigate the impact of different components on model performance, we utilized heat maps for a more comprehensive analysis. A scale sparse rate of 0.6 was selected, along with the baseline model, for the visualization analysis. As shown in Figure 10, the heat maps provide insightful visualizations. In Figure 10c, it is evident that the backbone network of the model, after pruning with CNF, retains a strong ability to extract meaningful features. Figure 10d illustrates our complete method, P2P-CNF, where the integration of DAFM enhances the model’s feature attention. This improvement is largely attributed to the self-attention mechanism, which enables the model to more effectively focus on the key features of rice plants, resulting in better localization and counting performance.

4.5. Experiments on URC Dataset

To accelerate training, the URC dataset was resized by reducing both the width and height dimensions by half prior to training with P2PNet. Subsequently, the pruning methodology was applied followed by another training phase. The results are documented in Table 6.

Our optimal model achieved a MAE of 5.11 and an RMSE of 6.57. At this stage, the model’s parameter count was 7.31 M. Notably, the model’s MAE was 0.96 higher than that of the optimal PET model, yet our parameters constituted only 35% of those in the PET model, and the MAE was 0.65 higher than that of the baseline P2PNet model. These findings suggest that the pruning and feature fusion strategies effectively mitigate the information loss typically associated with parameter reduction. When compared to RiceNet and RPNet, the method achieves a slightly better MAE after pruning 60% of the parameters, while maintaining a significantly lower parameter count.

To further demonstrate the impact of our methods, we compared the baseline P2PNet model and our two most effective strategies on the URC dataset, incorporating the results from two distinct parameter reduction scenarios. Figure 11b–d illustrates their counting performance, while Figure 12b–d displays the heatmap visualizations.

4.6. Experiments on WED Dataset

Given that the WED dataset solely comprises bounding box annotations, these were transformed into point annotations to facilitate a uniform comparison. Furthermore, considering the dataset’s high-resolution imagery, the image resolution was reduced to one eighth of its original size prior to training. The training protocol was maintained as previously established, beginning with pre-training using P2PNet, followed by pruning through the method.

Table 7 presents the comparative results on the WED dataset. Notably, RPNet and P2PNet attained the optimal MAE of 3.61, with P2PNet demonstrating a lower RMSE. While our method did not achieve the foremost performance, it ranked closely behind these models, with an MAE of 3.95, merely 0.34 higher than the leading score. Relative to the baseline model P2PNet, P2P-CNF required only 32% of the parameters, significantly enhancing the model’s efficiency. Figure 13b,c illustrates the visual results for the baseline models P2PNet and P2P-CNF (Pruned by 60%). The heatmap clearly shows that despite substantial parameter reduction, our method proficiently maintains focus on the target.

4.7. Experiments on MTC Dataset

Prior to initiating training, the MTC dataset was processed by scaling the shorter sides of the images to 512 pixels and proportionally adjusting the longer sides to expedite training. The results are presented in Table 8. Surprisingly, the P2P-CNF exceeded the baseline model P2PNet for the first time, achieving the best MAE of 2.94, comparable to RPNet. Although RPNet’s RMSE was marginally lower than that of this model, the model used only 33% of RPNet’s parameters, striking a balance between performance and efficiency.

When approximately 80% of the parameters were pruned from the backbone network, P2P-CNF’s parameters were reduced to 5.54 M, just 25% of the baseline P2PNet model. Nonetheless, its performance remained comparable to P2PNet. Despite having more parameters than the lighter TasselNetv2+, our method achieved superior accuracy. Overall, we successfully balanced high performance with a lightweight architecture.

Furthermore, we visualized the results of different parameter configurations of P2P-CNF and the baseline model P2PNet on the MTC dataset. Figure 14b–f display the visualization of the counting results produced by our method. The counting results clearly demonstrate that our method also excels in sparse scenarios.

5. Discussion

In this article, we introduced a novel pruning method, Cosine Norm Fusion (CNF), and a feature fusion technique, the Depth Attention Fusion Module (DAFM). CNF uses cosine similarity calculations during pruning to minimize information loss, while DAFM combines depthwise separable convolutions with a linear self-attention mechanism to create a lightweight feature module. These techniques were incorporated into the point-to-point architecture model, P2PNet, resulting in a lightweight rice counting model, P2P-CNF. A critical aspect of the CNF pruning method is the determination of the global threshold for pruning. The global threshold is set based on a percentile of the scaling factor values from the batch normalization (BN) layers. Our experiments demonstrate that setting the sparsity rate to 0.6 and 0.7, which corresponds to pruning 60% or 70% of the parameters from the backbone network, results in optimal model performance. This reduction in parameters does not significantly affect the model’s accuracy. To further mitigate performance degradation caused by parameter reduction and potential imbalances in the network layers, we introduce the Depth Attentive Fusion Module (DAFM). This module efficiently integrates the information from the remaining channels using a self-attention mechanism, enabling the model to focus more on the most relevant features. As a result, the model can be deployed effectively in agricultural environments, ensuring both high performance and a lightweight architecture.

Furthermore, to gain a deeper understanding of the fusion effect of DAFM, we conducted a thorough investigation by comparing the performance of various fusion strategies following CNF pruning on the RSC-UAV dataset. Specifically, we evaluated the experimental results using different fusion modules, including the original fusion module of P2PNet, the DAFM module, and a version of the DAFM module without the linear self-attention mechanism. The results of these comparisons are presented in Table 9.

It is evident that even the DAFM module without the self-attention mechanism outperforms the original P2PNet feature fusion method in both parameter efficiency and accuracy. This improvement can be attributed to the more diverse feature fusion strategies employed by DAFM. Moreover, the integration of the linear self-attention mechanism further enhances its performance. The attention mechanism specifically focuses on the inherent characteristics of the rice plants, which contributes to a significant increase in accuracy.

6. Conclusions

This paper proposes a lightweight method, P2P-CNF, based on P2PNet. By utilizing CNF for pruning and integrating the DAFM module for feature fusion, the model effectively maintains high accuracy while operating in agricultural environments with limited computational resources. To assess the performance of P2P-CNF, we constructed the RSC-UAV dataset, specifically designed for rice plant counting using UAV imagery, and conducted extensive experiments on multiple datasets, including URC, WED, and MTC. The results demonstrate that P2P-CNF achieves outstanding performance across these diverse datasets, with fewer parameters than current mainstream plant counting models. Looking ahead, future research will focus on further refining lightweight methodologies and striving for even higher accuracy while minimizing the model’s parameter count, ensuring better adaptability to resource-constrained environments.

Author Contributions

Conceptualization, S.T. and H.S.; methodology, H.S.; software, H.S. and K.Z.; validation, H.S. and Y.Y.; formal analysis, H.S. and C.C.; investigation, H.S. and Z.L.; resources, S.T.; data curation, S.T. and K.Z.; writing—original draft preparation, H.S.; writing—review and editing, H.S. and L.Z.; visualization, H.S. and C.C.; supervision, H.S. and L.Z.; project administration, S.T.; funding acquisition, S.T. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62202163, the Scientific Research Project of Hunan Provincial Department of Education under Grant 22A0145, and the national key research and development plan, research and development of intelligent identification technology for malignant weeds in rice fields based on spectral analysis, under Grant 2023YFD1401100.

Data Availability Statement

The RSC-UAV data presented in this study are available on request from the corresponding author; the URC data presented in the study are openly available at https://github.com/xdbai-source/Rice-Plant-Counting (accessed on 17 April 2024); the WED data presented in the study are openly available at https://github.com/simonMadec (accessed on 21 October 2018); and the MTC data presented in the study are openly available at https://github.com/poppinace/mtc (accessed on 15 August 2018).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zeigler, R.S.; Barclay, A. The relevance of rice. Rice 2008, 1, 3–10. [Google Scholar] [CrossRef]
Adhikari, B.; Mehera, B.; Haefele, S. Impact of rice nursery nutrient management, seeding density and seedling age on yield and yield attributes. Am. J. Plant Sci. 2013, 4, 146–155. [Google Scholar] [CrossRef]
Wang, X.; Li, Z.; Tan, S.; Li, H.; Qi, L.; Wang, Y.; Chen, J.; Yang, C.; Chen, J.; Qin, Y.; et al. Research on density grading of hybrid rice machine-transplanted blanket-seedlings based on multi-source unmanned aerial vehicle data and mechanized transplanting test. Comput. Electron. Agric. 2024, 222, 109070. [Google Scholar] [CrossRef]
Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear density estimation from high resolution RGB imagery using deep learning technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
Li, H.; Li, Z.; Dong, W.; Cao, X.; Wen, Z.; Xiao, R.; Wei, Y.; Zeng, H.; Ma, X. An automatic approach for detecting seedlings per hill of machine-transplanted hybrid rice utilizing machine vision. Comput. Electron. Agric. 2021, 185, 106178. [Google Scholar] [CrossRef]
Bai, X.; Liu, P.; Cao, Z.; Lu, H.; Xiong, H.; Yang, A.; Cai, Z.; Wang, J.; Yao, J. Rice plant counting, locating, and sizing method based on high-throughput UAV RGB images. Plant Phenomics 2023, 5, 20. [Google Scholar] [CrossRef] [PubMed]
Bai, X.; Gu, S.; Liu, P.; Yang, A.; Cai, Z.; Wang, J.; Yao, J. Rpnet: Rice plant counting after tillering stage based on plant attention and multiple supervision network. Crop J. 2023, 11, 1586–1594. [Google Scholar] [CrossRef]
Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3365–3374. [Google Scholar]
Zhao, J.; Kaga, A.; Yamada, T.; Komatsu, K.; Hirata, K.; Kikuchi, A.; Hirafuji, M.; Ninomiya, S.; Guo, W. Improved field-based soybean seed counting and localization with feature level considered. Plant Phenomics 2023, 5, 26. [Google Scholar] [CrossRef]
Yao, M.; Li, W.; Chen, L.; Zou, H.; Zhang, R.; Qiu, Z.; Yang, S.; Shen, Y. Rice Counting and Localization in Unmanned Aerial Vehicle Imagery Using Enhanced Feature Fusion. Agronomy 2024, 14, 868. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, M.; Shi, J.Q. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10558–10578. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. In Proceedings of the 38th Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Joo, D.; Kim, D.; Kim, J. Slimming ResNet by Slimming Shortcut. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7677–7683. [Google Scholar]
Li, M.; Ding, D.; Heldring, A.; Hu, J.; Chen, R.; Vecchi, G. Low-rank matrix factorization method for multiscale simulations: A review. IEEE Open J. Antennas Propag. 2021, 2, 286–301. [Google Scholar] [CrossRef]
Saha, R.; Srivastava, V.; Pilanci, M. Matrix compression via randomized low rank and low precision factorization. In Proceedings of the 37th Annual Conference on Neural Information Processing Systems (NIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Sun, S.; Ren, W.; Li, J.; Wang, R.; Cao, X. Logit standardization in knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15731–15740. [Google Scholar]
Bakhtiarifard, P.; Igel, C.; Selvan, R. EC-NAS: Energy consumption aware tabular benchmarks for neural architecture search. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5660–5664. [Google Scholar]
Gao, S.; Huang, F.; Cai, W.; Huang, H. Network pruning via performance maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9270–9280. [Google Scholar]
Fang, G.; Ma, X.; Song, M.; Mi, M.B.; Wang, X. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16091–16101. [Google Scholar]
Ma, X.; Fang, G.; Wang, X. Llm-pruner: On the structural pruning of large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 21702–21720. [Google Scholar]
Gao, S.; Zhang, Z.; Zhang, Y.; Huang, F.; Huang, H. Structural alignment for network pruning through partial regularization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17402–17412. [Google Scholar]
Wang, Z.; Li, C.; Wang, X. Convolutional neural network pruning with structural redundancy reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14913–14922. [Google Scholar]
Liao, Z.; Quétu, V.; Nguyen, V.T.; Tartaglione, E. Can Unstructured Pruning Reduce the Depth in Deep Neural Networks? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1402–1406. [Google Scholar]
Pietroń, M.; Żurek, D.; Śnieżyński, B. Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction. J. Comput. Sci. 2023, 67, 101971. [Google Scholar] [CrossRef]
Wilkinson, L.; Cheshmi, K.; Dehnavi, M.M. Register Tiling for Unstructured Sparsity in Neural Network Inference. Proc. ACM Program. Lang. 2023, 7, 1995–2020. [Google Scholar] [CrossRef]
Frantar, E.; Alistarh, D. Sparsegpt: Massive language models can be accurately pruned in one-shot. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 10323–10337. [Google Scholar]
Dhahri, R.; Immer, A.; Charpentier, B.; Günnemann, S.; Fortuin, V. Shaving Weights with Occam’s Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood. arXiv 2024, arXiv:2402.15978. [Google Scholar]
Sun, X.; Shi, H. Towards Better Structured Pruning Saliency by Reorganizing Convolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 2204–2214. [Google Scholar]
Chen, T.; Ji, B.; Ding, T.; Fang, B.; Wang, G.; Zhu, Z.; Liang, L.; Shi, Y.; Yi, S.; Tu, X. Only train once: A one-shot neural network training and pruning framework. Adv. Neural Inf. Process. Syst. 2021, 34, 19637–19651. [Google Scholar]
He, Y.; Liu, P.; Wang, Z.; Hu, Z.; Yang, Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4340–4349. [Google Scholar]
Wang, W.; Fu, C.; Guo, J.; Cai, D.; He, X. COP: Customized deep model compression via regularized correlation-based filter-level pruning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 3785–3791. [Google Scholar]
Ioffe, S. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
You, Z.; Yan, K.; Ye, J.; Ma, M.; Wang, P. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Zhuang, T.; Zhang, Z.; Huang, Y.; Zeng, X.; Shuang, K.; Li, X. Neuron-level structured pruning using polarization regularizer. Adv. Neural Inf. Process. Syst. 2020, 33, 9865–9877. [Google Scholar]
Kang, M.; Han, B. Operation-aware soft channel pruning using differentiable masks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 5122–5131. [Google Scholar]
Lu, H.; Cao, Z. TasselNetV2+: A fast implementation for high-throughput plant counting from high-resolution RGB imagery. Front. Plant Sci. 2020, 11, 541960. [Google Scholar] [CrossRef] [PubMed]
Karami, A.; Crawford, M.; Delp, E.J. Automatic plant counting and location based on a few-shot learning technique. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5872–5886. [Google Scholar] [CrossRef]
Pape, J.M.; Klukas, C. Utilizing machine learning approaches to improve the prediction of leaf counts and individual leaf segmentation of rosette plant images. Proc. Comput. Vis. Probl. Plant Phenotyping (CVPPP) 2015, 3, 1–12. [Google Scholar]
Giuffrida, M.V.; Minervini, M.; Tsaftaris, S.A. Learning to count leaves in rosette plants. In Proceedings of the Computer Vision Problems in Plant Phenotyping (CVPPP), Swansea, UK, 7–10 September 2015. [Google Scholar]
Qureshi, W.S.; Payne, A.; Walsh, K.; Linker, R.; Cohen, O.; Dailey, M. Machine vision for counting fruit on mango tree canopies. Precis. Agric. 2017, 18, 224–244. [Google Scholar] [CrossRef]
Sam, D.B.; Peri, S.V.; Sundararaman, M.N.; Kamath, A.; Babu, R.V. Locate, size, and count: Accurately resolving people in dense crowds via detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2739–2751. [Google Scholar]
Bai, S.; He, Z.; Qiao, Y.; Hu, H.; Wu, W.; Yan, J. Adaptive dilated network with self-correction supervision for counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4594–4603. [Google Scholar]
Hu, Y.; Jiang, X.; Liu, X.; Zhang, B.; Han, J.; Cao, X.; Doermann, D. Nas-count: Counting-by-density with neural architecture search. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 747–766. [Google Scholar]
Buzzy, M.; Thesma, V.; Davoodi, M.; Mohammadpour Velni, J. Real-time plant leaf counting using deep object detection networks. Sensors 2020, 20, 6896. [Google Scholar] [CrossRef]
Wang, N.; Cao, H.; Huang, X.; Ding, M. Rapeseed flower counting method based on GhP2-YOLO and StrongSORT algorithm. Plants 2024, 13, 2388. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, W.; Yu, J.; He, L.; Chen, J.; He, Y. Complete and accurate holly fruits counting using YOLOX object detection. Comput. Electron. Agric. 2022, 198, 107062. [Google Scholar] [CrossRef]
Chen, H.; Chen, H.; Huang, X.; Zhang, S.; Chen, S.; Cen, F.; He, T.; Zhao, Q.; Gao, Z. Estimation of sorghum seedling number from drone image based on support vector machine and YOLO algorithms. Front. Plant Sci. 2024, 15, 1399872. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Cao, Z.; Xiao, Y.; Zhuang, B.; Shen, C. TasselNet: Counting maize tassels in the wild via local counts regression network. Plant Methods 2017, 13, 79. [Google Scholar] [CrossRef]
Xiong, H.; Cao, Z.; Lu, H.; Madec, S.; Liu, L.; Shen, C. TasselNetv2: In-field counting of wheat spikes with context-augmented local regression networks. Plant Methods 2019, 15, 150. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Liu, L.; Li, Y.N.; Zhao, X.M.; Wang, X.Q.; Cao, Z.G. TasselNetV3: Explainable plant counting with guided upsampling and background suppression. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Wu, J.; Yang, G.; Yang, X.; Xu, B.; Han, L.; Zhu, Y. Automatic counting of in situ rice seedlings from UAV images based on a deep fully convolutional neural network. Remote Sens. 2019, 11, 691. [Google Scholar] [CrossRef]
Liu, W.; Zhou, J.; Wang, B.; Costa, M.; Kaeppler, S.M.; Zhang, Z. IntegrateNet: A deep learning network for maize stand counting from UAV imagery by integrating density and local count maps. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6512605. [Google Scholar] [CrossRef]
Kim, M.; Gao, S.; Hsu, Y.C.; Shen, Y.; Jin, H. Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1383–1392. [Google Scholar]
Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [Google Scholar]
Liang, D.; Xu, W.; Zhu, Y.; Zhou, Y. Focal inverse distance transform maps for crowd localization. IEEE Trans. Multimed. 2022, 25, 6040–6052. [Google Scholar] [CrossRef]
Liu, C.; Lu, H.; Cao, Z.; Liu, T. Point-query quadtree for crowd counting, localization, and more. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2 2023; pp. 1676–1685. [Google Scholar]

Figure 1. (a) Data collection location. (b) The collection equipment and the corresponding captured images.

Figure 2. Samples of rice plants collected across different dates.

Figure 3. Distribution of training and testing datasets.

Figure 4. The overall architecture of P2P-CNF.

Figure 5. Flowchart of network slimming procedure.

Figure 6. In the pruning process employed by Cosine Norm Fusion (CNF), the scaling factors (

γ

) are derived from the batch normalization (BN) layers and regularization is implemented on these factors during training. When channels deemed less important become sparse, the cosine similarity between the pruned and retained channels is computed to facilitate their fusion.

Figure 6. In the pruning process employed by Cosine Norm Fusion (CNF), the scaling factors (

γ

) are derived from the batch normalization (BN) layers and regularization is implemented on these factors during training. When channels deemed less important become sparse, the cosine similarity between the pruned and retained channels is computed to facilitate their fusion.

Figure 7. Detailed process of Depth-Attentive Fusion Module (DAFM).

Figure 8. The visualization of different models on the RSC-UAV dataset is presented, with (a) showing the ground truth. “Real count” refers to the actual number of points; (b–h) display the results predicted by different models, where “pred count” indicates the number of points predicted by each model.

Figure 9. The visualization on the RSC-UAV dataset is presented, with the ground truth shown on (a) followed by backbone network models with progressively pruned parameter ratios from (b–f), and the original model displayed on (g). The visualization includes both the detection results and their corresponding heatmap representations.

Figure 10. The ablation experiment on the RSC-UAV dataset is illustrated as follows: (a) shows the ground truth; (b) presents the results after pruning 60% of the parameters using CNF; (c) illustrates the P2PNet model using only CNF pruning; (d) demonstrates the complete P2P-CNF method, which incorporates DAFM for enhanced feature fusion.

Figure 11. In the visualization of the URC dataset, (a) represents the ground truth. The counting results for P2PNet, P2P-CNF (pruned by 60%), and P2P-CNF (pruned by 70%) are shown in (b–d), respectively.

Figure 12. In the heatmap visualization on the URC dataset, the original image is positioned on (a). Subsequent columns from (b–d) display the heatmap results for P2PNet, followed by P2P-CNF (pruned by 60%) and P2P-CNF (pruned by 70%).

Figure 13. Visualization of the results on the WED dataset (a). (b) shows the counting result of P2PNet along with its heatmap visualization, while (c) presents the visualization result of P2P-CNF (60% pruned).

Figure 14. The visualization of the MTC dataset includes the ground-truth on the (a). From (b–f), the subsequent images display the counting effect maps of P2P-CNF with varying column ratio parameters, culminating with the counting effect maps of the baseline model, P2PNet, on the (g).

Table 1. Software and hardware configuration for the experiment.

Experimental Parameter	Experimental Environment
Python	3.8
Pytorch	2.0.1
CUDA	12.1
CPU	Intel^® Core™ i7-13700KF
GPU	NVIDIA GeForce RTX 4090
Operating system	Windows 11
learning rate	0.0001

Table 2. The results of different models on the RSC-UAV dataset.

Model	Venue	MAE	RMSE	Parmaters
CSRnet [62]	CVPR 2018	6.93	8.53	16.26 M
P2PNet [8]	ICCV 2021	2.94	3.84	21.57 M
FIDTM [63]	TMM 2022	3.20	4.02	66.58 M
RPNet [7]	Crop J 2023	3.78	4.81	20.64 M
RiceNet [6]	PLPH 2023	3.71	4.77	20.55 M
PET [64]	ICCV 2023	3.46	4.36	20.90 M
P2P-CNF (60% Pruned)	This paper	3.12	4.12	7.31 M

Table 3. Localization experiment results on RSC-UAV dataset.

Model	Venue	Precision	Recall	F-Measure
P2PNet [8]	ICCV 2021	95.5%	96.1%	95.8%
P2P-CNF (60% Pruned)	This paper	95.5%	96.0%	95.7%
FIDTM [63]	TMM 2022	94.9%	95.5%	95.2%
PET [64]	ICCV 2023	93.8%	94.2%	94.0%

Table 4. Results of parameter pruning at different ratios on the RSC-UAV dataset.

Model	MAE	RMSE	Parmaters
P2PNet	2.94	3.84	21.57 M
P2P-CNF (40% Pruned)	3.67	4.57	10.4 M
P2P-CNF (50% Pruned)	3.45	4.59	8.63 M
P2P-CNF (60% Pruned)	3.12	4.12	7.31 M
P2P-CNF (70% Pruned)	3.21	4.42	6.51 M
P2P-CNF (80% Pruned)	3.66	4.94	5.59 M

Table 5. In the ablation experiment conducted on the RSC-UAV dataset, CNF refers to the pruning method proposed in this paper. The “scale sparse rate” represents the sparsity factor, indicating the percentage of parameters pruned from the backbone network by CNF. DAFM denotes the feature fusion method introduced in this paper.

Baseline	CNF	Scale Sparse Rate	DAFM	MAE	RMSE	Parmaters
✓				2.94	3.84	21.57 M
✓	✓	0.4		5.01	6.83	11.68 M
✓	✓	0.4	✓	3.67	4.57	10.4 M
✓	✓	0.5		5.19	6.65	9.93 M
✓	✓	0.5	✓	3.45	4.59	8.63 M
✓	✓	0.6		4.44	5.93	8.58 M
✓	✓	0.6	✓	3.12	4.12	7.31 M
✓	✓	0.7		5.33	6.77	7.54 M
✓	✓	0.7	✓	3.21	4.42	6.51 M
✓	✓	0.8		6.59	8.77	6.93 M
✓	✓	0.8	✓	3.66	4.94	5.59 M

Table 6. The results of different models on the URC dataset.

Model	Venue	MAE	RMSE	Parmaters
CSRnet [62]	CVPR 2018	10.95	12.12	16.26 M
TasselNetv2+ [44]	Front Plant Sci 2020	9.46	11.44	0.25 M
P2PNet [8]	ICCV 2021	4.46	5.88	21.57 M
RPNet [7]	Crop J 2023	5.53	6.69	20.64 M
RiceNet [6]	PLPH 2023	5.12	6.41	20.55 M
PET [64]	ICCV 2023	4.15	5.19	20.90 M
P2P-CNF (60% Pruned)	This paper	5.11	6.57	7.31 M
P2P-CNF (70% Pruned)	This paper	6.24	7.57	6.27 M
P2P-CNF (80% Pruned)	This paper	7.03	8.59	5.65 M

Table 7. The results of different models on the WED dataset.

Model	Venue	MAE	RMSE	Parmaters
Faster R-CNN [16]	TPAMI 2016	4.93	6.52	41.4 M
CSRnet [62]	CVPR 2018	6.37	8.35	16.26 M
TasselNetv2+ [44]	Front Plant Sci 2020	6.59	9.01	0.25 M
P2PNet [8]	ICCV 2021	3.61	4.97	21.57 M
RPNet [7]	Crop J 2023	3.61	5.13	20.64 M
RiceNet [6]	PLPH 2023	4.01	5.80	20.55 M
PET [64]	ICCV 2023	4.22	5.29	20.90 M
P2P-CNF (60% Pruned)	This paper	3.95	5.48	7.05 M
P2P-CNF (70% Pruned)	This paper	5.76	7.10	6.13 M
P2P-CNF (80% Pruned)	This paper	7.65	9.37	5.56 M

Table 8. The results of different models on the MTC dataset.

Model	Venue	MAE	RMSE	Parmaters
Faster R-CNN [16]	TPAMI 2016	7.77	9.80	41.4 M
CSRnet [62]	CVPR 2018	6.87	8.87	16.26 M
TasselNetv2+ [44]	Front Plant Sci 2020	5.10	8.75	0.25 M
P2PNet [8]	ICCV 2021	4.02	5.76	21.57 M
RPNet [7]	Crop J 2023	2.94	4.66	20.64 M
RiceNet [6]	PLPH 2023	2.99	4.86	20.55 M
PET [64]	ICCV 2023	4.26	5.88	20.90 M
P2P-CNF (40% Pruned)	This paper	2.95	4.71	10.00 M
P2P-CNF (50% Pruned)	This paper	3.35	4.54	8.30 M
P2P-CNF (60% Pruned)	This paper	2.94	5.05	7.00 M
P2P-CNF (70% Pruned)	This paper	3.23	4.70	6.07 M
P2P-CNF (80% Pruned)	This paper	4.02	5.85	5.54 M

Table 9. Experimental results under different scale parse rates.

Model	Scale Parse Rate	MAE	RMSE	Parameters
P2P-CNF (original feature fusion)	0.4	5.01	6.83	11.68 M
P2P-CNF (DAFM without attention)		4.0	5.06	10.15 M
P2P-CNF (DAFM)		3.67	4.57	10.4 M
P2P-CNF (original feature fusion)	0.5	5.19	6.65	9.93 M
P2P-CNF (DAFM without attention)		4.20	5.48	8.40 M
P2P-CNF (DAFM)		3.45	4.59	8.63 M
P2P-CNF (original feature fusion)	0.6	4.44	5.93	8.58 M
P2P-CNF (DAFM without attention)		4.17	5.27	7.05 M
P2P-CNF (DAFM)		3.12	4.12	7.31 M
P2P-CNF (original feature fusion)	0.7	5.33	6.77	7.54 M
P2P-CNF (DAFM without attention)		4.25	5.36	6.0 M
P2P-CNF (DAFM)		3.21	4.42	6.51 M
P2P-CNF (original feature fusion)	0.8	6.59	8.77	6.93 M
P2P-CNF (DAFM without attention)		4.21	5.39	5.38 M
P2P-CNF (DAFM)		3.66	4.94	5.59 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, H.; Tan, S.; Luo, Z.; Yin, Y.; Cao, C.; Zhou, K.; Zhu, L. Development of a Lightweight Model for Rice Plant Counting and Localization Using UAV-Captured RGB Imagery. Agriculture 2025, 15, 122. https://doi.org/10.3390/agriculture15020122

AMA Style

Sun H, Tan S, Luo Z, Yin Y, Cao C, Zhou K, Zhu L. Development of a Lightweight Model for Rice Plant Counting and Localization Using UAV-Captured RGB Imagery. Agriculture. 2025; 15(2):122. https://doi.org/10.3390/agriculture15020122

Chicago/Turabian Style

Sun, Haoran, Siqiao Tan, Zhengliang Luo, Yige Yin, Congyin Cao, Kun Zhou, and Lei Zhu. 2025. "Development of a Lightweight Model for Rice Plant Counting and Localization Using UAV-Captured RGB Imagery" Agriculture 15, no. 2: 122. https://doi.org/10.3390/agriculture15020122

APA Style

Sun, H., Tan, S., Luo, Z., Yin, Y., Cao, C., Zhou, K., & Zhu, L. (2025). Development of a Lightweight Model for Rice Plant Counting and Localization Using UAV-Captured RGB Imagery. Agriculture, 15(2), 122. https://doi.org/10.3390/agriculture15020122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of a Lightweight Model for Rice Plant Counting and Localization Using UAV-Captured RGB Imagery

Abstract

1. Introduction

2. Related Work

2.1. Network Pruning

2.2. Plant Count

3. Materials and Methods

3.1. UAV-Based RGB Image Collection

3.2. Rice Plant Counting Dataset Collected by UAV

3.3. Other Dataset

3.4. Methond

3.4.1. Problem Definition

3.4.2. P2P-CNF

3.4.3. Cosine-Norm Fusion

3.4.4. Depth-Attentive Fusion Module

4. Results

4.1. Implementation Details and Evaluation Metric

4.2. Experiment on the RSC-UAV Dataset

4.3. Impact of Pruning Parameters on RSC-UAV Dataset Performance

4.4. Ablation Experiment

4.5. Experiments on URC Dataset

4.6. Experiments on WED Dataset

4.7. Experiments on MTC Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI