1. Introduction
The Chinese mitten crab (
Eriocheir sinensis), commonly known as the hairy crab or river crab, is renowned for its unique flavor and rich nutritional value, making it one of China’s principal aquaculture products and a favorite among global culinary enthusiasts [
1]. According to the “China Fisheries Statistical Yearbook” (2023) [
2], China’s 2022 cultivation output of Chinese mitten crabs reached 815,318 tons, with a total value of 50 billion RMB, underscoring its significant role in freshwater fisheries. The market for Chinese mitten crabs is extremely broad, ranging from upscale restaurants to family dining tables, and from online marketplaces to offline markets, with its presence felt everywhere. The numerous brands and trademarks registered, over a hundred in total, demonstrate its significant status in the aquaculture market. As market demand increases, issues related to the quality and safety of Chinese mitten crab products are also rising, such as the misuse of drugs during breeding processes and the practice of temporarily altering breeding environments to counterfeit branded products [
3,
4]. This not only poses a threat to consumer food safety but also impedes efforts towards supply chain transparency and brand value establishment.
To enhance the traceability of Chinese mitten crab products and combat counterfeit goods, various solutions have been proposed. For example, anti-counterfeit labels printed with barcodes or QR codes are attached to the crab’s claws, allowing consumers to scan these identifiers with mobile devices to access product information [
5]. This method incurs high traceability costs and has poor environmental sustainability. Moreover, the anti-counterfeit tags are logically separate from the Chinese mitten crabs themselves, making it easy for unscrupulous merchants to recycle or counterfeit these identifiers. Studies show that the morphological growth of Chinese mitten crabs is influenced by genetic factors and the surrounding environment, with their carapaces exhibiting unique features such as grooves, protrusions, and textures [
6,
7]. Similar to human fingerprints, the characteristics of the Chinese mitten crab’s shell are unique and non-replicable [
8], and short-term changes in the environment do not alter these morphological features [
9]. Therefore, the recognition of the carapace has become an important means of distinguishing different Chinese mitten crabs [
10]. Weipeng T. and others were the first to use SURF and FLANN algorithms to extract and match feature points on crab carapaces, performing individual matching verification based on these features, which showed variability among individual crab carapace features, although this method is susceptible to disturbances from uneven lighting and noise [
11]. Therefore, the individual recognition of Chinese mitten crabs requires a more accurate, reliable, efficient, and convenient method of recognition.
Given the powerful feature extraction capabilities of deep learning, image recognition based on deep learning has been widely applied in studies on humans and certain animals and plants, such as facial recognition [
12,
13], animal facial recognition (e.g., pigs and sheep) [
14,
15,
16], and plant disease recognition [
17,
18]. Some researchers have also used deep learning for the individual recognition of Chinese mitten crab carapace features. Yuying F. and other researchers and other researchers utilized a pyramid convolutional network to extract carapace image features of Chinese mitten crabs, combining it with a Multilayer Perceptron (MLP) to classify and recognize 100 individual crabs and achieving an accuracy rate of 98.88% [
19]. Although this method achieved a high classification accuracy by enhancing the model’s ability to extract image features, it cannot recognize Chinese mitten crabs not included in the training samples. The addition of new individuals requires retraining the entire network. Guozhong S. and others improved the ResNet101 residual network to extract the features of Chinese mitten crabs, using these features to calculate their similarity with authentic images for verification against a registered database, achieving individual crab traceability with an accuracy rate of 92.1% [
20]. This method introduces a new technical approach to recognition technology. However, due to its large model size and computational demands, it requires substantial computational resources and training time, making it less suitable for resource-constrained environments.
Considering the issues mentioned above, as well as the potential applications and scalability of Chinese mitten crab traceability, this paper proposes a lightweight individual recognition method based on the image features of Chinese mitten crabs. This method aims to enhance the traceability of Chinese mitten crab products and combat counterfeit products.
The main contributions of this study are summarized as follows:
A Chinese mitten crab image recognition dataset is constructed, containing data for 122 individual Chinese mitten crabs and 64,050 annotated images.
A method for recognizing Chinese mitten crab carapace features based on MobileNetV2 is proposed, utilizing a lightweight model structure to effectively improve the accuracy and efficiency of crab recognition.
A coordinate attention mechanism is introduced to the module, enhancing the model’s ability to extract detailed features of the crab carapace.
The Additive Angular Margin Loss (ArcFace) is introduced to train the model, enhancing the intra-class compactness and inter-class variability of the extracted features.
The rest of the paper is organized as follows.
Section 2 describes the data collection process, data augmentation, carapace recognition process, and the methods.
Section 3 presents the experimental setup and analyses used to evaluate the proposed model, including the model training results, ablation experiments, quantitative comparisons with other advanced algorithms, and generalization testing of model feature extraction. Finally, the discussion and conclusions are presented in
Section 4 and
Section 5, respectively.
2. Materials and Methods
This section describes the materials and methods used in the study, including data collection, data augmentation, the methodological process, the configuration of the feature extraction network, and the specific application of attention mechanisms and loss functions.
2.1. Materials
This subsection provides a detailed description of the data collection methods and the dataset used for training the model. It explains how images of Chinese mitten crabs were collected, as well as the enhancement and partitioning of the dataset.
2.1.1. Image Acquisition
To perform image recognition using deep learning models, training the model is essential, and the collection of datasets serves as a critical foundation. Due to the scarcity of publicly available datasets on the research topic, we constructed our dataset. This study collected images from 122 artificially bred Chinese mitten crabs at the Aquatic Animal Germplasm Testing Station in Pudong New District, Shanghai (121°39′56.503″ E, 31°0′48.366″ N), numbering each crab starting from zero. An example of the image collection environment and equipment is shown in
Figure 1. An MV-CA050-12UC industrial camera (Hikvision, Hangzhou, China) was mounted on a stand and positioned at a vertical height of 0.4 m, centered on the carapace area of the Chinese mitten crab, with the collected images having a resolution of 2048 × 2048 pixels.
2.1.2. Data Augmentation and Partitioning
Deep convolutional neural networks excel in various computer vision tasks, including image recognition. Training these network models typically requires a large number of training images to enhance the model’s generalization ability. However, due to the high economic value of Chinese mitten crabs, obtaining a large amount of data is often challenging. To overcome this issue and enhance the recognition capability, generalizability, and robustness of the model, we augmented the existing dataset. Initially, all original sample image data were augmented by scaling to different sizes, resulting in a dataset of 610 images. Then, the data from the first augmentation were further expanded by changing their brightness, conversion to grayscale, random translation, rotation, affine transformations, the addition of salt-and-pepper noise, and random cropping and padding, yielding a total of 64,050 images. The sample division is shown in
Table 1, where 58,800 images from 112 samples are divided into training and validation image pairs in an 8:2 ratio. To verify the model’s recognition ability, the image pairs were randomly matched following the Labeled Faces in the Wild (LFW) format [
21], with 3000 matching and non-matching items each. The remaining 5250 images from 10 samples served as a generalization ability test set and were used to test the model’s ability to distinguish unknown samples.
2.2. Overall Process Flow of the Proposed Method
This study proposes a method for extracting features from the carapace of river crabs and verifying and recognizing them, as illustrated in the flowchart in
Figure 2.
2.3. Methods
This subsection introduces the technological advancements integrated into the MobileNetV2 architecture, focusing on modifications such as the coordinate attention mechanism and the ArcFace loss function. These enhancements are crucial for improving the accuracy and efficiency of the model in extracting and recognizing distinct features from crab carapace images.
2.3.1. Feature Extraction Network
MobileNetV2, proposed by Sandler et al., is a neural network model renowned for its lightweight design [
22]. The model emphasizes efficiency and incorporates innovative features such as depthwise separable convolutions, inverted residual blocks, and linear bottleneck structures. These designs enable MobileNetV2 to maintain a high level of accuracy while achieving exceptional efficiency, making it widely applicable to various image recognition tasks [
23,
24,
25]. Therefore, this study employed MobileNetV2 as the foundational architecture for the feature extraction network.
Additionally, in the Chinese mitten crab image traceability task, as the number of crabs is not fixed, to enhance the model’s generalization ability in recognizing crab features, this study modified the output layer of the MobileNetV2 network. Specifically, we replaced the traditional classification output layer with a new fully connected layer designed to output 128-dimensional feature vectors.
This fixed-dimension feature output method provides a standardized and information-rich input for subsequent feature similarity calculations, not only enhancing the model’s ability to handle unseen samples but also facilitating the model’s deployment in real-world application scenarios. The model structure is illustrated in
Figure 3.
2.3.2. Coordinate Attention Mechanism
In computer vision, attention mechanisms are inspired by the human brain’s focus on the detailed characteristics of areas of interest, learning target details layer by layer while suppressing irrelevant information, thus significantly highlighting the features of the target area. Attention mechanism modules have been proven to enhance performance in computer vision tasks and are widely used in fields such as image classification, object detection, and image segmentation [
26].
To improve the feature extraction capability and focus on the important morphological features of the Chinese mitten crab traceability model, this study introduces the Coordinate Attention mechanism (CA) [
27]. The CA refines the attention allocation across the spatial dimensions of the input feature map, enhancing the recognition of key morphological features on the crab carapace, such as grooves, protrusions, and textures. The structure of the CA is shown in
Figure 4.
The CA module adopts an effective method to capture channel relationships and positional information, achieving the further suppression of background noise and focusing on the key information of the crab carapace, thereby outputting refined features that more accurately represent the essence of the crab carapace. It decomposes any feature map in the convolutional layer into two different directions for feature encoding, thereby acquiring long-range dependencies in one spatial direction and precise positional information in the other. This method of directly embedding positional information into channel attention can be complementarily applied to the input feature map, thus enhancing the target representation in areas of interest. The specific steps are as follows:
First, the crab carapace feature map is pooled along the horizontal direction (kernel (
, 1)) and vertical direction (kernel (1,
)) to obtain the positional information of the input feature map along the x axis and y axis. The average pooling in the horizontal direction can be expressed as follows:
where
represents the average features along the width
at a specific height
. Similarly, the average pooling in the vertical direction can be expressed as follows:
where
represents the average of features along the height
at a specific width
. These two operations yield feature descriptors that capture information along the horizontal and vertical directions, respectively, with each independently summarizing the statistical information of the entire feature map.
Secondly, the feature maps from the horizontal and vertical pooling results are concatenated and fed into a 1 × 1 convolution to obtain the attention feature map.
where
represents the concatenation operation along the spatial dimensions and
is a nonlinear activation function. The feature maps with non-linear data obtained after the activation function are then processed for feature extraction in both the horizontal and vertical directions.
In the equation, and represent the attention weights along the horizontal and vertical directions, respectively.
Finally, after processing through a 1 × 1 convolution and a sigmoid operation, the feature map data are multiplied by the weights, which are processed horizontally and vertically to output the final features.
2.3.3. Loss Function
The loss function plays a crucial role in model training, serving as a key factor in ensuring effective learning and accurate predictions by the model. Traditional classification loss functions utilize the Softmax loss function [
28], which primarily maps output features to the probability range (0, 1) and classifies these features based on the probabilities. Although it ensures the separability of classes in recognition tasks, it lacks constraints on intra-class and inter-class distances. This limitation makes it less suitable for tasks requiring fine-grained individual identification, such as those distinguishing individual Chinese mitten crabs.
To address this issue, this paper employs the ArcFace loss function [
29]. This has proven effective in enhancing feature separability in facial recognition tasks, which is analogous to the challenge we face in identifying individual Chinese mitten crabs based on subtle differences in carapace features. The purpose of ArcFace is to encourage the learned feature vectors so that they cluster more closely together in angular space while larger angular distances are maintained between inter-class feature vectors, thereby improving the discriminability of the features. The specific expression is shown as follows:
In the equation, is the number of sample batches in the training batch, is the true label of the sample, is the angle between the feature vector of the sample and the weight vector of its corresponding category , is the scaling factor used to control the radius of the feature space, and is the margin added to increase the angular separation between feature vectors, thus enhancing the model’s discriminative ability.
2.3.4. Similarity Calculation
After obtaining the image features using the model, the similarity between two crab carapace images can be determined using cosine similarity. Suppose the features of two crab carapace images extracted by the feature extraction network are represented as a tuple (
,
). The similarity calculation is as shown in the following equation:
In the equation, represents the dimension of the feature vector.
2.3.5. Evaluation Metrics
To quantitatively analyze the effectiveness of the model, we used Accuracy to evaluate the model’s performance [
30]. The equation is as follows:
In the equation, True Positive (TP) refers to the number of image pairs correctly identified as the same crab, True Negative (TN) refers to the number of image pairs correctly identified as different crabs, False Positive (FP) refers to the number of image pairs incorrectly identified as the same crab, and False Negative (FN) refers to the number of image pairs incorrectly identified as different crabs.
3. Experiment and Analysis
This section reports on the setup, execution, and results of the experiments conducted to test the developed model. It includes detailed analyses of network training outcomes, ablation studies to assess component impacts, comparisons with other algorithms, and tests of the model’s generalization capabilities.
3.1. Experimental Configuration
The experiments in this paper were conducted on a Windows 10 system using the PyTorch 1.9.1 deep learning framework, Python 3.8 version, NVIDIA RTX 3080 Ti graphics card (NVIDIA, Santa Clara, CA, USA) with 12 GB of VRAM, and an AMD Ryzen 7 5800X3D 8-Core processor (AMD, Santa Clara, CA, USA) equipped with CUDA 11.1 for high-performance GPU computing.
During the training phase, each RGB crab carapace image in the dataset was resized to 112 × 112 pixels, and the pixel values were normalized to the range of [−1, 1]. This normalization helps to maintain a consistent distribution in the training data, accelerates model convergence, and prevents the problem of gradient vanishing to some extent. All feature embedding dimensions were set to 128. The overall experimental process used SGD as the training optimizer, with the batch size set to 128, and the model was trained for 5000 iterations.
In the testing phase, ten-fold cross-validation (10-fold cross-validation) was used to evaluate the performance of the algorithm. With a similarity threshold set at 0.90, ten different cross-validation tests were repeated, and the results of these ten tests were averaged. This method allows us to more thoroughly explore the model’s generalization ability, thereby ensuring the robustness and reliability of our findings.
3.2. Network Training Results
The training set was input into the improved MobileNetV2 network for training, and the results are shown in
Figure 5 and
Figure 6.
Figure 5 exhibits a typical pattern of a rapid initial decline followed by gradual stabilization. The sharp reduction in loss during the initial phase (0–1000 iterations) reflects the early learning stage of the model, where the optimizer effectively reduces the high error rate by significantly adjusting the model parameters. This is followed by a period of fluctuation (1000–2000 iterations), indicating that the model is refining its parameters and learning is still ongoing, although the changes are no longer as drastic. As the training progresses beyond 2000 iterations, we observed a gradual decrease in the rate of loss reduction, which approached zero, and a plateau formed and persisted for the remainder of the training process (3000–5000 iterations). This plateau indicates that the model reached a convergence point where additional training provided only a minimal improvement in loss metrics, suggesting that the learning capacity of the current model was maximized.
Figure 6 shows the trend of accuracy changes during the model training process. In contrast to the loss values, the accuracy rapidly increases in the early stages of training, reflecting a significant enhancement in the model’s ability to differentiate between classes. After entering 2000 iterations, the growth in model accuracy slows and stabilizes, reaching a high level and fluctuating within a narrow range, indicating that the model’s ability to recognize the training data has become saturated.
Combining the analyses of
Figure 5 and
Figure 6, we can conclude that the proposed model exhibits good learning capabilities and stability during the training period, laying the foundation for its potential application in traceability tasks of Chinese mitten crab images.
3.3. Ablation Experiment
To assess the specific impact of different components on model performance, this study conducted a series of ablation experiments. These experiments aimed to verify the contributions of the coordinate attention mechanism and the choice of loss function to the recognition accuracy of Chinese mitten crabs; the results are shown in
Table 2.
In the baseline model MobileNetV2, which only outputs feature vectors, we observed a recognition accuracy of 69.96%, setting the benchmark for subsequent experiments. After introducing the coordinate attention mechanism, the model’s accuracy improved to 75.72%. This significant increase demonstrates the effectiveness of the CA mechanism in enhancing the model’s ability to recognize the morphological features of river crabs. This suggests that the CA mechanism, by finely allocating attention across spatial dimensions, enables the model to focus more on the key morphological features of the crab carapace, thus improving recognition precision.
Furthermore, simply integrating the ArcFace loss function into MobileNetV2 resulted in a further substantial increase in accuracy, reaching 97.71%. The ArcFace loss function optimizes the feature space by introducing angular margins, promoting intra-class compactness and inter-class distinguishability and significantly enhancing the model’s discriminative ability. This result underscores the importance of angular margins in class discrimination and their contribution to enhancing feature differentiation.
Ultimately, combining the strengths of the coordinate attention mechanism and the ArcFace loss function in one model achieved the highest accuracy, at 98.56%. This further proves the complementary nature of CA and ArcFace; their combination not only enhances feature extraction capabilities but also optimizes the separation between categories. This integrated approach provides an effective strategy for achieving the high-accuracy recognition of individual Chinese mitten crabs.
3.4. Comparison of Different Algorithms
To comprehensively assess the efficacy and practicality of the proposed model, this paper conducted comparative experiments with several other advanced lightweight facial recognition models on 6000 validation image pairs. The selected comparison models include widely recognized industry benchmarks such as ShuffleFaceNet [
31], MobileFaceNet [
32], and VarGFaceNet [
33], which demonstrated a good baseline performance in the field of facial recognition. The key metrics that were examined were model size, test time, and accuracy, aiming to provide a comprehensive performance evaluation. The results are shown in
Table 3.
An analysis of
Table 3 shows that each model exhibited varying degrees of efficiency and accuracy while maintaining a lightweight structure. ShuffleFaceNet, although smaller in size (10.3 M), has a longer processing time (10.57 s) and an accuracy rate of 84.93%, indicating its limitations in accurate processing. MobileFaceNet has the smallest model size (4.0 M) and the shortest processing time (8.98 s), with an accuracy rate of 87.46%, demonstrating its high efficiency and reasonable accuracy. The VarGFaceNet model is the largest (24.3 M) and has the longest processing time (33.81 s), but its accuracy rate of 92.01% indicates that it sacrifices efficiency to achieve a higher accuracy.
Compared to these models, the model proposed in this paper achieves the best balance between model size (11.0 M) and accuracy (98.56%), reaching the highest accuracy while maintaining a lower processing time (10.21 s), with an average verification time of only 1.7 milliseconds per image pair. Although slightly larger than MobileFaceNet, the accuracy is significantly improved. Compared to VarGFaceNet, our model significantly reduces the model size while still achieving the highest recognition accuracy. These experimental results emphasize the effectiveness of the strategies proposed in this study, especially in the verification of Chinese mitten crab carapace recognition while maintaining a lightweight structure.
3.5. Model Feature Extraction Generalization Test
For the generalization test set of 5250 crab carapace images, the 128-dimensional feature vectors of the crab carapaces were extracted using the improved MobileNetV2 after training. Then, using T-SNE manifold learning [
34], a dimensionality reduction was performed to visualize the distribution of feature vectors, mapping out the distribution of extracted crab carapace features. The legend in the figure indicates the Chinese mitten crab numbers, with different colors representing different individual crabs. As shown in
Figure 7, the crab carapace features extracted from 5250 images of 10 different crabs are clustered into 10 categories, with compact intra-class and clear inter-class differences, demonstrating that the crab carapace features extracted by the improved MobileNetV2 after training possess excellent recognition and generalization capabilities.
4. Discussion
In this study, we propose and validate a lightweight Chinese mitten crab image verification model based on an improved MobileNetV2 and ArcFace loss function. By incorporating the Coordinate Attention (CA) mechanism and ArcFace loss function, the model not only maintains its light weight and processing speed but also improves in terms of accuracy and generalizability, which was not achieved in previous studies.
Our results support the initial hypothesis that attention mechanisms and angular margin losses can significantly enhance recognition accuracy. This is similar to the findings of Hu et al. [
35], who discovered that SE blocks could significantly improve network performance by recalibrating the feature responses of channels, thereby enhancing the model’s representational power. Meanwhile, the ArcFace loss was proven by Deng et al. to effectively improve inter-class separability, offering better differentiation among highly similar individuals [
29].
The design of the model in this study considers the needs of practical applications; it has an optimized computational efficiency and model size, making it suitable for resource-limited devices, which is particularly important in actual aquaculture scenarios. Moreover, the traceability method provided by this study supports the sustainable development of the aquaculture industry by enhancing the accuracy and efficiency of traceability, helping to establish a more transparent supply chain.
However, there are still some challenges in the field of Chinese mitten crab carapace recognition. Due to the lack of publicly available Chinese mitten crab carapace datasets, the dataset used in this experiment needs further expansion. Future research may involve applying the model to larger and more diverse datasets to further enhance the model’s generalization ability. Additionally, we aim to explore more efficient and accurate algorithms, making further improvements in the accuracy of Chinese mitten crab recognition, an important direction for future research.