1. Introduction
Nasopharyngeal carcinoma (NPC) is a malignant tumor that occurs in the nasopharyngeal site and lateral wall, and is endemic in southern China, North Africa, and Southeast Asia [
1]. According to the data of the World Health Organization [
2] in 2021, the number of new cases of NPC diagnosed globally reached 133,000. The incidence of NPC in China is higher than the average incidence rate in the world. New NPC cases in China account for about 50% of the world’s total. Incidences of NPC is one of the highest of malignant tumors in China, and the incidence is highest in otolaryngology malignant tumors. Therefore, research on NPC needs to continue, and a new synergistic relationship between distant metastasis in patients with nasopharyngeal carcinoma has been discovered in the latest study [
3]. Detailed examination of magnetic resonance (MR) imaging is necessary to accurately depict the primary tumor, and, as a routine clinical procedure for the diagnosis of NPC, preoperative MR is used to assess tumor progression. Most nasopharyngeal cancers are moderately sensitive to radiation therapy, and radiation therapy is the treatment of choice for NPC. Tumor detection and segmentation of MR images is an important step in computer-aided tumor diagnosis [
4]. A reliable automatic detection model can quickly detect tumor areas and effectively reduce the radiation therapy planning workload of radiologists.
At present, there are few studies on the detection of NPC. NPC tumors usually occupy a small volume in MR images, and the tissue background is closely connected to the tumor, with complex and variable border shapes that are difficult to distinguish and sometimes impossible to identify by the human eye, making the detection of NPC lesions more challenging. To address these issues, several segmentation methods for NPC detection have been proposed in previous work. For example, in 2015, Huang et al. [
5] introduced a region-based NPC segmentation method, which used clustering and classification methods to segment nasopharyngeal carcinoma from MR images. In 2018, Mohammed et al. [
6] proposed a new method for diagnosing NPCs from endoscopic images, which includes a trainable segmentation of NPC tissues, a genetic algorithm to select the best features, and a support vector machine for classifying NPCs, and the detection shows high accuracy. The disadvantage is that this method needs to display tumors on several incisions, and doctors need to draw separate ROIs on different tumor incisions to detect NPC segmentation in one patient, which is complex and requires a lot of time and money. In 2019, Zhao et al. [
7] proposed a DL method, which used a deep convolutional neural network (DCNN) to achieve automatic NPC segmentation on 2D PET-CT images with dice similarity coefficient (DSC), sensitivity, and accuracy of 0.785, 0.764, 0.789, respectively. However, it only collected PET-CT images from 22 patients, with a small number of samples, and increased the complexity of image data. In 2020, Chen et al. [
8] proposed a new multimodal MR fusion network (MMFNet) based on a multi-encoder and introduced a 3D-CBAM attention module to highlight information features. This method mainly uses different forms of MR images to complete accurate segmentation. The above methods [
5,
6,
7] are complex and demanding in terms of data processing for NPC detection, increasing the complexity of NPC images (multimodality). Their experiments uses 256, 381, and 1100 images, respectively, and the data scale is small. Our experiment uses 4694 NPC MR images for lesion detection, and the results are more reliable. The existing NPC segmentation methods for MR images need to be improved in the aspects of data processing cost, data scale, and algorithm performance.
The difficulty of NPC detection is partly due to the small size of the NPC lesion area in proportion to the whole MR image, while the complex background occupies the major part. Moreover, the shape of nasopharyngeal carcinoma is diverse, the background and tumor boundary are blurred, and the lesion shape is complex and difficult to distinguish, which makes NPC detection very difficult. In 2022, Wang et al. [
9] proposed a new network based on an improved Mask R-CNN framework using global-local attention to detect abnormal lymph nodes in MR images with good performance. Inspired by this literature, we introduced an attention mechanism [
10] to enhance the feature representation of NPC lesion regions and weaken the influence of background regions.
The traditional detection method of NPC takes a long time, and the accuracy of the algorithm is affected by the way of image feature extraction [
11,
12]. It uses manual operation, which is costly and has a high tendency for errors. In contrast, deep neural networks have powerful automatic representation learning capabilities. As one of the main architectures of DL, the convolutional neural network (CNN) method provides superior performance for classification, segmentation, and detection tasks in digital pathological images (DPGA) [
13]. Therefore, the CNN-based architecture is often used as a tool for faster and more accurate diagnosis by processing multimodal MRI images [
14]. Deep neural networks have powerful automatic representation learning capabilities and have been widely used in the detection and segmentation tasks of medical images. Zhang at al. [
15] developed a computer-aided detection method based on the deep learning model Faster R-CNN, which has the potential to detect brain metastases with high sensitivity and reasonable specificity. Elakkiya at al. [
16] proposed a hybrid deep learning technology, which developed a small object detection generative adversarial network (SOD-GAN) based on RCNN to automatically detect and classify cervical precancerous lesions and malignant lesions according to deep features without any preliminary classification and segmentation assistance. However, the R-CNN and Faster R-CNN methods only focus on the number of randomly selected and determined regions, and do not scan the entire input image, so they may miss key regions or concentrate on unimportant regions. These two cases will lead to error detection and classification results, while the YOLO object detector can scan the entire input image for classification and area detection. Salman at al. [
17] developed an automatic tool for detection and diagnosis of prostate cancer based on the YOLO algorithm, and empirical results demonstrate that it is possible to develop high-performance prostate cancer diagnostic tools using the object detection method. These diagnostic tools can reduce inter-observer variation between pathologists and decrease time delay in the diagnostic phase. Salman at al. [
17] also show that the YOLO algorithm can have good performance in cancer detection. In addition, there is no research on NPC detection in MR images using YOLO algorithm. In order to solve the above challenges of NPC detection, we improved the YOLO algorithm by integrating MWSR and YLCA modules, which can more accurately locate the object when performing lesion detection, and have higher detection performance and faster real-time detection of NPC lesions.
Deep neural network-based object detectors continue to evolve and are used in various applications. Object detectors accomplish both classification and localization by providing the location of the object as well as category labels and confidence scores, which are essential in high-impact real-world applications, and new methods are constantly being proposed. The CNN-based One-stage object detection OverFeat [
18], YOLO [
19], SSD [
20], and RetinaNet [
21] are improved step by step and are the basis for subsequent research in the object detection domain. In 2016, Redmon at al. [
19] proposed the YOLO (You Only Look Once) algorithm, which treats object detection as a spatial location regression problem containing category information and forms a new paradigm for object detection. After continuous optimization and innovation, Wang at al [
22] proposed YOLOv7 detector in 2022, and conducted the validation experiments on PASCALVOC and MS-COCO datasets. Empirical results demonstrate that YOLOv7 outperforms all known object detectors in the range of 5 FPS to 160 FPS in terms of speed and accuracy. YOLOv7 algorithm is the most advanced method for object detection and classification. We use the detector based on YOLOv7 algorithm to perform NPC detection. The main reason is that it determines the lesion area more accurately by analyzing the input features of the entire image. When using the image training detector in a specific field, it performs positioning and classification tasks more accurately than the previous algorithm, and has higher positioning and classification rates. Moreover, the algorithm detects real-time object faster than other algorithms [
22].
Medical images are more complex and have greater variability than natural scene images. In recent years, convolutional neural networks (CNNs) have been successfully applied to automatic medical image detection, and the automatic detection of NPC in MR images has effectively reduced the doctor’s workload in NPC diagnosis. In this paper, we propose an automatic method (MWSR-YLCA) for detection and diagnosis of NPC. Specifically, we design two modules in the MWSR-YLCA method, the multi-window settings resampling (MWSR) module and an improved YOLOv7 with an embedded coordinate attention mechanism (YLCA) module, to detect NPC lesions more accurately. First, the MWSR processes MR images of NPC through an image resampling method based on multi-window settings, which uses a windowing technique to fuse the optimal window width window position and nearby image information to enrich the amount of image information. Subsequently, the new YLCA network is constructed by embedding the fusion attention mechanism to enhance the feature representation of objects of interest for automatic detection and diagnosis of NPC. Due to the lack of public NPC detection data sets, we trained and evaluated our proposed model on our collected data sets, which include 26,000 MR images of 800 patients. By conducting extensive experiments using 4694 MR images containing lesion annotations, we evaluated the effectiveness of our proposed MWSR-YLCA and obtained high-accuracy NPC lesion detection performance. This paper main contributions are as follows.
- (1)
We use the multi-window setting based image resampling method (MWSR) to process NPC MR images. This method uses window technology to fuse image information in several windows (the optimal window and nearby windows), which reduces the information loss of the original image and enriches the image information for model input. The NPC detection performance using our method is improved compared to the detection performance using the original image, which provides a new way for medical MR images for NPC detection.
- (2)
We propose an NPC detection network YLCA for automatic detection and diagnosis of NPC, which builds a new network based on a YOLOv7 object detection network, embeds the fusion attention mechanism, and designs MP-CA Block to enhance the feature representation of objects of interest. Through extensive experimental evaluation, our detection network obtained the highest 80.2% mAP and 0.77 F1 compared to other comparison methods, proving that it is more effective for NPC MR image detection.
2. Method
The MWSR-YLCA method proposed in this paper consists of two main parts to jointly realize the detection of NPC lesions. The first part (MWSR) uses multi-window technology to resample the NPC MR image to obtain a three-channel (RGB) pseudo-color image for model training evaluation. The second part (YLCA) is based on the YOLOv7 [
22] framework, embedding the coordinate attention (CA [
23]) mechanism, constructing the attention convolution module MP-CA, obtaining the attention features, and fusing the attention features to construct the YLCA network, thereby improving the detection performance of the network.
2.1. Window Technique
The window technology in the field of medical images includes window width (WW) and window level (WL), which are used to select the range of CT values of interest. Because each tissue structure has a different range of CT values, when displaying a certain tissue structure, the suitable window width and window level for observing the tissue or lesion should be selected to obtain the best display effect. MR images are reconstructed analogue digital grey-scale images and therefore also have the characteristics to obtain the best display and perform various image post-processing using windowing techniques. However, unlike CT, the grey scale on MR images does not represent the density of soft tissues and lesions, but rather their MR signal intensity, reflecting the length of the relaxation time, and, therefore, the windowing technique for MR imaging does not have a fixed window width/level, which needs to be adjusted for each image. The DICOM image protocol specifies that medical images need to be stored as 16 bits (the actual number of bits used may be different), which indicates that the brightness of the pixel will be expressed in gray levels, and the role of window technology is to take out the grayscale value in a certain range of pixels in a grayscale image to display according to its gray level (usually ), so as to display more image details.
Figure 1 shows the MR image window width and window level diagram. First, we set a range, and the gray value range of the observed tissue is listed separately, called the window. The gray value of a certain range is taken from the MR grayscale range and mapped to the gray image. The tissue whose gray value is higher than the window range is displayed as white; tissues below this window range appear black, then the size of this MR grayscale range is called window width WW, and the central value of this grayscale range is called window level WL.
2.2. YOLOv7
The convolutional neural network (CNN) is a major branch of neural networks and one of the main algorithms for deep learning in image applications [
24,
25]. It is a deep feedforward neural network with three characteristics of local connection, weight sharing, and down sampling, which can effectively reduce the complexity of the network and prevent the occurrence of overfitting. The core feature of CNN is based on the convolution kernel, which is composed of several convolution layers, pooling layers, and fully connected layers. The convolution layer extracts different features of input image through convolution operation. The pooling layer reduces the feature dimension of data by partitioning the features. For the image, the main function of the pooling layer is to compress image features. The fully connected layer connects the extracted features to generate global features for image classification.
The YOLOv7 [
22] model was proposed in 2022 and validated on the COCO dataset to obtain better performance, standing out with faster speed and higher accuracy compared to the latest object detectors and attracting much attention. The general architecture of YOLOv7 consists of backbone, neck, and head. The entire network structure of 106 layers, of which the backbone layer is 51 and the head part 55, and consists mainly of modules such as CBS, MP, ELAN, and SPPCSPC. The structure of each module is shown in
Figure 2. Compared to the previous YOLO model, YOLOv7 has been architecturally reformed using E-ELAN [
22] and composite model scaling. It outperforms all real-time object detectors in terms of speed and accuracy, and it improves performance while reducing parameters by 40% and calculations by 50%.
2.3. Attention Mechanism
The attention mechanism is derived from the study of human vision. Generally speaking, because humans have a limited capacity to process information, they selectively focus on the more important part of all information and ignore the rest. The attention mechanism is similar to the logic that humans use to look at pictures. When we look at a picture, we do not see the whole picture, but we focus our attention on the focal point of the picture. The core logic of an attention mechanism is focusing on the focal point instead of focusing on the whole. The attention mechanism has been widely used to achieve good performance in a variety of computer vision tasks, such as image classification, image segmentation, and object detection. The attention mechanism in neural networks is mainly implemented through the attention score. The attention score is a digital value between 0 and 1, and the sum of all scores under the attention mechanism is 1. Each attention score represents the attention weight assigned to the current item. Attention mechanisms can make the neural network ignore unimportant feature vectors and focus on calculating useful feature vectors. While eliminating the interference of unimportant features on the fitting results, the operation speed is improved. There are many types of attention mechanisms, such as channel attention [
26], spatial attention [
27], self-attention [
28], mutual attention [
29], coordinate attention [
23], mixed attention, etc.
For mobile networks, the standardized attention mechanism SE (squeeze-and-excitation attention) effectively constructs the interdependence between channels by simply squeezing each two-dimensional feature map, which is significantly effective for improving the performance of the model. However, SE [
26] attention only considers the importance of encoding inter-channel information while ignoring location information, which largely influences the generation of selective attention maps and is important for focusing on feature regions of interest. In this paper, experiments are conducted using the embedded coordinate attention CA [
23], which splits the channel SE [
26] into two parallel 1D feature encodings, and clusters the features separately in the two directions. The method embeds localization information in channel attention, enabling it to capture long-range correlations in one spatial direction while maintaining accurate location information in another, effectively integrating spatial coordinate information into the generated attention map. These graphs are applied to the input feature maps to enhance the representation of objects of interest by supplementing the feature map information, which is essential for locating object regions in computer vision tasks. The schematic diagram of the CA network structure is shown in
Figure 3.
CA [
23] divides the attention mechanism into two stages to encode channel relation and long-term dependence with accurate location information, and divides it into two stages, coordinate information embedding and coordinate attention generation.
- (1)
Coordinate Information Embedding
Squeezing in the SE module is used for global information embedding. Given the feature tensor input
in the network, the squeezing step for the
th channel can be formulated as follows.
where
is the output associated with the
channel.
H and
W are the height and width of the input data feature map. The input
comes directly from a convolution layer with a fixed convolution kernel, and the feature tensor set is obtained by convolution processing.
To enable the attention block to spatially capture long-distance interactions with precise location information, the global pool is decomposed into equations that are converted into one-to-one feature encoding operations. Given an input
, we encode each channel using pooling kernels of sizes
or (1,
W) along horizontal and vertical coordinates, respectively. Therefore, the output of the
channel at height
and width
is obtained respectively, and the formula is as follows:
These two transformations combine features in each of two spatial directions, resulting in a pair of discriminative features with orientation. This is very different from the SE block of the channel attention method, which generates a single feature vector. These two transformations also help the network to locate objects of interest more accurately, which allows the attention module to capture long-term correlation in one space direction and maintain precise position information in another space.
- (2)
Coordinate Attention Generation
Through the above transformation, we can get a good global perception and encode accurate location information. In order to use the resulting representation, the second transformation is proposed. Given the generated aggregated feature maps, they are connected by Equations (2) and (3) and then transformed using the 1 × 1 convolutional transform function
.
where
represents an intermediate feature map encoding spatial information in the horizontal and vertical directions, respectively. Here,
is the reduction ratio used to control the size of the blocks in the SE block.
represents cascading operations in two spatial dimensions, and
is a non-linear activation function. Then, we split f into two independent tensors
and
along the two spatial dimensions of
and
Using the other two 1 × 1 convolutions transforms
and
, and
and
are transformed into tensors with the same number of input X channels, respectively, as follows:
Here
is the Sigmoid function, and, after the calculation of Formula (5), the attention weight
and
of the input feature map in the height direction and in the width direction will be obtained. Finally, by multiplying and weighting the original feature map, the final feature map with attention weight in the width and height directions will be obtained. The output
of the coordinate attention block can be written as:
2.4. Resampling Based on Multi-Window Settings
In medical imaging, the principle of MR imaging differs from that of CT. Compared with CT, MR has a better imaging effect on the body’s soft tissue and can provide more information. However, there is no corresponding window width and window level setting for MR images of different soft tissues of the body. Therefore, professional radiologists need to adjust each image to a suitable window for lesion labeling when labeling the lesion area, and obtain a better contrast image under this optimal window width and window level. However, the best window position selection for the same body part is usually different under different doctors, machines, sequences, and angle processing. Due to the diversity of window width and window level selection, we believe that the same MR image has different feature information with varying importance in different windows. If we can fuse the information in multiple windows to obtain richer image features, it is beneficial for the deep learning algorithm to analyze image features. From the above analysis, on the one hand, because of the special characteristics of MR images, NPC in MR images cannot identify a relatively fixed optimal window area as CT images do; on the other hand, selecting information within any single window may lead to information loss.
In this paper, we need to perform lesion detection on NPC MR images, and we need to convert the DICOM images to JPG format for neural network training. Traditional medical image processing methods only acquire the image information under a certain window, which leads to a large amount of information loss in the NPC images, so we fuse the image information under multiple windows to obtain a richly layered image for the detector training. In order to improve the data utilization efficiency of the detector for the original image, transmit more image information to the deep learning model, and enable it to obtain richer image features, we adopt the image resampling method based on multi-window setting (our previous MWSR research [
30]). The DICOM image metadata is used to obtain information about the preset window (default window width and position), and then the other two windows to the left and right of the preset window are used to obtain a three-channel pseudo-color image, resulting in a more informative and better contrasted nasopharyngeal cancer image.
The image resampling based on multi-window setting is as follows: obtain the preset best window width/level information (
,
) from the MR image metadata of nasopharyngeal cancer in DICOM format. Based on the optimal window (
,
), we set two new window width/level at a certain proportion in the gray level range covered nearby, namely:
where
denote the new window width and window level, respectively, and
is the weighting factor of the window width and window level.
By observing the image contrast ratio, it was found that the image contrast effect was best at , thus two new windows () and () were obtained, and the grayscale images under the three windows were combined into RGB images as R, G, and B channels, respectively, to enrich the image information.
As shown in
Figure 4, from the range of pixels contained in the MR image (a), the images (b) under the window width/level acquired at μ = 0.5, 1, 1.5 are taken out and displayed according to their grey scale, respectively, as R, G, and B channels to synthesize RGB pseudo-color images (c). Specifically, we convert the NPC MR image (Dicom format) in the data set, and convert the Dicom image into a grayscale image
,
,
(JPG format) by the above three window width window levels (
,
), (
,
), and (
,
), and then the grayscale images under the three window width window levels are synthesized into RGB pseudo-color images. The grayscale images under the three window width window levels correspond to the three channels of the RGB image.
corresponds to B channel,
corresponds to G channel, and
corresponds to R channel. It takes about 4 h to process the MR dataset used in this experimental hardware environment, and it is easy and fast to process by computer automation.
2.5. YLCA Network
Considering that the attention mechanism can make the neural network focus on calculating the most important feature vectors, and by embedding position information into the channel attention CA [
23], it can not only capture cross-channel information, but also capture direction-aware and position-sensitive information and more accurately locate and identify objects of interest. In the MP-2 module, we replace the 1 × 1 convolution’s CBS module with a CA to build the MP-CA module, as shown in
Figure 5.
In
Figure 5, the CA module uses coordinate attention to process the input features. In YOLOv7, the MP-2 module uses concat to fuse the features extracted from the input data by one pooling, 1 × 1 convolution and one 1 × 1 convolution, and 3 × 1 convolution, respectively. The MP-CA module constructed in this paper replaces the 1 × 1 convolution in the MP-2 block with the CA module, and the input and output remain unchanged. The other structures remain unchanged.
We embedded the fused CA block and MP-CA modules in backbone and head, respectively, in the YOLOv7-based framework, aggregated the primary features extracted from each stage into two independent direction-aware feature maps, encoded them into two attention maps respectively to retain location information, and then applied the two attention maps to the input feature maps to enhance the representation of nasopharyngeal cancer lesion areas to construct a novel network YLCA. A schematic diagram of the YLCA network structure is shown in
Figure 6, and this model is used to detect lesion areas of nasopharyngeal cancer.
As shown in
Figure 6, the YLCA network structure has 107 layers, composed of backbone and head. The first 52 layers are backbone and the last 55 layers are head. In backbone, we add the 51st layer attention mechanism module CA after ELAN, denoted as [−1,1, CoordAtt, [1024]]], where −1 indicates that the upper layer output is the local layer input and the input feature is 1024. In head, we replace the original MP-2 module with the MP-CA module for attention feature fusion at the 81st layer, which is expressed as [−3,1, CoordAtt, [128]], where −3 indicates that the output of the upper layer 3 is the input of this layer, and the input feature is 128. The schematic diagram of the MA-CA module is also shown in
Figure 6, which is used to construct the target object network YLCA.
3. Experimental Settings
3.1. Dataset Description
The experimental data were obtained from the Sun Yat-sen University Cancer Centre. MR images of 800 patients with nasopharyngeal carcinoma were acquired from January 2010 to December 2011. These MR images were T2-weighted (T2WI) axial cross-sectional images with the following imaging parameters: fast spin-echo sequence (FSE), TR = 4000 ms, TE = 99 ms, mean slice thickness of 5 mm, layer spacing of 6 mm, and intraplanar pixel resolution of 0.74 mm × 0.74 mm.
Of the 800 cases, a total of 26,000 MR images were available. Since the nasopharyngeal lesion area accounts for a small proportion of the MR imaging of the head and the clinical presentation is complex and varied, not every image has a lesion area, so only some of the images have a labeled cancerous area. We selected 4694 MR images with lesion areas and corresponding annotated images for the experiment, including 3540 images of male patients and 1154 images of female patients (1596 12-bit DICOM images and 3098 16-bit DICOM images). An expert consensus was formed by four experienced imaging physicians to give the appropriate tumor area annotation. Data enhancement processes, including rotation and horizontal flip, were used. We use the evaluation metrics Precision, Recall, AP (Average Precision), mAP (mean Average Precision), F1-Score, and Confidence to assess the performance of NPC detection.
3.2. Data Conversion
The medical image data used in this experiment is in DICOM format, and the lesion area corresponding to each MR image in the original data is annotated as a PNG format image. In this paper, we use a deep convolutional network for NPC lesion detection, and the annotated area is a rectangular bounding box, so we convert the NPC lesion information outlined by the doctor into rectangular box information and store it in YOLO data annotation format to facilitate the training of the detector. A schematic diagram of the nasopharyngeal cancer lesion labeling process is shown in
Figure 7.
In
Figure 7, the image (a) is the original MR medical image. The specific annotation process consists of pixel-level annotation of the nasopharyngeal cancer lesion area (b) by a pixel traversal algorithm to frame the white lesion area (c) and output the corresponding coordinates and convert the coordinate information into YOLO data format (e), indicating the categories Cancer, X-center, Y-center, w, and h, respectively, then saved as a txt file. The red dashed arrows in the figure point to the effect of converting the pixel-level annotation of the lesion area (white) to (d) a bounding box area covering the real lesion area.
3.3. Other Setting
Data set division: A total of 4694 images were used for detector training in the experiment after data processing. Considering that the detection of NPC lesion area is essentially a single category object detection, and the test set is not easy to be too many, we divided the NPC MR images of each patient into training validation sets and test sets according to 4:1 in 800 patients to ensure the balance of data distribution. Therefore, we used 3755 images as the training validation set, of which 3004 images were used as the training set, 751 images as the validation set, and 939 images as the test set.
Object detector setup: The constructed YLCA network was used as the object detector for this experiment. During the experiment, the input image size was 640×640 for both training and testing, the batch size was 32, the initial learning rate was le-2, the loss was calculated using the Stochastic Gradient Descent (SGD) optimizer, the number of iterations of the whole network was 300 epochs. It takes about 2 min to train one iteration (epoch), and each iteration is saved as a model. For the constructed YLCA detector, the performance is optimal around the 180th epoch.
Experimental environment: The algorithm in this paper is built using the deep learning framework Pytorch and the programming language Python. The training and testing of the model is based on the Ubuntu 18.04.6 operating system, with 128G RAM and a high-performance graphics card NVIDIA RTX A6000 GPU (48G).
3.4. Evaluating Metrics
In order to ensure the rationality of the experimental results and the fairness of the comparison test, the algorithm is evaluated by referring to the evaluation indexes widely used in the existing object detection methods, including Precision, Recall, F1, PR curve, and mAP.
- (1)
Precision rate: the number of samples correctly predicted as true accounts for the proportion of all samples predicted as true, and the accuracy represents the accuracy of the prediction in the positive sample results.
- (2)
Recall rate: the number of samples correctly predicted as true accounts for the proportion of all samples that are actually true. In the samples that are actually true, the proportion of predicted positive samples to the total actual positive samples.
- (3)
F1-score: F1 is the harmonic mean of precision and recall. It is used to balance the influence of precision and recall, and to evaluate a classifier more comprehensively. The larger F1 indicates the higher quality of the model.
- (4)
PR curve: according to the value of accuracy and recall rate, the PR curve is drawn to evaluate the model more comprehensively. The larger the area under the PR curve, the higher the average accuracy of the model.
- (5)
mAP: in this experiment, mAP was used as the main evaluation index.
[email protected] (referred:
) refers to the value of AP under the condition that IOU (predicted overlap between borders and real borders) is greater than 0.5. mAP@50:5:95 (referred:
) refers to the value of IOU from 0.5 to 0.95, the step size is 0.05, and then the mean value of AP is taken under these IOUs.