1. Introduction
As human society increasingly emphasizes sustainable development, the market demand for environmentally friendly green building materials continues to grow. Wood, being a natural, renewable, and eco-friendly material, is gaining prominence. Its unique grain structure, excellent physical properties, and ease of processing make it highly valuable. Consequently, wood is extensively utilized not only in the construction industry but also in furniture manufacturing, shipbuilding, artistic sculpture, and other fields, underscoring its significant importance [
1,
2].
However, defects such as cracks, decay, knots, and discoloration inevitably occur as tress grow. These defects not only affect the aesthetics of the wood but also significantly reduce its service life and structural strength, adding a considerable degree of difficulty to its processing. In addition to efforts aimed at preventing defects during wood growth, accurate and efficient detection of defects that emerge during wood processing is crucial. This approach is vital for ensuring wood quality, improving wood utilization rates, and promoting the healthy development of the timber industry [
3].
The development of wood defect detection technology not only helps identify and eliminate defective products promptly during wood processing and use, thereby reducing resource waste and environmental pollution; it also lays a scientific foundation for the rational utilization and processing of wood. Advanced detection technologies allow for accurate assessment of the type, degree, and distribution of wood defects, offering strong support for the classification, grading, and rational use of wood [
4].
Wood defect detection techniques can generally be classified into traditional non-destructive testing (NDT) methods and machine-learning-based computer vision techniques. Traditional NDT methods, such as stress waves [
5,
6,
7], X-rays [
8,
9,
10], and ultrasound [
11,
12], typically identify defects by detecting differences in the physical structure compared to the normal wood. However, these methods often fail to identify discoloration-type defects effectively because such defects do not involve significant changes in the wood’s physical structure. Furthermore, traditional NDT techniques face several limitations when applied in industrial settings. First, the detection accuracy of these methods needs significant improvement, particularly for complex or hidden defects. Second, their detection efficiency is relatively low, which poses challenges for meeting the high-throughput demands of factories or enterprises. Additionally, these techniques require operators with a high level of expertise to make accurate judgments and interpretations, adding to the operational complexity and limiting their widespread application [
13].
Machine-learning-based computer vision techniques include approaches utilizing both traditional machine learning algorithms and deep learning. Traditional machine learning algorithms often use manually extracted features for training and classification. Initially, wood images undergo preprocessing steps such as noise removal and contrast enhancement to enhance image quality. Subsequently, features such as color, texture, and geometry are extracted from the images using specific algorithms or techniques. These extracted feature datasets are then used to train various machine learning models, such as decision trees and support vector machines, enabling the models to learn how to accurately classify and localize wood defects [
14]. This approach requires prior identification of defective regions and manual involvement in the feature selection and extraction process. Furthermore, its effectiveness is often constrained by the accuracy and comprehensiveness of the extracted features. Additionally, this method may heavily rely on artificial features and prior knowledge, utilizing traditional machine learning algorithms such as grayscale covariance matrix (GLCM) [
15,
16], image segmentation [
17,
18], support vector machines (SVM) [
19,
20], and wavelet neural networks [
21,
22]. Deep-learning-based methods utilize neural network models to automatically learn feature representations from raw wood images. Specifically, convolutional neural networks (CNNs), a type of deep learning model, excel at automatically extracting multi-level features from images. By training with large datasets, these models optimize their parameters to achieve accurate detection of wood defects. This approach eliminates the manual feature extraction step, enabling the automatic learning of complex feature representations. Moreover, deep learning methods demonstrate higher efficiency and accuracy, particularly when dealing with large-scale datasets [
23]. Wang, M. et al. [
24] proposed a feature fusion network model named TSW (i.e., triplet attention mechanism, small target detection head, Wise-IoU loss function)-YOLOv8n, which is based on the YOLOv8 algorithm. This model integrates an attention mechanism and loss function optimization to enhance wood defect detection. The proposed algorithm achieves a high recognition rate, with a mAP of 91.10% and an average detection time of 6 ms. This performance represents a 5.1% improvement in mAP and a reduction of 1 ms in average detection time compared to the original model. However, the detection accuracy of the model for Quartzity defects, one of the common defects in wood, stands at 68.5%, indicating the necessity for further improvements. Cui, W. et al. [
25] introduced a novel network named cascade center of gravity YOLOv7 (CCG-YOLOv7) designed to improve the accuracy of detecting small defects on wooden boards. This enhanced network effectively identifies surface imperfections such as knots, scratches, and mold. Additionally, the model exhibits robust performance in detecting these defects under different lighting conditions, which is crucial for industrial production. Lim et al. [
26] introduced a set of lightweight and efficient CNN models based on the YOLOv4-Tiny architecture for rapid and precise wood defect detection. They emphasized the redundancy of large and complex CNN models in industrial settings. Through an iterative pruning and recovery process, they reduced model parameters by 88% while maintaining accuracy comparable to state-of-the-art (SOTA) methods. As a result, the model operates efficiently on inexpensive general-purpose embedded processors, facilitating almost real-time inference without external hardware accelerators.
In summary, deep-learning-based target inspection algorithms have become the preferred method for batch wood defect detection in factories [
27]. This approach does not rely on expensive equipment or require operators to possess extensive professional expertise; instead, it only necessitates inputting images of defective wood into a computer, which then processes these images through a pre-trained model to localize and classify defects automatically, outputting the processed images. This method simplifies the operation process while enhancing inspection efficiency and accuracy, thereby providing robust support for quality control and safe production in the wood processing industry.
In view of this, this paper focuses on the detection of seven common types of wood defects: Live_Knot, Marrow (in the field of wood processing, the term “pith” is more commonly used; however, since the dataset we employ refers to this particular defect as “Marrow”, we have chosen to retain the nomenclature “Marrow” for consistency in the present study.), Resin, Dead_Knot, Knot_with_crack, Knot_missing, and Crack. To address these complex detection challenges, we designed a specialized algorithm for wood defect detection based on the YOLOv8 framework. To address the issue of significant scale variations among different defects, even within the same defect type, in wood defect datasets, we integrated the DWR module and Deformable-LKA into the C2f module of YOLOv8, enhancing the model’s ability in multi-scale feature extraction and fusion. Furthermore, the introduction of a dynamic detection head and the MPD loss function significantly improved the model’s detection accuracy for small targets and regression efficiency. Our contributions are as follows:
An improved YOLOv8-based model for wood surface defects detection is proposed.
The C2f modules in the backbone and neck of YOLOv8 are, respectively, substituted with the C2f-DWR and C2f-DLKA modules to improve the network’s ability to extract multi-scale defects and to fuse defect feature information.
The detection head of YOLOv8 is upgraded to dynamic detection head, and a dynamic detection head is added in the shallower layer of the network to enhance the network’s ability to detect small targets.
The loss function is replaced with the MPD loss function to help the model achieve faster and more efficient regression.
2. Materials and Method
2.1. Wood Surface Defect Dataset
The dataset used in our study was a wood surface defect dataset produced by VSB-Technical University of Ostrava [
28]. The original dataset consisted of 20,275 images, including 1992 images without any defects and 18,283 images with one or more surface defects. It covered the 10 most common types of wood defects. From the original dataset, 3600 images with a resolution of 2800 × 1024 were screened. Three defect types—Quartzity, Blue stain, and Overgrown—were removed due to their small sample sizes.
To improve the generalization and robustness of the model and to avoid overfitting during the training process, we performed a series of data enhancement processes on these 3600 images to increase the size and diversity of the dataset. To mimic the situation where the camera is not aligned with the wooden board, the images were subjected to rotation, shear, and crop operations. To simulate different lighting conditions, the images were subjected to random adjustments in brightness and saturation. Additionally, to further enhance the dataset and mitigate overfitting, we added 2% Gaussian noise pixels to the images and randomly combined various data enhancement methods. After data enhancement, the total number of images increased to 9456. The dataset was partitioned into training, validation, and test sets in an 8:1:1 ratio. Examples of the enhanced images are illustrated in
Figure 1 below.
A detailed overview of the defect divisions in our screened and enhanced wood surface defect dataset is shown in
Table 1.
2.2. Review of the YOLOv8 Algorithm
The YOLOv8 algorithm structure (as shown in
Figure 2) incorporates design principles from YOLOv7 ELAN and utilizes the C2f structure from YOLOv5 in the backbone and neck networks to enhance gradient flow. In the head, YOLOv8 employs a decoupled head structure to separate the classification and detection tasks. Additionally, YOLOv8 uses the task aligned assigner strategy for positive sample allocation and introduces distribution focal loss as the regression loss. These improvements result in significant advancements in both loss computation and network structure.
The YOLOv8 algorithm performs well in detecting standard-sized objects [
29]. However, its performance tends to decline in special scenarios where object sizes deviate from the standard range. Notably, other algorithms are designed specifically for complex object detection, such as algorithms for detecting small-sized objects and multi-scale objects [
30,
31], and these tend to outperform YOLOv8 in scenarios requiring the detection of complex targets. The wood surface dataset we used contains numerous small defects that exhibit significant variability in shape and size. Additionally, there is a high degree of similarity between different types of defects, and the difficulty of detection is further exacerbated by the overlap between defects and the multi-scale problem. These complex features pose a significant challenge to the algorithmic model for accurate defect detection.
2.3. C2f-DWR Module
To enhance the model’s capability to extract contextual information at multiple scales, Haoran Wei proposed a robust DWR segmentation network [
32], comprising two main modules: a dilated residual (DWR) module for higher-order networks and a simple inverted residual (SIR) module for lower-order networks. The DWR segmentation network divides the initial single-step feature extraction method into two phases: region residualization and semantic residualization. This methodology enhances the model’s ability to achieve superior detection accuracy and faster detection rates.
The DWR module adopts a residual structure, as depicted in
Figure 3. Within this framework, a two-step method is utilized to effectively extract multi-scale contextual information and then integrate the feature maps produced by multi-scale features. To simplify acquisition, the previous single-step method for acquiring multi-scale context information has been decomposed into a two-step approach.
Step I involves the generation of associated residual features from input features. In this phase, a series of compact feature maps from regions of varying sizes are created to serve as input for morphological filtering in the subsequent step. As illustrated in
Figure 3, this is achieved by employing a 3 × 3 convolution for initial feature extraction, followed by a batch normalization (BN) layer and a ReLU layer.
Step II involves morphological filtering of features from regions of different sizes, a process referred to as semantic residualization, using a multi-rate extended depth-direction convolution. Each channel feature undergoes filtering with only a single desired receptive field in step II to avoid redundant receptive fields. In practice, the desired concise regional feature map learned in step I is efficiently matched with the receptive field size in step II for rapid processing. To carry out this step, the regional feature maps are first grouped, and then each group undergoes convolution with different expansion depths.
With the two-step region residualization–semantic residualization approach, multi-rate depth-direction dilation convolution transitions from the demanding task of extracting extensive semantic information to the simpler task of morphological filtering with the desired receptive field on each succinctly expressed feature map. Region-based feature mapping simplifies the role of depth-direction dilation convolution to mere morphological filtering, thus streamlining the learning process. This approach allows for more efficient preservation of multiscale context information. Following the mapping of multiscale context, multiple outputs are then aggregated. To elaborate, an aggregation process is initiated by concatenating all feature maps, subsequently followed by a batch normalization (BN) layer and point-wise convolution to integrate the features into definitive residual components. Ultimately, these consolidated residuals are superimposed onto the initial feature maps, fostering a strengthened and holistic feature portrayal. The DWR module facilitates the network in adapting more flexibly to features at different scales, thereby forming a multi-scale feature extraction mechanism capable of accurately recognizing and segmenting objects in images.
The C2f module is a crucial component in YOLOv8, representing one of the major advancements in the network structure. Compared to the C3 module in YOLOv5, the C2f module boasts fewer parameters while offering enhanced feature extraction capabilities.
To better suit the C2f module for learning the transformation of cross-scale feature points, we have made an adaptation. Therefore, in order to bolster the C2f module’s multi-scale feature extraction ability, we have chosen to substitute the standard convolution in the bottleneck structure of the C2f architecture with a DWR module. The improved C2f-DWR configuration is illustrated in
Figure 4.
2.4. C2f-DLKA Module
Study [
33] created a deformable large kernel attention (D-LKA attention), illustrated in
Figure 5. This mechanism is a simplified attention mechanism utilizing a large convolutional kernel, drawing primarily from the large kernel attention (LKA) architecture [
34] and deformable convolution [
35] and seamlessly integrating them. LKA decomposes a large kernel convolution into depth convolution, depth dilation convolution, and 1 × 1 convolution, effectively circumventing the significant computational overhead and parameter burden associated with directly employing large convolution kernels. While the direct application of large convolution kernels incurs substantial computational overhead and parameters, it also facilitates channel adaptivity and long-range dependencies, thereby effectively leveraging local context information.
The Deform-DW-D Conv2D and Deform-DW Conv2D modules within the D-LKA attention architecture apply the concept of deformable convolution to deep convolutional kernel depth dilation convolution. Deformable convolution (DCN) alters the convolutional receptive field from a fixed square to a shape closer to the actual object by introducing learnable offsets in the convolutional receptive field. Consequently, the convolutional region consistently encompasses the object’s shape, enabling dynamic adjustments of the deformable convolution (DCN) based on the spatial and shape positions of the detected target. This allows for better capture of target features and adaptation to variations in target size dynamics.
D-LKA attention merges the broad receptive field of a large convolution kernel with the versatility of deformable convolution. This attention mechanism enhances the model’s adaptability to irregularly shaped and multi-scale targets by dynamically adjusting the shape and size of the convolution kernel to align with various image features.
Given the challenges posed by multi-scale defects, similarities among different defect types, and defect overlap within the wood defect dataset, D-LKA attention offers several advantages. First, it leverages the self-attention capability of large kernel attention to enhance the model’s comprehension of the relationships between targets, thereby addressing issues related to overlap and similarity among wood defects. Second, deformable convolution enables D-LKA attention to flexibly distort the sampling grid, facilitating the model’s adaptation to multi-scale wood defects. Hence, to enhance YOLOv8’s ability to fully utilize image feature information, we have chosen to further improve and optimize the C2f structure in the network. Specifically, we embed D-LKA attention into the C2f module, replacing the original position of regular convolution within the Darknet bottleneck. The resulting C2f-DLKA module structure is depicted in
Figure 6.
2.5. Dynamic-Head-Based Detection Head Module
2.5.1. Dynamic-Head Detection Head
The detection head, a pivotal component in YOLOv8, plays a crucial role in extracting both the location and category information of the target from the convolutional feature map. The YOLOv8 detection head module adopts the prevailing decoupled head structure and transitions from anchor-based to anchor-free methodology. This transition enhances the model’s robustness and adaptability, allowing it to better accommodate various complex scenarios and target characteristics, thus providing a more dependable solution for practical applications. Nevertheless, it still lacks a sufficiently unified perspective to address the detection problem comprehensively.
The study in [
36] introduced a dynamic head architecture that incorporates attention mechanisms into the object detection head. This design incorporates various self-attention mechanisms across feature layers to bolster scale sensitivity, spatial awareness across spatial locations, and task comprehension within output channels. This approach effectively augments the representation capabilities of the object detection head while maintaining computational efficiency.
The dynamic-head detection framework incorporates individualized attention mechanisms tailored for each feature dimension: horizontal, spatial, and channel-wise. The scale-oriented attention component specifically operates on the horizontal axis, distinguishing the significance of semantic levels to augment features tailored to individual objects’ scales. The spatial-oriented attention module functions in the spatial plane (height × width), refining discriminative representations at distinct spatial positions. Concurrently, the task-oriented attention module is implemented channel-wise, allotting distinct functional channels to facilitate various tasks like classification, bounding box regression, and center/keypoint estimation. This strategy guarantees diverse convolutional kernel responses tailored to individual objects. Despite their isolated application to distinct feature dimensions, these attention mechanisms synergistically complement each other, integrating the detection framework with attention mechanisms in a unified manner.
The dynamic-head detection can be represented as follows:
where
,
, and
are three different attention functions applying on dimension
L,
S, and
C, respectively.
Figure 7 illustrates the frame structure of the dynamic-head detection head.
2.5.2. A Small Target Layer in the Detection Header
Given the prevalence of small targets within the wood defect dataset, there are inherent challenges associated with detecting such objects. These challenges stem from factors such as fewer discernible features, subtle semantic characteristics, and susceptibility to feature flooding due to constant convolution. Within the YOLO algorithm series, the relatively high downsampling multiples often hinder the accurate extraction of feature information for small objects from deeper feature maps. Moreover, the original YOLO architecture incorporates three distinct detection heads (P3, P4, P5), tailored for detecting objects of varying scales: small, medium, and large. Nevertheless, this configuration tends to compromise detection accuracy for small targets, stemming from limitations in capturing their intricate features.
The COCO dataset defines a small target as one with a size of (32, 32), while for the P3 detector head, the size of the feature layer is typically (80, 80), downsampled by a factor of 8 compared to the original input feature map size. This downsampling results in feature maps smaller than (4, 4) for small targets, leading to insufficient features and poor detection ability of the P3 detector head. To address this limitation, we opt to add a detection head in the P2 layer. First, the P2 layer, being in the shallower layer of the network, can capture more fine-grained features, which are essential for discerning the shape and texture of small targets. Second, the P2 layer usually has higher resolution, providing more spatial information, which is advantageous for detecting small targets by leveraging both shallow and deep feature information [
37]. Integrating the small target detection layer enables YOLOv8 to prioritize the feature information of small targets, thereby enhancing its sensitivity in detecting such objects. This enhancement ultimately contributes to enhanced detection accuracy, as it enables the model to become more proficient in precisely recognizing and pinpointing small targets within the dataset.
To further enhance the detection of small targets, we opted to replace all detection heads in YOLOv8 with dynamic heads. With the addition of a detection head in the P2 layer, YOLOv8 now incorporates four dynamic heads. The dynamic head employs a dynamic routing mechanism, facilitating the integration of contextual information and the extraction of multi-scale features. By leveraging self-attention mechanisms, the dynamic head achieves a unified approach to scale-aware attention, spatial-aware attention, and task-aware attention. This allows for better integration of contextual information and dynamic adjustment of weights across different feature layers, thereby facilitating the extraction and recognition of multi-scale features. Consequently, the model is better equipped to detect and classify targets within wood defect datasets, particularly those comprising multi-scale and small-target defects with complex morphology.
2.6. Improvement of YOLOv8 Algorithm
To address the issue of large-scale variation between different defects in the wood defects dataset, or even among the same type of defects, we embedded the DWR (dynamic weighting receptive field) module into the C2f module in the YOLOv8 backbone. This enhancement improves the network’s ability to extract multi-scale features. Second, to enable the network to better utilize the features extracted from the backbone layer and adapt to multi-scale wood defects, we introduced deformable large kernel attention (Deformable-LKA) into the C2f module at the neck of the network. Deformable-LKA combines the advantages of self-attention with the broad receptive field of large kernel attention and incorporates flexible deformable convolution. This allows for the sampling grid to be flexibly adjusted, enabling the network to better fuse feature information and helping the detector head to capture the complex features and different scale variations of wood surface defects with greater accuracy.
Next, we replaced the original detection head of YOLOv8 with a dynamic detection head. This dynamic detection head unifies the object detection head and attention across three dimensions: scale, space, and task. This helps the model reduce the false detection rate and the miss rate of small-target samples. To augment the model’s proficiency in detecting diminutive objects, we incorporated a dynamic detection module into the shallower P2 layer of YOLOv8. This head combines shallow feature information with deeper features for more comprehensive detection.
Finally, we replaced the CIoU loss function used in YOLOv8 with the InnerMPD loss function. This replacement aims to help the model achieve faster and more effective regression. In summary, the method proposed in our study is shown in
Figure 8.
2.7. MPDIoU Loss Function Based on Inner Ideas
The bounding box regression loss function plays a critical role in target detection as it enables the model to learn to predict the precise location of the bounding box, closely aligning it with the actual bounding box of the detected target. This information is essential for accurately determining the location and area of the detected target. Although YOLOv8 utilizes CIoU [
38] as its bounding box regression loss function, there are two main limitations associated with CIoU. First, in scenarios where the predicted bounding box and the ground truth bounding box exhibit a similar aspect ratio but varying width and height dimensions, the efficacy of the CIoU loss function tends to decrease. This constraint can hinder the swift convergence and precision of the model’s performance [
39]. Second, the CIoU framework lacks flexibility in accommodating varying detectors and detection scenarios, thereby affecting its resilience in generalizing to new conditions [
40].
Drawing inspiration from the geometric characteristics of horizontal rectangles [
39], introduces a novel similarity metric called MPDIoU, which is rooted in the minimum point distance between bounding boxes. This metric comprehensively accounts for crucial factors, including overlapping and non-overlapping regions, the distance between centroids, and deviations in width and height, while offering a streamlined computational approach. Based on this, the MPDIoU-based bracketed box regression loss function, LMPDIoU, is proposed. This new loss function aims to address the limitations of CIoU by providing a more efficient and comprehensive approach to bounding box regression, ultimately improving the convergence speed, accuracy, and generalization ability of the model. The Inner-IoU loss introduced in [
40] advances the conventional IoU loss by incorporating an auxiliary bounding box into its computation and introducing a scaling ratio to modulate the dimensions of this auxiliary box, thereby enhancing its effectiveness. This approach effectively speeds up the bracketed box regression process. Inner-IoU offers a more precise evaluation of overlapping regions, providing a more accurate assessment. Experiments demonstrate that Inner-IoU exhibits better generalization than traditional IoU on various datasets. Applying Inner-IoU to existing IoU-based loss functions results in faster and more efficient regression.
Therefore, in order to enable the model to perform faster and more effective regression, we choose to apply Inner-IoU to the MPD loss function and replace the original CIoU loss function in YOLOv8 to further enhance the detection accuracy, improve the accuracy of the model, help the model better adapt to various shapes and sizes of objects, reduce the leakage rate of the special defective targets in the wood images, and make the model have a better generalization performance. The calculation of InnerMPDIoU is summarized in Algorithm 1.
In view of the above analysis, to enable the model to perform faster and more effective regression, we applied Inner-IoU to the MPD loss function, replacing the original CIoU loss function in YOLOv8. This enhancement aims to improve detection accuracy, enhance model precision, and help the model better to adapt more successfully to defects of various shapes and sizes. It reduces the leakage rate of special defective targets in wood images and provides the model with better generalization performance. The calculation of InnerMPDIoU is summarized in Algorithm 1.
Algorithm 1. InnerMPDIoU as bounding box losses |
Parameters: —predicted box; —ground truth (GT) box; —predicted box coordinates; —GT box coordinates; (,)—center point of the GT box and the inner GT box; (,)—center point of the anchor and the inner anchor; —width of the GT box; —height of the GT box; —width of the anchor; —height of the anchor; —auxiliary predicted box; —auxiliary GT box; —auxiliary predicted box coordinates; —auxiliary GT box coordinates; ratio—scaling factor, typically within the range of values [0.5, 1.5]; —width of input image; —height of input image; —loss function of MPDIOU. Input: , , ratio, , . Output: |
1. For the predicted box ensuring and |
2. |
3. |
4. |
5. |
6. |
7. |
8. |
9. |
10. |
11. |