3.2.1. FAM (Feature Adaptive Mixer)
A Convolutional Neural Network (CNN) acts as a high-pass filter that can extract locally salient high-frequency information such as texture and detail [
36]. The self-attention mechanism is a relatively low-pass filter that can extract salient low-frequency information such as global and smooth [
37]. Although the traditional pure convolution-based methods can effectively extract rich high-frequency features, they are unable to capture the spatial contextual information of the image. In contrast, methods based on purely self-attentive mechanisms tend to extract only the low-frequency information of the image, and also suffer from computational complexity and poor model generalization. Therefore, determining how to give full play to the advantages of these two computational paradigms has become a bottleneck for further breakthroughs in model feature extraction capability. From the ideas of information distillation and frequency mixing in image super-resolution reconstruction, we can obtain some insights. By mixing low-frequency features and high-frequency features, the model’s information flow and expression ability can be effectively enhanced [
38,
39].
To enhance the accuracy of boundary identification, we propose a module called FAM. This method captures more accurate boundary features by enhancing the information flow and expressiveness of the model. It not only solves the single-scale feature problem, but also incorporates the idea of multi-branch structure to filter out important features from rich semantic information. Specifically, FAM includes three main parts: high-frequency branching, low-frequency branching, and adaptive fusion, as shown in
Figure 4. It aims to separate high-frequency features and low-frequency features in an image to capture local and global information of the image through the respective advantages of convolutional neural network and self-attention, and adaptively selects the fusion according to the contribution of channel fusion. Unlike traditional hybrid methods, we innovatively combine the high-frequency static affinity matrix extracted by convolution with the dynamic low-frequency affinity matrix obtained based on self-attention, which enhances self-attention’s ability to comprehensively capture high-frequency and low-frequency information and feature generalization. In addition, for the characteristics of these two computational paradigms, we carry out adaptive feature selection for multi-frequency mixing in the spatial domain, which can dynamically adjust the fusion effect according to the feature contribution.
The high-frequency branch is a simple and efficient module whose main function is to obtain local high-frequency features. Considering that high-frequency information can be obtained by a small convolutional kernel, we obtain local high-frequency feature information by concatenating 1 × 1 and 3 × 3 regular convolutions [
40]. To enhance the learning and generalization ability of self-attention, we designed it to introduce the obtained high-frequency affinity matrix into the low-frequency affinity matrix, which is used to compensate for the lack of feature information of self-attention due to linear modeling. Let
denote the input feature map, with
by default. After confirming the 2D feature map through identity, the size remains unchanged following standard convolutions of kernel sizes 1 and 3. The formulas for generating the high-frequency feature
and the high-frequency affinity matrix
are as follows:
where ⊗ represents matrix multiplication,
represents the operation of partitioning according to a predefined window size
N,
T denotes matrix transpose,
represents a 1 × 1 convolutional operator,
represents a 3 × 3 convolutional operator,
,
,
, and
.
The low-frequency branch plays a pivotal role in capturing global contextual relationships, primarily through a multi-head self-attention mechanism [
41]. Initially, the method expands the input feature map
by a factor of three along the channel dimension using standard
convolution. Subsequently, the 2D feature map is partitioned into windows of size
and flattened into a 1D sequence
with adjusted dimensions considering the number of heads and channels, where
N denotes the window size and
h represents the number of heads. This sequence is then decomposed into Query (Q), Key (K), and Value (V) feature vectors
. During the self-attention computation, a learnable positional encoding (PE) is introduced to encode positional information of the image sequence. The resultant low-frequency affinity matrix
, which is derived from multi-head self-attention, is then combined with the high-frequency affinity matrix
to produce a blended affinity matrix
. After applying softmax normalization to
, a matrix multiplication with
V produces the low-frequency feature map
. The formula is described as follows:
where ⊕ denotes element-wise addition,
represents the normalization activation function,
, where
N is the window size,
is the learnable positional encoding of window size,
d is a constant, and
.
High-low frequency adaptive fusion is a fusion mechanism built on spatial feature mapping. Inspired by the feature rescaling of SK-Net [
42], the weights of the contribution values of the hybrid channel occupied by high-frequency features and low-frequency features are learned by designing different pooling methods, so that the network can select a more appropriate multi-scale feature representation. Specifically, the obtained high-frequency feature
and low-frequency feature
are directly fused together to obtain the mixed feature
. Then, the maximum pooling and average pooling are performed on this mixed feature to obtain the high-frequency attention feature map
and low-frequency attention feature map
, respectively. The two spectral features are connected at the channel level, and the standard convolution smoothing filter with a size of
is applied to obtain
. After
activation in the fusion dimension, the high-frequency attention feature map
and low-frequency attention feature map
are obtained, and they are individually weighted by element-wise multiplication on the
and
. Finally, the weighted feature map results are added together to obtain the output result of the adaptive fusion,
. The relevant formulas are as follows:
where ⨀ represents matrix element-wise multiplication,
denotes global maximum pooling,
denotes global average pooling,
denotes channel-level splicing,
denotes the activation function, and
denotes convolution with a kernel size of
.
3.2.2. RAF (Relational Adaptive Fusion)
To obtain richer boundary features, fusing feature maps of different scales is considered to be an effective method to improve image effects [
43]. Currently, the commonly used fusion methods include spatial numerical summation and channel dimensional splicing. However, shallow and deep features in the network do not play the same contribution in feature fusion. Generally, the shallow features have larger values and the deeper features in the network have smaller values, leading to differences in their spatial contributions. In addition, since shallow and deep features contain different semantic information, there is also some semantic confusion in the channel dimension. Determining how to improve the effect of feature fusion has become a new thinking direction to optimize network performance. Inspired by the perceptual fusion of shallow and deep branches in ISDNet [
44], we propose a dynamic fusion strategy (RAF) based on relational perception. This module obtains more complete boundary information by improving the feature granularity, and its detailed structure is shown in
Figure 5.
Unlike other multi-scale static fusion methods, RAF can adaptively adjust the fusion of shallow and deep features according to the network task requirements and data characteristics by explicitly modeling the spatial and channel dependencies between features. While ensuring deep semantic transformation, it can fully use shallow features to achieve higher-quality feature reconstruction. Specifically, this method first models the spatial numerical differences between shallow-layer features and deep-layer features through global average pooling to learn spatial weighting factors. Then, matrix multiplication is performed under the feature mapping of spatial modeling to obtain the channel relationship matrix. By flattening the relationship matrix and compressing the features, channel weighting factors are obtained. Finally, the spatial weighting factors and channel weighting factors obtained are separately weighted and fused.
Given the shallow feature map
and the deep feature map
where
and
. In the first step, RAF aligns the height and width of the deep feature map with those of the shallow feature map. By explicitly extracting feature information, two one-dimensional attention vectors
and
containing their respective channel information are obtained. The following formulas can represent this:
where
denotes Global Average Pooling and
denotes spatially sampled twice. In the second step, spatial and channel dependencies are modeled sequentially. The two one-dimensional attention vectors
and
undergo global average pooling to derive spatial relationship weight factors
and
, expressed as follows in the following Equation:
When modeling channel dependencies, considering the semantic differences between channels, the two one-dimensional attention vectors
and
are compressed to a length of
r using perceptrons to reduce semantic errors, resulting in two contraction vectors
and
, where
r is typically much smaller than
C. Subsequently, based on these two contraction vectors, a channel correlation matrix
is obtained through matrix multiplication. This correlation matrix is then flattened and mapped through multiple perceptron layers to generate a channel weight factor consisting of only two numerical values,
and
, as shown in the following formula:
where ⊗ denotes matrix multiplication,
R represents the channel relationship matrix,
denotes the flattening operation, and
$ denotes a Multi-Layer Perceptron that maps a one-dimensional vector to two channel weighting factors. In the third step, the obtained weight values are separately weighted and fused. The spatial weight factors
and
from the shallow feature map
and deep feature map
, along with the channel weight factors
and
, are summed individually. After applying a softmax operation, they yield weighted values
and
. These are then dot-multiplied with
and
, respectively, and added together to form the final fused feature map
. The formula is as follows:
3.2.3. DWLK-MLP (Depthwise Large Kernel Multi-Layer Perceptron)
Enhancing the convolutional perceptual field is an effective means to improve semantic segmentation [
45]. Recent studies have shown that the introduction of DW convolution into MLP (Multi-Layer Perceptron) can effectively integrate the properties of self-attention and convolution, thus enhancing the generalization ability of the model [
46]. Compared to ordinary MLP [
41], DW-MLP [
47] with a residual structure introduces a 3 × 3-sized DW convolution into the hidden layer. This approach is effective in aggregating local information, mitigating the effects of the self-attention paradigm, and improving the generalization ability of the model. However, due to the large number of channels in the hidden layer, a single-scale convolution kernel cannot effectively transform channel information with rich scale features. To solve this problem, a multi-scale feedforward neural network MS-MLP [
48] has been proposed. It used DW convolution with kernel size [1, 3, 5, 7] to capture multi-scale features. In this way, the performance of the model is enhanced to some extent. However, just using MLP to transform the multi-scale features further to enhance the generalization of the model is limited as it also undertakes the important task of extracting the feature maps for higher-level combination and abstraction.
To further improve the completeness of boundary features, we propose the simple and effective DWLK-MLP module as shown in
Figure 6. This module increases the convolutional receptive field by deeply separating the large kernel convolutions, and more complete boundaries can be extracted with almost no computational overhead. Unlike other methods, DWLK-MLP introduces the idea of large kernel convolution, which can take on more advanced abstract feature extraction tasks by creating a large kernel receptive field. Specifically, we introduce a depthwise large kernel convolution of 23 × 23 size in front of the activation function. The final result is obtained by summing up the initial feature map with the feature map after the large kernel convolution using jump concatenation. To reduce the number of parameters and computational complexity, we use two depth convolution sequences of 5 × 5 and 7 × 7 for decomposition. This approach exploits the lightweight nature of the depth-separable computational paradigm and promotes the fusion of self-attention and convolution to improve network generalization. Numerous experiments have demonstrated that the introduction of depthwise large kernel convolution before the activation function improves the accuracy and robustness of image recognition more than after the activation function.