1. Introduction
Dual-band infrared–visible imaging technology is prevalent in the military sector, autonomous driving assistance systems [
1], disaster relief robots [
2], and small unmanned aerial vehicles (UAVs) [
3]. It effectively facilitates visual tasks such as target identification [
4], tracking [
5], and scene enhancement [
1,
2,
3]. Owing to constraints such as system volume, weight, and cost, a common configuration involves an infrared camera paired with a visible light camera to form a system for heterogeneous image acquisition. The visual tasks achieved through this image acquisition system typically utilize the two-dimensional information of the target scene, such as using visible light or infrared imaging during the day and only infrared imaging at night. Despite recent developments in visible light (low-light) and infrared dual-band fusion imaging technology that have enhanced the amount of information gathered from the target scene, the depth information obtained from the target scene is not yet sufficient. This limitation hampers the accurate and objective environmental perception of imaging systems [
6]. Therefore, researching stereoscopic vision technology based on heterogeneous binocular information leverages the complementary nature of dual-band scene information to achieve target identification and tracking in complex environments. Furthermore, it simultaneously provides information regarding the relative spatial position, depth, and dimensions of the target scene [
7].
2. Related Work
Previously, the feasibility of achieving binocular stereo vision using scene images from different spectral bands has been demonstrated. Visible light images exhibit rich color, texture, and edge details with high contrast, making them suitable for human observation and target discrimination. In contrast, infrared images reflect thermal radiation information from the observed environment [
8,
9] and possess strong smoke transmission capability while being less affected by lighting conditions. With the spread of research and application of deep learning in image processing, stereo-matching algorithms have evolved from traditional local, global, and semi-global optimization methods to deep learning-based stereo-matching algorithms [
10]. Leveraging the complementary advantages of multiband sensors on existing heterogeneous imaging systems has become a significant research direction for binocular stereo-vision technology. Multispectral image stereo matching involves identifying corresponding feature points between heterogeneous image pairs to compute disparity values. Kim et al., in 2015, introduced the dense adaptive self-correlation (DASC) matching descriptor [
11], which performed feature point matching on two spectral band images. In 2018, Zhi et al. proposed an unsupervised cross-spectral stereo-matching (CS-Stereo) method based on deep learning [
12], consisting of disparity prediction and spectral transformation networks. An evaluation function for material perception was integrated into the disparity prediction network to handle unreliable matching regions such as light sources and glass. Liang et al. improved Zhi’s network structure in 2019 [
13] by using a spectrally adversarial transformation network (F-cycleGAN) to enhance the quality of disparity prediction. In 2022, Liang et al. added a multispectral fusion subnetwork to the previous two network architectures [
14], minimizing cross-spectral differences between visible light and near-infrared images through fusion. The aforementioned networks are more suitable for visible light–near-infrared image pairs with minor spectral differences; however, their performance is not ideal for visible light–thermal infrared image pairs with more significant spectral differences. In 2020, Li et al. proposed a depth prediction network called IVFuseNet, which extracts common features from infrared and visible light images [
15]. However, it overlooks semantic image information, limiting its prediction accuracy.
In recent years, iterative networks have demonstrated promising performance in homogenous image stereo-matching tasks [
16,
17,
18]. Lipson et al., in 2021, proposed RAFT-Stereo [
16], which employs local loss values obtained from all-pair correlations to optimize and predict the disparity map iteratively. However, the capacity of this network for extracting and utilizing global information is insufficient and, thus, it struggles with local ambiguities in inappropriate regions. In 2023, Xu et al. [
17] addressed the limitations of RAFT-Stereo by introducing the IGEV-Stereo network. This network constructs a structure through which to encode global geometry and contextual information, along with local matching details, enhancing the effectiveness of the iterative process. The IGEV-Stereo network was designed for the stereo matching of visible light image pairs. It processes input image pairs through a feature extraction subnetwork to obtain two feature maps from the left and right views. These maps are subjected to correlation calculations in order to generate a correlation volume, which is subsequently fed into a lightweight encoder–decoder structure to produce a geometry-encoding volume. This volume offers an improved initial disparity map for the iterative convolutional gated recurrent units (ConvGRUs), thus accelerating network updates. Furthermore, it incorporates global geometry and semantic information, enabling the network to better address local ambiguity issues in pathological regions.
In response to the limitations of existing methods for predicting disparities in heterogeneous image pairs, we propose an iterative network for disparity prediction with infrared and visible light images based on common features (CFNet). Building upon the extraction of common features, CFNet comprehensively considers the unique information from each heterogeneous image. It integrates global geometry, local matching, and individual semantic information from the heterogeneous images into a cascaded iterative optimization module. Furthermore, CFNet leverages the geometry-encoding volume produced with a three-dimensional (3D) regularization network, regresses it, and obtains an initial disparity value, thereby expediting convergence and reducing prediction errors.
The remainder of this article is structured as follows:
Section 2 introduces the proposed method, detailing the structure and roles of various sub-modules within the network and the composition of the loss function.
Section 3 compares our network’s experimental results with those of other methods and provides the outcomes of ablation experiments. Finally,
Section 4 provides an overall evaluation of the network.
3. Methods
The proposed CFNet architecture is shown in
Figure 1. The input consists of heterogeneous infrared–visible image pairs, which are initially processed through a common feature extraction subnetwork for feature extraction. The green blocks within the blue dashed box represent the common features extracted from both infrared and visible light images. The context subnetwork extracts semantic features from the heterogeneous images, serving as the initial hidden state for the convolutional gated recurrent units (ConvGRUs). The green dashed box contains the multimodal information acquisition subnetwork, wherein a 3D regularization network generates a geometry-encoding volume, and an attention feature volume is obtained using the values of the correlation volume as attention weights. These two features are combined and passed to the next network level. Additionally, the geometry-encoding volume is utilized to derive an initial disparity map, which accelerates network updates. In the cascaded convolutional gated recurrent subnetwork within the red dashed box, each ConvGRU level receives the joint encoding from the common feature extraction subnetwork, the contextual information of the heterogeneous images from the context network, and the disparity update information from the previous ConvGRU level. After multiple ConvGRU computations are performed, the disparity values are updated.
3.1. Common Feature Extraction Subnetwork
Despite the distinct characteristics exhibited by infrared thermal radiation images and visible light images —where H and W denote the length and width dimensions of the original image, and subscripts “l” and “r” designate the correspondence of the image to the left or right feature map groups, respectively—infrared images of various scenes contain the objects’ contour information due to variations in thermal radiation, while, in contrast, visible light images often exhibit edge contours owing to brightness or color differences. We refer to the similar features extracted from the same scene’s infrared–visible light image pair using coupled filters as “common features”, whereas the distinct differences displayed in their respective images, owing to spectral disparities, are termed “unique features”.
The common feature extraction subnetwork employs a dual-stream convolutional structure. In the downsampling stage, the filters in each layer are coupled, allowing for the extraction of common features from the infrared and visible light images. The filters used during the downsampling process in the common feature extraction subnetwork can be classified into three categories: filters for extracting unique features from the infrared image, filters for extracting unique features from the visible light image, and partially coupled filters for extracting common features from the heterogeneous image pair. Within this subnetwork’s dual-branch structure, the ratio of partially coupled filters to the total number of filters at the same sequential position in the convolutional layers is called the coupling ratio, denoted as
Ri and defined as
where,
Ri represents the coupling ratio of the
i-th convolutional layer,
ki denotes the number of partially coupled filters, and
ni indicates the total number of filters.
Due to spectral differences, thermal infrared images and visible light images exhibit significant differences in detail, although both images contain “common features”. Shallow networks extract textural information from images, whereas deeper networks focus more on the structural and semantic information of objects. Therefore, the network design of this segment involved gradually increasing the coupling ratio with the deepening of convolutional layers. The coupling ratios used in this network were 0, 0.25, 0.25, 0.5, 0.5, and 0.75. Compared to IVFuseNet [
15], which employs pooling layers for downsampling, our proposed network employs consecutive convolutional layers to simultaneously achieve downsampling and extract higher-level semantic information from feature maps, enhancing the network’s feature extraction and fusion capabilities. Additionally, multiple small-sized convolutional kernels are utilized to replace the large-sized kernels. This reduces the parameter count and enhances the acquisition of structural information from feature maps, thereby improving the model’s generalization ability. After consecutive downsampling, a feature map group with an original resolution of 1/32 is obtained. Subsequently, upsampling blocks with skip connections are employed to restore the sizes of the left and right feature map groups to 1/4 of the original resolution, resulting in a multiscale feature map group:
Here,
Ci represents the number of feature channels, while
and
are utilized to construct the cost volume. The network flow of the downsampling process in the common feature extraction subnetwork is depicted in
Figure 2, and its primary structure is presented in
Table 1. The red dashed box represents the processing flow for infrared images, whereas the green dashed box corresponds to that for visible light images. The overlapping portion between the two represents the extraction of common features from the image pair using coupled filters.
3.2. Context Subnetwork
The input to the network consists of heterogeneous image pairs representing the left and right views. Owing to significant spectral differences between the images, the left and right views contain distinct contextual information. Therefore, this network extracts contextual information separately for each view. The context network comprises two branches with identical structures, each with a residual module series. First, the network generates feature map groups for the left and right views at resolutions of 1/4, 1/8, and 1/16 of the input image, with each feature map group having 64 channels. These feature map groups capture contextual information at different scales. Subsequently, feature map groups of the same size generated from the left and right views are stacked together. Finally, the contextual information obtained at different scales is used to initialize and update the hidden states of ConvGRU, and the evolution of its feature map group is shown in
Figure 3.
3.3. Multimodal Information Acquisition Subnetwork
Different processes were applied to the feature map groups extracted from the left and right views to obtain a more comprehensive geometric structure and local matching information from the heterogeneous image pair in the multimodal information acquisition subnetwork.
The extracted feature map groups from the left and right views constructed a correlation volume. These feature map groups,
and
, were divided into
g = 8 groups along the channel dimension, and the correlation mapping is computed for each group by
where
x and
y represent the pixel coordinates of feature points in the feature map;
d is the disparity index, with values ranging from 0 to 192;
Nc denotes the number of feature channels; and
indicates the inner product.
Since the cost volume
, based on feature correlation, focuses solely on local geometric information, this does not facilitate the network’s utilization of global image information to achieve better stereo-matching results. Inspired by the CEGV structure of the IGEV-Stereo network [
17], a 3D regularization network, denoted as
R, was employed to further process the corresponding cost volume of the left feature map group
.
R is a lightweight encoder–decoder network; whereas the upsampling and downsampling modules consist of 3D convolutions, this network effectively extracts and propagates feature information from the feature map groups of different scales [
19], resulting in an encoded volume
that combines global geometry and semantic information. The generation process is as follows:
Then, the corresponding cost volume of the right feature map group
was further encoded for matching and semantic information through the construction of an attention feature volume [
20,
21]. This was primarily due to the significant spectral differences between the input heterogeneous image pair, where different views contain more distinct semantic information. Using the cost volume values as attention weights efficiently enhances the extraction of image features.
The construction of the attention feature volume initially involves adjusting the channel count of the cost volume
using a 3 × 3 convolution operation to obtain a weight matrix,
. Subsequently, two consecutive 1 × 1 convolution operations are applied to adjust the channel count of the right feature map group
to 8, followed by activation using the sigmoid function to generate the adjustment matrix
. Finally, the attention feature volume
is computed as
where
represents the Hadamard product, indicating element-wise multiplication between two matrices.
We further downsampled and to obtain two pyramid-structured feature map groups of the same size. Stacking these two pyramid-structured feature map groups at corresponding positions results in a new pyramid-level-structured feature map group called the joint encoding volume, .
3.4. Cascaded Convolutional Gated Recurrent Subnetwork
Deep feature maps contain more semantic information and larger receptive fields, making networks more robust in stereo matching within non-textured or repetitively textured regions. However, these feature maps may require more fine structural details. To strike a balance between network robustness and the perceptual ability for image details [
17], the network also employs the ConvGRU structure for the iterative optimization of disparity values.
The initial disparity map
d0 is first computed from the geometry-encoding volume (
CG) using the soft-argmin method, where
Starting from
d0, the ConvGRU modules are utilized for iterative disparity map updates to aid in rapid convergence optimization of the disparity computation. Each level of ConvGRU accepts the joint encoding volume
, the semantic features extracted with the context subnetwork, and the disparity update information passed from the previous ConvGRU level. As shown in
Figure 4, from employing a 3-level ConvGRU, feature maps with sizes corresponding to 1/16, 1/8, and 1/4 of the original input image are processed. The information within the feature maps is connected using pooling and upsampling operations, and the outputs of the previous ConvGRU levels are cascaded as input hidden states to the subsequent ConvGRU level. Ultimately, the disparity map is updated using the output from the final level (denoted in green) ConvGRU.
After the computations through the multilevel ConvGRU, the updated disparity map
is obtained for updating the current disparity value
di as follows:
3.5. Loss Function
The computation of the loss value can be divided into two parts. Using the initial disparity map
d0 and all disparity prediction results,
, obtained after each iteration of ConvGRU to calculate
L1 loss, the final expression for the loss function is as follows:
where
d0 represents the initial disparity map;
dgt represents the transformation of distance information acquired using a LiDAR sensor into a corresponding disparity map aligned with the left view, set as the ground truth map in this study;
is set to 0.9 within the network, and the number of forward passes for disparity updates was is to 22; and
serves as a smoothing loss function and is calculated as follows:
5. Conclusions
In addressing the challenge of stereo matching for heterogeneous infrared–visible image pairs, this study presented CFNet, an iterative network for predicting disparity with infrared and visible light images based on common features. Compared to other networks, the CFNet integrates a common feature extraction subnetwork with cascaded convolutional gated recurrent subnetwork, which enables the network to effectively harness the complementary advantages of both spectral domains, incorporating semantic information, geometric structure, and local matching details in the images. This results in more accurate disparity predictions for scenes within heterogeneous image pairs. Existing methods have not exploited the complementary information in heterogeneous images or have not effectively utilized the semantic information from images. Additionally, an initial disparity value of 0 leads to use of more training iterations, which reduces the optimization efficiency. The disparity prediction performance of CFNet surpassed those of other methods, as evidenced by the superior results in recognized evaluation metrics (RMSE, log10 RMSE, Abs Rel, and Sq Rel). Visualizing the predicted disparity maps further demonstrated the superiority of CFNet compared to other publicly available networks.
Currently, parallel optical paths in multispectral imaging systems have extensive applications, often in switching or fusion imaging modes. However, these systems do not effectively utilize their field of view to acquire disparity information. CFNet directly leverages the heterogenous infrared–visible image pairs for stereo matching, enabling the system to perceive the disparity information from the image pairs without additional sensors. This approach enhances the system’s ability to perceive the surrounding environment while avoiding hardware complexity. Consequently, the system’s overall structure becomes more conducive to integrated design and precise calibration, facilitating the broader adoption of heterogeneous image acquisition systems.