1. Introduction
Unlike traditional two-dimensional (2D) imaging, three-dimensional (3D) imaging technology has the potential to capture the depth and angle information of objects. Three-dimensional imaging and display systems have brought revolutionary changes in many fields, providing more comprehensive, accurate, and visualized data, thus promoting innovation and development. Through 3D imaging technology, we can understand and analyze objects and scenes in the real world in a more in-depth and comprehensive way, bringing tremendous potential and opportunities to various industries [
1]. Integral imaging is a true 3D display technology characterized by its simple architecture, low manufacturing cost, and good imaging effect. It has been developed for more than a century. Many researchers have dedicated their efforts to improving the 3D display performance in various aspects [
2,
3,
4,
5]. Lim et al. used depth-slice images and orthogonal-view images to optimize element image arrays and enhance the display resolution [
6]. Navarro et al. proposed a smart pseudo-orthogonal transformation method to optimize the depth of field [
7]. Kwon et al. improved the resolution by reconstructing and rapidly processing the intermediate view through the image information of adjacent elements [
8]. They also enhanced the depth-of-field range by using a multiplexed structure combining a dual-channel beam splitter with a microlens array [
9]. Zhang et al. proposed a target contour extraction method and a block-based image fusion method to generate a reconstructed image with extended degrees of freedom and improve the depth of field [
10]. Sang et al. utilized a pre-filter function array (PFA) to preprocess elemental images, aiming to improve the reconstruction fidelity of integral imaging [
11]. Ma et al. employed a conjugate pinhole camera and a pinhole-based projection model to mitigate distortion and expand the view angle of integral imaging [
12]. Wang et al. utilized a bifocal lens array to achieve a 3D image display with a depth of field twice that of conventional integral imaging [
13]. Cao et al. proposed a high-resolution integral imaging display with an enhanced point light source to improve the image resolution of the vertical and horizontal dimensions [
14]. The above research improved the quality and efficiency of integral imaging content generation and display by optimizing and improving the resolution, viewing angle, and depth of field range of integral imaging.
With developments in computer science, real–virtual fusion based on integral imaging has recently become a research hotspot field, which overlays 3D images in real environments [
15,
16]. Hong et al. introduced an integral floating 3D display that adopts a concave lens to combine 3D images and real-world scenes [
17]. Hua and Javidi et al. used the microscopic integral imaging to create a mini-3D image and combined it with the freeform optical to design a see-through 3D display [
18]. Yamaguchi and Takaki et al. used two integral imaging systems to achieve a background occlusion capability [
19]. H. Deng et al. proposed a method for creating a magnified augmented reality (AR) 3D display using integral imaging. The method involves the use of a convex lens and a half-mirror to merge a magnified 3D image with a real-world scene [
20]. To increase real-time performance, Wang et al. proposed GPU-based computer-generated integral imaging and presented an AR device for surgical 3D image overlay [
21]. Furthermore, developing a high-definition, real–virtual fusion 3D display is also a problem that needs to be resolved [
22,
23,
24]. However, these methods can only present one real scene at a time. In order to achieve a more immersive and rich augmented reality experience, as well as capture, process, and combine different real scenes simultaneously, this study proposes a method to import real objects into virtual spaces. By utilizing virtual capture technology based on integral imaging, it is possible to generate a real–virtual fusion 3D display that combines multiple real scenes in a single process. Since modeling real objects degrades the object display quality, it is necessary to enhance the 3D reconstruction effect.
The Structure from Motion (SFM) is a classical 3D reconstruction algorithm [
10], first proposed by Longuet-Higgins [
25], that can accurately reconstruct real objects and generate computerized 3D models. The Scale Invariant Feature Transform (SIFT) has demonstrated excellent application effectiveness and high robustness [
26]. It is also a widely recognized feature extraction algorithm. However, it tends to perform poorly on image regions with weak textures. With the rise of deep learning, there has been a rapid advancement and increase in iteration speeds in image feature point detection. The fast corner detection is the first algorithm to obtain image corners through machine learning [
27]. Yi et al. proposed the learned invariant feature transform (LIFT), an end-to-end convolutional neural network-based feature point detection [
28]. Detone et al. proposed a self-supervised interest point (SuperPoint) algorithm for feature point extraction that is trained in a self-supervised manner [
29]. This algorithm has a good detection effect in areas of the image with a weak texture. However, the SuperPoint algorithm can only extract integer coordinates of feature points, which limits its accuracy. Wu added a sub-pixel module based on the Softargmax function to implement the subpixel detection of feature points [
30]. However, in dense feature maps, the edges of the pixel blocks extracted by Softargmax often contain isolated points with high pixel values. These isolated points can affect the accuracy of the bias values calculated by Softargmax. Since the calculation of Softargmax relies on the weights of individual pixels within the pixel block, if there are isolated high pixel value points, their weights may be excessively amplified, leading to inaccuracies in the bias values. This inaccuracy further affects the results of subsequent tasks and reduces the overall accuracy of the algorithm. On the other hand, if the size of the extracted pixel blocks is reduced to avoid isolated points, the pixels at the boundaries will not be fully utilized. This is because boundary pixels usually carry important edge information, and reducing the size of the pixel blocks may result in the neglect or loss of boundary pixels, thereby affecting the accuracy of Softargmax calculations.
Therefore, this paper proposes to improve the SuperPoint function by adding Gaussian weights and performing sub-pixelization to enhance the matching accuracy of image feature point detection and improve the mapping effect in real scenes. By reducing the weight of the edge area, the interference caused by isolated points with high pixel values can be minimized. Meanwhile, increasing the weight of the central area can better capture the position of the main feature points, improving the accuracy of feature point detection. In addition, sub-pixelization processing can further improve the positioning accuracy of feature points, thereby improving the quality and authenticity of scene reconstruction.
In conclusion, this paper combines integral imaging and 3D reconstruction techniques and proposes a method for generating elemental image arrays for real–virtual fusion scenes. The 3D information of multiple real scenes can be combined in a single process. The SuperPoint feature point detection is improved with an added Gaussian weight and combined with the SIFT algorithm to obtain a rich number of accurate image feature points. Then, the SFM algorithm is used to complete the 3D model reconstruction. Finally, following the principle of integral imaging, the elemental image array of the real–virtual fusion scene can be generated. This eliminates the need to consider the size and matching issues of the scene. The model only needs to be reconstructed once and can be used multiple times, which saves the cost and time of modeling and rendering. This enables the rapid generation of diverse scenes while minimizing the need for additional equipment and resources. Users can modify the real–virtual fusion model anytime according to their needs, achieving a more personalized scene presentation.
This article is organized as follows: in
Section 2, the content generation for real–virtual fusion in integral imaging is introduced.
Section 2.1 presents a joint feature point detection method aimed at improving the detection effectiveness.
Section 2.2 focuses on how to generate real–virtual fused elemental image arrays.
Section 3 presents the experimental results and provides a comprehensive discussion. Finally, in
Section 4, conclusions are drawn based on the conducted research.
2. Integral Imaging Content Generation for Real–Virtual Fusion
An improved SuperPoint and the SIFT algorithm are used for feature point detection and matching, followed by the utilization of the SFM-based method to achieve the 3D reconstruction of the object. By combining the 3D model of the real object with the virtual model, a mixed-reality scene is created. This allows the integral imaging virtual capture method to capture both real and virtual models simultaneously. The process is shown in
Figure 1.
2.1. Joint Feature Point Detection
The SIFT algorithm is renowned for its ability to detect a large number of feature points accurately. However, it exhibits an uneven distribution of feature points and performs poorly in low-texture regions. On the other hand, the SuperPoint feature detection, based on deep learning networks, excels in detecting feature points in low-texture regions but lacks precision in positioning. By combining the strengths of both algorithms, SIFT and SuperPoint, we can effectively enhance the capability of image feature point detection and matching. This combination allows us to leverage the advantages of each algorithm while mitigating their respective shortcomings. The SIFT algorithm provides a robust and accurate localization of feature points while the SuperPoint algorithm enhances the performance in low-texture regions. As a result, the overall capability for detecting and matching image feature points is significantly improved. This integration of SIFT and SuperPoint brings a complementary approach that addresses the limitations of each algorithm individually, leading to a more reliable and comprehensive feature point detection.
SuperPoint is a deep learning-based algorithm used for self-supervised feature point detection and description. It extracts image features through deep learning and generates pixel-level image feature points along with their corresponding descriptors. The SuperPoint network can be divided into four main parts: the VGG-style shared encoder, the feature point decoder, the descriptor subdecoder, and the error construction, as shown in
Figure 2.
The SuperPoint feature detection network outputs the feature point confidence distribution map and the feature descriptor map. Each pixel value on the feature point confidence distribution map represents the probability that the corresponding point is a feature point in the image. SuperPoint obtains feature points using the Non-Maximum Suppression (NMS) function. It detects a set of pixels above a certain threshold and then uses these pixels as centers to obtain bounding boxes of size N × N. The overlap degree between each bounding box is calculated, and feature points that are too close to each other are removed to obtain the final set of feature points. However, these feature points have integer coordinates, and the detection results are not precise [
31]. A study [
30] added a sub-pixel module based on the Softargmax function to the SuperPoint feature point detection algorithm to achieve the sub-pixel detection of feature points. Considering that the NMS function only retains the maximum value within an N × N pixel size and sets all other pixel points to 0, it is easy to have isolated points with larger pixel values on the edges of the pixel blocks extracted by Softargmax in densely featured images, which will affect the accuracy of the bias value calculated by Softargmax. On the other hand, if the size of the extracted pixel block is reduced, the pixel points at the boundary cannot be utilized, which will also affect the calculation accuracy of Softargmax.
To solve the above problem, we introduce Gaussian weight coefficients into the SuperPoint sub-pixel feature point detection algorithm. It starts by obtaining the integer coordinate feature point set through the feature point decoder of the SuperPoint algorithm. Then, the pixel blocks extracted by the Softargmax function are processed with the addition of Gaussian weight coefficients, typically with a block size of 5 × 5. This approach effectively reduces the weight of the edge regions within the pixel blocks while increasing the weight of the central regions. As a result, it mitigates the impact of isolated points with large pixel values at the edges on the accuracy of the Softargmax calculation and improves the precision of the feature point detection.
Let
be the Gaussian weight coefficient, the size is consistent with the pixel block extracted by Softargmax, and
is the weight value at
; then, the expectation of offset
in the
and
directions is calculated as follows:
where
represents the probability that the pixel at position
in the image is a feature point, while
represents the pixel value at position
in the extracted pixel block.
According to the obtained offset
, the final feature point coordinates are calculated as follows:
where
and
are the integer coordinates of feature points, which also represent the center position of the extracted pixel block. The above method increases the algorithm’s complexity due to the addition of the Gaussian weight coefficient, but it can effectively reduce the error generated in the process of feature point sub-pixelization in the feature-point-dense image or region, resulting in more accurate final feature points.
The sub-pixel feature descriptors of the improved SuperPoint feature point detection algorithm are obtained through the descriptor decoder. The input to the descriptor decoder is convolved to generate a descriptor matrix of size H/8 × W/8 × 256. This matrix is then subjected to bicubic interpolation and L2 normalization to obtain a 256-dimensional feature descriptor vector for the sub-pixel feature points.
The feature points extracted by SIFT are high-precision sub-pixel feature points, and those extracted by the improved SuperPoint algorithm are also at the sub-pixel level. This means that the feature points extracted by these two algorithms at the same position will not completely coincide. Therefore, in order to merge the feature point set extracted by SIFT and the improved SuperPoint, it is necessary to calculate the distance between each feature point in the improved SuperPoint and each feature point in the SIFT set. If the distance exceeds a threshold, the feature point is retained; otherwise, it is discarded. The formula for this calculation is as follows:
where
and
are the feature points in the improved SuperPoint and SIFT feature point sets. The value of
indicates whether the feature point
in the feature point set of the improved SuperPoint is discarded, and
is a constant.
Considering that the dimensions of the feature point description extracted by the SIFT and improved SuperPoint algorithms are 128 and 256, respectively, this paper first uses feature point matching and then detects distance thresholds to merge feature point collections. The feature point matching algorithm calculates the distance ratio between the closest and the second closest neighboring features and matches the SIFT feature point set and the improved SuperPoint feature point set on two images, respectively.
2.2. Virtual–Real Fusion Element Image Array Generation
The generation of elemental image arrays mainly involves two methods: real-scene capture and computer-based virtual generation. The real-scene capture can acquire the 3D information of the actual scene but requires camera arrays or lens arrays. The calibration and synchronization of camera arrays are complex and precise processes, while lens arrays are limited in size and shooting range due to manufacturing constraints. The computer-based virtual generation method uses 3D modeling and animation software such as Maya, 3DsMax, and Blender to simulate the shooting process of real lens arrays or camera arrays, establishing an optical mapping model to generate elemental image arrays. This method is not limited by capturing devices, allowing for an infinitely small virtual camera spacing and shooting from within virtual objects [
32]. It can achieve various shooting structures such as parallel, convergence, and shift. However, the captured 3D models are virtual models. Therefore, addressing the issues of a high cost, complex processes, limited shooting structures, and a single virtual shooting model in real-scene capture, this section proposes a method for generating a real–virtual fusion elemental image array. It utilizes SFM to establish the 3D models of real objects and merge them with the virtual models. The ray tracing technique is used to render the real–virtual fusion elemental image array. This method enables the capture of real scenes without being limited by capturing devices, providing a more diverse source of true 3D videos for integral imaging system research.
Compared to other 3D reconstruction methods, SFM is based on the principles of multi-view geometry to restore the 3D structural information of a scene through images captured from different angles [
33]. By using images from different angles, it is possible to calculate the fundamental matrix and essential matrix, which can then be decomposed to extract the relative camera pose information and determine the position and orientation of each camera when capturing the images. Subsequently, using triangulation techniques, the 3D point cloud of the scene can be reconstructed. In the reconstruction process, bundle adjustment is performed to minimize errors considering error accumulation. Bundle adjustment is a method that simultaneously adjusts all camera parameters to optimize camera poses and the positions of 3D points by minimizing reprojection errors. These steps are repeated, incorporating new images, until all multi-view images are reconstructed to obtain the final sparse point cloud. The advantages of SFM are its ability to capture camera motion freely without prior camera calibration and its ability to fuse multi-view information to generate sparse but accurate 3D reconstruction results. It can be used in various scenarios, including indoor, outdoor, static, and dynamic scenes. Therefore, this paper adopts SFM for the 3D reconstruction of real objects. The specific process is shown in
Figure 3.
Import the real object model into virtual space and combine it with a virtual model, and a virtual reality fusion scene is created. After that, using integral imaging virtual acquisition, a real–virtual 3D scene-fused elemental image array can be generated. The element image array serves as a bridge connecting the integral imaging acquisition and reconstruction process.
The process of collecting and visualizing spatial scene information using an integral imaging system can be summarized as consisting of two stages: acquisition and display, as shown in
Figure 4a [
5]. During the acquisition process, media with information recording capabilities are placed in parallel behind the lens array. In this way, when light emitted from an object passes through a lens array and is projected onto a capture device, the image information of the object at a certain spatial angle is saved at the elemental image array. Integral imaging is an optically symmetrical system, where the acquisition and display processes are optically reversible. In the display process, the elemental image array captured on the recording media is projected onto the lens array through high-definition display devices such as liquid crystal displays (LCDs). According to the reversible principle of light paths, the light emitted from the display screen, after optical decoding by the lens array, converges at the original location of the captured object, generating a real three-dimensional image. Therefore, in the integral imaging system, both the recording and reproducing processes need to satisfy the lens imaging principle—the Gaussian imaging formula.
where
and
are the distance between the captured or display device and the lenslet array;
is the focal length of the lens array.
In order to make the acquired elemental image array efficiently displayed in the actual light field environment, it is important that the camera array parameters in the virtual acquisition process must be set according to the display platform. The optical mapping model is shown in
Figure 4b.
where
,
,
, and
are the object distance, camera focal length, camera array pitch, and disparity between adjacent image pairs in acquisition;
,
,
, and
are the image distance, distance between the lens and the display plane, lens pitch, and disparity between adjacent elemental image pairs in the display;
is the magnification factor.
After establishing the optical mapping model from the acquisition to display, the virtual camera array is set according to the display parameters, as shown in
Figure 5.
3. Experimental Results
The experiment first verified that the improved SuperPoint feature point detection algorithm can reconstruct the 3D model of the object more accurately and realistically. On this basis, the real–virtual fusion elemental image array combining the real object reconstruction and virtual models and its optical 3D reconstruction effect were given experimentally.
A standard test dataset of large-scale castle-P30 and a self-captured small-scale rockery scene with more textures and details were selected as the real scenes. The image sets consisted of 30 buildings and 23 rockery images captured from different angles, with sizes of 3072 × 2048 and 2306 × 4608 pixels.
Figure 6 shows a subset of different view images.
In order to verify the effectiveness of the proposed method, three sparse point cloud reconstructions were carried out according to different feature point detection algorithms. The reconstruction result is shown in
Figure 7.
Table 1 shows the number of point clouds generated.
Figure 7 and
Table 1 demonstrate that the proposed method generates approximately one-third more point clouds compared to the single SIFT algorithm, and compared to [
30], it also has significant improvements. Furthermore, our proposed method exhibits superior accuracy in feature point detection and a stronger capability to generate 3D sparse point clouds, particularly in areas with weak textures such as the base and center of the rockery. This improvement is attributed to the incorporation of Gaussian weight coefficients to process the pixel blocks extracted by the Softargmax function. By reducing the weight of the edge regions and increasing the weight of the central regions, this approach mitigates the impact of isolated high pixel values at the edges of the pixel blocks on the results of the Softargmax calculation.
For the obtained sparse point cloud, this study performed patch reconstruction using the patch-based multi-view stereo (PMVS). Through the Poisson surface reconstruction operation in Meshlab, a smooth 3D model of the real object was obtained. The surface texture information of the model was recovered using the captured image of the object. As a result, a highly restored, detailed, and vivid 3D model was achieved, as shown in
Figure 8.
The 3D model and texture map of the real object rockery were imported into Blender to form a real–virtual fusion scene with the virtual model Mario, as shown in
Figure 9. Then, a virtual camera array was established straight ahead of the scene to generate real–virtual fused elemental image array. The parameters of the virtual camera array and optical lens array are shown in
Table 2, and the real–virtual fused elemental image array is shown in
Figure 10.
To assess the quality of the displayed real–virtual fused elemental image array, the virtual and optical reconstruction for the left, right, center, top, and bottom views, respectively, were established, as shown in
Figure 11 and
Figure 12.
The experimental results show that combining real scenes with virtual models and utilizing integral imaging virtual capture can generate an element image array with both real and virtual contents in a single step, greatly enriching the 3D scene. The introduction of Gaussian weighting coefficients helps to better address the edge region issue in the SuperPoint sub-pixel feature detection algorithm. The edge region often contains isolated points with large pixel values, which can significantly affect the accuracy of computations. By reducing the weight of the edge region, the interference caused by these isolated points on the computational results can be minimized. At the same time, increasing the weight of the central region allows for a better capture of the positions of the main feature points, thus improving the accuracy of feature point detection. Therefore, the proposed method in this paper is able to adapt to feature point distributions in various scenarios, enhancing the quality and authenticity of reconstruction results. As seen in
Figure 8 and
Figure 12, the detailed features of the rockery are well preserved, enabling a more realistic and vivid optical reconstruction of the 3D images. This means that during the reconstruction process, we are able to capture more detailed information, making the final reconstruction result more realistic, which is crucial for the visualization and interactivity of real scenes.
4. Conclusions
In this paper, the process of generating real–virtual fused 3D scenes based on integral imaging is introduced. The improved SuperPoint algorithm enhances the accuracy of feature point detection by adding Gaussian weighting coefficients to softeargmax to perform sub-pixel feature point detection. Through the reconstruction of sparse, dense point clouds and a Poisson surface, a highly restored, detailed, and vivid 3D model is achieved. Then, integral imaging is used to generate and reproduce the real–virtual fused 3D images by combining them with the virtual model. The experimental results show that more realistic and vivid real–virtual fusion 3D images can be optically reconstructed. This method enhances user immersion and experience by providing more intuitive and interactive visual effects, allowing them to better understand and explore scenes. It also helps save costs and time in modeling and rendering, enabling the rapid generation of diverse scenes while reducing the need for additional equipment and resources. Users can modify virtual models or replace real objects according to their needs, achieving a more personalized scene presentation. Overall, fusing real object modeling with virtual models can provide more realistic, rich, and interactive 3D scenes, offering users a better visual experience and immersion. The proposed method may be widely used in the fields of virtual reality (VR), augmented reality (AR), and training simulations in various industries, allowing learners to gain a more realistic experience closer to actual scenarios. This is particularly important for industries that involve complex operations or high-risk work environments, such as aerospace, medical surgery, hazardous chemical handling, and more. In the future, we will explore the combination of other deep learning methods to further improve the 3D reconstruction effect.