1. Introduction
3D scanning has been important in industry for several decades. The main methods include passive and active types. The active type usually uses a device to illuminate light (specifically in the near infrared band) and receive it for measuring distances or sense surface variation based on the Time-of-Flight (TOF) [
1] or Light-encoding (e.g., Kinect) [
1] principles, respectively. The passive type, however, computes three-dimensional surfaces or distance values based on the natural light reflected from objects’ surfaces. The passive method is characterized by low hardware cost, but large computational load. The active method, on the other hand, has a high hardware cost and large measurement noise.
“Depth from Focus” (DFF) [
2,
3] algorithms were developed for 3D measurement or scanning several years ago. In contrast to traditional stereo vision, they use a monocular camera which varies focus lengths or changes imaging plane positions for the same scene. For example, the focal sweep camera [
4] uses a high-speed image sensor translated with respect to the lens (often a duration of 200~500 ms to capture a stack of 24~60 images), or a liquid lens whose focal length is electronically adjusted as a function of time. Also, light field cameras (such as Lytro, CA, USA and Raytrix, Hamburg, Germany) can be used to capture an instantaneous focal stack by trading off the spatial resolution. Since all pictures for DFF are taken at the same viewing direction, the occlusion problem, as in stereo matching, can be avoided. Here we will investigate the DFF techniques to synthesize the All-In-Focus (AIF) image and estimate the corresponding depth map based on the multi-focus image stack [
5].
Principally, objects located at different distances will present different focuses on resulting pictures due to limited depth of field of the optical system. By varying the camera settings (e.g., focal length of the lens or position of the imaging plane) for the same scene at the same distance, a focal image stack can be obtained, where each of them presents different in-focus regions. It is possible for DFF techniques to reconstruct or generate an AIF image by extracting pixels of best focus from among the focal stack and record the frame indices accordingly. Since each frame index in the stack corresponds to a focal length or an imaging plane position, the object-to-lens distance p can be calculated from the well-known optical imaging geometry: , where f is the focal length and q is the distance between the lens and the imaging plane. The depth map can then be simply expressed as the index map that leads to the best focus.
The success of DFF/AIF techniques rely on a reliable focus measure for image patches. The focus measure operator often concerns a transformation of the original image patch to enhance its sharpness. The resulting energy of the transformed patch is then calculated as the focus level estimation. Traditional transforms often estimate the spatially high frequency information in a local window to indicate the focusing level, e.g., Laplacian filtering [
6] and the variation [
2] approaches. Chen et al. [
3], however, apply Gaussian (low-pass) filtering to blur the target image and then compare the blurred result with the original one; the difference can then be used to reveal the focus level of the original image. Image quality measure (IQM) [
7] was adopted by calculating the average of gradients for pixels within a window. In [
8], the modulation transfer function (MTF) is calculated as a ratio between the image contrast and sharpness to indicate focus metric. In [
9], the surface areas of the enclosed region around a same given pixel in different focused input images are computed and compared, as a measure, to distinguish focused and blurred regions. In [
10], Li et al. present a Multi-scale Image Analysis (MIA) technique to determine the focusing properties of input image pixels. However, their proposed metric is still misjudged on smooth regions and needs a block-based consistency verification procedure for correction. The above metrics might still result in higher focus measurements for blurred or smooth regions due to noise or image degradation, which will certainly degrade the reconstructed AIF image when the maximum selection rule is adopted.
DFF algorithms can be categorized into pixel-, block-, and region-based [
11]. This kind of categorization depends on the area where a focus measure is computed. For pixel-based algorithms [
12,
13], a pixel in the AIF image is often calculated as a weighted average of the collocated pixels in the original focal stack. However, these kinds of methods will yield a low-quality or noisy AIF image in the presence of noise. Pertuz et al. [
13] proposed a selective weighting scheme (a linear combination of selected pixels with higher focus measures) so as to reduce the noise in the AIF image. Other methods include post-optimization [
14] on the resulting weight maps before image fusion is performed. For block- [
15,
16] or patch-based algorithms, a regular shape often results in blocking or ringing artifacts and probably fails near region boundaries. To solve this problem, [
11] proposed a region-based algorithm, where the focus measure is calculated for each segmented region of arbitrary shape. The “average image” calculated from the focal stack is incurred segmentation by means of the well-known mean-shift algorithm to define initial regions. In their work, region definitions are however fixed and not further refined. Lee et al. [
17] also proposed region-adaptive fusion from focal stack images. A two-level DWT (Digital Wavelets Transform) is first applied to each frame of the focal stack. The focus profile of each pixel is then calculated from the detailed high-frequency sub-bands. All pixels are classified into three kinds of regions (according to the number of peaks in the considered focus profile) and different fusion rules are applied to different kinds of regions. Please note that in their work, pixels classified with the same kind of region in AIF are not extracted from the same image, but only applied with the same fusion strategy. Zhang et al. [
18] proposed finding boundaries between the focused and defocused regions, from which the source images could be naturally separated into regions with the same focus conditions. Their method, however, relies on the use of multi-scale morphological gradient operators to improve the precision of boundary detection and focused region detection. Adaptive region segmentation can be also achieved via spatial quadtree decomposition [
16,
19,
20], which is used to define hierarchical regions for focus measures. In contrast to the arbitrary region shapes in [
11] and [
18], quadtree methods require a metric to determine whether a block will be WTA (Winner-Take-All)-fused or decomposed into four smaller ones. However, they might suffer from over-segmentations into smaller regions, due to regular quad decomposition for a region. Using this method, their works were focused on fusion from two images only and no experimental reports were given for extension to multi-focus image sets whose number of images is larger than 2.
In view of the fusion algorithm, AIF algorithms are categorized into focal-weighting [
13], WTA [
12,
21], and weight-optimized [
14]. The focal-weighting method computes a weight of 0.1–1.0 for each image in the stack when synthesizing a specific pixel or block of given coordinate along the focal axis, i.e., weights are focal-position-dependent. The weight-optimized algorithms, however, refine the pixel-dependent weights subject to certain smoothness conditions. Though they are capable of getting better AIF and depth estimation, it seems time consuming and unsuitable for real-time applications. On the other hand, WTA seems to be a special case of the focal-weighting strategy that only one focal position in the stack is selected and has a weight of 1.0, while others have zero weights. WTA is simpler and popular in many applications. There are also some modified algorithms to improve the WTA scheme. For example, [
21] proposed using gradually changing sizes of smoothing kernels for eliminating visual artifacts in boundary regions of the initially WTA-fused AIF image. Liu et al. [
22], on the other hand, proposed a CNN model for simultaneous activity level measure (feature extraction) and fusion rule design (classification). This deep learning approach, though new, is not appropriate for generating the depths of the scenario. Xiao et al. [
23] first extracted image depth information through the inhomogeneous diffusion equation for simulating the optical imaging system, classified pixels into three types of regions (clear, fuzzy, and transition) according to depth information, and finally generated the fused image based on the clear and transition pixels. Their method actually belongs to a kind of DFD (Depth from Defocus, in contrast to DFF herein), which often suffers from inaccurate depth estimation from limited number (often 1 or 2) of defocused images. Some methods [
24,
25,
26] tried to construct global focus detection algorithms, making them get free of block artifacts and reducing the loss of contrast in the fused image. For example, references [
24,
26] proposed to decompose each of the multi-focus source images into cartoon and texture content; the two different contents are fused respectively and then combined to obtain the all-in-focus image. Unfortunately, only fusion results for two source images were reported. In [
26], the authors also applied their cartoon/texture decomposition and sparse representation algorithm for multi-modality (such as medical PET/MRI, or infrared/visible) image fusion.
Though many AIF algorithms have been proposed up to now, a large part of them are targeted at a stack of two images only and thus unsuitable for extension to depth estimation for larger image stacks (often up to several dozen) in industrial applications. As introduced, region-based methods play tradeoffs between pixel- and block-based algorithms in aspects of complexity and quality. However, a content-adaptive region determination algorithm is seldom developed. We are then motivated by the above two situations. In this paper, we extend our prior work [
1] to propose a region- (in spatial domain) and WTA-based (in focal domain) algorithm which overcomes the above two problems and use industrially captured image dataset for testing.
Figure 1 illustrates our iterative region-splitting algorithm for AIF image fusion and depth estimation. Differing from traditional region-based algorithms (e.g. [
11,
16,
18,
19,
20]), our definition of focusing regions is subject to iterative focal-spatial analysis (rather than spatial analysis only, as in [
11,
18]). No limitation on the manner of region splitting (unlike quadtree [
16,
19,
20]) also makes our algorithm less affected by the possible blocking effect.
First of all, the image domain is initially segmented into regions based on an “initial AIF image” synthesized based on a simple pixel-wise and focal-weighting scheme (
Section 2). The focus measures are then computed for each region definition at different focal positions. By analyzing the focus profile (a curve of regional focus measure along the focal axis) for each region, we are able to determine if the targeted region should be split. If the focus profile meets the no-splitting criterion, WTA-fusion along the focal axis is performed to get the fused result for the region. Otherwise, the region is split into subparts after spatial analysis and each divided part is incurred a recursive process for focus computation and analysis. The depth map can be obtained from the AIF image by assigning, to each region, the index of the frame that has the best focus measure (each index corresponds to a focal position and object distance). To sum up, we propose a region-based (in spatial domain) and WTA-based (in focal domain) algorithm for DFF.
2. Initial AIF Image Computation and Region Segmentation
First, we define a focal stack
(where
k is the image index corresponding to an imaging-plane position and
K is the number of images contained in the stack). Our aim is to synthesize an AIF image
P and estimate the depth map
D from
I. It is known that the high frequency strength around a pixel can be used as a metric of focusing. High-focusing pixels will be given larger weights in image fusion from the focal stack. The following formula [
14] is used here:
where subscript
i represents the pixel index, the superscript
k is the image index,
stands for the gradient,
indicates the variance of gradients for the
k-th image,
stands for Gaussian error function,
represents the frequency of non-zero gradients around pixel
i in the
k-th image,
is the weight at pixel
i of the
k-th image, and
K here is an exponent. Therefore, the
k-th image in the stack has its corresponding pixel-weighting map. The weighting map stack can then be adopted to synthesize an initial AIF image
P as:
where
stands for the intensity of pixel
i in the
k-th image, and
Pi is the value of pixel
i in
P. Obviously, all
’s are summed to 1.0 for any given pixel
i.
Region segmentation technique, e.g., the mean-shift segmentation algorithm [
27] in OpenCv, is applied to the initial AIF image
P to get an initial region set
, where
r is the number of segmented regions, which depends on some parameters (e.g., “spatialRadius” and “colorRadius”) to control the average of color and space together to form a segmentation. Compared to the “average image” derived in [
11], i.e.,
our initial AIF image
P (Equation (2)) is capable of achieving more focusing, getting more accurate segmentation, and then better/faster convergence in region splitting (see experimental section).
5. Experiment Results
Four focal stack images for industrial use are captured by adjusting the camera imaging plane at different positions. As shown in
Figure 9, the four test image sets include: “Battery” (65 images), “Screw” (48 images), “PCB” (48 images), and “Tool” (7 images), all of 640 × 480 pixels. The parameters are set empirically as:
= 0.8 (
Figure 8),
= 0.67 *
K,
ρ = 53 (
Section 4), to control final region segmentation and the formation of x-type regions.
Figure 10 shows the region segmentation for the four test images, where
x-type regions are colored in black. The initial and final numbers of segmented regions are listed below and the numbers of
x-type regions are also shown in parentheses.
- (1)
“Battery”: 299 → 367 (x-type: 15)
- (2)
“Screw”: 242 → 256 (x-type: 15)
- (3)
“PCB”: 343 → 367 (x-type: 49)
- (4)
“Tool”: 102 → 102 (x-type: 0).
It can be observed that most of the
x-regions occur at plain backgrounds that have less textures for focus measure, e.g., in top background of “Battery” and “Screw”. This however does not cause any difficulty in identifying SP property for the bottom and whole background of “Screw” and “Tool”, respectively. For “PCB”, the number of regions is increased by 24 after splitting, while 49 out of 367 are classified to be
x-type. This means at least 49 − 24 = 25 regions in initial segmentation are neither classified as SP nor TP type. According to
Figure 10c,
x-type regions concentrate on the green backgrounds of the right part image. This is similar to the behavior of the green backgrounds in “Battery”. Notable is the result of “Tool”, where no regions are further split and classified as
x-type. This good behavior also lead to better AIF synthesis and depth map estimation (see the results later).
Figure 11. compares four methods on AIF synthesis and depth map estimation: (1) Helicon, (2) Zerene Stacker, (3) [
12], and (4) our proposed method. For visual clarity, all depth maps are scaled to gray levels in 0~255). The Helicon, being a popular commercial software, adopts a pixel-weighting strategy, i.e., it calculates a weight for each pixel according to the image content and then gets final fused result by weighting co-located pixels from all source images. The Zerene Stacker, a commercial software much expensive than Helicon, is featured of accurate and robust alignment, scale correction by interpolation, and advanced stacking algorithm. It can also generate stereo and 3-D rocking animation from a single stack. Reference [
12] is a pixel- and WTA-based algorithm, enhanced with post-processing on the resulting AIF image. It also applies point diffusion function to the filtered AIF to generate the re-focused image and depth map, whose number of levels is actually larger than the original image number
K. Since Zerene Stacker cannot provide the resulting depth maps for the trial version, they are not shown in
Figure 11. Depth maps estimated by [
12] differs from Helicon’s and ours because of its recalculation by using point-diffusion function.
It is observed from
Figure 11 that both our proposed algorithm and Zerene Stacker are capable of achieving a better synthesis quality near electrodes and bodies of the “Battery”, but it seems that Zerene Stacker has better performance on table’s surface. For “Screw”, Zerene Stacker’s result presents two dark dots, while Helicon wrongly estimates the depths for background area and blurs the top boundaries of the screw. For “PCB”, Helicon presents geometrical distortions near the left-top portion of the image, defocusing at “1”, and distorted bright spots at “2” and “3”; Zerene Stacker shows defocusing and redundant textures as indicated in the red circles; [
12] gives ripples around object boundaries. For depth estimation, our algorithm has some errors, while Helicon blurs depth boundaries near objects. For “Tool”, our proposed algorithm and [
12] show the best results, while Helicon and Zerene Stacker lead to light defocusing. In view of the depth map, Helicon and [
12] show several errors. In
Figure 11, depth maps for [
12] are generally noisy (especially in background areas of the “Battery” and “Screw” and the right part of the “PCB”) and might have a copy-pattern from the color part.
Figure 12 shows the results that the initial AIF image
P calculated from Equation (2) is replaced with the average image calculated by Equation (4) [
11], which is then used for the same procedures of initial region segmentation and iterative region splitting. Focusing improvement by Equation (2) against Equation (4) as initial AIF image generation can be found by comparing the parts in red circles. Obviously, Equation (2) is capable of providing better AIF results compared with Equation (4).
To evaluate the performance for AIF fusion in an objective manner, the metric proposed in [
30] (
Q, whose values are between 0.0 and 1.0, and a larger value means better quality) is measured with four methods for comparison.
Table 1 illustrates the result, where “*” represents the winner. It is obvious that our proposed algorithm outperforms others except the “Screw” set. This could be due to some spots that are present in the bottom background. However,
Figure 11 reveal our superiority in depth estimation compared with Helicon for the top and bottom backgrounds of “Screw”.
Table 2 compares the computing time. The computing platform is based on the Intel Core i7-3770 3.40 GHz with 4 GB RAM for Helicon, Zerene Stacker, and our proposed method. The execution time for [
12] is, however, based on a CPU of 2.8 GHz and 16 GB RAM. It is observed that our algorithm (with non-optimized code) is faster than Helicon and Zerene Stacker for nearly all four image sets (with an average gain (defined as (compared_time-our_time)/compared_time) of 17.81%, and 40.43%, respectively), while slightly slower than [
12] (−15.54%), especially for small image sets (like “Tool”).