1. Introduction
Components of indoor scenes are meaningful parts of indoor objects. Detecting the indoor scene components and acquiring their spatial relationships (e.g., adjacent, in the right of, in the left of, etc.) is one of the most important research problems in the computer vision and graphics community. As pointed out in many studies [
1,
2,
3], the acquirement of indoor scene components and their spatial relationships will benefit many computer vision works such as indoor scene reconstruction and indoor scene understanding [
4,
5].
There are two main difficulties that arise during the detection of indoor scene components and their spatial relationships: (1) indoor scene components often have varied shapes and complex three-dimensional (3D) geometry. Moreover, the indoor scene components occlude each other. Thus, it is challenging to detect complex-shaped indoor scene components from point clouds; (2) due to the diverse internal structures of indoor objects and the messy arrangement of indoor objects, the spatial relationships between indoor scene components are complex, which makes it difficult to extract the spatial relationships between indoor scene components.
Most of the approaches [
6,
7,
8,
9,
10,
11,
12] for the detection of indoor scene components concentrate on using primitive shapes (e.g., planes, cylinders, spheres, cuboids, etc.) to approximate the components and exploit 3D primitive shape segmentation algorithms such as Hough transforming [
13,
14] and Random Sample Consensus (RANSAC) [
15,
16] to detect indoor scene components. In these approaches, the primitive shape features of indoor components are always pre-assumed, which is not suitable for complex-shaped indoor scene components.
Many methods [
17,
18,
19] transform the scattered point clouds into 3D voxel grids and use spatial connectivity and geometric features to segment the indoor scene models. However, due to sparsity of the point clouds, the voxel grids may have empty voxels which leads to redundant computations. Moreover, it is difficult to select the appropriate resolution to accurately segment the components and preserve the boundaries due to the different scales of objects in the indoor scene model and the non-uniform point cloud density.
With the availability of large 3D datasets and the popularity of machine learning techniques, some data-driven segmentation methods [
20,
21,
22,
23,
24,
25,
26,
27] have been proposed for indoor scene components. In previous data-driven methods [
20,
21,
27], indoor scene models are first segmented. Then the segmented results of the indoor scenes are classified into different components based on handcrafted features by machine-learning techniques, e.g., conditional random field (CRF), support vector machine (SVM) and so on. Motivated by directly learning features from input point clouds, the deep neural network has recently been exploited. Qi et al. [
22] designed a novel type of neural network (PointNet) to provide a unified architecture for feature classification directly from point clouds. On the architecture, the labelling of components of objects is performed. Followed PoinNet, other deep neural networks have been proposed, such as PointNet++ [
28], the deep part induction network [
23], the regularized graph convolutional neural network (RCGNN) [
25], semantic part decomposition network [
29] and so on. Although progress in detecting complex-shaped components is impressive, these methods are still inferior when it comes to discovering new components whose types are not covered in the training sets.
There are also other methods. Balado et al. [
30] proposed a method to detect floor elements based on relative distances. In the references [
31,
32], surface patches of indoor scene models were merged into components according to the consistency of their local convexity or non-local geometric signature. Due to poor connectivity caused by missing parts and outliers of point clouds, convexity-based methods are not reliable for detecting the indoor scene components. The detection of complex-shaped indoor scene components is still challenging.
The extraction of the spatial relationships between indoor scene components lays a foundation for understanding the indoor scene in a way similar to the way that humans perceive the environment. Many methods [
33,
34,
35] have been proposed to extract spatial relationships from scene images. In contrast with the spatial relationships in images, the spatial relationships in 3D point clouds are more complex [
36,
37], and the extraction of them is more challenging.
Recently, a few methods [
38,
39] have been proposed to extract spatial relationships from indoor point clouds based on machine learning techniques such as SVM and latent max-margin learning. However, it is difficult to build up a fixed parameter model for training due to the complexity of 3D spatial relationships. To fix the problem, Wald et al. [
40] recently tried to use deep learning techniques to train and predict spatial relations. The deep learning-based method showed prospects in the extraction of certain spatial relationships. However, spatial relationships in 3D space are complex. It is difficult to obtain salient features between different spatial relationships and to effectively divide the spatial relationships into different categories based on the feature.
On the other hand, some approaches [
41,
42,
43,
44] have been proposed to extract spatial relationships from indoor point clouds based on prior spatial knowledge. For example, Zender et al. [
42] presented an ontology to encode the spatial and functional knowledge of typical indoor environments. Suchan and Bhatt [
43] adopted prior knowledge to extract commonsense spatio-temporal relations. Most existing knowledge-based methods have aimed to provide root navigation for indoor robots or model specific interactions between human and indoor objects. They mainly focused on the inter-object or human-centric spatial relationships. As a smaller-grained element of scenes, spatial relationships between the indoor scene components are also affected by the structure of indoor objects. Accordingly, it is more difficult to detect the spatial relationships between indoor components.
In this paper, we present a framework to segment out the indoor scene components and detect their spatial relationships. Our method is based on a slice strategy. We are inspired by the methods in [
45,
46,
47], where components of complex-shaped indoor objects were segmented based on the similarity of 2D profiles. Furthermore, our kernel insight lies in two points: (1) Slices of indoor scene components always have spatial proximity and similar profiles no matter if the components are simply or complex shaped. (2) The spatial topological relationships between indoor scene components can be effectively preserved by slicing the indoor scene layer by layer.
We use the slice strategy to obtain many slices of indoor scene models and convert each slice into a set of profiles, then merge the profiles of neighbor slices progressively into different components based on spatial proximity and similarity. Next, we geometrically establish relationships between the detected indoor scene components on the base of two geometric distances. Meanwhile, an ontology is built up to model the semantic knowledge about the spatial relationships between indoor scene components. The geometrically correlated indoor scene components are loaded to populate the ontology. Finally, the spatial semantics of the relationships are thereby inferred, and a semantic graph of spatial relationship (SGSR) is yielded to organize the indoor scene components and their spatial relationships.
The contributions of this paper can be summarized as follows:
- (1)
We propose a slice-guided algorithm to detect complex-shaped indoor scene components from point clouds. The detected components are faithful to the meaningful parts of indoor objects;
- (2)
We present a framework for modelling indoor scene components and their spatial relationship structure, which lays a foundation for the detection of following objects, semantic analysis, and understanding of indoor scenes.
The remainder of the paper is organized as follows.
Section 2 presents a brief review of the extraction of indoor scene components and their spatial relationships.
Section 3 gives the overview of the proposed method.
Section 4 describes how to detect indoor scene components on the base of clustering of profiles.
Section 5 elaborates on the inferring of spatial relationships between indoor scene components. The experimental results are presented in
Section 6. The limitations of our method and proposals for future research are indicated in the last section.
4. Detection of Indoor Scene Components
4.1. Slicing and Resampling of Indoor Point Clouds
To slice the indoor scene model effectively, the slicing coordinate system is constructed, where the center of the bounding box of the ground is taken as the origin. The upward normal of the ground is selected as the z-axis, and two arbitrary orthogonal axes on the ground are chosen as x-axis and y-axis.
It is observed that most indoor objects are placed upright on the ground. Therefore, the ground can be labeled through a simple direction searching. The specific process is as follows. (1) Compute the Orientation Bounding Boxes (OBBs) of the indoor scene model and obtain outer planes , i = 0,1,2 … m (m ≤ 5) that correspond with the ground, the walls, and the ceiling. (2) Filter the points belonging to outer planes and segment the indoor scene model PC into point sets by a k-nearest-neighbor (KNN) algorithm. (3) We select , i = 0,1,2 … m (m ≤ 5) as the ground and roughly regard each point set Pi as an object and generate OBBs from the resulting point sets. Moreover, due to the assumption that most objects are parallel to the ground, we enforce this constraint for the OBB computation—the orientation along the parallel plane of the plane . (4) If a has the largest number of OBBs closest to itself, it is identified as the ground.
Motivated by the aim of ensuring that enough geometric features are included in each slice, we characterize the indoor scene slice as an indoor scene section with a thickness of
l. The thickness
l is computed by
,
denotes the density of the point clouds,
is a density factor.
is formulated as the following equation,
where
denote a point of indoor point clouds,
pk is the k-closest point of
pi.
K is set to 6.
The slicing position is initialized at the point that has the minimum
z-axis value in the slicing coordinate system. Starting from the initial slicing position (the lower slicing plane is located at the initial position), from bottom to up, we iteratively slice the input indoor point clouds using two slicing planes by a step size
h in the slicing direction, as seen in
Figure 2.
For each slice, a plane parallel to the slicing planes and located between the two slicing planes and equidistant from the two slicing planes is defined as the projection plane. On this basis, the point set of each slice is projected to the projection plane by setting the z-axis value of each point to the z-axis value of the intersection point of the projection plane and z-axis.
The projected point set of a slice is first divided into some subsets by the clustering algorithm [
51]. Then each subset is thinned using the Moving Least Squire (MLS) method [
52], and is thereafter resampled to a profile with an interval
d. The size of
d is calculated as
, where
is a point of indoor point clouds,
pk is the
k-closest point of
.
K is set to 5.
Figure 2b shows one of slices of the object and the resampled point set, i.e., the profile of the slice.
Figure 2c shows all the profiles of an indoor object.
Note that some special subsets do not need to be resampled. We divide the minimum bounding box (
MBB) of each subset into many sub-rectangles and label the sub-rectangles that include one or more projected points, then count the labeled sub-rectangles and total sub-rectangles. If the ratio of a labeled sub-rectangles number to total sub-rectangles number is bigger than 0.7, the subset does not need to be resampled. More details about dividing
MBB into sub-rectangles can be seen in [
49]. The special subsets directly constitute a special kind of component of the indoor scene model. We refer to them as horizontal plane components (horizontal planes for short).
To obtain the appropriate value of the density factor
, we performed experiments on indoor scene models with different densities. By using different sampling rates to down-sample the indoor point cloud, point clouds with different densities can be obtained. Given a tabletop scene model (
Figure 3a), we chose the original model, the 50% down-sampling model, and the 25% down-sampling model for the experiments. We set
to 0.14, 0.23, and 0.34 for the models. The results are shown in
Figure 3b–l). If
is smaller, the slice will be thinner. The thinner the slice, the fewer points on the slice. In a severe case, profiles will fail to be obtained. As can be seen, the profiles are largely missing when
is 0.14 or 0.23. When
is 0.34, good results are achieved. In our work,
was set to 0.34.
h depends on the expected number of slices.
h will affect the running time. The smaller the
h, the longer the cutting and resampling process will take, as shown in
Table 1. Different slicing results of the scene are shown in
Figure 4.
H was finally set to 1.0
l.
4.2. Clustering of Profiles
Let the total resampled point set of indoor scene model be , and let each profile be (i.e., the jth profile of the ith slice), then = .
Given two profiles and of , their spatial proximity and similarity are evaluated. To judge whether two profiles are adjacent, their MBBs are calculated and denoted as MBB1 and MBB2, respectively. If and belong to neighboring slices and MBB1 and MBB2 are overlapped, and have spatial proximity.
For two profiles with spatial proximity, their similarity is further judged. A similarity measure
is designed, where
computes the distance between shape context features [
53] of the two profiles,
is adopted to approximate the scale ratio of the two profiles. If the similarity measurement between
and
is smaller than a threshold
, the two profile
and
belong to the same component. Starting from the initial profiles, the profile pair in
are iteratively clustered into different components, i.e., the profile clustering-based components.
To evaluate the effect of threshold
on the clustering results, we set
to 0.38, 0.48, 0.58, respectively, and the clustering results are shown in
Figure 5. It can be seen that the smaller
may result in over-segmentation (see the blue rectangle in
Figure 5b), and the bigger
may result in under-segmentation (see the blue rectangle in
Figure 5d). We set
to 0.48 in our work.
Note that some complex-shaped components may be over-segmented due to the profiles in some local surfaces of the components (see the blue rectangle in
Figure 6a). To solve the problem, we will locally adjust the slicing direction in a way similar to the method in [
45] at this local surface. Specifically, given three components,
Comp1,
Comp2,
Comp3, if a profile of
Comp1 is respectively overlapped with its neighbor’s profiles that belong to
Comp2 and
Comp3 (see
Figure 6a), we label the component set as
, and we reslice the raw points that correspond with
Comp1. A rotational slicing direction will iteratively be applied in the raw points (see
Figure 6b) until the optimal slicing is found. Then we re-cluster the re-generated profiles and the profiles of
Comp2 and profiles of
Comp3, and update the clustering results (see
Figure 6c) according to the minimum number of components principle.
Algorithm 1. Clustering profiles into indoor scene components. |
Input: = , is the jth profile on ith projection plane, |
Output: {Compl} |
1.l = 0; |
2. for i = 1:1:N |
3. for j = 1:1:Mi do Compl ; |
4. if is not marked |
5. , mark ;// |
6. u = I + 1; do & is not marked; ; |
7. search spatial&similar profile of in ; |
8. Compl Compl ; mark ; ; |
9. u = u + 1; |
10. until is not found |
11. l = l + 1; end if |
12. end for |
13. end for |
14. Search component set from {Compl}, p, q, k [0,l], |
15. for each |
16. apply rational-direction slice in raw points that correspoing with Compp |
17. re-genarate profiles; |
18. re-cluster the profiles and profiles of Compq, Compk, and update {Compl} |
19. end for |
20. output {Compl} |
7. Conclusions
We present a framework to detect the complex shaped indoor components and infer their spatial relationships. The kernel is a slice-guided indoor scene components detection algorithm for indoor point clouds. The core insight is that slices of most components of indoor scenes always have similar 2D profiles, which allows for the detection of complex shaped components regardless of whether these components have regular geometry. Besides, through the layers of global slicing, the topological relationships between indoor components were reserved and the construction of spatial relationships between indoor components was also facilitated.
To obtain a spatial structure of indoor scene models, we built up an ontology to model the commonsense knowledge about the semantics of spatial relationships between indoor scene components. The spatial relationships between indoor components were inferred and a SGSR was constructed to represent the components and their spatial relationships.
With experimental evaluation, we demonstrated the segmentation performance of our proposed method on indoor scene components with complex shapes. We have also shown that our method can exactly predict spatial relationships.
A limitation of our method is the calculation of slicing direction. When using our proposed method, different slicing directions will lead to different segmentation results. In a real indoor scene, most objects are placed on the ground in a normal posture, thus the perpendicular direction to normal of floor is selected as the slicing direction and the segmentation results are satisfied. For the objects placed on the ground with an abnormal posture and the objects having special shapes, how to determine the slicing direction and how to detect the components are our future work.