1. Introduction
Energy efficiency in buildings is a multifaceted topic that has gained great attention in the last decade. Various national regulations have led to the creation of technical standards designed to make energy consumption transparent and optimized. To execute the legislation resolutions, existing buildings require inspections of their energy distribution. The most widely used technique for the performance of energy studies in built-up areas is infrared thermography. The method detects the infrared energy emitted from an object, converts it to temperature, and displays an intensity color-coded image of temperature distribution. Thermal infrared (TIR) images enable us to visualize different thermal faults such as air infiltrations or moisture areas and to detect damages to the building structure; for example, cracks, delamination, or loose tiling. Depending on the final aim of energy auditing, thermal data can be collected in an indoor environment [
1,
2] or by outdoor measurements including airborne platforms [
3] and close-range techniques [
4]. A broad review of various infrared thermography applications for building diagnostics is presented in Kylili et al. [
5].
Established thermographic building inspection procedures are performed on-site by a human operator. Such interpretation is not only time consuming, but also highly depends on the expertise of the operator. Hence, the current trend in building energy audits has led to an automatic interpretation of TIR data, minimizing subjectivity of the assessment and allowing large-scale inspections. To provide a spatial reference for thermal images and to facilitate their interpretation, thermal measurements are often integrated with other data sources. The complementary geometric information often comes from laser scanning [
6,
7], photogrammetric spectral imagery [
8,
9] or existing building models [
10]. Meanwhile, sensor solutions are also now available that offer a laser scanner with integrated thermal camera technology (Leica BLK360). It is worthy of note that there are also new, professional plug-on modules in the market of thermal imaging devices to upgrade a smartphone to a low-cost infrared camera (e.g., the FLIR ONE pro).
The georeferenced TIR images enable us to analyze thermal data, to extract corresponding information, and to map them to the 3D space. Considering an understanding of an automatic façade for energy saving, the literature focuses on the automatic detection of thermal leakages, windows, and other structures, which is performed for 2D textures with a reference to the building. In Hoegner and Stilla [
11], heat leakages on building façades are detected using a region-growing algorithm applied to a strong image gradient. The outcomes of the work show that the presence of windows and doors in the analyzed image influences the automatic investigation of a façade state and leads to false results. Since glass in thermal data reflects the temperature of the surroundings, the captured windows do not present the real temperature of the façade but relate to the temperature of the sky and neighboring objects. Therefore, an automatic classification of these façade objects and their removal from the input data is of great importance for the reliable accuracy of the following thermal inspection. A procedure for window detection in thermal texture images presented by Iwaszczuk et al. [
12] starts with image segmentation using local dynamic thresholds. Masked correlation for corner detection is then applied in order to detect the position and the size of the extracted rectangular bounding box. To detect windows and doors from rectified thermal images, the research described in Sirmacek et al. [
13] used L-shaped features and perceptual organization rules. When the detected objects are removed from a wall, heat leakage areas on the investigated façade are marked by applying the region-growing algorithm at local maximum values of the façade surface. In the work by Michaelsen et al. [
14], the authors present an attempt at window detection using gestalt grouping. For this purpose, structural knowledge about façade objects, such as their hierarchy, geometry and mutual relations, is coded in a declarative way using two different systems of production rules.
In the literature, the performance assessment of object detection methods that use TIR information is mostly limited to visual evaluation. Numerical statistics for unsupervised window extraction from thermal data are given in Lin et al. [
15]. They show 85% correctness and 82% completeness, calculated on a per-object level. Performance metrics are more often reported for façade opening detection from laser scanning point clouds or RGB images. The method presented in Malihi et al. [
16], applied to photogrammetric point clouds, achieves 92% correctness and 96% completeness on a per-object level. The automatic window detection in façade images described in Neuhausen et al. [
17] reveals a 95% detection rate with a precision of 97%, depending on the complexity of a building being processed. A more precise assessment, performed at a smaller level—point or pixel-based—is usually given during the semantic classification of façades. The pixel-based accuracy of window extraction achieved by Markus et al. [
18] is 78%, and Cohen et al. [
19] achieved 85%. A deep learning approach for façade parsing, described by Liu et al. [
20], achieved 93% accuracy for the class window.
Although a thermal reconstruction of 3D scenes based on sensor fusion has already often been discussed in the literature, the further automatic processing of thermal data is very seldom presented. Moreover, the presented studies of thermal information analyses are performed on TIR textures in 2D space by classical image processing algorithms. Only in the final step are detected objects back-projected to the 3D space using previously provided 3D references. Consequently, the investigation procedures neglect the geometric characteristics of the data and do not exploit the full potential of currently available photogrammetric techniques. On the other hand, the existing algorithms for semantic interpretations of 3D point clouds are mostly dedicated to laser scanning data and focus only on their geometric features without taking into consideration any other possible information. To cover this gap, we present a new approach that brings the investigation of thermal data into the 3D space. The novel part of the research is the combination of thermal information with other available characteristics of a 3D scene for a further thermal analysis executed directly on a 3D point cloud. The goal of this paper is to investigate how spectral and geometric characteristics may support thermal analysis. Furthermore, we aim to evaluate the utility of 3D thermal point clouds for object extraction. Since façade openings, such as windows and doors, impede automatic thermal inspection, the aim of the presented façade classification procedure is their prior detection. The performance of object extraction is compared and evaluated according to the achieved classification accuracy, completeness, and visualization results.
The input data consist of two types of image sequences, acquired in the thermal infrared and visible RGB spectrum. The Structure-from-Motion (SfM) technique is used on both image types to estimate camera orientations and 3D scene geometry without any initial information. The resulting 3D point cloud is attributed using geometric, RGB, and TIR information. The fusion of the different input information sources offers the opportunity to capitalize on the synergies between them and to improve the classification process. Therefore, the executed experiments are designed to investigate different data combinations and their impact on the final results. In order to focus purely on feature influence and label each point independently, we use a context-free supervised algorithm: the Random Forest. The best feature combination is then used in the final part of the study as an input for Conditional Random Fields that incorporate neighboring relations between points and their common interaction. The data preparation was done using the commercial software tools Photoscan (3D point cloud generation) and MeshLab (texture matching). For the classification algorithms, we developed our own software.
The structure of this paper is as follows: we start with a description of the data acquisition and the generation of TIR-attributed 3D point clouds. The next section addresses the methodology for the investigation of the relevance of different information types for the classification performance. This is followed by the explanation of an algorithmic frame applied for the contextual classification of thermal façades. Then, a thorough evaluation of the approach is presented and discussed. The final section summarizes conclusions and gives an outlook on future work.
3. Classification in 3D Space
While the classification of laser scanning data based on geometric features has already been well addressed in the literature (e.g., [
28,
29]), experiments executed in the 3D space on SfM point clouds, which also take other geometric features into consideration, are rarely met. Therefore, in the first part of the presented classification study, we investigate the relevance of different information types—thermal, geometric, color, and their combinations—for the object extraction. Finally, we aim at the adaptive smoothing of the results, which avoids wrongly classified single points but at the same time does not cause an artificial over-smoothing and allows us to detect object edges. The feature combination set providing the best classification performance is then used for Conditional Random Field (CRF) object extraction adding context information.
3.1. Data Structure Approximation and Feature Extraction
For many tasks related to understanding 3D scenes, relative differences between point characteristics are more relevant than the values originally measured by a sensor. Hence, in the presented method, we use two types of point descriptors: direct values belonging to a point, and relative values based on differences between neighboring points. Although many classification approaches presented in the literature are based on a single neighborhood recovery, the application of multiple neighborhoods is found to be favorable [
30,
31]. In the set of descriptors, we use the combination of three spherical neighborhoods starting with a radius of 10 cm (considering the TIR resolution of 2.5 cm), followed by radii of 30 cm, and 50 cm. The extracted neighbors serve as a base to describe the local geometric 3D structures and local differences between the points.
The set of descriptors computed for each point in 3D space is presented in
Table 1. For the computation of geometric features, we adapt the method based on eigenanalyses, which is widely applied in the classification literature [
32,
33]. The spatial coordinates of the neighboring points are used to compute a local 3D structure covariance tensor, whose eigenvalues of
together with the eigenvector
emin serve as a base for the computation of local geometry features. Since our work focuses on façade classification, we augmented the geometric feature set by the largest difference of depth values (
dmax and
dmin) extracted within the given spherical neighborhood
Np with the radius
r.
The second type of features that are investigated is based on colors. In our work, color features are computed in the HSV color space (the Hue, Saturation, and Value of each 3D point is noted in
Table 1 as
Hp,
Sp,
Vp, respectively). Unlike RGB, HSV separates color components from the intensity, giving more robust information. The robustness to lighting changes and the ability to remove shadows means that this color space is often employed in computer vision tasks. Regarding point cloud classification, the advantage of HSV over the RGB domain is, for example, stated in Becker et al. [
34]. The set of color descriptors applied in our method is composed of 12 values. Besides each component of the color space (hue, saturation, value), we compute the color span, average, and variance at all levels of the multiple neighborhoods.
Finally, the set of features related to the point temperature is extracted. The set consists of the intensity index measured by a thermal infrared camera and statistical components showing the differences between adjacent objects. The relative values assigned to each point are computed based on its nearest neighbors, extracted at three different scales of the neighborhood. Like color feature extraction, relative TIR features contain the values of the span, average, and variance.
3.2. Context-Free Classification Based on Different Information Types
Given the extracted set of point descriptors, we learn a supervised classifier that predicts the conditional probabilities
P(y|x) of different class labels,
y. Among the large number of standard classification approaches, we have chosen the Random Forest classifier, which has been shown to provide an accurate classification and runs efficiently for large point clouds [
35]. As a classical standard approach, the Random Forest allows the direct evaluation of the influence of the feature descriptors on the classification result. The Random Forest learning method proposed by Breiman [
36] is composed of an ensemble of randomly trained decision trees. Each tree predictor is trained on a random subset of the training data, depending on the random vector of features sampled with the same distribution for all trees in the ensemble. Consequently, the resulting set of decision trees can be considered as decorrelated, which improves the generalization and robustness of the classification performance. During the point cloud classification process, each tree in the Random Forest gives a unit vote for the most popular class for each 3D point. The final point label is determined by taking the respective majority vote
Nl over the predictions of all decision trees
T [
37]:
For the presented experiments, we use a Random Forest consisting of 50 fully grown trees. To estimate this value, we used a standard method based on the relation between the number of trees and Out-Of-Bag (OOB) error [
34]. With the chosen number of trees, the OOB error becomes stabilized in our experiment, providing an optimal balance between the prediction performance and computation time. The number of differentiating features randomly chosen at each split is set to the square root of their total number (64 features calculated at different scales). The points are split based on the impurity information given by the Gini index.
3.3. Contextual Classification
Contextual information is an important clue in complex data that can improve classification results. Therefore, after the investigation of the relevance of different features for object extraction and by choosing the most optimal setup, we enhance the classification process by adding context explicitly. For this purpose, we apply the Conditional Random Field (CRF), which belongs to the group of undirected graphical models, providing a probabilistic framework for context-based classification. CRF has become a popular technique for class derivation, especially in image processing [
38,
39,
40]; however, its application to 3D point clouds [
28,
41] has been relatively less reported. In the general formulation of the CRF framework, the underlying graph structure
G(n,e) consists of a set of nodes
n and a set of edges
e. In the presented case, each node
ni n corresponds to a 3D point, while each edge
eij represents a contextual relation linking a pair of neighboring nodes,
ni and
nj. The goal of classification is to find the most probable configuration of class labels
yi y determined for all points simultaneously, given an observed data point
x (input point cloud). Thus, the CRF has to maximize the posterior probability
as follows [
42]:
In Equation (2), Z(x) is a normalization constant, which turns potentials into probabilities. The terms are called the unary potential. They link the class label of each node ni to the observed data. The terms are called the pairwise potentials. They are responsible for the model of contextual relation. In the presented experiments, the unary and pairwise potentials are weighted equally. The general formulation of CRF allows the application of arbitrary discriminative classifiers with a probabilistic output to model both types of the potentials.
Besides differences in potential modelling, CRF differs in its definition of the graph structure. Three-dimensional points, unlike images, are irregularly distributed in 3D space; thus, there is no direct definition of the neighborhood which allows for the computation of a CRF graph structure. In the presented research, we use the neighborhood information already extracted during feature computation (c.f.
Section 3.1). Each 3D point is linked by edges to all its neighbors within the spherical neighborhood with a radius of 10 cm. Since the Random Forest is considered to be one of the best classifiers, we applied it to model both types of potential. The unary potential is already provided by the probabilistic outputs of the best feature combination, executed in the previous step of the research. In order to avoid zero values for unlikely classes, we compute the exponent of the calculated posteriors:
The pairwise potential is provided in many CRF applications by relatively simple models, such as the Potts model and its enhanced variations, favoring identical labels at neighboring nodes. More complex models are based on the joint posterior probability of two node labels given the observed data
x. They enable the avoidance of over-smoothing and lead to better classification performance at the cost of a much higher computational effort. In the presented case, a new RF is trained to predict the conditional probabilities of a different class label for each edge
eij connecting two neighboring points. For
c classes to be discerned during point classification, the classifier has to differentiate between
c2 possible configurations of classes. The observed data are represented by an interaction feature vector
computed for each edge
eij. The vector is usually provided either by concatenating the feature values of two points connected by the edge,
and
, or by calculating their difference,
. Similar feature values of neighboring points often result in the differences being close to zero, thus hindering class differentiation. Therefore, in the presented experiment, the interaction feature vector is provided by concatenating point features of the edge ends. For the computation of the pairwise potential, we propose the use of different sets of features than those for the calculation of the unary potential. Since the edges of the graph structure link points within a 10 cm radius spherical neighborhood, features calculated at larger scales do not give a large differentiation boost to the classification process of locally similar points. Thus, the used feature set contains 48 features resulting from a concatenation of the descriptors belonging directly to the endpoints and their close neighborhood (24 features each, as presented in
Table 1). In a similar manner to Equation (3), the RF pairwise potential is defined by
where
l and
k reflect the label configuration of adjacent nodes.
Given the model for the posterior according to Equation (2) and the parameters of the unary and pairwise potentials according to Equations (3) and (4), respectively, the goal of inference is to determine the label configuration for which P(y|x) becomes a maximum. For the optimization, we use an iterative message passing algorithm that can be applied to CRF with arbitrary formulations of interaction potentials: Loopy Belief Propagation (LBP).
4. Results and Discussion
The experimental part of our research starts with the investigation of the relevance of different information types for the classification performance. We examine the utility of the generated 3D point clouds, combining thermal information with supportive color and geometric characteristics. The most optimal setup is enhanced by the consideration of context, leading to the final classification results. To validate the presented approach in terms of its applicability and performance, the accuracy and the quality of the classification outputs was evaluated. Experiments were conducted applying our procedure to two sets of data—façade 1 and façade 2—which were generated as described in the previous sections and differin the calibration method used to extract their TIR attributes. Both data sets present large and complex building façades collected along 180 m, with 371 façade openings (windows and doors) reflected respectively by ~770,000 point samples.
The classification algorithm was executed on the point clouds, which were down-sampled with a resolution of 3 cm (around 25 times fewer points than the original data sets stemming from RGB imagery). The value is related to the lowest resolution of TIR images on the furthest parts of the façades. The point clouds were provided with reference labelling containing manually marked 3D points of façade openings (windows and doors).
4.1. Investigation of Different Information Types for the Classification Performance
In the experiments, each dataset was split into disjointed training and testing sets through a vertical plane. The resulting point clouds are similar with respect to the number of points and class distribution. Respective data characteristics are collected in
Table 2.
Since the number of points belonging to façade openings significantly differs from the number of points of other object classes, using the whole training set for classifier learning might have a detrimental effect on the classification results [
43]. Thus, in order to avoid an unbalanced training data set, we sample the same number of training examples for each class (160,286 and 120,008, respectively, for façade 1 and façade 2).
The main objective of the first part of the conducted experiments was to evaluate the utility of different types of information for the classification process. Therefore, the classifier performance was tested against different feature sets, considering the following scenarios:
Thermal infrared only;
Thermal infrared and geometric;
Thermal infrared and colors;
Thermal infrared, geometric, and colors (i.e., all extracted features).
Once the classifier was trained on the training data using the respective set of features, we predicted the labels for the test data and then compared them to the reference labelling. The quality assessment was executed on a per-point level. To evaluate the performance of our framework, we computed two final evaluation metrics,—completeness (Equation (5)) and correctness (Equation (6))—related to the quality of façade object extraction [
44]:
where
#TP,
#FN, and
#FP are the numbers of true-positive, false-negative, and false-positive 3D points, respectively.
The corresponding quality metrics obtained for each scenario are collected in
Table 3.
The highest indicator values achieved in the whole experiment were obtained for the fusion of all the available information types (74% completeness and 92% correctness for façade 1, and 85% completeness and 95% correctness for façade 2). The results of both data sets show a very similar distribution of relative improvements in the consecutive experiments. It is important to notice that progress is observed at the same time for the completeness and correctness of the outputs. The statistics also demonstrate that the enhancement of TIR data by color information brings better performance than the TIR fusion with geometric information (with an up to 9% difference in the completeness and 8% in the correctness).
Discrepancies between the final results of the façade 1 and façade 2 processing are easy to capture in a graphical overview of the computed statistics (
Figure 3). The correctness values obtained for each set show large differences between each other of up to 20%, which decrease together with the addition of other features supporting thermal infrared information. Such differences between data sets are most likely related to the different method of thermal information extraction during data acquisition. The 3D point cloud of façade 1 was attributed to TIR data calibrated with commercial software, while the second set of thermal images was processed by our own algorithm [
22]. The latter workflow of data processing proves itself to be more suitable for the classification purpose, due to the better accuracy of the finally obtained thermal information and its higher consistency within the whole data set (also visible by comparing
Figure 2a,b). Despite clear differences in classification completeness, in each scenario, the correctness indicators are largely correlated for both data sets. The statistics show that even though the quality of TIR information has a large influence on the number of detected objects, it does not contribute to false object detection.
Important feedback of the classification performance achieved in various feature scenarios is given by a visual comparison of the resulting 3D point clouds.
Figure 4 illustrates the classification outputs with marked semantic class. It is easy to notice how different methods of TIR data calibration and pre-processing affect the classification results. The differences between the two data sets are especially large when TIR is the only considered feature. Adding a second type of information may often improve the classification performance by the effect of synergy. Supplementing thermal data with geometric features, however, does not bring about a significant improvement in the visualized results. On the other hand, merging thermal infrared data with color only gives a large boost to the classification performance, making the biggest visual difference in the detection results. Still, in this experiment, some portions of points (mostly on the roofs) are falsely recognized as window openings. The fusion of TIR and color information with geometric data enables us to reduce the percentage of such misclassified points and to achieve the best classification results.
4.2. Contextual Classification of TIR-Attributed Point Clouds
In order to enhance our experiments by the direct consideration of context, we integrate a Random Forest classifier into a Conditional Random Field framework. The RF probabilities for the classes calculated in the previous step are plugged into the CRF as unary potentials. For the computation of edges, we use the same method as in the unary potential computation case: disjointed training and testing data sets. The direct input data for the calculation of pairwise potential contained 7,171,967 training edges and 9,854,536 test edges for façade 1 and 4,548,491 and 10,254,142 edges, respectively, for façade 2. The training sets are balanced according to the class with the smallest number of samples by a random selection of the same number of samples for all four classes: 110,829 edges for façade 1 and 122,857 edges for façade 2.
Given the probability values for the unary and pairwise potential, we determine the most optimal label configuration by applying Loopy Belief Propagation. To evaluate the final results of façade opening extraction, the output is compared with the reference labelling. The quality assessment on a per-point level is executed according to Equations (5) and (6) by the examination of every single point and by calculating classification completeness and correctness measures. In order to take a broader view of the algorithm performance, we also execute the evaluation on a per-object level. The assessment is based on the overlap concept described in Rutzinger et al. [
44]. In a general context, an object is considered to be a true positive if a certain minimum percentage of its area is covered by objects in the other data set. In our research, similarly to the window detection evaluation presented in [
15,
16], the detected object is considered to be true positive (TP) if at least 70% of its points are properly classified. If at least 50% of object points are classified incorrectly, the object is considered to be a false negative (FN).
Figure 5 presents the visualization of the final results. The statistics measures of the classification performance are collected in
Table 4.
In comparison with the best performance indices obtained for RF in the previous experiment, we observe a large improvement in the point-based completeness: from 74% to 82% for façade 1 and from 85% to 90% for façade 2. The consideration of context enables us to extract the full shapes of objects by the proper labelling of missed neighboring points. It is important to notice that the improvement in the classification completeness was not achieved to the significant detriment of the correctness measurement (no changes for façades 2, and one percentage point of performance deterioration for façade 1). The statistical changes are also reflected in the visualization of the final results (
Figure 5). We can clearly observe the improvement in the completeness of the shapes of detected objects. Furthermore, the restriction of edge features to local ones during the pairwise potential calculation enables us to detect the sharp edges of the windows, which should be especially important in the case of further post-processing and polygon boundary extraction.
The quality assessment executed on a per-object level revealed much higher completeness values than the values computed per-point (95% vs. 82% for façade 1, and 97% vs. 90% for façade 2). This is due to the fact that, in both data sets, nearly all of the objects were detected; however, with slightly missing points. On the other hand, the object-based correctness compared to the same metric calculated on a per-point level was the same (91% for façade 1) or even lower (88% vs. 95% for façade 2). Such a result indicates that the structures falsely recognized as windows openings are mostly very small. Our achieved values of 90% completeness and 95% correctness, compared against the same metrics reported in the literature (85% and 82% [
15], or 92% and 96% [
16]), prove the quality of the applied method.
5. Conclusions
We have presented a method for the supervised extraction of façade openings from photogrammetric 3D point clouds attributed to TIR and RGB information. Detected objects are removed from the data, allowing for the reliable accuracy of the following investigation and the monitoring of thermal changes on a building façade. The novelty of the research is the direct combination of thermal information with other available characteristics of data, as well as the classification workflow being performed entirely in 3D space. Unlike thermal analyses processed on 2D textures, the processing of 3D data allows us to benefit from the geometric characteristic of a classified 3D scene. Furthermore, we aimed to investigate how spectral and geometric data may support thermal analysis. The experiments have shown the supremacy of color-based features over geometric characteristics as a complementary information source for thermal data. We also observed that differences in TIR information pre-processing led to significant changes in the classification completeness, while they doid not affect the correctness measurement. The visual comparison of the processing results clearly shows the advantage of feature fusion over classification based on a single information type. The fusion of all available information—i.e., thermal, geometric, and color attributes—allows the recognition of 74% object points with an exactness of 92% for façade 1 and 85% object points with an exactness of 95% for façade 2. Considering the context in our experiments improved the point-based classification accuracy by eight percentage points and five percentage points for façades 1 and 2, respectively. Analyzing the algorithm performance on a per-object level, we notice larger values of the completeness metric than in point-based evaluation together with lower values of correctness. The comparison indicates that falsely detected structures are mostly very small, and that, besides a small portion of missing points, nearly all of the objects were successfully detected.
In the presented study, we focused on the examination of the suitability of TIR and RGB-attributed 3D point clouds for a classification in a 3D space. Therefore, the algorithm classified single 3D points. Windows and doors, however, heavily depend on the size of regions and symmetry considerations, which can be exploited, for example, by shape grammars. In this respect, the presented results could provide a valuable input for further investigation. In the future, we also plan to extend the studies by taking into consideration a larger number of semantic classes and by classifying different urban materials. Since the values displayed in thermal images are dependent on the emissivity of object materials, knowledge about the material type should improve the emissivity calculation and result in a more precise calculation of the surface temperature.