1. Introduction
Techniques for the automatic creation of 3D models directly from photos or laser scans have shown success in constructing the antecedents for architectural structural analysis [
1,
2,
3]. In the preservation of heritage buildings, these techniques are even more relevant due to the often complex design of target structures and the historical importance of decorative or facade elements; these elements are not captured when creating 3D models from floorplans without substantial manual work [
4]. Automatically creating 3D models of the exteriors of buildings has been well researched in recent years, but progress on interior modeling has been slower [
5,
6,
7]. Capturing highly detailed 3D models directly from the physical ground truth presents a substantial problem for manufacturing structurally relevant interior spaces due to the presence of non-architectural elements like furniture [
8,
9]. Not only are these interior elements not relevant for structural integrity testing or truly apprehending the naked building, but in occupied buildings they can also present privacy concerns for the current residents [
10]. One option for dealing with these interior elements is simply to physically remove them by hand before modeling the building; however, this is inconvenient, expensive, and time-consuming. It is also possible to manually clean the 3D model after capture, but this is likewise time-consuming, thus preventing these techniques from being used at scale; it also requires human annotators to comb through the interior spaces with a high level of focus, potentially exposing private details. The need for the automatic, privacy-preserving creation of 3D models of interior spaces motivates the techniques explored in this paper; we propose a pipeline from lightweight image capture to 3D modeling with the 3D native removal of non-structural elements.
This article focuses specifically on neural radiance field (NeRF) based approaches for 3D modeling. NeRFs are a 3D modeling paradigm wherein images of a scene and their corresponding camera poses are used to train a neural network to predict the 3D model of the scene without a 3D ground truth. Though rival photogrammetric approaches are able to show strong reconstruction performance in many scenarios, they do have multiple drawbacks in their large storage size, lack of novel view synthesis ability, and lack of native methods for manipulation or understanding of the 3D content. To ameliorate these concerns and also to explore the potentialities of more novel methods of 3D reconstruction, we utilized strictly neural radiance field-based approaches following Pepe et al. [
11], Llull et al. [
12], and Croce et al. [
13,
14,
15] who demonstrate the feasibility of utilizing NeRFs specifically within the cultural heritage domain. In particular, we employed language embedded radiance fields (LERFs) to introduce querying ability to our models and make the identification of extraneous objects possible. Furthermore, if 3D modeling follows the general trend of 2D imaging in recent history, it is increasingly more likely that the capture and manipulation of 3D models for cultural heritage preservation will rely on deep learning-based approaches, making the creation and study of these techniques crucial in the present day.
The subject of our study was the old hospital of Sion (L’Ancien Hôpital de Sion) in Valais, Switzerland. This building has substantial historical significance in Switzerland, as the first hospital in Sion was constructed by the Monks of St. John by at least 1164. The various hospitals of the town went through many iterations before centralizing in this particular building, which is shown in
Figure 1, in 1781 [
16]. The chapel of the hospital is listed as a national historic monument, and the rest of the building (much of which was built later), is regionally listed [
17]. The primary method of construction is natural stone masonry, about which much research has been conducted regarding its vulnerability to seismic activity [
18]. This is particularly relevant considering that the hospital is in the most seismically vulnerable region of Switzerland, and in fact an earlier seismic analysis using a manually created 3D model was performed on the hospital in 2015 which is displayed later in the paper in
Section 2.2 regarding structural assessment [
17]. In 2020, the government of Sion announced a twenty-five million dollar renovation on the hospital designed to stabilize the building and retrofit it to become the new site of the city administration [
19]. The building was mostly empty when we began our data capturing process, but around one fourth of the rooms were being utilized by a music school. Though this building had many interesting architectural features, the most beneficial aspect for our research was that the building was being emptied in preparation for the future renovations. This allowed us to establish a structural ground truth for the building without any furniture or objects in addition to scans of the same rooms while they were filled and in use with a myriad of complex objects.
Throughout this article, we explore the efficacy of our modeling techniques in this realistic context by showing the ability of our LERF-based approach to apprehend the actual underlying 3D structure and to effectively remove various unwanted items from the scene in preparation for the creation of the structural model for integrity testing.
2. Materials and Methods
To achieve our desired objective of testing the usefulness of NeRF-derived approaches for privacy preserving automatic point cloud creation, we utilized numerous techniques which we explain in this section. In particular, we cover basic neural radiance fields and the offshoots Nerfacto and LERFs. We also detail the downstream LOD modeling methods which will inherit the point clouds as they are relevant to many of the specifics of how we chose to develop our pipeline and modeling techniques.
2.1. Neural Radiance Fields
Neural radiance fields (NeRFs) are designed to approximate 3D information about a scene given a set of input images of that particular scene and the camera pose relative to each input image [
21]. The camera poses are computed from the images by the use of structure-from-motion techniques such as COLMAP [
22] or relying on information from the acquisition device. To be more precise, NeRFs utilize neural networks as a backbone to learn an approximate function which accepts a 3D coordinate vector as well as a viewing direction and outputs a predicted color vector and a density at that coordinate. The most general form of this function is
where
is the coordinate vector,
d is the directional vector,
c is the color vector, and
is the density.
In the training of a NeRF, rays are cast through a target pixel in an input image from the viewing direction of the camera into an synthetic 3D scene. Three-dimensional points are then sampled along that ray to be passed to the neural network described above along with the relative directional vector from the input image. The accumulated color and densities which are predicted at the location of each sampled point by the network are volumetrically rendered into the predicted pixel color of the target pixel. The predicted pixel is then compared with the target pixel to optimize a reconstruction loss. Through this training process, the neural network learns the latent 3D information which is described by the image dataset for one particular scene, and thus can be used to generate novel views of that scene effectively.
2.1.1. Nerfacto
Since the publication of the original neural radiance field paper, there has been an explosion of techniques utilizing some variant of the standard NeRF architecture [
23]. Though the utility and interest of many of these offshoots is without a doubt, we based our research on the Nerfacto architecture [
24].
Nerfacto was developed as a synthesis of multiple other NeRF approaches during the creation of the Nerfstudio toolset. Nerfstudio is an open source NeRF development software which provides viewers and standard components of various NeRF architectures. In re-implementing multiple architectures (MipNeRF-360 [
25], NeRF– [
26], Instant-NGP [
27], NeRF-W [
28], and Ref-NeRF [
29]) in a standardized, modular fashion, the authors were able to then use various components to create a model, Nerfacto, that is both fast and relatively accurate; close or surpassing the state of the art for NeRFs in most quality metrics while training substantially faster [
30,
31]. There are multiple important differences between a base NeRF and the Nerfacto approach in the optimization of camera positions, ray sampling techniques, scene registration, hashgrid encoding, and the generation of normals. We will leave an in depth explanation of the Nerfacto innovations to the original paper, but we include here a diagram of the source of various components in
Figure 2.
2.1.2. Language Embedded Radiance Fields
Generating realistic 3D scenes is an important standalone task, but in order to understand the content of the scene it is necessary to encode semantic features into the 3D representation or a proxy for the 3D scene. Three-dimensional semantic information underlies many tasks in computer science, and is of course necessary for our architectural object removal task. One approach to injecting semantics into 3D spaces, in this case linguistic information, is to utilize a language embedded radiance field [
32]. The general concept of a LERF is to instantiate a separate network which mimics the NeRF framework by accepting a 3D coordinate and camera pose but is trained to output the appropriate language embedding vector at that position.
In general, 3D semantic features are predicated on assigning some descriptive information about a scene to particular locations within the scene. There are multiple forms of assignment schemes including using object labels with bounding boxes that contain that object, per voxel assignments of particular objects, or continuous features like those used by LERFs where there are no discrete boundaries between objects. While 3D features can encode various sorts of data such as temperature or other sensor measurements, linguistic features specifically are useful because they allow for interaction with various other language-based tools and make querying or interacting with the scene natural for humans. Linguistic features can take multiple forms, such as explicit dictionaries of descriptive captions, but representing them as embedding vectors offers a broader range of querying possibilities. Since language embedding vectors allow fast, effective comparisons of any linguistic information (i.e., a description of a particular object), having linguistic embeddings assigned to the 3D points throughout a scene makes it possible to understand and interact with the scene in a robust way.
The underlying system for creating language embeddings which is used in LERFs is CLIP (Contrastive Language Image Pretraining) [
33] or alternately its open source relative OpenCLIP [
34]. These embeddings form the basis of many systems that rely on a joint understanding of language and images such as stable diffusion because they are effective in regards to open set or long tail queries which do not fit particularly well into typical classes. This makes CLIP particularly useful for LERF, which utilizes the input images for the neural radiance field to generate its language embeddings. The largest difficulty with this approach is that CLIP embeddings, while image size agnostic, can only be produced on a full image. In other words, it does not give pixel wise embeddings or segmented embeddings, but instead a single embedding per image. However, in order to find the loss and thus train the network that predicts the CLIP embedding for 3D positions in a similar structure to a regular NeRF, each pixel in the input image must have a corresponding CLIP embedding. The way the LERF authors found a CLIP embedding value for each pixel was by creating a pyramidal stack of image subsets: they divided the original image into smaller sections before subdividing each of the smaller sections into even smaller subsets. After creating the stack of image subsections by repeating this operation up to a particular depth of subdivisions, they used CLIP to find the embeddings of each subset of the original image. When a neural network is used to predict the CLIP embedding for a particular pixel, the loss is then calculated for the prediction relative to the embedding for each of the images that contain the target pixel. Through training with this pyramid of image embeddings, the network learns to predict the best embedding at each 3D position due to the overlapping views of each 3D point in all the image subsets, both within a single stack, across stacks, and across images.
The trained LERF network is then easy to query; the system converts your text query into a clip embedding
and weighs it against the embeddings
at each 3D coordinate. The primary caveat to this query is that a distance measure between two embeddings vectors of a large enough size is arbitrary without a reference point or comparison because the embedding space can be rather sparse. The original LERF authors solved this by implementing a default dictionary of vague, unspecific labels dubbed canonical phrases: “object”, “things”, “stuff”, and “texture”. The query relevance score is then determined by weighing how close the embedding at any position is to the query relative to its distance to the canonical phrases by cosine similarity. Expressed mathematically, this score is
2.2. Structural Integrity Assessment
Structural integrity analysis and assessment is a crucial task in civil engineering for ensuring the stability of buildings both before they are built and throughout their lifespan. The backbone of many contemporary computational structural integrity testing techniques are 3D models of the particular building being tested, such as in
Figure 3 Often, 3D models are inferred from the floor plans; however, in many situations this is either not possible or sub-optimal. For many buildings, the floor plans are either not representative of the physical building (often due to renovations), have a non-standard illustration style which is difficult to automatically parse, or simply do not exist in the first place. Furthermore, some of the buildings in need of structural integrity assessment have been damaged by natural disasters or the passage of time, and thus the details of this damage are not present in the floor plans created during the initial construction. These various difficulties combine to make it essential to be able to create 3D scans directly in reference to the real building.
2.2.1. Modeling Occupied Buildings
Many relevant computational integrity testing techniques specifically derive their models from a scan of the exterior walls of a building. However, since many important structural walls, beams, or other elements are inside of the buildings, it is also crucial to be able to model the interior as well as the exterior as a unified model. Though this poses some problems in an unoccupied building, it is a particularly difficult and pernicious problem when the building is currently occupied, as most of the target buildings are.
In an occupied building, especially large ones that are most of interest, it is usually not an option to empty the entire building of furniture. It would be extremely time-intensive, expensive, and logistically challenging to do this with one building, let alone with multiple buildings. Due to this, the only true option for interior scanning of large buildings or sets of buildings is to do so while they are occupied with non-structural objects such as furniture and artwork. It is therefore crucial to remove these non-structural objects for two primary reasons: functionality of the model for integrity testing and privacy preservation for the current occupants of the building.
Since any non-structural elements present in the model would only degrade the efficacy of the integrity tests, it is self evident that the extraneous objects which do not affect the actual building must be removed. It is possible to perform this sort of segmentation by hand; however, this is very labor-intensive and thus does not scale well. Another motivation for removing these objects is privacy preservation, especially in the case of modeling residential buildings. Many people will not consent to having the layout and contents of their homes publicly disclosed in a 3D model which will be saved for posterity and future testing. With these two considerations in mind, it is necessary to build modeling pipelines which can detect and remove non-structural elements automatically.
2.2.2. Inpainting versus Removal
A key consideration for any automatic removal pipeline is the choice between inpainting and removal. In some applications, it is imperative to inpaint the removed information, i.e., infer what should actually be in the missing space of the data and attempt to generate some realistic replacement. However, in our case, it is actually not necessary to inpaint because the building representation which is ultimately used is based on inferred wall placement rather than the native pointcloud or mesh representation. In this case, walls can be inferred from just the points that model the non-occluded walls or floors which were behind the objects that were removed. Since walls can be inferred without inpainting and because inpainting introduces the danger of generating artifacts which could distort wall placement, we choose to only remove the non-relevant points rather than to infer the sections of the scene which were previously occluded by the removed objects.
2.2.3. Automatic Generation of Simplified Building Geometry
The practice of assessing structural integrity in buildings leads to the automatic generation of building geometries in terms of Level of Detail (LOD) models. These models represent simplified versions of the buildings and denote a reduced complexity 3D representation [
35]. Current research focuses on generating such models for the building exterior. For instance, Pantoja-Rosero et al. [
1] presented an automated method for generating LOD3 building models, which capture exterior geometry, including openings, using structure-from-motion and semantic segmentation techniques. In this approach, the point cloud is clustered into planar primitives, and the intersection of these primitives yields potential faces for constructing a polygonal surface model of the building’s exterior. Correct faces are then selected by solving a linear programming problem, considering the goodness of fit relative to the point cloud, the coverage of the point cloud on the faces, and the complexity of the LOD2 model. Subsequently, the LOD2 model is upgraded to LOD3 using a convolutional neural network designed to classify openings in images, such as windows or doors, and projected to 3D space using camera poses retrieved from the structure-from-motion. A simplified representation of the LOD2 model construction process is provided in
Figure 4.
A similar approach, involving clustering interior point clouds into planar primitives, can be applied to generate simplified geometry, which can then be combined with the exterior for generating LOD4 models. This approach is currently being developed in parallel with the present work by the authors, and some results are displayed in
Figure 5.
2.3. LERF-Based Open Set Automatic Removal
In order to solve the challenges of the structural integrity modeling outlined above, we introduce a LERF-based automatic removal technique built on top of Nerfstudio functionality which is detailed in
Figure 6. Our approach is effective on a large range of exotic objects, and in small dataset sizes where there are few captured views of particular objects. It can be run with no further human input after the initial capture of data and thus allows for the privacy preserving capture of accurate interior structural models of buildings when combined with the techniques detailed above.
2.3.1. Open Set Queries and the Default Dictionaries
A fundamental component of the general effectiveness of CLIP classification, and thus LERF classification, is the strong open set capability (in other words, its ability to classify an extremely diverse set of objects with no limitation on what those objects are). Since we are dealing with removing a wide range of unknown objects from a myriad of views, in our context it is particularly useful to utilize CLIP rather than a more traditional object classifier, even one specialized for interior objects and furniture.
In the traditional LERF workflow, queries are inputted through the GUI at rendering and then compared against a dictionary of canonical phrases to find a relevancy score toward the present query. Since we are seeking automatic detection without supervision, and we have a consistent, coherent yet broad range of objects we want to remove, we need to present a persistent set of negative queries. Furthermore, since we also know what we want to preserve (walls, floors, structural beams, etc.) we can present a persistent set of positive queries which we want to keep. By comparing the highest relevancy of each of these canonical sets of possible classifications, we can determine simply whether a particular point should be removed or preserved. This set of phrases was subject to substantial prompt engineering to find the optimal dictionaries and is presented in
Table 1.
2.3.2. Three-Dimensional Native Segmentation and Point Removal
Many NeRF-based 3D segmentation approaches, such as SPIN–NeRF [
36], actually rely on 2D native segmentation, inpainting, and removal as the backbone of their approach. Though 2D inpainting or segmentation has very strong results on individual images of a scene, this approach introduces a substantial number of artifacts because the segmentation maps are inconsistent on a image-by-image basis within the same scene [
37]. This results in a particular piece of furniture being mapped and removed in one image but not another, leading the NeRF model to try to learn to produce a piece of furniture from one view but not from another and thus creating ghostly floating artifacts where they ultimately should have been removed. Another paradigm in object removal follows [
38] where objects are learned distinctly from the scene, and there is a discrete representation for each individual object. This works well in simple scenes with clearly delineated objects, but cannot scale to very complex scenes with large numbers of objects. One of the key advantages to using a LERF-based approach is that each point is being evaluated and segmented in the actual 3D space, and thus it avoids many of the issues associated with 2D removal and inpainting in addition to foregoing the need for discrete object representations. When removing points, we calculate a relevancy score for each point in 3D space after training and then simply blacklist every point which is classified as a negative sample when rendering the pointcloud from the NeRF. We also utilized the blacklist during training to not calculate loss for the RGB NeRF on points in the hashgrid which were eventually going to be blacklisted anyway, but we determined that the marginal improvement in quality of the non-blacklisted points was not worth the substantial slowdown in speed.
2.4. Data, Preprocessing, and Postprocessing
Our dataset is entirely architectural scans of an existing building that reproduces similar scenarios to our actual use case but where there were no true privacy preservation issues. In terms of image preprocessing, we only performed the steps outlined in the collection section as well as removed images with a high level of blur from the LERF input dataset. In terms of point cloud postprocessing (after the generation and cleaning), we normalized all the point clouds and afterwards, where appropriate, trimmed the point clouds to be the same size. This trimming was performed primarily on the point clouds of the filled rooms as they always had more points than the automatically cleaned rooms due to point removal and more than the empty rooms due to them holding more objects/information. We choose the points to remove in post-processing by finding the points farthest from the origin point of the normalized point clouds, and then trimming those by a mean of 12,696 points, leaving an average point cloud size of 1,004,997 points. Since neural radiance fields are known to generate outlier points more than photogrammetric or laser scanning-based approaches, it was actually a benefit to the filled rooms in terms of distance metrics to have their outliers removed. Since our interest is in making the automatically cleaned rooms as close as possible to the physically empty room, and this actually made our task more difficult, we considered this an acceptable trade-off to have more accurate distance measures not effected by disparities in the number of points.
Collection
For dataset capture, we utilized the free but closed source Polycam 3D capture application with a LiDAR enabled iPhone 12 Pro due to the higher quality camera poses produced with this approach in comparison to non-LiDAR-enabled cameras. Though more recent models of iPhone have substantially higher quality cameras and LiDAR scanners, we still considered the photo and pose quality of the 12 Pro sufficient for evaluating our method. The images were 944 by 738 pixels, and we captured between two hundred and eight hundred images per room depending on the size of the room. We captured our test building in per-room segments due to the limited capacity of the LERF models at their normal scale and because it allowed us to have multiple samples for evaluation of the approach. It would be possible to run this same pipeline with larger-scale captures, as there is no theoretical limit to modeling outside of the capacity of capture device and size of the neural network. However, in practice, smaller scenes tend to be less unstable in training/yield better point clouds. To capture the scenes, we walked through each room with LiDAR and video running, and allowed Polycam to record the camera positions directly from the iPhone. After optimizing the camera positions relative to the LiDAR and images, the raw data format was a series of images, each paired with the relevant camera pose at that moment. We then took the raw data from these captures and utilized the Nerfstudio data processing pipeline to place them in the format which was acceptable for the original LERF which we built our approach on top of (this did not involve any change in the images or poses themselves but involved removing blurred images as mentioned in the pre-processing section).
4. Discussion
Though these techniques are not yet fully mature, the automatic modeling and embedded understanding of historical building models by deep learning-enabled techniques is already proving to be a useful tool. They are currently limited by their scale, but we expect that, with further improvement of the underlying algorithms and the propagation of high-memory GPU compute, these techniques will become able to model large buildings in their totality in the future. In terms of expanding on this particular work, there are multiple avenues that present themselves. One is to rely on other forms of capture (such as photogrammetry or laser scanning) for producing the underlying 3D models and then embedding the linguistic features after the fact. Similar techniques have been designed for other purposes such as ConceptFusion [
39] and CLIPFO3D [
40], which would likely exhibit similar open set classification ability to LERFs. Though our ambition was to test the efficacy of novel generative methods in addition to utilizing the embedded classification ability, these projection-based techniques could be very useful when the accuracy of the underlying 3D model is paramount or where the object to be modeled is extremely large. Recently, language features have also been embedded into Gaussian splats (another technique for learning 3D representations from a set of poses and images), which also seems rather promising in terms of the quality of object segmentations [
41]. Another further solution would be to build a multimodal embedding model with a similar training regime to CLIP but which natively accepts 3D data and thus avoids the need to project linguistic information from images into the 3D space. This would involve assembling large datasets, marshalling a huge amount of compute, and developing innovative deep learning models; a substantial effort from even the most dedicated of researchers.
There is also the issue of point cloud inpainting—an active area of research in heritage preservation [
42], automatic building models [
43], and computer science more broadly [
44]. Through the novel view synthesis ability of the underlying NeRFs in our approach, we did generatively fill particular areas of our point cloud where there were not enough reference images to construct a strong representation using an approach like photogrammetry—yet there were still substantial holes in our 3D models. Utilizing other neural network or propagation-based approached to inpainting, these missing components of our point clouds likely would have increased the realisticness of our automatically cleaned point clouds—this is an area for future work. As the integration of neural rendering and point cloud understanding techniques advance, it may be possible to inpaint more holistically and directly within the generation of the point cloud itself rather than as a post-processing step.
5. Conclusions
In this paper, we explored the capabilities of language-embedded radiance fields for the automatic generation of point clouds from images and for the removal of non-structural elements from those point clouds. By contextualizing our approach within the ongoing conservation of the Ancien Hôpital de Sion, we showed the current use and tremendous potential of point cloud understanding in the cultural heritage space. The sort of linguistically aware modeling techniques presented here have a strong ability to classify a wide range of objects with a minimal level of manual intervention, making privacy preserving modeling of the interiors of at-risk buildings a realistic possibility. Due to the inherent difficulty of the manual removal of objects in the real world or within an already constructed 3D model, this research enables faster, simpler modeling of the interior of buildings—contributing to better models for general conservation and structural modeling in particular. This in turn has implications for the ability to protect and cherish our cultural heritage for the long run. There are multiple extensions and areas of improvement for this approach including more complete inpainting of the removed areas and more precise underlying modeling either with NeRFs, Gaussian splats, or other linguistic projection techniques. Regardless, language-embedded radiance fields and their technological descendants will undoubtedly have a important impact on the field of cultural heritage preservation in years to come.