Deep Learning-Driven Virtual Furniture Replacement Using GANs and Spatial Transformer Networks

Vijaykumar, Resmy; Ahmad, Muneer; Ismail, Maizatul Akmar; Ahmad, Iftikhar; Noreen, Neelum

doi:10.3390/math12223513

Open AccessArticle

Deep Learning-Driven Virtual Furniture Replacement Using GANs and Spatial Transformer Networks

by

Resmy Vijaykumar

¹,

Muneer Ahmad

^2,*

,

Maizatul Akmar Ismail

¹

,

Iftikhar Ahmad

^3,*

and

Neelum Noreen

⁴

¹

Department of Information Systems, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur 50603, Malaysia

²

Department of Computer Science, University of Roehampton, London SW15 5PH, UK

³

Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

⁴

School of Information Technology, Whitecliffe College, Auckland 1010, New Zealand

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(22), 3513; https://doi.org/10.3390/math12223513

Submission received: 23 July 2024 / Revised: 14 October 2024 / Accepted: 29 October 2024 / Published: 11 November 2024

(This article belongs to the Special Issue Advances and Applications of Artificial Intelligence Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

This study proposes a Generative Adversarial Network (GAN)-based method for virtual furniture replacement within indoor scenes. The proposed method addresses the challenge of accurately positioning new furniture in an indoor space by combining image reconstruction with geometric matching through combining spatial transformer networks and GANs. The system leverages deep learning architectures like Mask R-CNN for executing image segmentation and generating masks, and it employs DeepLabv3+, EdgeConnect algorithms, and ST-GAN networks for carrying out virtual furniture replacement. With the proposed system, furniture shoppers can obtain a virtual shopping experience, providing an easier way to understand the aesthetic effects of furniture rearrangement without putting in effort to physically move furniture. The proposed system has practical applications in the furnishing industry and interior design practices, providing a cost-effective and efficient alternative to physical furniture replacement. The results indicate that the proposed method achieves accurate positioning of new furniture in indoor scenes with minimal distortion or displacement. The proposed system is limited to 2D front-view images of furniture and indoor scenes. Future work would involve synthesizing 3D scenes and expanding the system to replace furniture images photographed from different angles. This would enhance the efficiency and practicality of the proposed system for virtual furniture replacement in indoor scenes.

Keywords:

generative adversarial networks; indoor scene synthesis; image inpainting; furniture swap; deep learning; object placement in indoor scenes

MSC:

68T45; 68U10

1. Introduction

The visual perception of new furniture in a space and its proper fit within a given area are important considerations when moving furniture. Physically moving furniture to see how it looks can be time-consuming and physically demanding. Image processing algorithms can provide a virtual means of testing new furniture placement, but current methods using tools like Photoshop version 20 can look patchy and unrealistic. Can neural networks learn the pixel values for missing regions and patch them naturally? Can they create a composite image by replacing the existing furniture image in an indoor scene with a new furniture image? Advances in Image Inpainting and Image Replacement using Generative Adversarial Networks (GANs) provide a solution to this problem. By utilizing huge datasets of indoor scenes and furniture images, GANs can produce highly realistic and natural image reconstructions in a matter of minutes. This has potential for household users and furniture companies, providing a virtual shopping experience and allowing customers to see the effects of furniture replacement before investing.

Deep learning techniques like Convolutional Neural Networks (CNNs) and large datasets like ImageNet [1] can now identify and classify objects in images, including furniture items like sofas, chairs, and tables. We can use this to remove furniture from an image, which creates a missing region that needs to be reconstructed realistically using a process called image inpainting. However, without images of the background without furniture, it is challenging to recreate the background scene naturally. While CNNs trained on indoor scenes can help [2], the reconstructed image may still look patchy. Generative Adversarial Networks (GANs) can address this by having a generator network estimate missing pixel values and a discriminator network assess the recreated image’s correctness [3]. This adversarial competition between the two networks leads to highly realistic and accurate results, allowing GANs to reconstruct images naturally and take elements from the adjoining background. Recent advances in image processing techniques [4] have made it possible to replace furniture in an indoor scene with a new one through image inpainting and superimposition. However, accurately positioning a new furniture image within the space of the indoor scene remains a challenge. Existing methods rely on local extraction of background information [5] to fill the missing region after removing an existing furniture object. However, accurately placing a new furniture object in the scene requires considering the physical dimensions of the room and the furniture object, which has not been fully explored in existing algorithms. Without considering the actual dimensions, the output may not be realistic and accurate, which defeats the purpose of the furniture replacement. Therefore, accurate positioning of the new object is critical to achieving a realistic and accurate output of the recreated indoor scene.

In this research paper, we explore the use of Generative Adversarial Networks (GANs) [6] to predict furniture placement in indoor scenes. GANs consist of two neural networks: a generator and a discriminator. The generator creates composite images of furniture within an indoor space, while the discriminator determines if the furniture is correctly positioned. The two networks compete, resulting in more accurate predictions of furniture placement. By combining image reconstruction and precise object placement, GANs could enable a range of applications for modifying indoor and outdoor scenes. Our goal is to investigate deep learning algorithms using GANs to achieve realistic results with both visual accuracy and proper positioning. This approach addresses the limitations of supervised training, which struggles to account for the countless possible furniture positions in an indoor environment.

The primary research objectives of this study are as follows:

To detect furniture objects in indoor images and create masks for them, effectively segmenting and isolating the objects from their backgrounds.
To reconstruct indoor images using background elements after removing furniture objects, allowing for the seamless integration of new furniture pieces. To accurately position new furniture in indoor images, taking into account room and object dimensions to maintain realistic proportions and placements.

To accomplish these objectives, we will address the following research questions:

How can GANs be utilized to reconstruct missing parts of indoor images, considering various background elements such as walls and floors?
How can GANs incorporate multiple parameters, including room dimensions, furniture dimensions, and visual placement correctness, to accurately place furniture in indoor scenes?

The subsequent flow of the write-up presents a literature review of the current state of research. Next, it discusses the methodology employed to achieve the research objectives, followed by the presentation of results and discussions on GANs implications. Lastly, we conclude the findings and present the potential directions of future research.

2. Literature Review

Before Convolutional Neural Networks were discovered, image processing for object detection and classification was mainly conducted using filters without the advantages of neural networks. It was in 2012 that Convolutional Neural Networks [1] were used during the ImageNet Image Classification Competition, and they stood out from the rest due to their high accuracy compared to other image segmenting techniques such as image filters and Hidden Markov fields.

Later, Fully Convolutional Networks [2] used convolution operations (multiplication of inputs in the form of matrices with matrix filters) to generate feature maps of pixels, which were then used to classify each pixel into classes. In May 2015, Olaf Ronneberger devised the U-Net Segmentation architecture [7], which extended the Fully Convolutional Network. Later, the encoder-decoder architecture was used in Segnet by Badri Narayan et al. [8] that had greater pixel-level accuracy than FCNs because it used a set of encoding layers to extract the features and decoding layers to recreate the higher resolution of the images. Still, the computation time and the memory storage required were huge. Fast R CNN used the concept of Regions of Interest (ROIs) to segment various regions of an image and then used a classifier to classify regions into an object category. The advantage of Fast R CNN was that it used feature maps to generate a single region of interest for each object and then send the region of interest to the classifier to identify the object. Thus, it was very effective for object detection and instance segmentation. Mask R CNN [9] was built over Fast R CNN, and it not only performed object detection and instance segmentation but also created a segmentation mask for each object instance in an image. Thus, Mask R CNN is very useful for our project, as it will help detect furniture objects and create masks for each furniture object. The masks will be later used during image replacement. And we can use a pre-trained Mask R CNN model trained on an MS COCO dataset of images, thereby saving us time and effort.

Atrous convolution was used in DeepLabv3 [10] to reduce the number of training parameters by using dilated convolution. This resulted in faster predictions and lower memory requirements. DeepLabv3+ is a suitable model for object detection and classification because of its efficiency in identifying pixel-level labels of regions based on pre-trained models. As ours is going to be based only on indoor scene images, pre-trained models trained on indoor scenes and furniture objects can be used along with DeepLabv3 in order to obtain precise segmentation and labeling of regions.

Image inpainting using Deep Learning and GAN has grown tremendously over the last few years and is now the most efficient and accurate algorithm to fill in missing regions of an image. Pathak et al. [11] proposed the Context Encoders that generate missing regions in an image from their surroundings, in which the encoder encodes the image into a feature map and the decoder uses it to complete the missing region. It uses a GAN network to generate realistic results. More accurate and realistic results were produced by the Globally and Locally Consistent Image Completion (GLCIC) architecture by Satoshi IIsuka [12], which used two GAN networks, one that is trained on the global image and another trained only on the local portion to be filled.

In 2019, Kamyar Nazeri et al. [4] proposed EdgeConnect that recreated images by highlighting the edges of the missing region, and the image completion network uses the edges to obtain pixel values for the missing regions. For our objectives, since we are dealing with the furniture objects that will be removed as a whole, our problem involves filling in a complete missing region, often large. Another aspect is that the filled-in regions will be available from the regions surrounding the edges, such as the wall and the floor. Therefore, Edge Connect is a suitable algorithm for our project.

For image replacement, Daniel Ritchie et al. [9] proposed a fast and flexible indoor scene synthesis using GANs. Object placement takes into account empty spaces where objects should not be. Donghoon Lee et al. [5] devised a learning procedure that uses two modules that are connected together, one for finding the location of object insertion and the other determining how it should look. It learns to synthesize a new instance into a semantic map. But it is performed from a top view of an indoor scene.

Linghzi Zang et al. [13] propose a generative model called PlaceNet that will predict locations for the placement of an object in an indoor scene. PlaceNet and the procedure proposed by Daniel Ritchie are suited to our objectives. But they do not actively consider dimensions of objects in the indoor scene in order for greater accuracy of object placement.

Donghoon Lee et al. [5] proposed a neural network consisting of two generator networks and four discriminator networks, where one GAN determines the location and scale of the new object and the other GAN determines the shape and pose of the new object. But it does not handle collisions of objects and left it as a future work.

Convolutional Neural Networks (CNN) are very useful for generating feature maps by combining low-level features into high-level features and then classifying the feature maps to make predictions. But CNNs do not capture the position and orientation of the object into their predictions. All spatial information is lost. This can be resolved through the use of Spatial Transformer Networks (STNs). STNs provide spatial manipulation of data within the Convolutional Neural Network. Max Jadergberg et al. [14] proposed Spatial Transformer Networks that allow neural networks to geometrically transform images.

Generative Adversarial Networks were proposed by Ian Goodfellow et al. [6] as a machine learning framework for generating synthetic data with the same statistics as existing data using a combination of two neural networks. One neural network generates random data, and the other tries to classify it as real or fake, and both networks learn and improve themselves to generate better images. Different forms of GAN architectures have been devised to solve a wide variety of machine learning problems.

Lin, C.-H. et al. [2] addressed the problem of realistic geometric corrections to an object to be placed within an indoor scene image using a novel GAN architecture named ST-GAN. ST-GAN combined Spatial Transformer Networks (STNs) with GANs to iteratively find the correct position of an object image when superimposed on a background image.

Our project requires the insertion of a new furniture object into an indoor scene image to replace an existing furniture object and, at the same time, ensure that the positioning is realistic and accurate. Thus, we need to take care of the location and scale of the new object when placed within the indoor scene, as well as identify if the available space is enough for the new object to be placed. First, we need to remove an existing furniture object and recreate the missing region. In order to remove only the furniture object, a CNN trained on existing datasets can be used. Mask R CNN is suitable to detect furniture objects and generate masks for furniture objects. We can use pre-trained model weights in order to save time and effort on training our network. Next, the missing region has to be recreated in a way that it does not look like the object was there so that we recreate the indoor scene in order to place the new furniture object. This requires Image Inpainting algorithms for which Edge Connect [4] is suitable. DeepLabv3 [10] is another CNN architecture that can perform specific semantic segmentation of only furniture objects within an image. So, we can use DeepLabv3 for semantic image segmentation and object removal, EdgeConnect for filling in the missing region, and ST-GAN to insert the new object with correct positioning and orientation. We can improve upon the procedure adopted by Lin, C.-H. et al. (2018) [2] and use ST-GAN to accurately place the object. In ST-GAN, a Spatial Transformer Network (STN) provides geometric corrections to the image in order to obtain the correct orientation and pose of the image. A GAN then evaluates the accuracy of the positioning of the image within the indoor scene.

The existing literature on indoor scene generation offers various solutions for object detection, image inpainting, and object placement using deep neural networks (DNNs). However, it is essential to acknowledge that DNNs often encounter challenges due to noisy labels, which can interfere with accurate object recognition and placement. To address this, recent studies, such as “Cross-to-merge training with class balance strategy for learning with noisy labels”, have proposed methods for mitigating the impact of noisy labels [15,16,17,18,19]. This work introduces a strategy that balances classes and trains the network to handle incorrect labels, ultimately improving model robustness.

In this study, we used the discriminator in the ST-GAN, which is a classifier already trained using the SUNCG dataset [2,3,20,21,22,23,24]. This dataset includes labeled data, making it easier for the model to learn object placement in indoor scenes.

Therefore, in this research, we improve a combination of DeepLabv3+ architecture with EdgeConnect algorithm and ST-GAN architecture to receive greater accuracy and positioning of new furniture images into an indoor scene.

3. Methodology

In this section, we present a comprehensive methodology for virtual furniture replacement in indoor scene images. Virtual furniture replacement is a complex process that involves several stages, ultimately leading to a new indoor scene image in which an existing furniture image is replaced with a new furniture image.

Figure 1 shows the process flow of our approach towards virtual furniture replacement. Our approach involves several stages: image semantic segmentation, mask creation for indoor scene furniture and new furniture objects, removal of indoor scene furniture, indoor scene image inpainting, and new furniture image placement in the indoor scene. We employ the Mask R-CNN model for image semantic segmentation, which allows us to create masks for furniture objects in both the indoor scene image and the new furniture image. Following this, we use the DeepLabv3+ architecture [10] and EdgeConnect [4] algorithm for image inpainting to fill in the missing regions created after removing the furniture object. Lastly, we implement a combination of Generative Adversarial Networks (GANs) and Spatial Transformer Networks (STNs) to accurately position the new furniture image within the inpainted indoor scene, achieving a realistic and natural-looking final output.

3.1. Image Semantic Segmentation

Image semantic segmentation is a critical step in this process, as it segments an image to create various regions and classifies each region as a specific category of objects. In the context of indoor scenes, an image may contain furniture and other objects that can be segmented and classified as sofas, chairs, and tables. The outcome of this procedure is a dictionary data structure, which encompasses the coordinates of different objects that are part of the image, the count of objects identified, their labels, and the masks for every object.

This process involves several steps, including resizing input images to 160 × 120 pixels and converting them to numpy arrays of size 160 × 120 × 3. The inputs are then sent to the Mask R CNN model, which consists of a convolutional neural network that detects various feature maps and proposes regions of interest (ROI) for the objects within the input images. These ROIs are then sent to a fully connected layer (FCN), which acts as a classifier that classifies the various regions into categories such as chairs, tables, sofas, and others.

Convolutional neural networks (CNNs) are critical in this process, as they perform convolution with a filter on the input image to obtain a feature map. Successive convolutions are conducted to receive higher-level feature maps leading to the labeling of regions such as floor, wall, ceiling, and furniture objects. CNNs are based on the operations of convolution and pooling, with convolution being similar to matrix multiplication and pooling operations extracting the important features from a feature matrix. Figure 2 shows the steps within a Convolutional Neural Network.

Mask R-CNN operates as an efficient model for the semantic segmentation of images. It is essentially a convolutional neural network that carries out tasks for object recognition, image segmentation, and object classification. The model builds on Faster R-CNN, which itself evolved from R-CNN (Region-based convolutional neural network).

Mask R-CNN is designed with three stages. In the first stage, the ResNet neural network architecture (specifically, ResNet 101) is used to spot objects within the image. Next, a classifier is used to categorize the numerous Regions of Interest (ROIs) into objects from diverse classes and simultaneously establish boundary boxes for each of these objects. The final stage is tasked with forming masks for all the detected ROIs for each object.

The output from Mask R-CNN is composed of four elements, with the outcomes after object detection for the images of indoor scenes and furniture saved in dictionaries, referred to as results_f and results_i, respectively. Each of these dictionaries includes keys for the boundary boxes, class ids, masks, and scores. Here, ‘masks’ imply the masks for the identified objects, ‘class_ids’ suggest the class integers for the recognized objects, and ‘scores’ indicate the confidence level or probability for each predicted class.

Thus, image semantic segmentation is a crucial step in the virtual furniture replacement process, and Mask R-CNN is an effective model for performing this task. The segmentation results are used to create masks for the furniture objects in the indoor scene image, which can then be used for removing the indoor scene furniture and for image inpainting.

The Mask R-CNN model is used to detect and classify furniture objects in both the indoor scene image and the new furniture image. This process involves the creation of masks for each detected object, which are saved as image files. These masks are essential for later use in image replacement, where the existing furniture object is replaced with a new furniture object.

3.2. Image Inpainting

To remove the furniture object from the indoor scene image, the Image Inpainting algorithm is used. This process involves the generation of a mask for the furniture object, which is then used to remove the object from the image. The missing region that is created after the furniture object is removed needs to be filled in using Image Inpainting.

Figure 3 shows the various stages of the DeepLab v3+ Architecture. The DeepLabv3+ architecture [10], which is based on ResNet-101, is used for object detection and classification in the Image Inpainting algorithm. This architecture uses atrous convolution, which allows for faster feature generation and lower computation costs. A pre-trained model is used, which is trained on millions of images from the ImageNet database that have labeled furniture objects. Therefore, the final image.

When the furniture object is removed, a missing region is created. This missing region is filled using the EdgeConnect [4] algorithm. This algorithm uses the edges of the missing region to find the missing pixel values, such as the colors and textures of the background, walls, and floor. This algorithm uses the information from the surrounding background to recreate the image in a realistic way.

The EdgeConnect algorithm starts by hallucinating the edges of the missing region using the Canny Edge Detector, creating an edge map. This edge map is then used to estimate the pixel values of the missing region. A Convolutional Neural Network (CNN) computes the pixel values and determines a loss predicated on the pixel values adjacent to the boundaries. This calculated loss subsequently aids in fine-tuning the estimated values of the pixels within the absent region. Consequently, this results in the reconstitution of an image that closely resembles reality.

To create an accurate edge map, a Generative Adversarial Network (GAN) is used to assess the accuracy of the edge map. This leads to highly accurate results. Figure 4 shows the steps in the creation of an edge map.

The process of image inpainting using the EdgeConnect algorithm involves three steps: edge map creation, edge map assessment, and image completion. First, the indoor scene image containing the missing region due to the removal of the furniture object and the furniture object mask is passed through the Canny Edge detector to create an edge map. Next, a GAN is used to generate an accurate edge map. Finally, the incomplete image and the edge map are given to an image completion network consisting of a GAN to recreate the image with the pixel values from the surroundings, resulting in a realistic image without the furniture object.

3.3. Image Replacement

The final step of our furniture replacement algorithm involves placing the image of the new furniture into the indoor scene in a realistic way while maintaining accurate dimensions and sizes. The primary challenge of this step is to ensure that the new image is positioned accurately within the available space created by removing the existing furniture object. To achieve this, we utilize Generative Adversarial Networks (GANs) and Spatial Transformer Networks (STNs) to accurately position the image within the indoor scene.

Spatial Transformer Networks (STNs) are a form of Convolutional Neural Networks (CNNs) that undertake geometric adjustments on an image. Feedback from a discriminator trained on a dataset of indoor scene images allows the STNs to learn and perform more accurate geometric corrections on an image. The discriminator is responsible for determining the accuracy of the geometric corrections applied to the furniture image. The ultimate goal is to accurately place the new furniture image onto the indoor scene image, ensuring that the final image appears realistic and natural.

The ST-GAN algorithm incorporates a generator that is an STN and a discriminator, which has been trained on the SUNCG dataset. This dataset includes 45,622 indoor scenes featuring 5 million 3D object instances across 37 categories. The ST-GAN takes the new furniture image and the inpainted background image as inputs. Within the GAN, the STN applies a geometric transformation to the new furniture image to achieve the correct orientation before passing it to the GANs discriminator, as shown in Figure 5.

The STN predicts a geometric correction

Δ p_{1}

using a geometric prediction network. The correction is determined based on the dimensions of the foreground furniture object, the indoor scene, the object’s views, and the semantic classification of various regions in the layout of the indoor scene. Instead of one major correction to position the furniture object correctly, small iterative corrections are performed, making pixel-level alignments to find the correct position. So, the STN generates a series of predictions

Δ p_{i - 1}

. At the i^th iteration, a new predicted state

p_{i}

is formed based on the correction

Δ p_{i}

.

Mathematically, it can be written as [3]:

Δ p_{i} = G_{i} (I_{f} (p_{i - 1}), I_{b})

(1)

G_{i}

is the prediction network,

I_{f}

is the foreground image and

I_{b}

is the background image [3],

p_{i} = p_{i - 1} θ Δ p_{i},

p_{i}

is the predicted state,

p_{i - 1}

is the previous state.

θ

denotes the process of composing an image from two images.

As shown in Figure 6, in each iteration, the Spatial Transformer Network (STN) produces a composite image that is then forwarded to a Generative Adversarial Network (GAN). With each iteration, the GANs discriminator shown in Figure 7 provides a loss feedback that improves the geometric network. Initially, a specific dataset is used to train the generator G₁. In subsequent iterations, the training of the generator is dictated by the loss feedback from the discriminator. As a result, the generator within the ST-GAN appears to be a series of stacked generators. After each iteration i, there is a new generator network G_i constructed with the novel geometric update

Δ p_{i}

, while the preceding generator remains unaltered. Therefore, in each subsequent iteration, only the newly formed Generator G_i undergoes updating. A fresh composite image, I_comp(p_i), derived from the projected state p, is dispatched to the discriminator at every iterative step. The discriminator then cross-references this image with the genuine data distributions of indoor scenes that contain furniture.

The discriminator within the GAN is trained using a vast collection of indoor scene pictures to determine the exact positioning of the furniture image. This training leverages the SUNCG dataset, encompassing more than 40,000 indoor scene images populated with objects from over 30 varied categories.

Thanks to its training on the SUNCG dataset, the discriminator can assess the accuracy of the position of the new furniture image within the background image. This sequence is replicated four times, with the result being an output image that closely resembles the desired final product, in which the image of the new furniture is accurately integrated into the image of the indoor scene.

The process can be represented mathematically as [3]

I_{c o m p} = I_{f} ⊙ M_{f} + I_{b} ⊙ (1 - M_{f})

(2)

where

I_{c o m p}

—Composite image;

I_{f}

—Furniture image;

I_{b}

—Background image;

M_{f} —

Mask of furniture image.

In essence, the integration of GAN and Spatial Transformer Networks forms a structured approach for accurately embedding the new furniture image within the background scene. This combination ensures a final outcome that is both realistic and visually appealing.

3.4. Deployment

The Image Replacement System was implemented using Google Colab, a browser tool to write and run Python code, and various Python libraries, such as Tensorflow v1.0. The algorithms were divided into three stages: Mask R-CNN for Image Segmentation and Mask Creation, Image Inpainting, and ST-GAN for Image Replacement. Each stage was developed and tested in a separate virtual Python environment. The code was stored on Google Drive, and the input and output images were also saved there.

The best deployment solution was found to be Anvil Works (https://anvil.works, accessed on 1 July 2024), a tool that can easily convert Google Colab code into a web app. The User Interface was created using ready-made components, and the input images were uploaded to the website. The User Interface also allowed the user to start the process of replacing the existing furniture object with a new one. Each stage’s intermediate image was displayed on the website, and the final output was shown after all stages were completed. Figure 8 shows the flow of data after deploying the code.

The whole process took approximately three minutes on average. Figure 9 shows the images displayed on the website during the mask creation process. Figure 10 shows the images displayed on the website after the image segmentation process, image inpainting process, and image replacement process using GAN. Figure 11 shows the final image obtained when the image replacement is performed without the use of GAN.

The final output with and without GAN was displayed to show the effects of using GAN for image placement. Once deployed as a data product, anyone could use the Image Replacement System to upload an indoor scene image and a new furniture object for a virtual replacement of the existing furniture object. The system was used to conduct experiments with different indoor scenes and furniture images.

To conduct the experiments, a set of indoor scene images with furniture and single furniture images were collected. For each experiment, a pair of an indoor scene image and a single furniture image were uploaded to the website. The output images and the intermediate images were saved for assessment.

The following images were assessed after each experiment:

Input furniture image mask,
Input indoor scene furniture image mask,
Inpainted image after removing furniture object,
Composite image after furniture replacement with GAN,
Composite image after furniture replacement without GAN.

Since the evaluation was qualitative, each output image was assessed visually and given a score between 1 and 3 to indicate the quality of the output, with 1 indicating very low quality and 3 indicating very high quality. Table 1 shows the criteria for each score and the dimensions of the qualitative assessment. Since there were a total of 50 experiments, the final score was out of 150, as the maximum score for each quality parameter was 3.

3.5. Constraints

For the experiments, indoor scene images were selected with a single furniture object belonging to the categories of chairs and sofas. Additionally, furniture objects in both indoor scenes and new furniture images were photographed in front view, and furniture objects in the new furniture images belonged to any of the categories of sofas, chairs, and dining tables.

4. Results

In the scope of this research, a series of 50 tests were performed to substitute one piece of furniture with another in an indoor scene. Every individual experiment utilized an indoor scene image featuring a piece of furniture and a standalone furniture image as the input. Figure 12 shows a sample of the results of each test. It shows the input images, intermediate images, and the final composite image after virtual image replacement.

In Figure 12, the column “Input Furniture Images” shows the input indoor scene with existing furniture on top and the new furniture below. The column “Masks” shows the mask images of each of the input furniture images. The column “Inpainted Image” shows the inpainted indoor scene image after removing the existing furniture. The column “Image Replacement” shows the final image after replacing the existing furniture with the new furniture image.

The three stages of the framework—object detection and mask creation; image inpainting; and image replacement with and without GAN—were evaluated qualitatively based on the accuracy of object detection; the ability to recreate missing regions after removing the furniture object; and the accuracy of the new furniture object’s placement in the image. Table 2 shows the results based on the qualitative assessment.

For object detection and mask creation, the Mask R CNN program accurately detected furniture objects in the categories of chair, sofa, and cots and scored 87% in accuracy. However, it was unable to create masks of the legs of tables and chairs. The program’s ability to recreate the missing regions after removing the furniture object was partially successful, particularly for chairs and single-seater sofas, but less effective for larger sofas. The recreated images had equivalent texture but with slight blurring in some areas.

Using GAN for image replacement resulted in a more accurate and natural placement of the new furniture object compared to images without GAN. The difference was due to proper scaling and placement of the furniture object to make it look like a natural part of the indoor scene. Overall, the project scored 77% in terms of producing realistic results for replacing existing furniture objects with new ones. The use of GANs for image inpainting and virtual image placement was found to be advantageous.

5. Discussion

Object detection, classification, and segmentation have become accurate today due to large datasets of millions of labeled and annotated images. ImageNet is an example of such a database with over a million labeled images for objects in more than 30 categories, including chairs, couches, beds, and dining tables. In the first stage of our project, Mask R CNN is used for object detection. We used pre-trained model weights of a ResNet 101 model, which was trained on the ImageNet collection of images. The model was downloaded and stored in a local directory.

With these pre-trained model weights, furniture object detection and classification were accurately conducted by the Mask R CNN program without requiring further training. Object detection in a Convolutional Neural Network (CNN) is conducted by extracting features from an image. The low-level features are detected through convolution and pooling, then combined to form high-level feature maps, which are passed through a classifier. The weights of the pre-trained model can be saved and used later for any object detection program, minimizing time and effort. Mask R CNN was able to use the pre-trained model to detect the regions occupied by the furniture object in the image and then create a mask over the object area. However, we found that for mask creation, the masks of the furniture object image are sometimes only partially created, especially for thin objects such as the legs of a chair or a table.

For image inpainting, GANs are effective in creating highly realistic reconstructed images of missing regions, as shown in Figure 13. This is because GANs use a minimax game played between the generator and discriminator, where both try to improve themselves to the best possible level, leading to optimal changes by both neural networks. GANs require the object mask along with the edge map to remove the object accurately and recreate the missing region. Some of the images did not get accurately inpainted, as seen in Figure 14, when objects were not detected, as there was no object mask passed to the GAN.

In the image replacement stage, ST-GAN was used to position the furniture image accurately within the indoor scene. ST-GAN uses a Spatial Transformer Network (STN) to provide the geometric corrections required to fit a furniture image into the indoor scene. The STN does a geometric correction to make the furniture straight and position it accurately, and the composite image is then passed onto the GAN, which assesses it to identify if the positioning is correct. The discriminator used in the ST-GAN is a classifier already trained using the SUNCG dataset. This provides feedback to the generator to improve the network and arrive at a highly realistic output for the image replacement.

The advantage of using STNs is the ability to provide the geometric corrections required to fit a furniture image into the indoor scene. The advantage of using GANs is the iterative process, where the generator and discriminator improve their networks and arrive at a highly realistic output for image replacement. Without GANs, the positioning of the image is not as realistic or accurate. Figure 15 shows the difference between the final images after using GAN and before using GAN.

6. Conclusions

In this study, we developed a web application utilizing Generative Adversarial Networks (GANs) for virtual furniture replacement in indoor scenes. The application enables users to efficiently replace furniture objects in indoor images with just a few clicks, generating highly realistic composite images of new furniture placements.

We employed object detection and instance segmentation algorithms trained on extensive datasets, resulting in accurate detection and classification of furniture objects. By incorporating GANs into image processing techniques such as Deeplabv3+ and Edge Connect, our method effectively generates realistic images, even when objects are partially obscured or set against intricate backgrounds.

The GAN-based approach successfully positions new furniture objects within indoor scenes, performing precise geometric corrections and generating realistic composite images. Our web application offers a user-friendly platform for virtual furniture replacement, streamlining access to advanced GAN-based image processing algorithms. This tool proves valuable for professionals like designers, architects, and interior decorators who need to visualize furniture arrangements before implementing them in real-life settings.

In future research, the application could be expanded to handle indoor scenes with furniture objects photographed from different angles and viewpoints. Additionally, the technology could be adapted for 3D layout reconstruction by removing and replacing furniture objects within the scene. Our study serves as a foundation for further exploration of GAN applications in image replacement for indoor environments.

Author Contributions

Conceptualization, R.V.; Formal analysis, M.A.I.; Methodology, R.V.; Resources, I.A. and N.N.; Writing—original draft, R.V.; Writing—review and editing, M.A. and M.A.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia under grant no. (GPIP-1424-611-2024).

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

This project was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, under grant no. (GPIP-1424-611-2024). The authors, therefore, acknowledge with thanks DSR for technical and financial support.

Conflicts of Interest

Authors have no conflicts of interest in this research.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Lin, C.-H.; Wang, E.Y.O.; Shechtman, E.; Lucey, S. ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 9455–9464. [Google Scholar]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.Z.; Ebrahimi, M. Edgeconnect: Generative image inpainting with ad-versarial edge learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Lee, D.; Yang, M.; Kautz, J. Context-Aware Synthesis and Placement of Object Instances. NeurIPS 2018, 1–11. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
Badrinarayanan, V.V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
He, K.; Dollar, G.G.P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Pathak, D.; Donahue, J. Context Encoders: Feature Learning by Inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Iizuka, S.; Simo-serra, E.; Ishikawa, H. Globally and Locally Consistent Image Completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Zhang, L.; Wen, T.; Min, J.; Wang, J.; Han, D.; Shi, J. Learning Object Placement by Inpainting for Compositional Data Augmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
Zhang, Q.; Zhu, Y.; Yang, M.; Jin, G.; Zhu, Y.; Chen, Q. Cross-to-Merge Training with Class Balance Strategy for Learning with Noisy Labels; Expert Systems with Applications; Elsevier: Amsterdam, The Netherlands, 2024. [Google Scholar]
Song, H.; Kim, M.; Lee, J. Learning from Noisy Labels with Deep Neural Networks: A Survey. arXiv 2020, arXiv:2007.08199. [Google Scholar] [CrossRef] [PubMed]
Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Tsang, I.; Sugiyama, M. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. arXiv 2018, arXiv:1804.06872. [Google Scholar]
Jiang, L.; Huang, D.; Liu, M.; Yang, W. Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020. [Google Scholar]
Reed, S.E.; Lee, H.; Anguelov, D.; Szegedy, C.; Erhan, D.; Rabinovich, A. Training Deep Neural Networks on Noisy Labels with Bootstrapping. arXiv 2015, arXiv:1412.6596. [Google Scholar]
Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic Scene Completion from a Single Depth Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhao, H.; Puig, X.; Liu, S.; Zhu, S.; Torralba, A. Indoor Scene Generation from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savva, M. Matterport3D: Learning from RGB-D Data in Indoor Environments. In Proceedings of the IEEE International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017. [Google Scholar]
Ritchie, D.; Wang, K.; Lin, Y. Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]

Figure 1. Virtual Furniture Replacement Process Flow.

Figure 2. Convolutional Neural Network [2].

Figure 3. DeepLab v3+ Architecture [10].

Figure 4. An edge map is created [4].

Figure 5. Structure of GAN for Image Replacement.

Figure 6. STN in ST-GAN [3] (p. 3).

Figure 7. Discriminator D in ST-GAN [3] (p. 4).

Figure 8. Data Flow Diagram.

Figure 9. Creation of Masks for Furniture Objects.

Figure 10. Furniture Object Removal and Image Replacement.

Figure 11. Image Obtained Without Use of GAN.

Figure 12. Virtual Furniture Replacement Results.

Figure 13. Inpainted Images.

Figure 14. Partially Inpainted Images.

Figure 15. Advantages of using GANs for Image Replacement.

Table 1. Qualitative Assessment.

	Low—1	Medium—2	High—3
Object Detection	The furniture object is not detected.	The furniture object is detected partially.	The object is completely detected.
Image Inpainting	The furniture object in the indoor scene is not removed.	Part of the furniture object in the indoor scene is partially removed, and the missing region is recreated partially.	The furniture object in the indoor scene image is almost removed, and the missing region is recreated correctly.
Image Replacement	The new furniture object is not inserted at all or placed incorrectly into the new image.	The new furniture object is inserted but placed partially correctly into the new image.	The new furniture object is inserted and placed accurately in the new indoor scene image.

Table 2. Results Based on Qualitative Assessment.

Stages	Total Score Out of 150	Score (%)
Object detection and mask creation	131	87%
Image Inpainting	90	60%
Image Replacement with GAN	126	84%
Overall Result	115	77%
Image Replacement Without GAN	30	20%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vijaykumar, R.; Ahmad, M.; Ismail, M.A.; Ahmad, I.; Noreen, N. Deep Learning-Driven Virtual Furniture Replacement Using GANs and Spatial Transformer Networks. Mathematics 2024, 12, 3513. https://doi.org/10.3390/math12223513

AMA Style

Vijaykumar R, Ahmad M, Ismail MA, Ahmad I, Noreen N. Deep Learning-Driven Virtual Furniture Replacement Using GANs and Spatial Transformer Networks. Mathematics. 2024; 12(22):3513. https://doi.org/10.3390/math12223513

Chicago/Turabian Style

Vijaykumar, Resmy, Muneer Ahmad, Maizatul Akmar Ismail, Iftikhar Ahmad, and Neelum Noreen. 2024. "Deep Learning-Driven Virtual Furniture Replacement Using GANs and Spatial Transformer Networks" Mathematics 12, no. 22: 3513. https://doi.org/10.3390/math12223513

APA Style

Vijaykumar, R., Ahmad, M., Ismail, M. A., Ahmad, I., & Noreen, N. (2024). Deep Learning-Driven Virtual Furniture Replacement Using GANs and Spatial Transformer Networks. Mathematics, 12(22), 3513. https://doi.org/10.3390/math12223513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Driven Virtual Furniture Replacement Using GANs and Spatial Transformer Networks

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Image Semantic Segmentation

3.2. Image Inpainting

3.3. Image Replacement

3.4. Deployment

3.5. Constraints

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI