To achieve image composition, the network must extract feature information from both foreground objects and the background image. It should also derive suitable placement information for foreground objects based on background feature information, including spatial positioning. Simultaneously, it should utilize the foreground feature information to determine the sizes of the foreground objects and whether geometric angle adjustments are necessary. Ultimately, the goal is to produce the actual composite images.
Hence, our FTOPNet comprises two stages: one for completing the feature information extraction stage for the foreground objects and the background image, and the other for the adversarial training required to achieve image composition. In the initial stage, foreground objects are obtained based on the input source image and object mask image. Spatial transformer networks (STNs) [
8] are employed for geometric and spatial transformations to adapt them to various background images. Subsequently, the foreground objects’ encoder performs feature extraction and obtains the foreground feature vector (FFV). While extracting features from the background image, rather than utilizing direct global feature extraction, we adopt the feature extraction approach inspired by the Swin transformer (SwinViT) [
18]. This method extracts features from the background image in layers and different local windows, allowing the network to concentrate on patch blocks suitable for foreground object placement, thus reducing the computational overhead to obtain the background feature vector (BFV).
In the second stage, a simple decoder generates the position prediction
by receiving only the FFV and BFV. Conversely, the hybrid decoder’s input comprises not only the FFV and BFV but also the cross-attention foreground–background association feature vector. Simultaneously, introducing a random vector sampled from the
uniform distribution space enhances the diversity of the foreground object placement [
5]. It also provides information about the foreground object’s position coordinates in the background image and its size, denoted as
. The final prediction result
P is obtained through the addition operation. During the training process, the accuracies of the encoder-generated results are enhanced by iteratively updating the focus area of the FBCAFFM. The FTOPNet encompasses these two stages, as depicted in
Figure 1.
3.1. Design of the Background Feature Extraction Module
Unlike PlaceNet [
5], TERSE [
3], HIC-GAN [
4], and some other networks [
20] in processing background images, global features are typically extracted directly from the background image. However, this approach is susceptible to the loss of local background feature information and may lead to the neglect of suitable foreground object placement. Therefore, in this work, we propose a reconfiguration of the background image feature extraction process and introduce the BFEM to improve the accuracy of background feature extraction while reducing computational complexity. Our focus is on identifying regions within the background image that are conducive to foreground object placement, as opposed to excessively emphasizing irrelevant or unsuitable local regions. The BFEM module is illustrated in
Figure 2.
In the BFEM, both large and small branches are utilized to extract features at different scales from the background image. In the large branch, the input image is divided into different patches by convolution; these patches are then flattened and sent to the linear projection layer, where the position information of each patch is added to the large embedding. The feature vector is then obtained using the large transformer encoder. Similarly, in the small branch, a more detailed image-slicing operation is performed, reducing to , and the feature vector is derived through the small transformer encoder. To complete the fusion of features from the small and large branches at different scales, the two types of embedded patches are merged, and the fused feature vector is generated using the base transformer. The aforementioned feature vectors , , and are concatenated to produce the final feature vector from the background image. The purpose of this is to fully fuse large patch information, small patch information, and fusion feature information, which enables the network to better discover the background areas of suitable foreground objects during training.
The input of the transformer encoder is as follows:
where
C represents the number of channels,
p denotes the size
of each patch in the large branch and small branches, and
N is calculated as
.
3.2. Design of Foreground–Background Cross-Attention Feature Fusion Module
After extracting the feature information of the foreground object and the background image, simply decoding the placement position and size based on the feature information of both often results in suboptimal outcomes. Therefore, we designed FBCAFFM to enhance the accuracy of foreground object placement and improve the visual realism of composites.
Inspired by the cross-attention module (CAM) employed in the cross-attention multi-scale vision transformer (CrossViT) [
19], which splits images into different scales and employs cross-attention through two paths (the small branch and large branch), using a transformer encoder to derive the final classification results, we adopted a different approach in FBCAFFM. Here, we directly encode the foreground object into the FFV using an encoder. Simultaneously, the background image is encoded using three paths (large, small, and merge) within the BFEM to obtain the BFV. These FFVs and BFVs are then fed directly into the large and small branches in CAM. Additionally,
(class) tokens are introduced to learn abstract feature information within their respective branches. This setup allows the model to acquire information from different scale levels (background patches) from each other, facilitating enhanced extraction of background features more conducive to foreground object integration. The FBCAFFM is illustrated in
Figure 3.
The FFV and BFV obtained from the feature extraction stage have their dimensions changed from
to match the input feature dimensions of FBCAFFM, resulting in
and
, respectively. With the addition of
tokens, their dimensions become
and
, which represent the input dimensions of the small branch and large branch, respectively. Then, position embeddings are added, and a multi-scale transformer is employed for feature fusion. Finally, to ensure that the output fusion feature vector can be directly accepted by the decoder in the second stage, its dimension is set to
, where
s and
denote the number of sampled random variables [
5] and the dimension of embedding, respectively.
As a result, the cross-attention processing can be expressed as follows:
represent the FFV/BFV from small/large branches and concatenate their respective
tokens, as follows:
This is to obtain the fusion feature vector. Following CrossViT [
19],
and
are projections to align dimensions.
denotes the operation of concatenation,
C and
H are the embedding dimensions and number of heads,
,
,
R are linear learnable weights for the query, key, and value, respectively.
After obtaining the cross-attention feature vector (CAFV) by FBCAFFM, following conditional GAN loss [
21], we define the fusion adversarial loss to enable the network to fuse the key feature information of the foreground and background and use it for placement location prediction, which is defined as follows:
where
D denotes the discriminator,
G denotes the generator,
f denotes the foreground,
b denotes the background,
y denotes the ground truth of the placement,
z is the random variable in
uniform distribution,
,
, and
F denote the fusion of the foreground, background, and both,
and
denote the predicted placements,
,
, and
denote the hyperparameters that we set at 0.9, 0.9.
In addition to that, reasonable prediction of foreground object placement is not our ultimate goal; we expect the network to learn other placement information to satisfy diverse requirements without loss of plausibility. Therefore, we first randomly sample the extracted feature vector BFV and FFV from BEFM in different dimensions to obtain
,
,
, and
, and dimensionally reconstruct
and
by fusing the corresponding feature vector. The further fusion of the two, together with the result of CAFV from FBCAFFM, constructs the reconstruction fusion loss, which we denote as
. Meanwhile, following the diversity loss of PlaceNet [
5], we consider the placement prediction results
,
from the different generators. With the fusion results
y, more placement schemes can be obtained by calculating the variation of pairwise distances with random variables [
22]. And unlike the single diversity loss [
5], which computes the distance between the predicted outcomes
y and
z, we merged the different predicted outcomes
,
,
y with the distance between
z, which we denote as
. Finally, the fusion diversity loss is as follows:
where
and
denote the predicted locations,
z denotes the random variable from
uniform distribution,
and
denote the feature vector of the foreground and background,
F denotes the fusion feature vector,
N and
S denote the different numbers of sampled random variables in
D and
R;
i,
j indicate the sample indices,
denotes the hyperparameter that we set at 0.3. Following PlaceNet [
5],
,
,
are the normalized pairwise distance matrices; they are defined as follows:
After designing the fusion adversarial loss and fusion diversity loss, we use
and
to represent the learnable weights in
G and
D, so the optimization objective can be expressed as follows: