4.1. Fashion Product Dataset and Search
The training and test dataset for men’s and women’s fashion products were collected through the online crawling of the thumbnail images from one of the major online shopping malls in Korea. As shown in
Figure 5, fashion product images are very useful for verifying the applicability and flexibility of the proposed approach because each image can have various attributes at the same time.
The apparel images were searched and collected with regard to gender, type, and color. In total, 12 categories of men’s clothes and 15 categories of women’s clothes were searched (shown in
Table 6 and
Table 7, respectively). Additionally, each clothing category was crawled separately for five colors. Because there were numerous redundant and misclassified images, we first collected 2500 images for each sub-category. Then, duplicated images were deleted by a program. Among the total 67,500 obtained images, 28,035 images were identified as duplicates and removed. Furthermore, miscategorized images were manually removed by human operators. In particular, 677 images were not related to clothes, and 8922 images were misclassified. After these were removed, 29,866 images remained. The accuracy was calculated after the elimination of misclassified and redundant images. We did not consider duplication as an incorrect result. Excluding redundancy, the accuracy of the crawled images from the shopping mall was approximately 75%, as indicated by
Table 7 and
Table 8. This process is essential to prevent the proposed approach from training and testing the duplicate data at the same time or to prevent it from making a wrong decision, such as answering the false to the true or vice versa.
The dataset in
Table 7 and
Table 8 were randomly partitioned into the ratio of 5:1 for training and testing. The number of training images is 24,885, and that of testing images is 4981. We also balanced the ratio among categories when partitioning. Each data used in training and testing consists of a pair of images and queries. For each image, several queries were randomly generated since there are many queries corresponding to the same image.
4.2. Experimental Analysis
The performance of the proposed approach was analyzed by using a confusion matrix as a performance evaluation index. The accuracy (
ACC), precision (
PPV), recall (
TPR), and
F1 score were used as performance indices, as shown below (Equations (1)–(4)) [
32]. Additionally, the receiver operating characteristic (ROC) curve was used for performance visualization [
33].
| | Model decision |
True | False |
Actual class | True | TP | FN |
False | FP | TN |
To evaluate the performance of the proposed method, various combinations of multimodal classifiers were applied, as shown in
Table 9. The accuracy of the fashion product retrieval was analyzed by applying different combination methods: concat, ptwise sum, ptwise product, and multimodal compact bilinear (MCB). Additionally, we analyzed the performance of the classifier according to the number of hidden layers and the type of combination. The combination methods and their characteristics are described as follows:
- ⬤
concat: This is a method of concatenating the image feature using the CNN and the query feature using the seq2seq-based RNN to obtain a single feature vector. The feature dimensions of the two vectors need not be the same. However, the two feature vectors may merge together, because they depend on either image data or text data. The simple concatenation is not good, but its performance improves as FC layers are added. This appears to be due to the improved ability to interpret the data with increasing FC layers, rather than the improved performance of the concat method.
- ⬤
ptwise sum, ptwise product: In contrast to the concat method, these methods attempt to combine the image and query data properly (pointwise (or elementwise) sum and pointwise product). These methods are to calculate the sum or product of the elements of the same position with the same size of the image and query feature vectors. Thus, the two feature tensors processed by the CNN and RNN must have the same shape. In contrast to the ptwise sum method, the ptwise product can be interpreted as the inner product of two tensors or the statistical correlation of two tensors. Thus, the ptwise product method exhibits relatively good performance even without the FC layers.
- ⬤
MCB pooling: This method is based on the assumption that the outer product of tensors yields better performance than the inner product. For example, assuming that the size of the image tensor and the text tensor is 256, the dimension of the outer product of the two tensors is 256 × 256 (2
16). The dimension of 2
16 is too large for the parameters to be learned. However, in the
MCB pooling, the two tensors are transformed into tensors of the same dimension [
24]. The transformed tensors are re-calculated to obtain a single tensor in the frequency domain using a fast Fourier transform (FTT) and are then returned as another tensor through an inverse fast Fourier transform (IFTT). Because the multiplication in the frequency domain has the same effect as the convolution in the original state, it is possible to obtain a similar result while avoiding the calculation of the outer product.
As shown in
Table 9, the ptwise product + FC + FC method exhibited the best performance. The performance of the methods varied in the F
1 score range of 67.381-86.6. A larger number of FC layers yielded better results. However, adding FC layers increased the computation time. The performance of the proposed multimodal technique increased in the following order: concat, ptwise sum, MCB pooling, and ptwise product methods. The MCB pooling showed better performance than the ptwise product for the VQA problem, but it showed lower performance in the proposed approach [
27].
Figure 6 shows the ROC curve for each combination presented in
Table 9. The performance of each combination is indicated by the area under the curve (AUC). The x- and y-axes of the ROC curve have a range of [0, 1]. An AUC closer to 1 (closer to the top-left corner) indicates better performance. An AUC of < 0.5 indicates a value-less test. The concat and ptwise sum combinations exhibited the smallest AUCs, and the ptwise product combination exhibited the largest AUC.
We have compared the proposed approach with conventional methods indirectly. The proposed approach is unique with respect to multimodal learning of different types of queries and images for fashion product retrieval. Since it is impossible to compare it with conventional methods directly, two conventional methods and their variants without using multimodality learning were designed, as shown in
Table 10 and
Table 11. Without the query encoder, they are similar to the proposed method. The input model of the conventional method is the image encoder in
Table 1. As these methods cannot adopt query vectors, they need a new classification output with respect to the dataset. In this paper, there are 135 classes for fashion products (two genders, twelve categories for men, fifteen categories for women, and five colors). As there is a difference between the gender-related categories, we just merged them into 27 categories. As a result, 27 × 5 = 135 classes were defined for the single-label classification, and 75 and five classes were defined for the multi-label classification.
As shown in
Table 11, the proposed approach outperformed conventional methods regardless of the number of input models and that of FC out layers, as conventional methods did not utilize multimodal information simultaneously. If the number of type or factor increases, the number of classification increases drastically so that the conventional method requires a lot of resources, and the results deteriorates. In particular, the difference between precision and recall is due to the much higher rate of false negative decision of the models since there are only one true answer and 134 false answers. Nevertheless, a further study is still needed to evaluate the proposed approach compared with other methods with similar architectures directly.
4.3. Qualitative Results of Case Studies
In this subsection, we present the results of the proposed multimodality-based deep learning approach for fashion product retrieval.
Figure 7 shows the template for the search results. The 10 apparel images with the highest probability corresponding to the user query were retrieved from the test dataset and displayed sequentially, and the 10 images with the lowest probability were also selected and displayed. The
ptwise product method with two FC layers was used for the search, as it exhibited the best performance, as shown in
Table 8. TensorFlow and Keras were used to implement the proposed approach [
34,
35].
Figure 8 shows the results of fashion product retrieval in the case of a user query with a single keyword. The first example shows the search results for the query term “men”. The 10 best matches include shirts, suits, and hoodies that men usually wear. In the 10 matches with the lowest probability, there are clothes that only women wear (e.g., a dress). For the second example (i.e., the query term “skirt”), the top 10 matches include women’s skirts. On the other hand, the 10 worst matches include men’s apparel. Finally, in the case of the query term “green”, all the best matches are green fashion products, regardless of gender. Conversely, the worst matches are red or white fashion products. This is because the distance between the green and red in the image color space is large. The distances among green (0, 255, 0), red (255,0,0), and white (255,255,255) are larger than the distances of these colors to black (0,0,0). Even the color blue has a large distance from green, but the training dataset contained fewer blue products than red and white ones. We found that the blue ones were also far from the green ones.
Boolean operators include AND, OR, and NOT. As shown in
Figure 9, these logical operators were employed to determine whether the proposed method could understand the queries properly. For the first and second queries in
Figure 9, we checked whether meanings of AND, OR, and NOT were properly understood. First, in the case of “black AND NOT white” query, only fashion products with the white color were retrieved. Second, in the case of the OR operation, product images containing either blue or green were correctly retrieved. Similarly,
Figure 10 shows the results of the apparel retrieval for category-based queries. These search results indicate that the proposed approach correctly retrieves fashion products according to the queries.
Pseudo-SQL-based queries were employed to determine how well the proposed deep learning method can interpret complicated queries and to thereby investigate the scalability and flexibility of the method (
Figure 11). The first query—“SELECT * FROM men”—is similar in meaning to the query using the single keyword, “men”, shown in
Figure 8, although the format is completely different. The search results for this pseudo-SQL query are consistent with the results for the single keyword query shown in
Figure 8. The results with the highest probability mainly include men’s clothes, whereas those with the lowest probability are women’s clothes. For the second query—“SELECT * FROM men WHERE red”—the search focused on red apparel for men. The results with the highest probability were appropriate, and those with the lowest probability included women’s clothing without the red color. For the last query—“SELECT * FROM men where blue AND shirt”—the results with the highest probability were blue men’s shirts, as intended, and the 10 images with the lowest probability included women’s clothes without the blue color.
We compared and analyzed the results for the three aforementioned types of queries. As shown in
Figure 12, when the queries were designed to find black pants for men, they retrieved similar or the same images. The 10 results with the highest probability included men’s black trousers, as intended. The 10 results with the lowest probability mainly included women’s clothing.
We also examined the result for a ternary Boolean operation. Previously, we did not use a Boolean operation with three query terms. A third term was arbitrarily added to evaluate the flexibility of the proposed approach. The results for the query “men AND pants AND black” were good, indicating that the proposed deep-learning network works correctly and flexibly even for user queries not included in the training.
We also investigated the fashion product retrieval results for queries that were not used during the training, such as the ternary Boolean query shown in
Figure 12. We randomly mixed two query forms or nonexistent conditions (e.g., women’s leggings in men’s clothing). As shown in
Figure 13, the query “men → coat AND shirt” is a mixture of Boolean and category queries. Nevertheless, the search worked well. In particular, the query “men → leggings” is unusual, as there is no such fashion product for men. According to the characteristics of leggings, the query results included pants, and skirts and dresses had the lowest probability.