Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval

Li, Kangying; Wang, Jiayun; Batjargal, Biligsaikhan; Maeda, Akira

doi:10.3390/fi14070212

Open AccessArticle

Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval

¹

Research Organization of Science and Technology, Ritsumeikan University, Shiga 525-8577, Japan

²

Graduate School of Information Science and Engineering, Ritsumeikan University, Shiga 525-8577, Japan

³

College of Information Science and Engineering, Ritsumeikan University, Shiga 525-8577, Japan

^*

Authors to whom correspondence should be addressed.

Future Internet 2022, 14(7), 212; https://doi.org/10.3390/fi14070212

Submission received: 9 June 2022 / Revised: 3 July 2022 / Accepted: 5 July 2022 / Published: 18 July 2022

(This article belongs to the Special Issue Big Data Analytics, Privacy and Visualization)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In recent years, artworks have been increasingly digitized and built into databases, and such databases have become convenient tools for researchers. Researchers who retrieve artwork are not only researchers of humanities, but also researchers of materials science, physics, art, and so on. It may be difficult for researchers of various fields whose studies focus on the colors of artwork to find the required records in existing databases, that are color-based and only queried by the metadata. Besides, although some image retrieval engines can be used to retrieve artwork by text description, the existing image retrieval systems mainly retrieve the main colors of the images, and rare cases of color use are difficult to find. This makes it difficult for many researchers who focus on toning, colors, or pigments to use search engines for their own needs. To solve the two problems, we propose a cross-modal multi-task fine-tuning method based on CLIP (Contrastive Language-Image Pre-Training), which uses the human sensory characteristics of colors contained in the language space and the geometric characteristics of the sketches of a given artwork in order to gain better representations of that artwork piece. The experimental results show that the proposed retrieval framework is efficient for intuitively searching for rare colors, and that a small amount of data can improve the correspondence between text descriptions and color information.

Keywords:

artwork retrieval; cross-modal representation learning; multi-task fine-tuning

1. Introduction

In the digital archives of artwork, colors often play an important role in helping users to understand and retrieve the contents. Researchers from different fields are doing research on content related to color. For example, the color composition of paintings is a research topic in art [1], computers [2], chemistry [3], economics [4], and psychology [5]. However, large digital archives of artwork often contain a wide variety of artwork and their metadata were created by experts in specific fields, making it difficult for users from different fields or backgrounds to formulate appropriate text queries, thereby hindering the satisfaction of the users’ information needs. Moreover, since the artwork is often digitized with different methods, which result in significant differences in color information, making it difficult to search simply by color pickers. Taking these into account, we propose a novel method that allows users to retrieve colors using intuitive words. For a simple example, users can use the word “sea” to retrieve colors with blue tones, and the word “grass” to retrieve colors with green tones. This method allows researchers from different fields and general users to easily formulate text queries to retrieve the desired color depicted in a artwork. In addition, the model has the effect of color correction for the same artwork subject to different photographic methods. Further, since in many cases, rare colors are noticeable and easy to be found, and many studies focus more on rare colors [6], in our proposed model, we help users to find the rare colors in the artwork when retrieving it. We also implement a search system based on the proposed method. The demo system has been released on (http://color2ukiyoe.net/ [7] (accessed on 4 July 2022)), where users can retrieve images both through text descriptions and a color selection board. As color descriptions are used in retrieving rare colors, the color selection board is designed towards dominant color searches. The potential users of the system can include researchers or workers from various fields, such as arts, literature, psychology, physics, etc. The proposed framework uses the pre-trained model of CLIP for multi-task fine-tuning. The pre-trained CLIP model benefits from text-image correspondence knowledge that can link image color information in real life with text descriptions. Our framework allows the CLIP model to learn more about the cross-modal information of the artwork and color language descriptions and optimizes the ability to encode images with similar structures into nearby vector spaces.

The rest of this section provides the background and the contributions of our proposed method. Section 2 gives a brief overview of the related works. Section 3 describes the framework and technical details of the proposed method. Section 4 and Section 5 describe experiments and the conclusions, respectively. In this paper, all the ukiyo-e images utilized in this research were obtained from the Ritsumeikan ARC ukiyo-e Dataset which were released from the Art Research Center, Ritsumeikan University [8].

1.1. Retrieval of Artwork

Many studies indicate the importance of colors in image retrieval or artwork retrieval [9,10]. However, the classification methods of colors are complicated due to physical characteristics or luminosity and brightness. Different classification methods have different views of color characteristics and greatly affect the artwork retrieval results. Some simple color features, such as RGB features, are easily affected by light, thus causing the color information of the same artwork to become unequal during the digitizing process. Therefore, for the retrieval of digitized artwork that is heavily influenced by natural light, it is difficult to use these types of features. Besides colors, sketches also form the basis of artwork retrieval, as well as an important way of studying the characteristics of image composition and the painters’ style [11]. Different from ordinary image retrieval, the separation of colors and sketches for feature extraction is meaningful for the retrieval of artwork. Paying attention to the similarity of sketches helps to retrieve artwork with different digitized versions or artwork with a fading phenomenon.

One of the important features of our proposed system is that we allow users to search for the rare colors in the artworks, since the colors that are less used often attract more attention. Painters often use rare pigment materials and toning methods in their paintings [6]. However, in many cases, these rare color schemes do not appear on a large area of the paintings. If the retrieval method is based on common feature extraction methods of color space, these rare color schemes might be ignored. In addition, learning an image feature that does not ignore rare colors is also a challenge for artwork retrieval. Figure 1 shows some rare color examples of Japanese ukiyo-e prints.

1.2. Human Senses and Colors

To relate the intuitive words (human sense) to the actual colors, we must consider the following cases:

First, it is challenging to find images corresponding to similar colors based on human senses by simply selecting RGB or HEX values. For example, if colors are ordered by hue (rounded to nearest 30) and luminance (rounded to nearest 20), it can be seen that the color group in the range of (H, L) = (0, 60) tends to be pink, as shown in Figure 2. It can be seen from the figure that most of the colors in this range are close to pink based on human senses, but the colors in the red boxes may tend to be described more as gray. Therefore, accurately finding works with similar colors based on human senses is a challenge.

Second, describing a color from a metaphor of objects that exist in everyday life, for example, by describing a certain color as a ‘beer color’ or a ‘tea color’, may also be an effective method of finding color information. As shown in Figure 3, since these descriptions and colors vary based on an individual’s reality, the colors one may want to search for are not the same according to the senses of each person. The same problem may also arise when users want to find the name of a pigment by description, as there may be cultural differences between countries or regions in the description of objects and colors. In several different color name databases [12,13], the phenomenon shown in Table 1 can be found. There are considerable differences even in different databases in the same language.

In addition, a more complex challenge is to retrieve the impossible color. Newall [14] shows that some people describe the color of ‘sky illuminated by the sun’ as ‘yellowish blue’; however, this color does not exist in the structure of color spaces. Through the adjustment of light and other elements, it is possible to simulate this seemingly natural color, and our research aims to show examples of this. Using similar image retrieval and text-image retrieval from Google, as can be seen in Figure 4, the search results show the given color example and the text ‘yellowish blue’. Hence, it is possible to find more similar colors that fit the human senses through language description; therefore, mapping different colors to natural language descriptions is a challenge in searching for artwork based on colors.

1.3. Contributions of Our Work

We propose a framework of color-based artwork retrieval that can help users intuitively search for artwork by text description. The main contributions of our work are summarized as follows:

A new retrieval framework is proposed for the word-based color retrieval of artwork. The framework utilizes the cross-modal multi-task fine-tuning method on CLIP.
We propose a new artwork color descriptor, and we project the color information into the text feature space to obtain a similar color based on human senses using a textural semantic space.
We adapt a commonly used method IDF (inverse document frequency) to extract image color information, and we propose a label generation method for finding the rarest colors.
A training data sampling method using a sketch structure is proposed, where images with the same structure can learn more similar feature vector representations.
We apply the proposed method to retrieve ukiyo-e prints by two methods of color selection: direct main color retrieval using a color selection board and searching for the rare colors using text descriptions. By modifying the training setting, the main color can also be retrieved by the language description.

2. Related Work

In this section, we mainly discuss the various techniques used in artwork retrieval, as well as the related deep learning models we mention in this paper.

2.1. Artwork Retrieval Related Work

Existing image similarity calculation applications [15,16] could also be used in artwork retrieval systems, however, they are not optimized for artworks, which affects their retrieval results. In Section 4, we show the experimental results to illustrate the retrieval results of these two image similarity calculation applications.

There are some studies that focus on the retrieval of museum collections [17,18] or artworks [19,20,21]. Kim et al. [20] proposed an effective method of searching for paintings by exploiting color to express human visual memory, and they provide an interface for non-expert users, where users can search by drawing graphics on the interface. The NBS-Kmeans algorithm was proposed for color clustering, and Hierarchical-LMNN (Large Margin Nearest Neighbor) was proposed to calculate the distance between query and images in a database. Companioni-Brito et al. [21] proposed an image retrieval framework for art paintings using shape, texture, and color features. The LSH (Locality Sensitive Hashing) method was used for image indexing. However, these related works proposed monomodal retrieval models, which means that the relationship between text and images cannot be accurately obtained, therefore the corresponding images cannot be retrieved with text queries. Some studies have proposed color descriptors for artwork retrieval [19,22]. However, the color space models in those color descriptors are often based on one-to-one color information description, such as assigning a specific value or vector to particular colors. Since one intuitive word can correspond to several different colors, one-to-one color information description is not an appropriate method in our study case.

Since other computer vision-related work on artwork also utilize common image features, we also used them as references. Zhao et al. [23] compared the art classification performances of seven different models either using or not using transfer learning. The models include ResNet and its variants (e.g., RegNet, ResNeXt, Res2Net, ResNeSt and EfficientNet). For the task of classifying paintings by artist, style, or genre, the color information from the paintings was used by the CNNs (Convolutional Neural Networks) to perform classification. Wang et al. [24] proposed a style recognition framework based on sketches, in which the users’ drawing styles can be inferred by analyzing contour features, and the results are calculated by re-ranking the K-nearest sketches stored in the users’ usage histories. However, these related works also suffer from monomodal design.

2.2. Related Deep Learning Models

Radford et al. [25] proposed CLIP, an approach of pre-training tasks on image-text pairs, that helps to link natural language to visual concepts, thus enabling zero-shot transfer of the model to downstream tasks. Conde et al. [26] proposed a multimodal method that fine-tuned descriptions and images of artwork on CLIP [25]. However, in addition to paintings, training data also include photos of museum collections. This research focuses on learning the multimodal representation of artwork images and their descriptions, but does not pay attention to learning the spatial representation of the color information and structure information of artwork images.

BERT (Bidirectional Encoder Representation Transformer) [27] model was first proposed to obtain bidirectional representations from unlabeled text. There are many studies trying to use BERT and its linguistic knowledge for computer vision-related research. The studies that are most relevant to our study are those using BERT for the representation of colors. In [28], Abdou et al. did case studies on BERT and other similar language models, and found a significant correspondence between text-derived color term representations and a perceptually uniform color space CIELAB. To compare the performance of our model with language models including BERT on color text representation, we conducted experiments and the results are introduced in Section 4.

3. Methodology

According to the research topics proposed in Section 1, the following problems are considered to be solved: (1) Reduce the impact of color information by fine-tuning the image encoder to learn image representation space prone to sketch similarities, (2) Enhance the association between color description and the original input image, (3) Find rare colors from color descriptions, and highlight the correlation between the description of the rare colors and the original input images. Hence, multi-task learning is applied to our proposed fine-tuning method, as shown in Figure 5. Task 1 is used to enhance the learning of image sketch features as the solution for problem (1), and task 2 is set to enhance the connection between color information and language descriptions as the solutions to problems (2) and (3). Task 1 and task 2 are designed relatively independent, which is because different image objects may be used during the same training step.

The architecture of our proposed cross-modal representation learning framework consists of three modules, as shown in Figure 5: (1) the space sampler module, (2) the text and label generator module, and (3) the multi-task fine-tuning module.

We discuss the details of our proposed model in later sections.

3.1. The Space Sampler Module

The space sampler is used to sample the image triplet pairs from an image dataset for fine-turning the weights of the image encoder. As shown in Figure 6, this module contains three processes: (1) image sketch extraction, (2) HOG (histogram of oriented gradients) feature extraction, and (3) triplet data sampling. This module contains image processing, but these processed images are only used to sample image data based on the perspective of geometric features, and the processed images are not directly used in fine-tuning the training of the CLIP model which means it is a component separate from the input image to the fine-tuning process.

The design of this module can be intuitively understood as the training data of image triplets is sampled using simple handcraft features extracted from sketches and distance computations to finally provide the fine-tuning process with more image boundaries and structural information. Firstly, obtain handcraft feature extraction from sketches with descriptors sensitive to contours and boundaries. Then, we apply Euclidean distance calculation to the obtained feature as the principle for picking triplet sample for CLIP fine-tuning. From Section 3.1.1 to Section 3.1.3 we will explain the structure of this module step by step.

3.1.1. Image Sketch Extraction

AODA (Adversarial Open Domain Adaptation Network) is an open-domain sketch-to-photo translation framework proposed by Xiang et al. [29]. A sketch image closer to the hand-drawn style can be obtained by utilizing pre-trained weights trained on a large-scale sketch-photo dataset.

As shown in Figure 7, we applied AODA and its pre-trained weights to sketch all images and obtain their structural features.

3.1.2. Histogram of Oriented Gradients (HOG) Feature Extraction

The HOG [30] is an efficient method for extracting the gradient orientation feature in localized portions of an image. The HOG descriptor focuses on the structure or the shape of an object. Since the digitized ukiyo-e images have only a small amount of deformation and rotation, we chose a HOG descriptor that is easier to calculate and can show the difference in contour features between different images. The distribution of directions of gradients are used as features to represent an input image. Figure 8 shows an example of a HOG feature map. The HOG feature can obtain the gradient feature of the sketch edge and is not influenced by the color of the original images.

3.1.3. Triplet Data Sampling

Features obtained from the HOG feature extractor are used for sampling triplet data, which were used as triplet pairs for the fine-tuning task. Triplet loss is used for the fine-tuning process when the triplet pairs are inputted, the triplet loss is an improvement on contrastive loss. The goal of triplet loss is to make the samples with similar features as close as possible in spatial position, while the samples with different features are as far away as possible in spatial position. And at the same time, in order to prevent the learned features of the samples from being aggregated into a very small space, it is required for two other samples to be inputted with anchor images, a positive example and a negative example. The negative example should be more distant than the positive example. We randomly sample three images from the training data each time, randomly selected one as the anchor image, and used Euclidean distance to calculate which of the remaining two images is closer to the anchor image and which one is farther. The definition of the sampler is shown in Equation (1):

{\begin{matrix} x_{i} = x_{a n c h o r} selected as x_{i}, \\ x_{j} = x_{p o s i t i v e} i f D (x_{i}, x_{j}) < D (x_{i}, x_{k}) \\ x_{k} = x_{n e g a t i v e} i f D (x_{i}, x_{k}) > D (x_{i}, x_{j}) \end{matrix},

(1)

where

(x_{i}, x_{j}, x_{k})

is a triplet randomly choice from the training data. D stands for the distance function, which is set as a regular cosine similarity function. Other similarity metrics such as Euclidean distance can also be used for distance calculations. The labels ‘anchor,’ ‘positive’, and ‘negative’ are calculated by the distance between

x_{i}, x_{j}, and x_{k}

. And the variables

x_{a n c h o r}

,

x_{p o s i t i v e}

, and

x_{n e g a t i v e}

are used as one of the training pairs of the triplet training task on the CLIP model.

3.2. Text and Label Generator

In order to train cross-modal information, the training data of ‘text–image’ pairs need to be prepared. A text and label generator is proposed to help the pre-trained model learn more ‘color description-image’ information. The generator includes two processes: (1) color information extraction, and (2) IDF calculation.

3.2.1. Color Information Extraction

We applied a colorgram [31] to extract color information from images. An example of the results and their proportions (with floor round) is shown in Figure 9.

Twelve main colors and their proportions were extracted, and the corresponding color descriptions were extracted from the HCN (Handpicked Color Names) dataset [12] based on RGB values. For the RGB values that were not recorded in the HCN dataset, we used the NNS (nearest neighbor search) to find the RGB values recorded in the HCN dataset that is most similar to the given RGB value.

As shown in Figure 10, we converted the extracted color information into a textual document for IDF label generation and a color-proportion index for the main color search by using a color palette interface.

3.2.2. IDF Calculation

By calculating the frequencies of the color name words that appear in the ‘document’ extracted from all images from the Ritsumeikan ARC ukiyo-e Dataset [8], colors with top 6 highest frequencies were obtained, and they are shown in Figure 11.

The top frequent colors are the colors of the backgrounds or the colors of the sketches in ukiyo-e prints. If the task is to find rare colors in the image, the importance of these colors is considered to be relatively low.

We used IDF, a common method in text retrieval, to calculate the low-frequency color names that appear in all image color description documents.

{Score}_{color name} (x^{i})

is used to indicate the importance of each color name and is calculated as Equation (2):

{Score}_{color name} (x^{i}) = x_{\min - \max}^{i} (\frac{sigmoid {(\log (\frac{N}{{df}_{x}}))}^{i} - μ_{X_{sigmoid}}}{σ_{X_{sigmoid}}}),

(2)

where

{df}_{x}

is the number of color description documents containing color name

x^{i}

, and N is the total number of color description documents.

μ_{X_{sigmoid}}

and

σ_{X_{sigmoid}}

are the sample mean and sample standard deviation of all the results of

\log (\frac{N}{{df}_{x}})

from all color names of all color description documents. The function

x_{\min - \max}^{i}

is the min–max normalization.

The function sigmoid(z) is represented by Equation (3):

sigmoid (z) = \frac{1}{1 + e^{- z}} .

(3)

Figure 12 shows the

{Score}_{color name}

of the color names from the image in Figure 9. It can be seen that blue, which many art researchers focus on [6,32], has a high score. If the task requires the result to be biased towards extracting the dominant color, the opposite value can be used on the training loss function setting.

3.3. Multi-Task Fine-Tuning on the CLIP Model

We set two tasks to fine-tune the CLIP model: (1) a cross-modal contrastive fine-tuning task for learning the cross-modal knowledge between color name descriptions and images, and (2) a fine-tuning task on an image encoder with triplet loss for learning the structural-similarity-based features of the image.

3.3.1. Cross-Modal Fine-Tuning with Cosine Similarity-Based Pairwise Loss

As shown in Figure 13, the training data used for cross-modal fine-tuning consisted of pairs of images and texts of a color name. Since we use the cosine embedding loss function to calculate the loss value, the label in Figure 13 represents the score value in Figure 12. We extracted a total of 12 colors from each image, so each image corresponds to 12 sets of training data pairs.

The similarity function is defined in Equation (4):

{Cos in}_{I_{img} ~ T_{color}} = \frac{\sum_{1}^{n} I_{img (i)} T_{color (i)}}{\sqrt{\sum_{1}^{n} I_{img (i)}^{2}} \sqrt{\sum_{1}^{n} T_{color (i)}^{2}}},

(4)

where

I_{img}

and

T_{color}

show the extracted feature representations from image and text encoder of each modality.

\sum_{1}^{n} I_{img (i)} T_{color (i)} = I_{img (1)} T_{color (1)} + I_{img (2)} T_{color (2)} \dots . + I_{img (n)} T_{color (n)},

is the dot product of the image and text representation vector

I_{img}

and

T_{color}

learned from the model, i shows the element of each vector and n shows the dimensions in a vector. The loss function is defined in Equation (5):

ℒ_{{Pair}_{n}} (I_{img}, T_{color}, y) = ‖ y - {Cos in}_{{Pair}_{n} (I_{img} ~ T_{color})} ‖,

(5)

where

ℒ_{{Pair}_{n}} (I_{img}, T_{color}, y)

represents the loss value generated when inputting the image–color name pairs, y is the ground truth label which is obtained from Section 3.2.2 correspond to

{Score}_{color name}

.

3.3.2. The Fine-Tuning Image Encoder with Triplet Loss

In order to enhance the learning of the representation that focuses on image structure, triplet loss [33] is used to improve the performance of image encoding. Table 2 shows one of the input pairs for the structural feature-based triplet training task. Obtained triplet loss and cosine similarity loss obtained from Section 3.3.1 are iteratively used in a round-robin way in our fine-tuning process.

4. Experiments and Results

In this section, we introduce the datasets used in our proposed framework and evaluation experiments. Section 4.1 introduces the databases used in the experiments and the basic experimental setup. A simple visualization of the training process is presented in Section 4.2. In Section 4.3, we use the language models including BERT as a comparison model to evaluate the text encoder’s ability to acquire text information after fine-tuning, to evaluate whether the model semantically links color descriptions and abstract descriptions. Section 4.4 introduces using different image feature extractors including pre-trained VGG models as a comparison to evaluate whether the trained image encoder can pay more attention to the image sketch information. The experimental setting of using ‘color description-color card’ is set to evaluate the cross-modal information processing ability of the model. An introduction of a demo system is described in Section 4.5.

4.1. Datasets and Basic Experimental Setup

All ukiyo-e images were obtained from the Ritsumeikan ARC ukiyo-e Dataset which was released from the Art Research Center, Ritsumeikan University [8]. We only selected 1000 images for training in the demonstration system due to the computational cost of utilizing a low specification cloud server equipped with a 4 core Intel Xeon Platinum 8255C CPU and 16 GB RAM, which is a provisional demo environment. From 1000 images we can extract image training datasets with total permutation (1000, 3) pairs, and 12,000 pairs of cross-modal training data. We augment negative samples for cross-modal training by randomly selecting color descriptions that the image does not contain. Two color name datasets in English are used for fine-tuning and evaluation experiments of the model: (1) the HCN dataset for training and testing, where 20% of the training data was selected as the testing data during the training process, and (2) the CN [13] dataset for evaluation. The HCN dataset [12] is a handpicked list of 29,205 unique color names from various sources and thousands of curated user submissions. CN is a color name dataset including over 1200 color names and corresponding RGB values. We extracted the ‘color name–RGB value’ pair data corresponding to the RGB values recorded in the two datasets. As shown in Table 3, the vocabulary differences between the two datasets are obvious, and the different color names account for almost 48%. Color name examples from HCN and CN are shown in Table 4.

To evaluate the effectiveness of the fine-tuning process, we used two datasets to evaluate the proposed method:

(1): the CN dataset and corresponding RGB color cards, where the color cards were extracted according to their HEX code and used as input for image feature extraction, and
(2): the 100 most recent color description data obtained from colornames.org [34], which provides a color naming interface. We also collected the color cards corresponding to these color descriptions but only used them for visualizing text embeddings. The collected data samples for evaluation are shown in Table 5.

4.2. Training Process

We used Adamax [35] as the optimizer, and the learning rate was set to 0.001. Cosine, Euclidean, Manhattan, and Dot Pearson correlation were chosen to evaluate the learned representation vector from the image encoder and the text encoder in the training process. The Pearson correlation values calculated at 60 epochs are shown in Figure 14.

The results show that, as training progresses, the similarity of learned representations between the color name descriptions and the ukiyo-e images gradually increases, numerical changes in Manhattan Pearson and Euclidean Pearson values are unstable, and the cosine Pearson value is relatively stable.

4.3. Evaluation Experiments Using Representations Extracted from the Fine-Tuned Text Encoder

In this section, we evaluate the fine-tuned text encoder by using color descriptions from different datasets.

4.3.1. Pearson Correlation Coefficient Calculation on the HCN and CN Datasets

Correlation coefficients were used to measure the strength of a relationship between two vectors.

The points in Figure 15 represent the vector visualization of the dimension reduction to two dimensions. The pink line indicates the distribution of the vector under different correlation values. We extracted the color name description pairs from HCN and CN under the corresponding colors and obtained embeddings from the learned model to calculate the correlation coefficients between the two datasets. Since CN was not used in the training process, it can be regarded as an evaluation dataset.

Table 6 shows that the correlation coefficients between the two datasets increase, proving that the model has learned more correspondence between the color information and the text description.

4.3.2. Example of Color-Based Similar Description Retrieval by Using Features Extracted from Pre-Trained Language Models

In this section, we show an example of color-based similar abstract description retrieval by extracting text description embeddings from different pre-trained language models. Data collected from colornames.org were used for evaluation, and some examples are shown in Table 7. Table 8 shows part of the top three results and their corresponding colors. In Table 7, we show some examples of color names collected from colornames.org (accessed on 4 July 2022).

As shown in Table 8, some users named some colors with character names such as ’Luigi Yellow,’ and the uncased-based model has difficulty in processing color names, as in this case. Compared with the BERT [27] model trained with a single language, the model trained with a multilingual corpus shows a higher color name information processing ability. This provides reference information for our future work to expand the text encoder to a multilingual space. Figure 16 shows a visualization of the extracted text embedding in each pre-trained model linguistic space. It can be seen that the fine-tuned CLIP has a better performance in retrieving color information in a linguistic space.

4.4. Evaluation Experiments Using Representations Extracted from a Fine-Tuned Image Encoder

We evaluate the fine-tuned image encoder using different experimental settings in this section.

4.4.1. Pearson Correlation Coefficient Calculation on the CN Dataset and the Color Card Image Dataset

The CN dataset and corresponding color cards were used to evaluate cross-modal performance by calculating the Pearson correlation coefficient score between text descriptions and color card images.

As results shown in Table 9, the cross-modal performance is increased after the fine-tuning process.

4.4.2. Similarity Score Calculation on the Same Ukiyo-e Print Digitized at the Different Institutions

First, as mentioned in Section 1, some of the image similarity calculation application can be used in artwork retrieval systems. Therefore, we conducted an experiment to check the performance of them. We collected three different versions of the same ukiyo-e print, as shown in Figure 17, and compared two similarity calculation applications [15,16] with all image pairs as inputs. The results are shown in Table 10. Although the detailed calculation methods of these two commercial applications are not explained, it can be seen that there is a large difference between input image 1 and the other two input images under both calculation applications. This is probably caused by color differences, angle changes, or any blemishes in different image versions. From this example we can assume that, for some of the artworks, it is indispensable to extract better color information and image sketch for better representation.

In order to investigate whether the model, after fine-tuning, has an improved performance in the calculation of the similarity of the same ukiyo-e print digitized at the different institutions, we tested our fine-tuning method using the examples in Section 1.1. In order to evaluate whether the model is improved in the calculation of the similarity of both structures and colors, we added a set of comparative experiments, as shown in Table 11.

It can be seen from the results that, for the same work, our model has a higher similarity score than the model before fine-tuning and can give a relatively low score when there is a considerable color difference between the input image pairs.

4.4.3. Example of Comparison on Image-to-Image Retrieval

We compared the performance of our model and the existing pre-trained models based on the similarity calculation and ranking calculation of image embedding similarity. Table 12 shows part of the image–image search results.

It can be seen that the image features extracted by the model after fine-tuning not only are similar in color, but also have better structural similarity than the other models. Compared with the results of simply using HOG features, our model shows improved performance in color information processing.

4.4.4. Example of Rare Color Retrieval

Some studies on rare colors do not give exact HEX or RGB values, and the names of these colors are originally derived from people’s intuitive impressions of some actual objects in nature.

A fine-tuning process example is shown in Figure 18 that corresponds to the case mentioned in the study [6], and it can be seen that real-word data which is used for pre-training can help to catch more abstract information from a large visual database. We applied the learned knowledge to achieve color information retrieval for artwork collection.

Table 13 shows the possible colors for ‘dayflower Blue’, which was extracted from an ukiyo-e print Butterflies around a wine jar [38] by colorgram [31]. It is difficult to find an abstract description of the color precisely on the palette, because people’s sense of color perception is different.

Table 14 shows top 3 of search results for query ‘Blue’ and ‘Dayflower blue’. From the results, it can be seen that if only ‘blue’ is entered, the system is biased towards searching for rare colors associated with light blue. When dayflower blue is entered, the system can search for some blues that are closer to the colors indicated in Table 13.

4.5. Demo Application Implementation

A demo system was built to show the applicability of our proposed method. As shown in Figure 19, we implemented two modes to search for colors in ukiyo-e prints: (1) by text descriptions and (2) by selecting a color from a color selection board.

Figure 20 shows the results of inputting the pigment name ‘Cinnabar dark’. Cinnabar is a bright red mineral consisting of mercury sulfide, sometimes used as a pigment. It can be seen that, our system can directly use ‘Cinnabar dark’ to find artworks containing the corresponding color.

5. Conclusions

In this paper, we propose a method that allows users to formulate intuitive descriptive text queries and find rare colors in artwork, and we describe an implementation of this method. The main novelty of our research can be considered as the proposed method that links the color information space of artworks to the linguistic space and connects human senses to color-based artwork retrieval. The proposed model is designed as cross-modal representation learning and uses a multi-task fine-tuning method.

The experimental results show that a small amount of data can improve the ability of the model in four aspects: (1) improve the similarities between embeddings of different text descriptions of the same color; (2) improve the ability to describe similar colors with text; (3) improve the similarity of corresponding text and colors. (4) improve the efficiency of retrieving rare colors. The results are expected to improve after increasing the training data and training for a sufficient amount of time. In the future, we will investigate what kind of queries users tend to formulate when using the application. Additionally, we will aim to expand the text encoder to a multilingual space, which enables the model to support the retrieval of color descriptions in multiple languages. The ArtEmis project [39] provides textual description for artworks, and it also includes emotion information corresponding to the button clicked by the annotator. In future work, it is an interesting topic to enhance the model by adding color description from the above dataset and connect color and emotion information to support the analysis of color psychology. Furthermore, since mapping multimodal information into a more suitable space remains a challenge, using a knowledge graph to represent metadata and developing an end-to-end embedding model are considerable solution directions.

Author Contributions

Conceptualization, K.L., J.W., B.B. and A.M.; methodology, K.L.; resources, B.B. and A.M.; data curation, B.B.; software, K.L.; funding acquisition, A.M. and K.L.; project administration, A.M.; writing—original draft, K.L. and J.W.; writing—review and editing, B.B. and A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, Grant Numbers JP 20K12567 and JP 21J15425.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The demo system is available online at http://color2ukiyoe.net/ (accessed on 4 July 2022).

Conflicts of Interest

The authors declare that there is no conflict of interest.

References

Serra, J.; Garcia, Á.; Torres, A.; Llopis, J. Color composition features in modern architecture. Color Res. Appl. 2012, 37, 126–133. [Google Scholar] [CrossRef]
Mojsilovic, A. A computational model for color naming and describing color composition of images. IEEE Trans. Image Process. 2005, 14, 690–699. [Google Scholar] [CrossRef] [PubMed]
Cotte, M.; Susini, J.; Metrich, N.; Moscato, A.; Gratziu, C.; Bertagnini, A.; Paganoet, M. Blackening of Pompeian cinnabar paintings: X-ray microspectroscopy analysis. Anal. Chem. 2006, 78, 7484–7492. [Google Scholar] [CrossRef] [PubMed]
Stepanova, E. The impact of color palettes on the prices of paintings. Empir. Econ. 2019, 56, 755–773. [Google Scholar] [CrossRef] [Green Version]
He, X.F.; Lv, X.G. From the color composition to the color psychology: Soft drink packaging in warm colors and spirits packaging in dark colors. Color Res. Appl. 2022, 47, 758–770. [Google Scholar] [CrossRef]
Sasaki, S.; Webber, P. A study of dayflower blue used in ukiyo-e prints. Stud. Conserv. 2002, 47, 185–188. [Google Scholar] [CrossRef]
Demo Application Implementation of Color based Ukiyo-e Print Retrieval. Available online: http://color2ukiyoe.net/ (accessed on 4 July 2022).
Art Research Center, Ritsumeikan University. 2020. ARC Ukiyo-e Database, Informatics Research Data Repository, National Institute of Informatics. Available online: https://doi.org/10.32130/rdata.2.1 (accessed on 4 July 2022).
Yelizaveta, M.; Tat-Seng, C.; Irina, A. Analysis and retrieval of paintings using artistic color concepts. In Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, 6 June 2005; IEEE: Piscataway, NJ, USA, 2005. [Google Scholar]
Smith, J.R.; Chang, S.-F. Tools and techniques for color image retrieval. In Storage and Retrieval for Still Image and Video Databases; International Society for Optics and Photonics: Bellingham, WA, USA, 1996; Chapter 4; Volume 2670. [Google Scholar]
Collomosse, J.; Bui, T.; Wilber, M.J.; Fang, C.; Jin, H. Sketching with style: Visual search with sketches and aesthetic context. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Handpicked Color Names. Available online: https://github.com/meodai/color-names (accessed on 31 May 2022).
Ranatunga, D.; Gadoci, B. Color-Names. Available online: https://data.world/dilumr/color-names (accessed on 31 May 2022).
Newall, M. Painting with impossible colours: Some thoughts and observations on yellowish blue. Perception 2021, 50, 129–139. [Google Scholar] [CrossRef] [PubMed]
Imgonline. Available online: https://www.imgonline.com.ua/eng/ (accessed on 6 June 2022).
DeepAI: Image-Similarity Calculateor. Available online: https://deepai.org/machine-learning-model/image-similarity (accessed on 31 May 2022).
Goodall, S.; Lewis, P.H.; Martinez, K.; Sinclair, P.A.S.; Giorgini, F.; Addis, M.J.; Boniface, M.J.; Lahanier, C.; Stevenson, J. SCULPTEUR: Multimedia retrieval for museums. In Proceedings of the International Conference on Image and Video Retrieval, Singapore, 21–23 July 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 638–646. [Google Scholar]
Sharma, M.K.; Siddiqui, T.J. An ontology based framework for retrieval of museum artifacts. In Proceedings of the 7th International Conference on Intelligent Human Computer Interaction, Pilani, India, 12–13 December 2016; Elsevier: Amsterdam, The Netherlands, 2016; pp. 176–196. [Google Scholar]
Falomir, Z.; Museros, L.; Sanz, I.; Gonzalez-Abril, L. Categorizing paintings in art styles based on qualitative color descriptors, quantitative global features and machine learning (QArt-Learn). Expert Syst. Appl. 2018, 97, 83–94. [Google Scholar] [CrossRef]
Kim, N.; Choi, Y.; Hwang, S.; Kweon, I.S. Artrieval: Painting retrieval without expert knowledge. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1339–1343. [Google Scholar]
Companioni-Brito, C.; Mariano-Calibjo, Z.; Elawady, M.; Yildirim, S. Mobile-based painting photo retrieval using combined features. In Proceedings of the International Conference Image Analysis and Recognition, Waterloo, ON, Canada, 27–29 August 2018; Springer: Cham, Switzerland, 2018; pp. 278–284. [Google Scholar]
Lee, H.Y.; Lee, H.K.; Ha, Y.H. Spatial color descriptor for image retrieval and video segmentation. IEEE Trans. Multimed. 2003, 5, 358–367. [Google Scholar]
Zhao, W.; Zhou, D.; Qiu, X.; Jiang, W. Compare the performance of the models in art classification. PLoS ONE 2021, 16, e0248414. [Google Scholar] [CrossRef] [PubMed]
Wang, F.; Lin, S.; Luo, X.; Zhao, B.; Wang, R. Query-by-sketch image retrieval using homogeneous painting style characterization. J. Electron. Imaging 2019, 28, 023037. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clarket, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning PLMR, Virtual Event, 13–14 August 2021. [Google Scholar]
Conde, M.V.; Turgutlu, K. CLIP-Art: Contrastive pre-training for fine-grained art classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3956–3960. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Abdou, M.; Kulmizev, A.; Hershcovich, D.; Frank, S.; Pavlick, E.; Søgaard, A. Can language models encode perceptual structure without grounding? a case study in color. arXiv 2021, arXiv:2109.06129. [Google Scholar]
Xiang, X.; Liu, D.; Yang, X.; Zhu, Y.; Shen, X.; Allebach, J.P. Adversarial open domain adaptation for sketch-to-photo synthesis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; IEEE: Piscataway, NJ, USA, 2021; pp. 1434–1444. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Darosh. Colorgram. Available online: https://github.com/obskyr/colorgram.py (accessed on 31 May 2022).
Hickey, G. The Ukiyo-e Blues: An Analysis of the Influence of Prussian Blue on Ukiyo-e in the 1830s. Master’s Thesis, The University of Melbourne, Melbourne, Australia, 1994. [Google Scholar]
Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition, Copenhagen, Denmark, 12–14 October 2015; Springer: Cham, Switzerland, 2015; pp. 84–92. [Google Scholar]
Colornames.org. Available online: https://colornames.org/download/ (accessed on 31 May 2022).
Kingma, D.P.; Ba, J.S. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 357–366. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Kubo, S. Butterflies around a wine jar.1818. Surimono, 203 × 277 cm, V&A Collection E136-1898. Photo: Courtesy of the Board of Trustees of Victoria & Albert Museum.
Achlioptas, P.; Maks, O.; Haydarov, K.; Elhoseiny, M.; Guibas, L. Artemis: Affective language for visual art. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11569–11579. [Google Scholar]

Figure 1. The example of rare and attractive colors in ukiyo-e prints are shown in red circles.

Figure 2. The (H, L) = (0, 60) color group. the colors in the red boxes may tend to be described more as gray based on human senses.

Figure 3. A variety of different beer colors and tea colors.

Figure 4. Results for ‘yellowish blue’ using text-image search and image-image search.

Figure 5. Architecture of our cross-modal multi-task fine-tuning representation learning framework.

Figure 6. Structure of the space sampler module. (1) refers to process image sketch extraction, (2) refers to process HOG feature extraction and (3) refers to process triplet data sampling.

Figure 7. Example sketch image output from AODA Net. Original input image is extracted from the ukiyo-e prints series Ogura Imitations of One Hundred Poems by One Hundred Poets (小倉擬百人一首).

Figure 8. An example of a HOG feature map.

Figure 9. Example of color information extraction.

Figure 10. Extracting the color-proportion index and a color description document from the image data. In the color description document, the name of each color is treated as a word unit.

Figure 11. Colors with top 6 highest frequencies.

Figure 12. Example of color name

{Score}_{color name} (x^{i})

.

Figure 12. Example of color name

{Score}_{color name} (x^{i})

.

Figure 13. Example of training data of the cross-modal fine-tuning task.

Figure 14. Cosine, Euclidean, Manhattan, and Dot Pearson correlation values calculated at 60 epochs.

Figure 15. Example of correlation scores: −0.3, 0, and +0.3.

Figure 16. Linguistic space visualization.

Figure 17. The same ukiyo-e print that was digitized at the different institutions.

Figure 18. Example of extracting rare color information based on our proposed method.

Figure 19. Demo application implementation.

Figure 20. Retrieval results of inputting the pigment name ‘Cinnabar dark’.

Table 1. Descriptions of the color name for the same color are different.

HEX Code	(a) [12]: Stormcloud	(b) [13]: Sumatra Chicken
#4f666a

Table 2. Fine-tuning for learning image representation focused on the similarity of the sketch structure.

Image-Label Pair	Label: Anchor	Label: Positive	Label: Negative
Image-Label Pair

Table 3. Vocabulary differences between CN and HCN.

	Totally Different	Overlap or Differ in Alphabetic Case	The Same
Proportion	721 (48%)	570 (37%)	223 (15%)

Table 4. Example of color names for the same color in different databases.

For Training and Testing: HCN	For Evaluation: CN	Sample
Manganese Red	Amaranth
Hiroshima Aquamarine	Aquamarine
Wet Ash	Ash Gray

Table 5. Data samples for evaluation.

	Text Description	Color Card
CN dataset and corresponding RGB color cards	Acid green (#b0bf1a)
Color description from colornames.org	Slurp Blue (#38e0dd) Pyonked (#c61fab)	(-)

Table 6. The Pearson correlation coefficient score between HCN and CN before and after fine-tuning at 500 epochs.

Pre-Trained Model	Pearson Correlation Coefficient Score
CLIP (ViT-B/32)	0.7990
Fine-tunned CLIP (ours)	0.7996 ↑

Table 7. Examples of color names collected from colornames.org.

hexCode	Name	hexCode	Name
01ba8e	Technology Turquoise	dc13bb	Plurbin
04196b	Bottle Cap Blue Juice	df1c8a	Punk Delilah
0f1612	Haha Its Not Black	e255f2	Aortal
11af1e	Its Definitely Not Pink	e5149a	Elle Woods Was Here
16d54d	Grenitha	e5caeb	Light Salvia
173838	Fredoro	e8c251	Luigi Yellow
174855	Business Blue	e98425	True October
176b76	Dweebith	ea4835	Piper Pizza Red
20654d	Pompoono	eca741	Yeach
2388fb	Dragonbreaker147	ef4bc3	Violent Barbie Doll
243a47	Autumn Storm in the Mountains	f22966	Vadelma
260b84	Gibbet	f5221b	Old Liquorish
2d31c3	Creeping Dusk	f57c7e	White Girl Sunburn Rash
341e50	Earthy Gearthy Purthy	fc7420	Dinner in the Desert

Table 8. Example of color-based similar description retrieval.

Pre-Trained Language Model	Query	Rank 1	Rank 2	Rank 3	Rank 4	Rank 5
BERT-base-uncased	Electrica Violet	Olayan	Fredoro	Polar Purple	Luigi Yellow	Toxic Mermaid Tears
BERT-base-uncased
BERT-base-multilingual-uncased		Red Purple	Luigi Yellow	Polar Purple	Vadelma	Light Salvia
BERT-base-multilingual-uncased
Fine-tuned CLIP (ours)		Polar purple	Bright Light Eggplant Purple	Lavenviolet Crush	Could You Be Any More Purple	Oompa Lompie Purple
Fine-tuned CLIP (ours)

Table 9. The Pearson correlation coefficient score between CN and the corresponding color card dataset before and after fine-tuning at 500 epochs.

Pre-Trained Model	Pearson Correlation Coefficient Score
CLIP (ViT-B/32)	0.2640
Fine-tunned CLIP (ours)	0.2642 ↑

Table 10. Similarity between image pairs in the IMGonline and DeepAI application.

Input Pair	IMGonline [15]	DeepAI [16]
Input image (1,2)	${Similarity}_{1, 2} :$ 81.80%	${Distance}_{1, 2} :$ 6
Input image (2,3)	${Similarity}_{2, 3} :$ 90.05%	${Distance}_{2, 3} :$ 2
Input image (1,3)	${Similarity}_{1, 3} :$ 87.11%	${Distance}_{2, 3} :$ 6

Table 11. Similarity score calculation on the same ukiyo-e print digitized at the different institutions.

Model	Similarity Score
Image pairs	Image 1–Image 2	Image 2–Image 3	Image 1–Image 3	Image 3-ukiyo-e print with similar structures and different colors


Fine-tunned CLIP (ours)	0.98256	0.987789	0.98914	0.897935
CLIP (ViT-B/32)	0.98254	0.987763	0.98917	0.898041

Table 12. Similar image retrieval using the ukiyo-e print digitized at the different institutions.

Model	Query	Rank 1	Rank 2	Rank 3	Rank 4
CLIP (ViT-B/32)
Cross-Vison Transformer [36]
VGG19 [37]
HOG
Ours

Table 13. Possible colors for ’dayflower Blue’ extracted from an ukiyo-e print Butterflies around a wine jar [38] mentioned in study [6].


546c94	5e6b96	7f8bb3	6b7ca4	6774a4	3c4476

Table 14. Top 3 results for query ‘Blue’ and ‘Dayflower blue’.

Query	Rank 1	Rank 2	Rank 3
Blue
Dayflower blue

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, K.; Wang, J.; Batjargal, B.; Maeda, A. Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval. Future Internet 2022, 14, 212. https://doi.org/10.3390/fi14070212

AMA Style

Li K, Wang J, Batjargal B, Maeda A. Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval. Future Internet. 2022; 14(7):212. https://doi.org/10.3390/fi14070212

Chicago/Turabian Style

Li, Kangying, Jiayun Wang, Biligsaikhan Batjargal, and Akira Maeda. 2022. "Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval" Future Internet 14, no. 7: 212. https://doi.org/10.3390/fi14070212

APA Style

Li, K., Wang, J., Batjargal, B., & Maeda, A. (2022). Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval. Future Internet, 14(7), 212. https://doi.org/10.3390/fi14070212

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intuitively Searching for the Rare Colors from Digital Artwork Collections by Text Description: A Case Demonstration of Japanese Ukiyo-e Print Retrieval

Abstract

1. Introduction

1.1. Retrieval of Artwork

1.2. Human Senses and Colors

1.3. Contributions of Our Work

2. Related Work

2.1. Artwork Retrieval Related Work

2.2. Related Deep Learning Models

3. Methodology

3.1. The Space Sampler Module

3.1.1. Image Sketch Extraction

3.1.2. Histogram of Oriented Gradients (HOG) Feature Extraction

3.1.3. Triplet Data Sampling

3.2. Text and Label Generator

3.2.1. Color Information Extraction

3.2.2. IDF Calculation

3.3. Multi-Task Fine-Tuning on the CLIP Model

3.3.1. Cross-Modal Fine-Tuning with Cosine Similarity-Based Pairwise Loss

3.3.2. The Fine-Tuning Image Encoder with Triplet Loss

4. Experiments and Results

4.1. Datasets and Basic Experimental Setup

4.2. Training Process

4.3. Evaluation Experiments Using Representations Extracted from the Fine-Tuned Text Encoder

4.3.1. Pearson Correlation Coefficient Calculation on the HCN and CN Datasets

4.3.2. Example of Color-Based Similar Description Retrieval by Using Features Extracted from Pre-Trained Language Models

4.4. Evaluation Experiments Using Representations Extracted from a Fine-Tuned Image Encoder

4.4.1. Pearson Correlation Coefficient Calculation on the CN Dataset and the Color Card Image Dataset

4.4.2. Similarity Score Calculation on the Same Ukiyo-e Print Digitized at the Different Institutions

4.4.3. Example of Comparison on Image-to-Image Retrieval

4.4.4. Example of Rare Color Retrieval

4.5. Demo Application Implementation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI