4.4. The Baseline for Triplet-Based Deep Cross-Modal Hashing
Given a set of triples
, triplet sample distance learning goals is to close distance between anchor sample
, regular sample
and long distance between anchor sample
and negative sample
, so distance constraints of triplet samples can be defined as:
where
is the interval value of distance
between anchor sample and positive sample and the distance
between anchor sample and negative sample.
The triples loss function as shown in Formula (
5) can be constructed according to the triples sample distance constraint in Formula (
6). During the training of the network model, it is found that, when the distance interval value
is smaller, the distance
between the anchor sample
and the positive sample
is closer to the distance
. Although the loss function can quickly converge and close to 0, similar text and image modal samples are difficult to distinguish. When the distance interval value
is larger, the distance
between the anchor sample
and the positive sample
is much smaller than the distance
. Similar text and image modal samples are easy to distinguish, but network models are difficult to converge.
Based on the triples sample distance constraint, the improved triples sample distance constraint is proposed. In detail, the improved triples sample distance learning goal is to construct the similarity relationship between anchor sample and positive sample and negative sample, so that the distance between anchor sample and positive sample is less than that between anchor sample and negative sample, and, at the same time, to construct the similarity relationship among positive sample, anchor sample, and negative sample, so that the distance between positive sample and anchor sample is less than that between positive sample and negative sample. By learning from two sets of sample distance relationships, heterogeneous data of the same category is forced close to each other to improve the learning ability of the network model and effectively realize that the intra-class distance is less than the inter-class distance. Improved triples sample distance constraints are formally expressed as shown in Formula (
7):
where the distance interval value
between
and
is a custom parameter, and the distance interval value
between
and
is also a custom parameter. These parameters control the distance relationship between anchor sample, positive sample, and negative sample as balance parameters.
In cross-modal hash learning, the semantic relationship between triples samples is described by the triples likelihood function. Assuming that the anchor sample is text mode and the positive sample and negative sample are image mode, the improved triples likelihood function is proposed according to the improved triples sample distance constraint in Formula (
7), as shown in Formula (
8):
where
,
,
,
,
,
and
are the feature output of text and image modes, respectively,
,
,
and
are text feature extraction network and image feature extraction network, and
denotes that the probability is calculated as a sigmoid function.
is the interval value between anchor sample and positive sample feature distance and anchor sample and negative sample feature distance, and
is the interval between the positive sample and the anchor sample and the positive sample and the negative sample.
Based on the improved triplet likelihood function, the heterogeneous association between different modal data is established by its negative logarithmic likelihood loss. The triplet loss function
from text mode to image mode is shown in Formula (
10).
Similarly, the triplet loss function from image mode to text mode is shown in Formula (
11).
Therefore, according to Formulas (
10) and (
11), the complete form of loss function of depth-span modal hash algorithm based on triples is shown in Formula (
12):
where
represents the feature vector matrix of the learned text modal data, and
represents the feature vector matrix of the learned image modal data; they contain the relative semantic relationship in the triple tag, and
,
, respectively, represent the text modal hash code matrix of modal and image modal data, the data feature vectors pass the semantic relationship to the corresponding hash code, where
,
. Jiang et al. [
43] confirmed by a large number of experiments that better network performance can be obtained by assuming that the text mode hash code is the same as the image mode hash code during the training of the network. Therefore, the constraint condition
is added on the basis of the objective loss function shown in Formula (10), and the final complete triples loss function is shown in Formula (
11):
By optimizing the loss function shown in Formula (
14), the triple network can learn deep neural network parameters and hash code representation at the same time, as well as realize end-to-end learning. The first and second terms of the loss function are improved triple-negative log-likelihood loss functions. In the optimization learning process of these two terms, the cross-modal similarity of the data in the original semantic space is preserved. The third term of the loss function
is the regularization term. By optimizing this term, the quantization error is reduced, so that the cross-modal hash code better retains the semantic similarity in the data features. The fourth term of the loss function
is also a regularization term. By optimizing this term, the balance of hash code values is ensured, i.e., the number of
and
elements in the same position of the hash code is the same, so that each the information contained in the bit hash code is maximized.
Driven by a large number of image-text datasets, the optimized text feature extraction network parameters , image feature extraction network parameters , and hash code matrix V are obtained by using random gradient descent algorithm and alternating iteration strategy to quickly get and optimize the TDCMR network model; then, the network model branches are selected according to the modal types of the input data, and the deep features of the input data are extracted to obtain the cross-modal hash code.
Specifically, the optimization of the triple loss function shown in Formula (
13) is a non-convex problem. Therefore, the random gradient descent algorithm simultaneously uses the alternating optimization strategy to learn parameters
, and the hash code matrix
V, when updating one parameter, the other two parameters are fixed, the third parameter is optimized, and the optimization process is alternately carried out until the model converges or reaches the maximum number of iterations.
4.4.1. Update
We learn
with fixed
and
V. For each iteration, a batch-size data input network is randomly selected from the training dataset, and the back-propagation algorithm is used to learn the text features to extract the network parameters
. The gradient of the
i-th text data object
to calculate the loss function is shown in Formula (
14):
Compute
to update parameter
:
4.4.2. Update
We learn
with fixed
and
V. For each iteration, a batch-size data input network is randomly selected from the training dataset, and the back propagation algorithm is used to learn the image features to extract the network parameters
. The gradient of the loss function is calculated by the i-th image data object, as shown in Formula (
16):
Compute
to update parameter
:
4.4.3. Update V
We have fixed parameters
and
as learning hash code matrix
V. By the relation between trace and norm of matrix, to matrix
N,
. Thus, the loss function can be simplified as shown in Formula (
18):
Let
:
Keep
and
the same sign:
Given text modal data and image modal data, semantic quantization coding of text modal data and image modal data can be realized by TDCMR model. Meanwhile, semantic similarity of heterogeneous multimedia data can be measured by hamming distance. During the concrete process, the text modal data
T generates semantic quantization coding
, and the calculation process is shown in Formula (
21):
Image modal data
I generate semantic quantization coding
, as shown in Formula (
22):
As shown in the Algorithm 1, the following is our optimization procedure of the proposed TDCMR.
Algorithm 1 Optimization procedure of the proposed TDCMR |
- 1:
Input Textual dataset ; Image dataset ; Tuple label ; - 2:
Output Text feature extraction network parameters ; Image feature extraction network parameters ; Hash code matrix V; - 3:
Initialize parameter and ; - 4:
The number of samples taken for each iteration ; - 5:
The maximum number of iterations ; - 6:
for to N do - 7:
for to do - 8:
Random sampling of text samples from to build a batch of dataset; - 9:
Taking anchor samples from batch data to build a set of triples samples; - 10:
For each text sample , calculated by forward propagation; - 11:
Gradient in Equations (3)–(10); - 12:
Update parameters by backward propagation; - 13:
end for - 14:
for to do - 15:
Random sampling of text samples from to build a batch of dataset; - 16:
Taking anchor samples from batch data to build a set of triples samples; - 17:
For each text sample , calculated by forward propagation; - 18:
Gradient in Equation ( 15); - 19:
Update parameters by backward propagation; - 20:
end for - 21:
Update V in Equation ( 18); - 22:
end for
|
4.4.4. Algorithm Analysis
Based on the analysis of the training effect of TDCMR algorithm, the selection of triplet samples is the key of model training effect. Give a triples sample , which is divided into the following categories:
Simple triples: triples with a loss function value of 0. The distance of triples is satisfied , i.e., the distance between anchor sample and positive sample is less than the distance between anchor sample and negative sample margin, and the negative sample is easy to identify.
Semi-difficult triples: triples with a loss function close to 0 and the distance relation of triples satisfied , and negative sample is close to anchor sample , and negative sample is easy to identify.
Difficult triples: triples with a loss function value greater than 0, the distance relation of triples satisfied , and the negative sample is closer to the anchor point sample than the positive sample , and the negative sample is difficult to identify.
The selection of triples affects the training effect of the model. Simple triples are easy to identify but cannot provide effective information for network model training. Besides, difficult triples are difficult to identify, and all difficult triples are easy to diverge the network model and seriously affect training efficiency. With the training of network model, the number of easy triples and semi-difficult triples will be much larger than the number of difficult triples, which leads to the difficulty of continuous optimization of network model in the later stage of training. Therefore, we adopt a two-stage strategy to select the three-component sample training network model. In the early stage of training, the semi-difficult triples are selected as the training data to train the network model, which makes the network model fit converge. In the later stage of training, the difficult triples are selected as the training data, and the network model is fine-tuned to obtain the optimal network model parameters and improve the training efficiency of the network model.
4.5. The TDCMR-Quadtree-Based Index Method
Figure 3 illustrates the proposed framework for index problem. The quadtree and the semantic hash table are integrated in the vertical dimension by the order of first space and then semantics. The location information is first organized according to the structure of the quadtree, then the spatial objects contained in the quadtree leaf nodes are semantically quantized by the cross-modal hash algorithm, and the hash table (Hash table) is associated to the corresponding leaf nodes according to the cross-modal hash code. The geo-semantic hybrid index TH-Quadtree is established to speed up the access to the spatial objects in the O of the geo-multimedia dataset. The structure is shown in
Figure 3, and the quadtree is a tree-type index structure for accelerating spatial distance, which can organize spatial information efficiently; The TDCMR, where the cross-modal hash code describes the semantic information of geo-multimedia data, hash table supports organizing semantic information with lower storage space and search time; organizing quadtree and semantic hash table by spatial-first coupling can ensure that geo-multimedia data
k nearest neighbor queries can quickly retrieve spatial objects that meet the requirements in a given spatial limitation and query semantics. The TH-Quadtree index combines the information of geo-multimedia data object space and semantics. It is a two-layer hybrid index structure, which is mainly composed of two parts: the quadtree of the spatial layer and the hash table of the semantic layer.
4.5.1. Space Layer
TH-Quadtree is a two-tier hybrid index structure that integrates the spatial and semantic layers on the vertical dimensions. In the spatial dimension, the spatial information of spatial objects is generally represented by two-dimensional latitude and longitude coordinates, which has better pruning effect than high-dimensional semantic information. Therefore, the spatial layer index is constructed by using the spatial position relation of spatial objects in geo-multimedia dataset, as the first layer of TH-Quadtree index structure. In this paper, quadtree is used to index the spatial position information of all spatial objects, which is efficient in two-dimensional spatial information organization. First, all spatial objects are regarded as the point set in the geographical space, and then each spatial object belongs to a minimum boundary rectangle MBR, i.e., each node on the quad tree, and then all the MBR are organized into different levels of tree structure according to the spatial distribution. In general, geospatial recursion is divided into hierarchical tree-type structures. Geo-multimedia data objects are all stored on each leaf node, while the root and middle nodes do not store spatial objects.
4.5.2. Semantic Layer
For each leaf node of the quadtree spatial layer, a hash table index is associated as the semantic layer of the second layer of the TH-Quadtree index structure to facilitate pruning in the semantic dimension. The semantic quantization coding of all geo-multimedia data objects in leaf nodes is obtained according to the TDCMR cross-modal hash algorithm, i.e., cross-modal hash code. Then, the uID identification code of geo-multimedia data objects is stored in a hash bucket with c bit binary encoding as key value to generate a hash table containing all spatial object semantic information of leaf nodes. The spatial objects in the hash table are located in the same hash bucket, and they have high spatial similarity, on the one hand, in the same leaf node; on the other hand, they have high semantic similarity due to the same semantic quantization coding.
4.6. TH-Quadtree-Based Nearest Neighbor Query Algorithm
The main idea of geo-multimedia data nearest neighbor query algorithm based on TH-Quadtree is: Given a query object
q, the spatial object is searched orderly in the spatial layer and semantic layer. Starting with the root node of the index structure, the TH-Quadtree index structure space is traversed by TH-Quadtree index structure space according to the Spatial Best Match Proximity according to the principle of the best priority nodes of the layer to continuously obtain the tree nodes closest to the spatial position
q the query object, where the optimal spatial similarity calculation is shown in Formula (
23):
where
stands for the spatial similarity node
N and query objects
q, and optimal spatial similarity
is the lower bound of the score of spatial object similarity o geo-multimedia data in query object
q and node
N. Based on the above optimal spatial similarity
, when the query processing process accesses the leaf node, it transforms from the spatial layer search to the semantic layer search. The candidate sets related to query object semantics in the hash table associated with the leaf node are obtained quickly by Hashing Looking. Then, for the spatial object fusion spatial similarity and semantic similarity in the candidate set, the optimal spatial object update result set
R is selected according to the geo-multimedia data similarity score
. During the whole search, the result set
R is used maintain the traversed space object dynamically, and the current knot is formed, and a small geo-multimedia data similarity score
k the fruit set is used as the upper bound of the result set, and the search is terminated when the node that has not been accessed satisfies the condition of Formula (
24), and the current result set is returned as the optimal query result.
where
is the distance lower bound of the spatial similarity between all spatial objects with
N as the root node q the spatial similarity of the query object. When the distance lower bound of the spatial similarity is larger than the distance upper bound R the known result set, then, all spatial objects that are not accessed have no chance of better than the Top-k results in the current result set, then the search process terminates.
Since the spatial distance between query q and any spatial object o in node N is greater than the spatial distance from query q to node N, the spatial similarity between query q and node N will not be higher than that between query q and any spatial object o spatial similarity, i.e., , similarly, since node N is the top element of priority queue L, the spatial similarity of query q and node N is the lower bound of the spatial similarity of all currently unvisited nodes and query q. As , the similarity scores of all spatial objects that have not been accessed, and the geo-multimedia data of query q would be greater than the upper bound of distance , . Therefore, compared with the current search results, all unaccessed spatial objects have no chance to be closer to the query q, and the current result set R is the optimal solution, which can terminate the query process.
Given a geo-multimedia data k nearest neighbor query q, the distance upper bound of the result set R and the priority queue L sorted according to the spatial similarity score from small to large. In the query process, for the top element N popped by the priority queue L, the query termination condition is .