Next Article in Journal
Breast Histopathological Image Classification Method Based on Autoencoder and Siamese Framework
Previous Article in Journal
Secure Z-MAC Protocol as a Proposed Solution for Improving Security in WSNs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Image Retrieval via Canonical Correlation Analysis and Binary Hypothesis Testing †

1
The Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4S1, Canada
2
The Institute for Infocomm Research, A-STAR, Singapore 138632, Singapore
*
Author to whom correspondence should be addressed.
This paper is an extended version of our presentation at the 16th Canadian Workshop on Information Theory, Hamilton, ON, Canada, 2–5 June 2019.
Information 2022, 13(3), 106; https://doi.org/10.3390/info13030106
Submission received: 7 January 2022 / Revised: 17 February 2022 / Accepted: 18 February 2022 / Published: 23 February 2022
(This article belongs to the Special Issue Image Enhancement with Deep Learning Techniques)

Abstract

:
Canonical Correlation Analysis (CCA) is a classic multivariate statistical technique, which can be used to find a projection pair that maximally captures the correlation between two sets of random variables. The present paper introduces a CCA-based approach for image retrieval. It capitalizes on feature maps induced by two images under comparison through a pre-trained Convolutional Neural Network (CNN) and leverages basis vectors identified through CCA, together with an element-wise selection method based on a Chernoff-information-related criterion, to produce compact transformed image features; a binary hypothesis test regarding the joint distribution of transformed feature pair is then employed to measure the similarity between two images. The proposed approach is benchmarked against two alternative statistical methods, Linear Discriminant Analysis (LDA) and Principal Component Analysis with whitening (PCAw). Our CCA-based approach is shown to achieve highly competitive retrieval performances on standard datasets, which include, among others, Oxford5k and Paris6k.

1. Introduction

The past two decades have witnessed an explosive growth of online image databases. This growth paves the way for the development of visual-data-driven applications, but at the same time poses a major challenge to the Content-Based Image Retrieval (CBIR) technology [1].
Traditional approaches to CBIR mostly rely on the exploitation of handrafted scale- and orientation-invariant image features [2,3,4,5,6], which have achieved varying degrees of success. Recent advances [7,8] in Deep Learning (DL) for image classification and object detection have generated significant interests in bringing Convolutional Neural Networks (CNNs) to bear upon CBIR. Although CNN models are usually trained for purposes different from CBIR, it is known [9] that features extracted from modern deep CNNs, commonly referred to as DL features, have great potential in this respect as well. Retrieval methods utilizing DL features can generally be divided into two categories: without/with fine-tuning the CNN model [10]. The early application of CNN to CBIR almost exclusively resorts to methods in the first category, which use Off-The-Shelf (OTS) CNNs (i.e., popular pre-trained CNNs) for feature extraction (see, e.g., [11,12,13,14]). A main advantage of such methods is the low implementation cost [15,16], which is largely attributed to the direct adoption of pre-trained CNNs. Performance-wise, they are comparable to the state-of-the-art traditional methods that rely on handcrafted features. In contrast, many recent methods, such as [17,18,19], belong to the second category, which take advantage of the fine-tuning gain to enhance the discriminatory power of the extracted DL features. A top representative from this category is the end-to-end learning framework proposed in [20]. It outperforms most existing traditional and OTS-CNN-based methods on standard testing datasets; however, this performance improvement comes at the cost of training a complex triple-branched CNN using a large dataset, which might not always be affordable in practice.
Many preprocessing methods have been developed with the goal of better utilizing DL features for image retrieval, among which Principal Component Analysis with whitening [21] (PCAw) and Linear Discriminant Analysis [22] (LDA) are arguably most well known. Despite being extremely popular, PCA and LDA have their respective weaknesses: the dimensionality reduction in PCA often leads to the elimination of critical principal components with a small contribution rate while the performance of LDA tends to suffer from decreasing differences between mismatched features. As such, there is great need for a preprocessing method with improved robustness against dimensionality reduction and enhanced sensitivity to feature mismatch. In this work, we aim to put forward a potential solution with desired properties by bringing Canonical Correlation Analysis (CCA) [23] into the picture.
CCA is a multivariate technique for elucidating the the associations among two sets of variables. It can be used to identify a projection pair of a given dimension that maximally captures the correlation between the two sets. The applications of CCA are too numerous to list. In cross-modality matching/retrieval alone, extensive investigations have been carried out as evidenced by a growing body of literature, from those based on handcrafted features [24] to the more recent ones that make use of DL features [25,26,27]. There is also some related development on the theoretical front (see, e.g., [28,29]).
Motivated by the consideration of computational efficiency and affordability as well as the weaknesses inherent in the existing preprocessing methods, we develop and present in this paper a new image retrieval method based on OTS deep CNNs. Our method is built primarily upon CCA, but has several notable differences from the related works. For the purpose of dimensionality reduction (i.e., feature compression), the proposed method employs a basis-vector selection technique, which invokes a Chernoff-information-based criterion to rank how discriminative the basis vectors are. Both the basis vectors and their ranking are learned from a training set, which consists of features extracted from a pre-trained CNN—the neural network itself is not retrained/finetuned in our work. Given a new pair of features, the ranked basis vectors are used to perform transformation and compression. This is followed by a binary hypothesis test on the joint distribution of pairs of transformed features, which yields a matching score that can be leveraged to identify top candidates for retrieval. We show via extensive experimental results that the proposed CCA-based method is able to deliver highly competitive results on standard datasets, which include, among others, Oxford5k and Parise6k.
This paper is organized as follows. The proposed CCA-based preprocessing method along with the associated matching procedure is detailed in Section 2. Section 3 includes the experimental results and the relevant discussions. We close the paper in Section 4 with some concluding remarks.

2. Proposed Method

The proposed image retrieval method utilizes CCA in an essential way. It leverages a training dataset of features extracted from a pre-trained CNN model (see Figure 1) to learn a set of canonical vectors, which serve as the basis vectors of the feature space. These vectors are used to project the features of a pair of images into a new space, in which a Chernoff-information-based selection method is applied to identify the most discriminative elements of the transformed features. Such elements then undergo a binary hypothesis test to measure the similarity between the features and, consequently, the two images. This process is expounded in the following four subsections (see also Figure 2).

2.1. Image Pre-Processing and Feature Extraction

The CNN model adopted in this work for feature extraction is VGG16 [30]. It takes an input image of maximum size 1024 × 1024 and produces 512 feature maps of maximum size 32 × 32 from its very last pooling layer. A single feature element is extracted from each feature map via pooling. A 512-dimensional vector, which resulted from the concatenation of these elements, is converted, through centralization and normalization (here centralization is performed by subtracting the mean (computed based on the training set) while normalization yields a unit-length vector), to a global feature vector, which serves as a compact representation of the image.

2.2. Correlation Analysis and Canonical Vectors

At the heart of the proposed method lies so-called canonical vectors, which are learned from a large training set of matching and non-matching image features in a manner inspired by CCA. The learning process consists of the following steps.
Step 1: Construct two raw matching matrices
X ( R M ) = [ x 1 , x 2 , , x L ] , Y ( R M ) = [ y 1 , y 2 , , y L ] ,
where L is the number of raw matching pairs, x l and y l for l { 1 , 2 , , L } are a pair of global feature vectors representing two matching images (here “matching images” means images from the same class while “non-matching images” means images from different classes).
Using the raw matching pairs X ( R M ) and Y ( R M ) , a pair of matching-feature matrices is formed:
X ( M ) = [ x 1 , y 1 , x 2 , y 2 , y L , x L ] ,
Y ( M ) = [ y 1 , x 1 , y 2 , x 2 , , x L , y L ] .
The total number of training pairs is 2 L after feature order flipped. This is performed to ensure that in Equation (1) below, the diagonal blocks are identical and symmetric, so are the off-diagonal blocks. The size of both X ( M ) and Y ( M ) is 512 × 2 L . The training data matrix of matching features H ( M ) is constructed by stacking X ( M ) on Y ( M ) :
H ( M ) = X ( M ) Y ( M ) ( 1024 × 2 L ) .
The estimated covariance matrix of matching features is given by
Φ ( M ) = 1 2 L 1 H ( M ) ( H ( M ) ) T = 1 2 L 1 X ( M ) Y ( M ) X ( M ) Y ( M ) T = Σ X X ( M ) Σ X Y ( M ) Σ Y X ( M ) Σ Y Y ( M ) ,
where
Σ X X ( M ) = X ( M ) ( X ( M ) ) T 2 L 1 , Σ Y Y ( M ) = Y ( M ) ( Y ( M ) ) T 2 L 1 , Σ X Y ( M ) = Σ Y X ( M ) = X ( M ) ( Y ( M ) ) T 2 L 1 = Y ( M ) ( X ( M ) ) T 2 L 1 .
Step 2: Randomly permuting the columns of one of the raw feature matrices, say from Y ( R M ) to Y ( R N ) , yields two raw non-matching matrices. More specifically, we construct two raw non-matching matrices by successively associating each column of X ( R M ) with a randomly selected (without replacement) non-matching column from Y ( R M ) . For example,
X ( R N ) = [ x 1 , x 2 , , x L ] , Y ( R N ) = [ y 3 , y 7 , , y L 4 ] .
Based on these two raw non-matching matrices, the feature order flipping is performed to generate X ( N ) and Y ( N ) :
X ( N ) = [ x 1 , y 3 , x 2 , y 7 , , x L , y L 4 ] ,
Y ( N ) = [ y 3 , x 1 , y 7 , x 2 , , y L 4 , x L ] .
With a procedure similar to that of step 1, we can estimate the covariance matrix Φ ( N ) for non-matching features H ( N ) :
Φ ( N ) = 1 2 L 1 H ( N ) ( H ( N ) ) T = 1 2 L 1 X ( N ) Y ( N ) X ( N ) Y ( N ) T = Σ X X ( N ) Σ X Y ( N ) Σ Y X ( N ) Σ Y Y ( N ) .
Note that
Σ X X ( N ) = Σ Y Y ( N ) = Σ X X ( M ) = Σ Y Y ( M ) = Σ a u t o ,
for they are the covariances of sets of random image features. As in Equation (1), the diagonal blocks in Equation (2) are also identical and symmetric, so are the off-diagonal blocks.
Step 3: Since Σ a u t o is positive definite, it follows that Θ 1 2 is well defined, where
Θ = Σ a u t o 0 0 Σ a u t o .
We can multiply both covariance matrices, Φ ( M ) and Φ ( N ) , on the left and right by Θ 1 2 to de-correlate their diagonal blocks:
Φ ^ ( M ) = Θ 1 2 Φ ( M ) Θ 1 2 = I J ( M ) J ( M ) I , Φ ^ ( N ) = Θ 1 2 Φ ( N ) Θ 1 2 = I J ( N ) J ( N ) I ,
where
J ( M ) = Σ a u t o 1 2 Σ X Y ( M ) Σ a u t o 1 2 = Σ a u t o 1 2 Σ Y X ( M ) Σ a u t o 1 2 , J ( N ) = Σ a u t o 1 2 Σ X Y ( N ) Σ a u t o 1 2 = Σ a u t o 1 2 Σ Y X ( N ) Σ a u t o 1 2 .
Step 4: Apply eigen-decomposition [31] on J ( M ) :
J ( M ) = U Λ U T ,
where U is a unitary matrix, and Λ is a diagonal matrix with the diagonal entries being the eigenvalues of J ( M ) . The columns of U are exactly the sought-after canonical vectors. The blockwise left- and right-multiplication of both Φ ^ ( M ) and Φ ^ ( N ) by U T and U , respectively, gives the following pair of matrices:
U T U U T J ( M ) U U T J ( M ) U U T U = I Λ Λ I ,
U T U U T J ( N ) U U T J ( N ) U U T U = I Π Π I ,
where Π = U T J ( N ) U . The off-diagonal block Λ in Equation (3) is a diagonal matrix whereas Π in Equation (4) is not necessarily so. Nevertheless, it will be seen that in practice Π is often close to a zero matrix (as two non-matching image features tend to be uncorrelated) and thus is approximately diagonal as well.

2.3. Chernoff Information for Canonical Vector Selection

Note that the learned canonical vectors of matching image features form an orthonormal basis of R 512 . These vectors are not necessarily equally useful for the purpose of measuring the similarity between two feature vectors of an unknown pair of images; therefore, it is of considerable interest to quantify how discriminative each canonical vector is. To this end, the off-diagonal blocks of the covariance matrix of non-matching image features can be brought into play. Evaluating Chernoff information ( C I ) [32,33] with respect to the diagonal elements of both Λ and Π yields a ranking of the most different diagonal element pairs, which can be used to guide the selection of canonical vectors.
Define the following set of 2 × 2 matrices
S t ( M ) = 1 c t ( M ) c t ( M ) 1 , S t ( N ) = 1 c t ( N ) c t ( N ) 1 ,
using matching coefficient c t ( M ) = [ Λ ] t t and non-matching coefficient c t ( N ) = [ Π ] t t , t { 1 , 2 , , 512 } , determined by the diagonal elements of Λ and Π :
Λ = c 1 ( M ) 0 0 0 c 2 ( M ) 0 0 0 c 512 ( M ) , Π = c 1 ( N ) π 1 , 2 π 1 , 512 π 2 , 1 c 2 ( N ) π 2 , 512 π 512 , 1 π 512 , 2 c 512 ( N ) .
Now let S t ( λ t ) = ( λ t ( S t ( M ) ) 1 + ( 1 λ t ) ( S t ( N ) ) 1 ) 1 , λ t [ 0 , 1 ] and define
D ( S t ( λ t ) | | S t ( M ) ) = 1 2 log e | S t ( M ) | | S t ( λ t ) | + 1 2 tr ( ( S t ( M ) ) 1 S t ( λ t ) ) 1 , D ( S t ( λ t ) | | S t ( N ) ) = 1 2 log e | S t ( N ) | | S t ( λ t ) | + 1 2 tr ( ( S t ( N ) ) 1 S t ( λ t ) ) 1 ,
where tr ( · ) is the trace operator. Let λ t = λ t * be the solution of D ( S t ( λ t ) | | S t ( M ) ) = D ( S t ( λ t ) | | S t ( N ) ) . The Chernoff information C I ( S t ( M ) | | S t ( N ) ) is defined as
C I ( S t ( M ) | | S t ( N ) ) = D ( S t ( λ t * ) | | S t ( M ) ) = D ( S t ( λ t * ) | | S t ( N ) ) .
An expression for individual λ t * is derived in Appendix A.
Given λ t * , C I of all pairs ( S t ( M ) , S t ( N ) ) can be evaluated, leading to a ranking (greater CI corresponds to higher rank) of the most different pairs of diagonal elements ( c t ( M ) , c t ( N ) ) and, consequently, the most discriminative canonical vectors of U . Let the k most discriminative vectors serve as the columns of the new canonical vector matrix U ˜ . Moreover, select the top k different pairs of diagonal elements ( c ˜ i ( M ) , c ˜ i ( N ) ) and the corresponding ( S ˜ i ( M ) , S ˜ i ( N ) ) , where i { 1 , 2 , , k } .

2.4. Similarity Measurement

The selected canonical vectors can be leveraged to measure the similarity between an arbitrary pair of images through a binary hypothesis test. Let ( x r , y c ) be an arbitrary pair of global feature vectors. The exact joint distribution of ( x r , y c ) likely varies from one dataset to another and does not admit an explicit characterization. Here we make the simplifying assumption that x r and y c are jointly Gaussian. Specifically, we assume that ( x r , y c ) N ( 0 , Φ ( M ) ) if they come from two matching images, and ( x r , y c ) N ( 0 , Φ ( N ) ) otherwise, where N ( 0 , Σ ) denotes a multivariate Gaussian distribution [34] with mean 0 and covariance matrix Σ . Given ( x r , y c ) , the transformed feature vectors are computed as follows:
w = w 1 , w 2 , , w k T = U ˜ T Σ a u t o 1 2 x r , v = v 1 , v 2 , , v k T = U ˜ T Σ a u t o 1 2 y c .
Since Λ is a diagonal matrix, it follows that ( w 1 , v 1 ) , ( w 2 , v 2 ) , , ( w k , v k ) are mutually independent with ( w i , v i ) N ( 0 , S ˜ i ( M ) ) for i { 1 , 2 , , k } in the case where ( x r , y c ) is a matching pair. We shall further assume that Π is also a diagonal matrix, which is justified by the fact that in practice Π is often very close to a zero matrix (see Figure 3 and Figure 4 for some empirical evidences). As a consequence, ( w 1 , v 1 ) , ( w 2 , v 2 ) , , ( w k , v k ) are mutually independent with ( w i , v i ) N ( 0 , S ˜ i ( N ) ) for i { 1 , 2 , , k } in the case where ( x r , y c ) is a non-matching pair. To check whether the given two images match or not, one can perform a binary hypothesis test regarding the underlying distribution of ( w , v ) : i = 1 k N ( 0 , S ˜ i ( M ) ) vs. i = 1 k N ( 0 , S ˜ i ( N ) ) .
Note that N ( 0 , S ˜ i ( M ) ) has probability density
P M ( w i , v i ) = e 1 2 w i v i 1 c ˜ i ( M ) c ˜ i ( M ) 1 1 w i v i ( 2 π ) 2 1 c ˜ i ( M ) c ˜ i ( M ) 1
while N ( 0 , S ˜ i ( N ) ) has probability density
P N ( w i , v i ) = e 1 2 w i v i 1 c ˜ i ( N ) c ˜ i ( N ) 1 1 w i v i ( 2 π ) 2 1 c ˜ i ( N ) c ˜ i ( N ) 1 .
We are now in a position to conduct a binary hypothesis test based on the confidence score given below:
s c o r e G = log i = 1 n P M ( w i , v i ) i = 1 n P N ( w i , v i ) = i = 1 k log P M ( w i , v i ) P N ( w i , v i ) .
Substituting Equations (5) and (6) into Equation (7) gives
s c o r e G = i = 1 k ( log P M ( w i , v i ) log P N ( w i , v i ) ) = i = 1 k w i 2 2 w i v i c ˜ i ( M ) + v i 2 2 π ( 1 ( c ˜ i ( M ) ) 2 ) + w i 2 2 w i v i c ˜ i ( N ) + v i 2 2 π ( 1 ( c ˜ i ( N ) ) 2 ) + log 1 ( c ˜ i ( N ) ) 2 1 ( c ˜ i ( M ) ) 2 ,
which is equivalent to
i = 1 k w i 2 2 w i v i c ˜ i ( M ) + v i 2 ( 1 ( c ˜ i ( M ) ) 2 ) + w i 2 2 w i v i c ˜ i ( N ) + v i 2 ( 1 ( c ˜ i ( N ) ) 2 )
as the log term and the scalar 2 π have no effect on rankings. This confidence score reflects the degree of similarity between the two given images. The higher the score is, the more likely the images match each other.

3. Experimental Results

3.1. Training Datasets

We resort to two datasets for training, namely, 120k-Structure from Motion (120k-SfM) and 30k-Structure from Motion (30k-SfM) [35]. Both are preprocessed to eliminate overlaps with the evaluation datasets. A succinct description of these two datasets can be found below:

3.1.1. 120k-Structure from Motion

120k-Structure from Motion (120k-SfM) dataset is constructed from the one used in the work of Schonberger et al. [36], which contains 713 3D models with nearly 120k images. The maximum size of each image is 1024 × 1024 . The original dataset includes all image from Oxford5k and Paris6k. Those images are removed to avoid overlaps (in total 98 clusters are eliminated).

3.1.2. 30k-Structure from Motion

30k-Structure from Motion (30k-SfM) dataset is a subset of 120k-SfM, which contains approximately 30k images and 551 classes. The maximum size of images are resized to 362 × 362 .
Each dataset serves its own purpose; 30k-SfM is a small dataset while 120k-SfM is a big one. This enables us to investigate the pros and cons of different datasets in terms of their sizes. Compared to 30k-SfM, 120k-SfM supplies richer features to be explored by the methods being tested.

3.2. Training Details

Using each dataset, two lists of matching and non-matching pairs of images are created for training—feature space analysis not CNN training. Table 1 shows some examples of matching and non-matching pairs. Specifically, we randomly select 10,960 raw pairs from 30k-SfM and 58,502 raw pairs from 120k-SfM. We double the number matching and non-matching pairs by simultaneously using each raw pair and its flipped version to ensure that the diagonal/off-diagonal blocks of the data covariances in Equations (1) and (2) are identical and symmetric. This could also be seen from Table 1: each pair is used twice but with its image order flipped.
The feature vector of a given image is extracted from the very last pooling layer of a pre-trained VGG16 via one of the following three pooling strategies: Global Max (MAC) pooling, Global Average (AVE) pooling, and global Standard Deviation (SD) pooling (global Max (MAC) pooling, Global Average (AVE) pooling, and global Standard Deviation (SD) pooling compute, respectively, the maximum value, the average value, and the standard deviation of the feature map in each channel). We conducted separate training for each of these strategies in order to compare performances.
For benchmarking, the proposed method (G-CCA) and its variant (S-CCA) were trained along with three alternative feature-space analysis methods, i.e., PCAw [21], Supervised PCA (SPCA) [37] and Multiclass LDA (MLDA) [38]. G-CCA is depicted in Figure 2 while S-CCA is the same as G-CCA except that in the final step the scalar similarity measure is used instead (namely, in the last block of Figure 2, s c o r e G is replaced with s c o r e S = w T · v ). PCAw infers a basis matrix of the feature space from the covariance matrix of the training image features. This basis matrix is used to whiten and compress new image features, which are then leveraged to make a matching/non-matching decision based on the scalar similarity measure. See [12] for a detailed description of the PCAw method and its performance. Furthermore, we compared the proposed method with SPCA, which is a weighted PCA method. It uses a Laplacian matrix to characterize the relationship among the classes in the dataset. We implement SPCA by following the steps in [39]. As to LDA [40], its application to image retrieval has also been thoroughly investigated [41], which is hardly surprising given its popularity in statistical analysis. Here we use its variant MLDA [38] as a competing feature-space analysis method. MLDA is trained using the classes provided by both training datasets. It derives a set of projection vectors that offer the best linear separation of the classes (full separation is achievable if the classes are linearly separable, otherwise, MLDA produces some overlaps between the classes). These projection vectors are employed to transform and compress (in the sense of dimensionality reduction) new feature vectors. Scalar similarity is then evaluated for the transformed features to determine whether or not they match.

3.3. Implementation Details

In the experiment, we compare G-CCA, S-CCA with PCAw, SPCA, and MLDA. The G-CCA and S-CCA are presented in this paper while the PCAw, SPCA, and MLDA are implemented by following procedures in [21,38,39]. Here, we discuss some detailed issues in the implementation.
Firstly, S-CCA, PCAw, SPCA, and MLDA use scalar similarity score to calculate the confidence score while G-CCA uses the proposed score in Equation (8). Secondly, for all these methods, the feature vectors are obtained via MAC, AVE, and SD pooling, and centralization and normalization are performed. Thirdly, the performance comparisons are conducted for eight dimensions: 512, 450, 400, 300, 200, 100, 50, and 25. Lastly, we calculate the scores between the query image and each image in the test dataset, and obtain the image retrieval results by ranking scores from high to low. All the methods are evaluated by the mean Average Precision ( mAP ) (we calculatethe mAP without enforcing the monotonicity for Precision (Recall) relationship). which can be formulated as follows:
mAP = i = 1 m AP i m with AP i = k = 1 n P ( k ) Δ r ( k ) ,
where AP i is the average precision for the i-th query image, m is the total number of query images, and n is the total number of images in the testing dataset, P ( k ) is the precision of top k results, and Δ r ( k ) = R ( k ) R ( k 1 ) with R ( k ) being the recall of top k results. For calculating the precision P ( K ) and recall R ( k ) , the positive labels for each query image are provided by the test datasets.

3.4. Evaluation Datasets and Details

Four datasets, namely, Oxford5k [42], Paris6k [43], R Oxford [44], and R Paris [44], are used to assess the performance of each retrieval method. As the first two datasets are contained in the large raw 120k-SfM dataset, they are excluded from the training dataset via preprocessing. The last two datasets contain new annotations and more difficult query images, and consequently create more challenges for image retrieval; therefore, they can help test the reliability of our approach. A short description of each dataset is given below.

3.4.1. Oxford5k

Oxford5k dataset contains 5063 images and 55 query images for 11 different buildings. It is annotated with bounding boxes for the main objects.

3.4.2. R Oxford

R Oxford dataset contains 4993 images and 70 query images for 11 different buildings. Query images are excluded from the retrieval images. Same as Oxford5k, it is annotated with bounding boxes for the main objects.

3.4.3. Paris6k

Paris6k dataset contains 6412 images and 55 query images for 11 different buildings. It is also annotated with bounding boxes.

3.4.4. R Paris

R Paris dataset contains 6322 images and 70 query images for 11 different buildings. Query images are excluded from the retrieval images. Same as Paris6k, it is annotated with bounding boxes.
The performance of each retrieval method is evaluated using mean Average Precision (mAP) [42]. The positive labels of each query image are provided by the datasets. The standard evaluation protocol is followed for Oxford5k and Paris6k. As for the R Oxford and R Paris datasets, the medium protocol setups in [44] are adopted. We crop all the query images with the provided bounding boxes before feeding them to VGG16. Each method undergoes training and evaluation twice. The first training used the small dataset, 30k-SfM, followed by evaluation. Then it was trained with the large dataset, 120k-SfM, before evaluation. This enables us to study the effect of dataset size and diversity on the methods under comparison.

3.5. Performance Evaluation and Analysis

Before getting into the performance evaluation of the proposed method, it is useful to have some insights about how discriminative the canonical vectors are. Figure 5 and Figure 6 show the profile of the diagonal elements of the off-diagonal blocks in Equations (3) and (4). It can be seen that the values of c t ( N ) fluctuate around zero whereas those of c t ( M ) range between 0.1 and 0.9 . This observation suggests that there exists a set of canonical vectors that can effectively tell apart matching from non-matching pairs of images. This is shown in the rest of this subsection.
Table 2 reports the baseline performances of MAC, AVE, and SD without dimensionality reduction. Specifically, for these baselines, we directly calculate the scalar similarity between the pooling features (after centralization and normalization) of the query image and each image in the testing dataset. In the evaluation, we consider the proposed method (G-CCA) and its variant with Gaussian-distribution-based hypothesis testing replaced by scalar similarity (S-CCA). From Table 2, we observe that G-CCA achieves better performance than S-CCA in most cases except for Paris6k and AVE on R Paris.
By considering three different pooling strategies, three image retrieval methods are trained on the 30k-SfM dataset and evaluated on all four test sets. Table 3 provides a comprehensive depiction of the experimental results for each retrieval method with different pooling strategies and feature dimensionality choices (compression levels). The results for MLDA are not reported there, for MLDA cannot be trained on the 30k-SfM dataset, which is a consequence of the fact that the difference between classes is too small as far as MLDA training is concerned. From Table 3, four observations can be made. The first is regarding the effect of the pooling strategy. Specifically, SD pooling appears to result in the most competitive performance for all methods at every choice of feature dimensionality while MAC renders G-CCA superior to SPCA and PCAw at low dimensions over all test sets. The second observation is that for MAC, AVE, and SD pooling strategies, the proposed method outperforms PCAw at low feature dimensionality. As such, the proposed method is a better choice for producing compact features than PCAw regardless of the pooling strategy. The last observation is that G-CCA is more robust against dimensionality reduction than S-CCA.
The performance of the proposed method can be improved by replacing 30k-SfM with 120k-SfM, which is a larger training set. Table 4 shows the corresponding evaluation results for all the methods with different pooling strategies and dimensionality choices (the only exception is SPCA for which the training on 120k-SfM is computationally infeasible as its Laplacian matrix is too large to be stored on our computer). It is clear that the increased-size training set leads to an improved mAP performance on all test sets and for all pooling strategies. It is also interesting to note that the proposed method outperforms all others on Oxford5k. This uniform superiority across all dimensions is only attained on Paris6k using SD pooling. Although AVE and MAC improve mAP, they cause G-CCA to lose its edge at high dimensions on Paris6k. In contrast, with SD pooling, the proposed method retains its dominating performance at all feature dimensions. On R Oxford and R Paris, the performance of G-CCA is better than MLDA at almost all dimensions with MAC. G-CCA almost outperforms PCAw in every dimensions with all three pooling strategies.
Based on Table 3 and Table 4, there are three notable advantages of G-CCA over MLDA, PCAw, and SPCA. The first is that the CCA-based methods can be trained using datasets with small differences between classes whereas MLDA cannot be trained on such datasets. The second advantage is that G-CCA typically shows a more graceful performance degradation than PCAw after dimensionality reduction. The last is that SPCA can not be trained on large datasets as compared with G-CCA.
Table 5, Table 6 and Table 7 present some retrieval results for visual illustration. In Table 5, a query image from the Oxford5k set is presented to PCAw, SPCA, and G-CCA, trained on the 30k-SfM set, while in Table 6 and Table 7, a query image from the Oxford5k set is presented to PCAw, MLDA, and G-CCA, trained on the 120k-SfM set. We list top 10 matches for each method with each list ranked using the matching score associated with the corresponding method. Table 5 and Table 6 show the top 10 retrieved images for different methods with SD pooling while Table 7 gives examples for G-CCA with different pooling strategies.

4. Conclusions

In view of the success of DL in image classification, a CCA-based method is proposed to exploit DL features for image retrieval applications. By adopting an OTS CNN without fine-tuning, it achieves good retrieval accuracy with a minimal computational overhead. As shown by the experimental results on standard evaluation datasets, the proposed method is performance-wise competitive against traditional and other OTS-CNN-based methods. Moreover, it exhibits improved robustness against dimensionality reduction and enhanced sensitivity to feature mismatch.

Author Contributions

Data curation, K.S.; Methodology, J.L. and J.C.; Software, K.S., X.L., M.A., X.G. and H.L.; Supervision, J.C.; Writing–original draft, K.S.; Writing–review and editing, M.A., J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada through a Discovery Grant.

Data Availability Statement

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Chernoff Information between Two 2-Dimensional Gaussian Distributions

For notational simplicity, we suppress subscript t in the following derivation. Consider
S ( M ) = 1 c ( M ) c ( M ) 1 , S ( N ) = 1 c ( N ) c ( N ) 1 ,
where c ( M ) and c ( N ) are two corresponding coefficients. Let S ( λ ) = ( λ ( S ( M ) ) 1 + ( 1 λ ) ( S ( N ) ) 1 ) 1 , λ [ 0 , 1 ] . Now we proceed to find the solution λ = λ * of the equation D ( S ( λ ) | | S ( M ) ) = D ( S ( λ ) | | S ( N ) ) .
Note that
D ( S ( λ ) | | S ( M ) ) = D ( S ( λ ) | | S ( N ) ) log e | S ( M ) | | S ( N ) | = tr ( ( ( S ( N ) ) 1 ( S ( M ) ) 1 ) S ( λ ) ) .
We have
( ( S ( N ) ) 1 ( S ( M ) ) 1 ) S ( λ ) = 1 λ ( ( λ ( S ( M ) ) 1 S ( N ) + ( 1 λ ) I ) 1 I ) .
It can be verified that
( λ ( S ( M ) ) 1 S ( N ) + ( 1 λ ) I ) 1 = 1 θ λ ( 1 c ( M ) c ( N ) ) 1 ( c ( M ) ) 2 + 1 λ λ ( c ( M ) c ( N ) ) 1 ( c M ) 2 λ ( c ( M ) c ( N ) ) 1 ( c ( M ) ) 2 λ ( 1 c ( M ) c ( N ) ) 1 ( c ( M ) ) 2 + 1 λ ,
where
θ = ( c ( N ) c ( M ) ) 2 1 ( c ( M ) ) 2 λ 2 + 2 c ( M ) ( c ( M ) c ( N ) ) 1 ( c ( M ) ) 2 λ + 1 .
As a consequence,
tr ( ( ( S ( N ) ) 1 ( S ( M ) ) 1 ) S ( λ ) ) = 2 θ ( c ( N ) c ( M ) ) 2 1 ( c ( M ) ) 2 λ + c ( M ) ( c ( N ) c ( M ) ) 1 ( c ( M ) ) 2 .
Therefore, λ = λ * is a root in [ 0 , 1 ] of the following quadratic equation:
α λ 2 + β λ + γ = 0 ,
where
α = ( c ( N ) c ( M ) ) 2 1 ( c ( M ) ) 2 log e | S ( M ) | | S ( N ) | , β = 2 ( c ( N ) c ( M ) ) 2 1 ( c ( M ) ) 2 2 c ( M ) ( c ( M ) c ( N ) ) 1 ( c ( M ) ) 2 log e | S ( M ) | | S ( N ) | , γ = 2 c ( M ) ( c ( N ) c ( M ) ) 1 ( c ( M ) ) 2 log e | S ( M ) | | S ( N ) | .
We shall show that Equation (A1) has a unique root in [ 0 , 1 ] , which is given by
λ * = β + β 2 4 α γ 2 α .
Clearly, Equation (A1) must have a root in [ 0 , 1 ] since D ( S ( λ ) | | S ( M ) ) | λ = 0 > 0 , D ( S ( λ ) | | S ( N ) ) | λ = 1 > 0 , and D ( S ( λ ) | | S ( M ) ) | λ = 1 = D ( S ( λ ) | | S ( N ) ) | λ = 0 = 0 . So it remains to prove the uniqueness of this root.
First consider the case ( c ( N ) ) 2 > ( c ( M ) ) 2 . It is clear that α > 0 and
γ = 2 c ( M ) ( c ( N ) c ( M ) ) 1 ( c ( M ) ) 2 log e | S ( M ) | | S ( N ) | = 2 c ( M ) ( c ( N ) c ( M ) ) 1 ( c ( M ) ) 2 log e 1 ( c ( M ) ) 2 1 ( c ( N ) ) 2 2 c ( M ) ( c ( N ) c ( M ) ) 1 ( c ( M ) ) 2 ( c ( N ) ) 2 ( c ( M ) ) 2 1 ( c ( M ) ) 2 = ( c ( N ) c ( M ) ) 2 1 ( c ( M ) ) 2 < 0 ,
where the first inequality is due to log e x x 1 x . Therefore, the two roots of Equation (A1) must be of different signs, which implies that there exists a unique root in [ 0 , 1 ] with the expression given by Equation (A2).
Next consider the case ( c ( N ) ) 2 < ( c ( M ) ) 2 . Define λ ¯ = 1 λ . Equation (A1) can be written equivalently as
α ( 1 λ ¯ ) 2 + β ( 1 λ ¯ ) + γ = 0 ,
i.e.,
α λ ¯ 2 ( 2 α + β ) λ ¯ + ( α + β + γ ) = 0 .
Note that
2 α + β = 2 ( c ( N ) c ( M ) ) 2 1 ( c ( M ) ) 2 2 c ( N ) ( c ( M ) c ( N ) ) 1 ( c ( M ) ) 2 log e | S ( M ) | | S ( N ) |
and
α + β + γ = 2 c ( N ) ( c ( N ) c ( M ) ) 1 ( c ( M ) ) 2 1 ( c ( N ) ) 2 1 ( c ( M ) ) 2 log e | S ( M ) | | S ( N ) | .
Therefore, Equation (A3) can be rewritten as
α ¯ λ ¯ 2 + β ¯ λ ¯ + γ ¯ = 0 ,
where
α ¯ = ( c ( M ) c ( N ) ) 2 1 ( c ( N ) ) 2 log e | S ( N ) | | S ( M ) | , β ¯ = 2 ( c ( M ) c ( N ) ) 2 1 ( c ( N ) ) 2 2 c ( N ) ( c ( N ) c ( M ) ) 1 ( c ( N ) ) 2 log e | S ( N ) | | S ( M ) | , γ ¯ = 2 c ( N ) ( c ( M ) c ( N ) ) 1 ( c ( N ) ) 2 log e | S ( N ) | | S ( M ) | .
A similar argument to that for the case ( c ( N ) ) 2 > ( c ( M ) ) 2 can be used to prove that Equation (A4) has one root in [ 0 , 1 ] and the other root in ( , 0 ) . This implies that Equation (A1) must have one root in [ 0 , 1 ] and the other root in ( 1 , ) ; the one in [ 0 , 1 ] must be given by Equation (A2) (note that α < 0 when ( c ( N ) ) 2 < ( c ( M ) ) 2 ).

References

  1. Wengang, Z.; Houqiang, L.; Qi, T. Recent advance in content-based image retrieval: A literature survey. arXiv 2017, arXiv:1706.06064. [Google Scholar]
  2. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  3. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
  4. Tan, X.; Triggs, B. Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 2010, 19, 1635–1650. [Google Scholar] [PubMed] [Green Version]
  5. Ojansivu, V.; Heikkilä, J. Blur insensitive texture classification using local phase quantization. In International Conference on Image and Signal Processing; Springer: Berlin/Heidelberg, Germany, 2008; pp. 236–243. [Google Scholar]
  6. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. IEEE Conf. Comput. Vis. Pattern Recognit. 2005, 1, 886–893. [Google Scholar]
  7. Batool, A.; Nisar, M.W.; Shah, J.H.; Khan, M.A.; El-Latif, A.A.A. iELMNet: Integrating novel improved extreme learning machine and convolutional neural network model for traffic sign detection. Big Data, 2022; ahead of print. [Google Scholar] [CrossRef]
  8. Nawaz, M.; Nazir, T.; Javed, A.; Tariq, U.; Yong, H.-S.; Khan, M.A.; Cha, J. An efficient deep learning approach to automatic glaucoma detection using optic disc and optic cup localization. Sensors 2022, 22, 434. [Google Scholar] [CrossRef]
  9. Razavian, A.S.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; Institute of Electrical and Electronics Engineers: Columbus, OH, USA, 2014; pp. 806–813. [Google Scholar]
  10. Khan, M.A.; Muhammad, K.; Sharif, M.; Akram, T.; Kadry, S. Intelligent fusion-assisted skin lesion localization and classification for smart healthcare. Neural Comput. Appl. 2021, 1–16. [Google Scholar] [CrossRef]
  11. Tolias, G.; Sicre, R.; Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. arXiv 2015, arXiv:1511.05879. [Google Scholar]
  12. Babenko, A.; Lempitsky, V. Aggregating local deep features for image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1269–1277. [Google Scholar]
  13. Lin, J.; Duan, L.-Y.; Wang, S.; Bai, Y.; Lou, Y.; Chandrasekhar, V.; Huang, T.; Kot, A.; Gao, W. Hnip: Compact deep invariant representations for video matching, localization, and retrieval. IEEE Trans. Multimed. 2017, 19, 1968–1983. [Google Scholar] [CrossRef]
  14. Kalantidis, Y.; Mellina, C.; Osindero, S. Cross-dimensional weighting for aggregated deep convolutional features. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 685–701. [Google Scholar]
  15. Azhar, I.; Sharif, M.; Raza, M.; Khan, M.A.; Yong, H.-S. A decision support system for face sketch synthesis using deep learning and artificial intelligence. Sensors 2021, 21, 8178. [Google Scholar] [CrossRef] [PubMed]
  16. Khan, S.; Khan, M.A.; Alhaisoni, M.; Tariq, U.; Yong, H.-S.; Armghan, A.; Alenezi, F. Human action recognition: A paradigm of best deep learning features selection and serial based extended fusion. Sensors 2021, 21, 7941. [Google Scholar] [CrossRef]
  17. Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Largescale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3456–3465. [Google Scholar]
  18. Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 5297–5307. [Google Scholar]
  19. Radenović, F.; Tolias, G.; Chum, O. CNN image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 3–20. [Google Scholar]
  20. Gordo, A.; Almazan, J.; Revaud, J.; Larlus, D. End-to-end learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 2017, 124, 237–254. [Google Scholar] [CrossRef] [Green Version]
  21. Hyvärinen, A.; Hurri, J.; Hoyer, P.O. Principal components and whitening. In Natural Image Statistics; Springer: Berlin/Heidelberg, Germany, 2009; pp. 93–130. [Google Scholar]
  22. Izenman, A.J. Linear discriminant analysis. In Modern Multivariate Statistical Techniques; Springer: Berlin/Heidelberg, Germany, 2013; pp. 237–280. [Google Scholar]
  23. Johnson, R.A.; Wichern, D.W. Canonical correlation analysis. In Applied Multivariate Statistical Analysis, 6th ed.; Pearson: Upper Saddle River, NJ, USA, 2018; pp. 539–574. [Google Scholar]
  24. Gong, Y.; Ke, Q.; Isard, M.; Lazebnik, S. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 2014, 106, 210–233. [Google Scholar] [CrossRef] [Green Version]
  25. Yan, F.; Mikolajczyk, K. Deep correlation for matching images and text. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3441–3450. [Google Scholar]
  26. Dorfer, M.; Schlüter, J.; Vall, A.; Korzeniowski, F.; Widmer, G. End-to-end cross-modality retrieval with cca projections and pairwise ranking loss. Int. J. Multimed. Inf. Retr. 2018, 7, 117–128. [Google Scholar] [CrossRef] [Green Version]
  27. Yu, Y.; Tang, S.; Aizawa, K.; Aizawa, A. Category-based deep cca for fine-grained venue discovery from multimodal data. arXiv 2018, arXiv:1805.02997. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Lin, Z.; Peltonen, J. An information retrieval approach for finding dependent subspaces of multiple views. In International Conference on Machine Learning and Data Mining in Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2017; pp. 1–16. [Google Scholar]
  29. Yair, O.; Talmon, R. Local canonical correlation analysis for nonlinear common variables discovery. IEEE Trans. Signal Process. 2017, 65, 1101–1115. [Google Scholar] [CrossRef]
  30. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  31. Abdi, H. The eigen-decomposition: Eigenvalues and eigenvectors. In Encyclopedia of Measurement and Statistics; SAGE Publications, Inc.: Thousand Oaks, CA, USA, 2007; pp. 304–308. [Google Scholar]
  32. Nielsen, F. An information-geometric characterization of chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef] [Green Version]
  33. Nielsen, F. Chernoff information of exponential families. arXiv 2011, arXiv:1102.2684. [Google Scholar]
  34. Prince, S.J. Common probability distribution. In Computer Vision: Models, Learning and Inference; Cambridge University Press: Cambridge, England, 2012; pp. 35–42. [Google Scholar]
  35. Radenović, F.; Tolias, G.; Chum, O. Fine-tuning CNN image retrieval with no human annotation. In IEEE Transactions on Pattern Analysis and Machine Intelligence; Institute of Electrical and Electronics Engineers: Manhattan, NY, USA, 2018. [Google Scholar]
  36. Schonberger, J.L.; Radenovic, F.; Chum, O.; Frahm, J.-M. From single image query to detailed 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5126–5134. [Google Scholar]
  37. Koren, Y.; Carmel, L. Robust linear dimensionality reduction. IEEE Trans. Vis. Comput. Graph. 2004, 10, 459–470. [Google Scholar] [CrossRef] [PubMed]
  38. Li, T.; Zhu, S.; Ogihara, M. Using discriminant analysis for multi-class classification. In Third IEEE International Conference on Data Mining; IEEE Computer Society: Los Alamitos, CA, USA, 2003; p. 589. [Google Scholar]
  39. Mirkes, E.M.; Gorban, A.N.; Zinoviev, A. A Supervised PCA. 2016. Available online: https://github.com/Mirkes/SupervisedPCA (accessed on 10 September 2021).
  40. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  41. Swets, D.L.; Weng, J.J. Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 1996, 8, 831–836. [Google Scholar] [CrossRef] [Green Version]
  42. Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
  43. Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
  44. Radenovic, F.; Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In Proceedings of the IEEE Computer Vision and Pattern Recognition Conference, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Figure 1. Modified VGG16 for feature extraction.
Figure 1. Modified VGG16 for feature extraction.
Information 13 00106 g001
Figure 2. Block diagram of the proposed method.
Figure 2. Block diagram of the proposed method.
Information 13 00106 g002
Figure 3. Profile of the diagonal elements of Λ and Π (i.e., c t ( M ) and c t ( N ) , where t { 1 , 2 , , 512 } ) using AVE features. The CCA training was performed on the 120k-SfM dataset.
Figure 3. Profile of the diagonal elements of Λ and Π (i.e., c t ( M ) and c t ( N ) , where t { 1 , 2 , , 512 } ) using AVE features. The CCA training was performed on the 120k-SfM dataset.
Information 13 00106 g003
Figure 4. Profile of sorted diagonal elements of Λ and Π (i.e., c ˜ t ( M ) and c ˜ t ( N ) , where t { 1 , 2 , , 512 } ) using AVE features. The CCA training was performed on the 120k-SfM dataset.
Figure 4. Profile of sorted diagonal elements of Λ and Π (i.e., c ˜ t ( M ) and c ˜ t ( N ) , where t { 1 , 2 , , 512 } ) using AVE features. The CCA training was performed on the 120k-SfM dataset.
Information 13 00106 g004
Figure 5. A 2D visualization of matrix Π .
Figure 5. A 2D visualization of matrix Π .
Information 13 00106 g005
Figure 6. A 3D visualization of matrix Π .
Figure 6. A 3D visualization of matrix Π .
Information 13 00106 g006
Table 1. Examples of matching/non-matching pairs.
Table 1. Examples of matching/non-matching pairs.
Matching Pair Information 13 00106 i001 Information 13 00106 i002 Information 13 00106 i003 Information 13 00106 i004 Information 13 00106 i005 Information 13 00106 i006
Information 13 00106 i007 Information 13 00106 i008 Information 13 00106 i009 Information 13 00106 i010 Information 13 00106 i011 Information 13 00106 i012
Non-Matching Pair Information 13 00106 i013 Information 13 00106 i014 Information 13 00106 i015 Information 13 00106 i016 Information 13 00106 i017 Information 13 00106 i018
Information 13 00106 i019 Information 13 00106 i020 Information 13 00106 i021 Information 13 00106 i022 Information 13 00106 i023 Information 13 00106 i024
Table 2. Performance comparison of the baseline, S-CCA, and G-CCA on Oxford5k, R Oxford, Paris6k, and R Paris without dimension reduction.
Table 2. Performance comparison of the baseline, S-CCA, and G-CCA on Oxford5k, R Oxford, Paris6k, and R Paris without dimension reduction.
MethodOxford5k R OxfordParis6k R Paris
MAC0.52960.32950.74550.5122
S-CCA + MAC0.58000.35750.77260.5408
G-CCA + MAC0.62750.39960.74550.5939
AVE0.53120.28840.64670.4653
S-CCA + AVE0.68450.43030.78450.5936
G-CCA + AVE0.71460.44440.75070.5812
SD0.60950.38340.73550.5311
S-CCA + SD0.69430.45030.81910.6199
G-CCA + SD0.74190.48060.81640.6403
1. The evaluation results are based on 120k-SfM. 2. For the same type of features, the best performances are highlighted in bold.
Table 3. Evaluation results from 30k-SfM on Oxford5k, R Oxford, Paris6k, and R Paris.
Table 3. Evaluation results from 30k-SfM on Oxford5k, R Oxford, Paris6k, and R Paris.
Oxford5kDimMACAVESD
SPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCA
250.35890.35550.24310.38730.44740.44430.30910.48730.47570.48380.34390.4979
500.44120.42580.31740.44870.49300.49330.37820.51270.50860.50740.44030.5690
1000.50160.50270.41220.50430.55990.56970.54470.60340.60020.60410.51910.6164
2000.56280.55830.48180.55010.60830.60860.61570.64450.66350.66190.62800.6772
3000.57230.56720.52800.53790.64160.63070.64280.65520.67530.67360.65130.6830
4000.57280.57150.55050.54050.65170.63850.63730.65250.68110.68110.67030.6745
4500.56700.56540.56090.53640.65440.64220.63930.65380.68390.68490.67400.6746
5120.56150.56010.55800.53630.65060.63880.64930.65370.67660.67630.67640.6743
R OxfordDimMACAVESD
SPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCA
250.20700.22260.14950.22760.27020.27090.19390.25900.28830.28560.21160.3031
500.28230.27710.18860.29140.27310.27570.22060.28760.31170.31230.25900.3485
1000.32590.32810.24840.32820.33040.31970.30830.33720.38850.37950.30070.3848
2000.34620.35450.30710.35690.35690.35310.37590.40020.43990.43680.40210.4417
3000.35950.35930.32900.34130.39010.37710.39110.40570.45070.44200.41730.4484
4000.35760.35680.34240.34000.39050.37960.37980.40650.45260.43810.44540.4538
4500.35510.35440.34660.33980.40020.37720.38760.40520.44980.43820.44350.4499
5120.34420.34690.34440.33960.40420.37670.39630.40770.44170.43830.44120.4419
ParisDimMACAVESD
SPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCA
250.48780.50840.41330.54640.49440.43300.41820.49900.56330.58580.47580.5969
500.60270.62080.53910.63470.56920.58930.58980.61530.64150.65550.60840.6746
1000.66910.67500.58480.68080.64410.67360.65590.67900.72900.72670.69880.7426
2000.70350.69420.63840.71660.69310.69940.70490.71060.77190.76200.75010.7811
3000.70040.69800.67010.70670.71090.73280.72970.71180.78340.78190.77390.7892
4000.70760.70570.68930.70520.73750.75860.74180.71200.80100.79700.78850.7867
4500.70910.70730.69640.70270.74820.76790.74720.71300.80670.80660.79690.7871
5120.70320.70600.70390.70290.75080.77320.75200.71330.80200.80310.80360.7874
R ParisDimMACAVESD
SPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCASPCAPCAwS-CCAG-CCA
250.39660.39440.32250.42120.38770.39810.30850.42120.43610.44100.36020.4433
500.47250.47380.40630.47810.43540.44420.43850.45380.50060.50150.45240.5056
1000.50070.50210.43110.51060.48200.48860.49460.50820.54570.55010.52580.5653
2000.51830.51820.46680.53700.51180.51290.53020.53550.58220.58270.56350.5985
3000.52060.52000.48940.52850.52810.53060.55070.53770.59660.59640.58290.6045
4000.52240.52190.50400.52720.55040.55070.55770.53790.60640.60700.59580.6024
4500.52220.52000.51090.52550.55870.55900.56200.53830.61190.61210.60130.6027
5120.51690.51680.51540.52560.55790.55880.56460.53840.60510.60670.60480.6028
1. The best performances in each dimension are highlighted in bold.
Table 4. Evaluation results from 120k-SfM on Oxford5k, R Oxford, Paris6k, and R Paris.
Table 4. Evaluation results from 120k-SfM on Oxford5k, R Oxford, Paris6k, and R Paris.
Oxford5kDimMACAVESD
MLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCA
250.36030.39060.26770.40190.47580.42660.26440.48210.47590.47900.34000.5212
500.47600.43190.38020.49870.56120.50330.42930.55720.53750.53550.46670.5956
1000.51570.52750.45370.54810.60170.57560.55290.64020.64290.62400.55930.6688
2000.58870.54530.55620.62310.65710.64370.64980.69640.68610.64100.66200.7244
3000.60280.56690.56970.63060.66430.64740.66580.71020.70300.67110.67540.7382
4000.59740.58100.57680.62750.66880.66810.67580.71390.70200.69700.68640.7422
4500.59390.58400.58200.62790.66780.67280.67810.71440.69720.69860.69390.7412
5120.58680.57990.58000.62750.66130.67110.68450.71460.69580.69460.69430.7419
R OxfordDimMACAVESD
MLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCA
250.23300.25030.15430.24590.27120.25330.14220.24410.26660.30370.17820.2853
500.29890.26640.23660.30250.35220.28020.23370.33660.34180.33570.26360.3254
1000.34700.35210.27240.34370.39810.33180.34120.42900.40020.40750.32690.4073
2000.39240.35100.34820.39910.43240.39110.39130.44110.44970.41920.40850.4622
3000.40060.35570.34970.39860.44040.39200.40560.44300.46450.44540.43350.4796
4000.39640.36250.35260.40010.44120.41060.42150.44620.46730.46090.44290.4812
4500.39410.36130.35870.39980.43630.41590.42150.44430.46240.46040.43940.4807
5120.38810.35700.35750.39960.42670.41360.43030.44440.45970.45010.45030.4806
Paris6kDimMACAVESD
MLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCA
250.57810.48780.51090.62700.55530.50130.44420.56930.62040.55430.52690.6611
500.63840.61530.54160.66790.63620.58930.54670.63140.69000.65750.59350.6968
1000.69160.67880.62260.73390.69940.67360.66570.69100.75020.73130.71050.7641
2000.72440.71240.67650.76740.71620.69940.72200.74910.78450.78420.77600.8043
3000.74930.72140.69000.77190.72990.73280.75380.74910.803000.80460.79730.8160
4000.75480.72300.71460.77290.72470.75860.77290.75070.80420.81430.80670.8164
4500.75400.72220.77290.74550.71970.76790.77750.75080.80030.81440.80960.8161
5120.75490.72880.77260.74550.71110.77320.78450.75070.79710.81590.81640.8191
R ParisDimMACAVESD
MLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCAMLDAPCAwS-CCAG-CCA
250.43210.37280.39560.47870.45240.37450.36070.44550.48170.41360.40750.5032
500.49100.46850.42140.51560.49440.44950.42290.48770.53730.49980.46110.5415
1000.53390.50960.46810.56310.50030.51010.50520.53400.57960.55960.54720.5970
2000.55260.53460.50660.59100.56560.53100.54370.56780.60660.60020.59280.6317
3000.55200.54250.51240.59420.58090.55660.56390.57990.61950.61560.60210.6408
4000.54370.54060.52820.59410.58430.57380.58240.58570.61650.62280.60720.6401
4500.53990.53690.53130.59410.58290.57960.58440.58130.61360.61870.61180.6401
5120.53330.53870.54080.59390.58300.58280.59360.58120.60930.61780.61990.6403
1. The best performances in each dimension are highlighted in bold.
Table 5. Image retrieval comparison of PCAw, SPCA, and G-CCA.
Table 5. Image retrieval comparison of PCAw, SPCA, and G-CCA.
QueryTOP 10 Retrieved Images
A Information 13 00106 i025 Information 13 00106 i026 Information 13 00106 i027 Information 13 00106 i028 Information 13 00106 i029 Information 13 00106 i030 Information 13 00106 i031 Information 13 00106 i032 Information 13 00106 i033 Information 13 00106 i034 Information 13 00106 i035
B Information 13 00106 i036 Information 13 00106 i037 Information 13 00106 i038 Information 13 00106 i039 Information 13 00106 i040 Information 13 00106 i041 Information 13 00106 i042 Information 13 00106 i043 Information 13 00106 i044 Information 13 00106 i045 Information 13 00106 i046
C Information 13 00106 i047 Information 13 00106 i048 Information 13 00106 i049 Information 13 00106 i050 Information 13 00106 i051 Information 13 00106 i052 Information 13 00106 i053 Information 13 00106 i054 Information 13 00106 i055 Information 13 00106 i056 Information 13 00106 i057
1. Top 10 retrieved images from Oxford5k. (A) SD + PCAw. (B) SD + SPCA (C) SD + G-CCA. 2. Correct images are bounded with green boxes, wrong images are bounded with red boxes.
Table 6. Image retrieval comparison of PCAw, MLDA, and G-CCA.
Table 6. Image retrieval comparison of PCAw, MLDA, and G-CCA.
QueryTOP 10 Retrieved Images
A Information 13 00106 i058 Information 13 00106 i059 Information 13 00106 i060 Information 13 00106 i061 Information 13 00106 i062 Information 13 00106 i063 Information 13 00106 i064 Information 13 00106 i065 Information 13 00106 i066 Information 13 00106 i067 Information 13 00106 i068
B Information 13 00106 i069 Information 13 00106 i070 Information 13 00106 i071 Information 13 00106 i072 Information 13 00106 i073 Information 13 00106 i074 Information 13 00106 i075 Information 13 00106 i076 Information 13 00106 i077 Information 13 00106 i078 Information 13 00106 i079
C Information 13 00106 i080 Information 13 00106 i081 Information 13 00106 i082 Information 13 00106 i083 Information 13 00106 i084 Information 13 00106 i085 Information 13 00106 i086 Information 13 00106 i087 Information 13 00106 i088 Information 13 00106 i089 Information 13 00106 i090
1. Top 10 retrieved images from Oxford5k. (A) SD + PCAw. (B) SD + MLDA (C) SD + G-CCA. 2. Correct images are bounded with green boxes, wrong images are bounded with red boxes.
Table 7. Image retrieval comparison of G-CCA with MAC, AVE, and SD feature.
Table 7. Image retrieval comparison of G-CCA with MAC, AVE, and SD feature.
QueryTOP 10 Retrieval Images
A Information 13 00106 i091 Information 13 00106 i092 Information 13 00106 i093 Information 13 00106 i094 Information 13 00106 i095 Information 13 00106 i096 Information 13 00106 i097 Information 13 00106 i098 Information 13 00106 i099 Information 13 00106 i100 Information 13 00106 i101
B Information 13 00106 i102 Information 13 00106 i103 Information 13 00106 i104 Information 13 00106 i105 Information 13 00106 i106 Information 13 00106 i107 Information 13 00106 i108 Information 13 00106 i109 Information 13 00106 i110 Information 13 00106 i111 Information 13 00106 i112
C Information 13 00106 i113 Information 13 00106 i114 Information 13 00106 i115 Information 13 00106 i116 Information 13 00106 i117 Information 13 00106 i118 Information 13 00106 i119 Information 13 00106 i120 Information 13 00106 i121 Information 13 00106 i122 Information 13 00106 i123
1. top 10 retrieved images from Oxford5k. (A) MAC + G-CCA. (B) AVE + G-CCA (C) SD + G-CCA. 2. Correct images are bounded with green box, wrong images are bounded with red box.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shi, K.; Liu, X.; Alrabeiah, M.; Guo, X.; Lin, J.; Liu, H.; Chen, J. Image Retrieval via Canonical Correlation Analysis and Binary Hypothesis Testing. Information 2022, 13, 106. https://doi.org/10.3390/info13030106

AMA Style

Shi K, Liu X, Alrabeiah M, Guo X, Lin J, Liu H, Chen J. Image Retrieval via Canonical Correlation Analysis and Binary Hypothesis Testing. Information. 2022; 13(3):106. https://doi.org/10.3390/info13030106

Chicago/Turabian Style

Shi, Kangdi, Xiaohong Liu, Muhammad Alrabeiah, Xintong Guo, Jie Lin, Huan Liu, and Jun Chen. 2022. "Image Retrieval via Canonical Correlation Analysis and Binary Hypothesis Testing" Information 13, no. 3: 106. https://doi.org/10.3390/info13030106

APA Style

Shi, K., Liu, X., Alrabeiah, M., Guo, X., Lin, J., Liu, H., & Chen, J. (2022). Image Retrieval via Canonical Correlation Analysis and Binary Hypothesis Testing. Information, 13(3), 106. https://doi.org/10.3390/info13030106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop