1. Introduction
With the rapid development of computer vision and pattern recognition, face images, as one of the most common formats of visual information, provide a wide range of application prospects. Face images can be captured in many fields, including social media, forensic investigations, and surveillance monitoring, etc. [
1,
2,
3]. However, these captured face images generally lack labels, which leads to difficulties in identifying them. The unknown identities (clusters) can range from a few dozen to thousands, and how to locate an efficient range or even an exact range raises an essential problem [
4]. In addition, implementation of modern clustering algorithms [
5,
6] is required to ensure the number of clusters as well as some adjustable parameters in advance. Therefore, it is important to design an efficient facial imagery clustering algorithm for image labeling.
Recently, face image clustering has attracted a lot of interest from researchers [
7,
8,
9,
10,
11]. L. Zhang et al. [
7] presented a clustering method based on a spectral clustering and self-organizing feature mapping (SOM) neural network. C. Qi et al. [
8] proposed a deep face clustering method based on a residual graph convolutional network (RGCN). M.S. Abdallah et al. [
9] proposed a TV media mining system based on a DCNN to rapidly identify a specific individual in real-time processing video data. I. Ahn et al. [
10] proposed a multiple segmentation technique combined with constrained spectral clustering to label facial images containing objects with complicated boundaries. C. Otto et al. [
11] proposed an approximate rank-order clustering algorithm for large-scale face image clustering. Despite the tremendous efforts that have been devoted to image clustering, most previous studies have not adapted to the illuminance, pose, and complicated background of images. In addition, existing studies have failed to estimate the number of clusters and the values of adjustable parameters.
In general, face clustering involves face representation and a clustering model. Face representation is generally a numeric vector descriptor of a face image. In the literature [
12,
13,
14,
15], most traditional hand-crafted face representations often lack robustness with respect to adverse factors within the imagery. The quality of the captured images is low due to the factors of illuminance, pose, and complicated background, etc. M. Versaci et al. [
16] presented a fuzzy geometrical approach based on unit hypercubes for image contrast enhancement to solve this problem. L. Wang et al. [
17] learned the discriminative target feature by aligning the feature domain globally, while distinguishing the target clusters locally. Their method reduced the impact of illuminance, pose, and image quality. We adopt the significant progress that has been achieved in unconstrained face representation [
18,
19,
20], which can alleviate the influence of underlying factors by using a convolutional neural network-based face representation. Spectral clustering is an example of graph-based clustering algorithms. As compared with other clustering algorithms such as hierarchical clustering methods and K-means, spectral clustering has more robustness with respect to the uncertainty of data sparsity and distribution. Therefore, we use spectral clustering as the partition model. The main aim of this study is to propose an adaptive facial imagery clustering method, which mainly involves face representations, a spectral clustering model, and reinforcement learning (i.e., Q-learning used in this study) [
21]. With Q-learning, the clustering model can adaptively search and find the optimal setting for the number of clusters and dimension reduction (i.e., the core adjustable parameter) in the spectral clustering model, which helps to dramatically improve the performance of clustering.
The remainder of this paper is organized as follows: In
Section 2, we present the details of adaptive clustering; in
Section 3, we briefly introduce the IMM Face Database and provide some simulations to show the effectiveness of our proposed method; finally, conclusions and future work are presented in
Section 4.
2. Proposed Adaptive Clustering
In this section, we describe the proposed adaptive facial imagery clustering, which includes the following four components: face representations, spectral clustering, performance evaluation, and a feedback module (Q-learning), as shown in
Figure 1. In the following subsections, we introduce each aspect in detail.
2.1. Face Representations
Since face images are generally captured under unconstrained situations, traditional face representations lack universal applicability. Considering the superior performance of deep convolutional neural networks (DCNNs) on feature extraction and representation, we leverage a DCNN for our face representation. The specific implementation of face representation includes facial landmark detection and deep face representation, as illustrated in
Figure 2, and described in detail below.
Facial landmark detection, also known as face alignment, aims to align faces into the mean face shapes based on facial keypoints (mouth, eye, canthus, etc.). In this way, the positions of these keypoints on all faces after alignment are almost the same. The face representations based on these keypoints extracted by a DCNN is more effective and representative, even if lower dimensions of the representations are adopted to cluster the faces. We adopt the success of the DLIB implementation of Kazemi and Sullivan’s method [
22], which uses 68 landmarks of the input image to detect a face and return the position of facial region.
After facial landmark detection, the aligned face is the input of the DCNN to obtain its deep face representation. Specifically, we employ deep architecture ResNet1 pretrained on DLIB, which finally outputs a 128-dimension feature vector. Then, we use L2-normalization [
23] to normalize the vector, thus, each dimension of the feature vector is normalized into a specific interval. In addition, the use of L2 also improves the performance of the clustering model. After L2, the 128-dimension normalized vector is the final representation of the face.
2.2. Spectral Clustering
Spectral clustering is an algorithm that evolved from graph theory and has been widely used in different fields such as speech recognition, document analysis, and image segmentation. Implementation of the spectral clustering model includes construction of a similarity matrix and clustering partition. The detailed process is as follows:
Give the face representations , and calculate the similarity matrix based on the full connection method.
Construct the adjacency matrix
based on , and then calculate the degree matrix .
Calculate the Laplacian matrix :
Construct the normalized Laplacian matrix :
Create a set of eigenvectors by computing the eigenvectors of and use dim eigenvectors corresponding to the largest dim eigenvalues. The dim is an adjustable parameter, which must be less than the dimension of the representation .
Create a new face representation matrix , which uses these dim eigenvectors as columns; these eigenvectors are also normalized by L2.
With the new representation matrix , the final clustering result is generated by using K-means++.
In the above steps, we use the Gaussian Dadial Basis Function to construct
and
as:
where
is the radial width parameter,
is the adjoin metric and
is the similarity metric of
and
, respectively.
2.3. Performance Evaluation
For most practical image clustering, face images generally lack labels or the labels are incomplete. Therefore, we use the Davies–Bouldin index (DBI) as the internal evaluation measure to evaluate the clustering performance. The
DBI is defined as:
where
is the number of clusters,
is the distance of centers of
and
,
U is the center vector of cluster
, and
is the average distance within the cluster
. Obviously, a smaller
DBI value means better clustering performance with better intracluster compactness and intercluster separation.
2.4. Feedback Module
The feedback module adjusts the parameters based on the performance evaluation to optimize the clustering performance. For this, we adopt Q-learning to debug parameters adaptively. For this paper to be self-contained, first, we introduce the key idea of Q-learning before applying it to the parameter updating.
2.4.1. Q-Learning
Reinforcement learning is a goal-directed learning methodology, which constructs a control strategy to maximize the behavior performance of the agent. Specifically, the agent can require status information and reward signals from a complex environment. The agent is not advised which actions to take, but aims to find the best action that brings the best reward by trying all actions. In addition, the agent can use knowledge learned in the past to guide future actions. The sketch map of reinforcement learning is shown in
Figure 3.
The environment model is not necessary because reinforcement learning also supports a model-free algorithm such as Q-learning that can gradually update the optimal choice of actions to reach the best state with reward incentives. The process is denoted as the Bellman Equation (3):
where
is the state quality of all possible actions (Q-table),
is the state,
is the action,
is the reward signal,
is the learning rate,
is the discount index,
is the state of the next moment, and
is the action of the next moment.
2.4.2. The Application of Q-Learning
In our proposed method, we can deem the different parameter settings as actions and the performance evaluation (DBI) as the reward. The state of the clustering system is constantly updating by trying the possible parameter settings whose clustering performance can be evaluated with the DBI. With the reward incentive of the optimization to make the DBI minimum, the dynamic debugging process can be implemented using a Q-learning algorithm.
In the clustering, there are two parameters for setting: the clustering number () and dimension reduction (dim). Each parameter is given three types of debugging, that is, increasing, decreasing, and remaining unchanged. Therefore, we obtain nine types of parameter debugging actions in each state of Q-learning. Each action contains two independent debugging actions of the two parameters. We set the step values and for and dim, respectively, to denote the parameter debugging steps in the debugging. Because the parameters may vary beyond their range, the corresponding debugging is set to remain unchanged if the parameter debugging is out of range at the next state.
The aim of the actions is to make the performance evaluation the
DBI minimum. Therefore, the clustering system must give positive incentives when the action facilitates a reduction in the
DBI. For an inapparent change in the
DBI, we give tiny incentives, even no incentives. For a bad action which increases the
DBI, we give a negative incentive. Due to the random initialization of K-means++, the observations of the
DBI have small fluctuations for the same action. Therefore, we repeat the observations of the
DBI for each action ten times and use the largest
DBI value as the final
DBI. The process is given as:
where
is the final evaluated
DBI with the action a, and
is the
i-th observation of the clustering process. We define the variable quantity
of the
DBI as:
where
is the
DBI of the current status, and
is the
DBI of the next status. Therefore, the reward
can be expressed as:
where
is the reward signal and
is a positive threshold.
To prevent the algorithm converging to the local minimum region, we use the
-greedy strategy to improve Q-learning. Specifically, we use
to denote the selection probability function for each action in Q-learning.
where
is a greedy coefficient and
is the number of actions. Different from the strategy in which Q-learning always chooses the best action in each state, the
-greedy strategy has
probability of choosing the remaining nonoptimal actions. In doing this, Q-learning enhances the exploration of unknown states, which may be potential global optimal states.
In addition, we terminate the Q-learning algorithm when we have
for multiple consecutive states. In this study, we set the number of threshold termination states as10. The Algorithm 1 pseudo program diagram is as follows.
Algorithm 1: The Q-learning applied in our method. |
Input: , dim, , , , Output: , dim, Initialize Repeat Initialize counter While do Choose from using the policy derived form with the -greedy strategy; take action , and observe and by ; Parameter updating: , ; Calculate , and generate reward signals ; State updating: . If then Else |
3. Experiments
In this section, we use two face datasets to test our proposed algorithm. We scrambled all face images and placed them in the same folder, and then used the clustering algorithm to classify each image label, and finally compared the original label of the dataset to obtain the final accuracy. First, we introduce the face databases used, and then test our algorithm in four parts.
In the first part, we adopt the manual investigations of the parameters based on the variable-controlling approach, and the results prove that the DBI reaches a minimum value when the correct clustering parameters are set, therefore, the reinforcement learning method can be used to obtain this optimal value.
In the second part, we guide automatic parametric settings based on manual search results, and use the asynchronous length to perform the parameter settings test, which proves the feasibility of our algorithm, and reduces the parameter setting time by two times the asynchronous long search.
In the third part, we compare our final clustering results with other algorithms, and the F-score is used as the evaluation index. The comparison result proves that our algorithm clustering can achieve high accuracy, which is better than other algorithms. In addition, our algorithm can automatically set important parameters such as the number of clusters, thereby greatly reducing the time spent on parameter modulation in the clustering process.
Finally, we test the robustness of the algorithm on larger wild face data.
3.1. Datasets
In this study, the ORL [
24] and CFP [
25] Face Databases are used to evaluate our proposed adaptive face clustering algorithm. The ORL Face Database contains 40 people of different ages, skin colors, and genders. Each person has 10 images with different lighting, postures, and expressions. There are glasses occluded in some of the images, and the pixels of each image are 92 × 110, such as the examples of face images shown in
Figure 4.
The CFP Face Database contains 500 unconstrained images of different people. The pixel size of each image is not the same, which makes it difficult to extract the feature vectors of the faces. As compared with the above face images collected in a laboratory environment, these images are obtained from the Internet. Therefore, the expression, lighting, and posture of the same person’s image are quite different, and the face occlusion situation is more serious. We use 10 images that are close to the front of each person.
Figure 5 is an example of some pictures in the CFP Face Database.
3.2. Manual Investigation of Parameter Settings
We assume that the best number of clusters is unknown. However, for 400 facial images, we reasonably consider that they are divided into no more than 80 categories (i.e., the expected maximum number of clusters). Thus, the number of clusters ranges from 0 to 80. In addition, the dimension reduction of spectral clustering should not exceed the original maximum dimension and retain the primary information of the face representations. Generally, it is reasonable to reserve no more than 50% of the original dimensions, that is to say, due to the original dimension 128, the dimension reduction (dim) can range from 0 to 60.
We observe several settings of
and dim using the variable-controlling approach. Specifically, when one parameter changes within its range, the other is an invariable parameter. Due to this investigation over a wide range of parameters, we make a rough global observation in which the interval of the changing parameters is five and the invariable parameter is 10. The observation results of the manual investigations are illustrated in
Figure 6 and
Figure 7.
As shown in
Figure 6, different parameter settings result in the difference of the performance evaluation
DBI. Specifically, the optimal value of
is about 40 because most of the dimension settings at this value make the clustering performance better (smaller
DBI). In fact, this conclusion is of practical significance due to the fact that the ORL Face Database is indeed comprised of 40 different persons.
As shown in
Figure 7, there are some valuable conclusions. First, the clustering performance deteriorates shapely when the reserved dimension of the face representations is less than 15. Second, more reserved dimensional information does not mean that the performance of clustering is better. On the contrary, it becomes worse, especially when dim is over 50 in most curves. Third, we can see that the red curve, which has the invariable parameter setting
, in fact, corresponds to the optimal setting of
. The curve indicates that changing dim has less impact on the ideal
within the range of dim from 20 to 45. It implies that the closer
gets to the optimal value, the less dim changes. Therefore, it requires a smaller step size in the optimal field when we use searching methods for the parameter setting.
3.3. Adaptive Face Clustering
Our proposed adaptive face clustering achieves the optimal parameter setting by Q-learning. To reduce the runtime and computational complexity, we follow the principle that the clustering takes the smaller value within the range as the initial searching point.
3.3.1. Algorithm Complexity
The computational complexity of the clustering mainly depends on the spectral clustering and parameter adjustment. Let
and
, respectively, denote the computational complexity of parameter searching
K and parameter searching dim, satisfying:
Here, is the starting point clustering number, is the starting point dimension reduction, and are step values, and is the number of face images. is the computational complexity of spectral clustering, in the computational complexity of the Davies–Bouldin index.
When and dim increase, the time and space complexity of the system are shapely increasing, and the starting points have a greater impact on the algorithm. Therefore, it is necessary to run the searching algorithm with a smaller initial value within the searching range.
3.3.2. Parameter Searching
In this simulation, we want to compare the performance of rough searching and exact searching, and then perform exact searching on the basis of the rough searching results to speed up the search. For the rough searching, we search the optimal fields of
and dim synchronously with a large step. The system is trained 10 times from the same starting point to update the Q-table. The initial parameters of the Q-learning are shown in
Table 1, second column.
For the exact searching, we search the exact setting of
and dim synchronously with a small step in the optimal field. The initial parameters of the Q-learning are shown in
Table 1, third column. The searching paths of
and dim are plotted in
Figure 8 and
Figure 9. In order to show the convergence of the algorithm, the threshold interrupt algorithm is not used.
As shown in
Figure 8 and
Figure 9, the searching path of the adaptive clustering system is convergent. In rough searching, the system converges to the field in which the
and dim are both 40 as the center. Therefore, we acquire the optimal field based on the center and the large searching step as
. In exact searching, the results indicate that the optimal setting of
and dim is 41. We analyze some valid reasons why we obtain the clustering number of 41 instead of 40 for the ORL. First, the difference in the
DBI value within the optimal field is inapparent, which causes inefficiency of the rewards in the Q-learning. Second, the value of the
DBI rests with the face representations. Therefore, the architectures of the DCNN lead to representations that are not unique for the same face. Third, the given threshold
is too large, which is set, in advance, as small as possible but does not always meet the expected effect.
We use a combination of rough searching and exact searching to speed up the clustering. The final adaptive clustering takes 77.01 s, and each spectral clustering takes about 0.185 s. However, manual clustering parameter adjustment requires multiple comparisons of all image clustering results, and then multiple adjustments to the parameters. This process requires a lot of time and effort.
3.4. Comparison of Clustering Accuracy
On the ORL Face Database, we use an adaptive algorithm to obtain clustering results, and compare the results with other algorithms, including k-means, LGC [
26], and AEWLP [
27]. We use dlib to extract the facial features in the image, and then we use k-means clustering to add the clustering results to the comparison. Using F-score as a comprehensive index for evaluating the accuracy of clustering, it calculates the harmonic average of the precision rate P and the recall rate R, as shown in Formula (8). The accuracy rate P refers to the prediction result, which represents the proportion of the correct prediction being positive to the total prediction being positive. The recall rate R is for the original sample, which represents the proportion of the correct prediction that is positive to all that is actually positive. We use scikit-learn and MTALAB to reproduce the above algorithm and F-score, respectively:
where
TP is true positive,
FP is true negative, and
FN is false negative. Since random parameters are used in the clustering of the k-means algorithm, we evaluate the results of multiple clustering and take the average. The final results are shown in
Table 2.
It can be seen from the above table that our clustering effect can exceed other algorithms, and the F-score can reach 99.2%, indicating that only 2–3 face images in the ORL’s 400 images will be classified into the wrong category. Comparing the result of directly using k-means for clustering and the result of using dlib for face feature extraction, and then using k-means clustering, the accuracy has been greatly improved, therefore, showing the effectiveness of the feature extraction technology we used.
3.5. CFP Face Database Clustering Test
We use 5000 images in the CFP Face Database for the clustering test. We mainly search for the number of clusters automatically, because as the number of identities in the data increases, the number of required features also continue to increase, as shown in
Figure 7.
We randomly select 50, 100, 200, 300, 400 and all human images in the dataset for adaptive clustering, and compare the F-score value calculated by the clustering results with the k-means algorithm. As shown in
Table 3, the clustering accuracy of our algorithm is still higher than the k-means algorithm, and it can automatically search for the appropriate number of clusters K. As the number of clusters increases, there is still some deviation between the number of searched clusters K and the actual number of people.
The images with incorrect clustering are presented separately, as shown in
Figure 10. There is a huge age gap between different images of the same person, leading to morphological differences and making clustering errors. The algorithm judges young images into different categories to increase the number of clusters. However, the untagged images acquired through video surveillance, medical equipment, and social media over a period of time are basically the same age. The deviation in the number of clusters caused by age changes is not a big problem.
Figure 11 shows the path for searching on the CFP Face Database of 5000 images. The number of clusters is large, so the first step is 20 for rough searching, but it is difficult to converge in the final result. Because a large step size will cause a large change in the number of clusters, and then a large change in the
DBI value. However, the optimal domain range of the number of clusters 450, 580 can be obtained through a rough search. Then, the final optimal number of clusters is 532 through an exact search with a step size of two. Although there are some deviations, it proves that the clustering parameter adaptive algorithm we designed can still obtain a suitable number of clusters. The F-score calculated for the final clustering result is 0.7483, which is better than 0.4367 obtained by the k-means algorithm.
We also tested the method on the LFW face dataset [
28], which contained 13,233 images of 5749 people. First, we deleted the images that contained multiple faces in the same image. The 12,323 images of the remaining 5484 people are clustered, and the final cluster number K is 5622, and the F-score is 0.6492. Because the data samples provided in the LFW dataset are not balanced (the number of images for each person is different), the clustering accuracy is not high.
We conducted a test on the YouTube Faces Dataset [
29]. First, the video was intercepted into images at 20-frame intervals, and images containing multiple faces and no faces were removed through face detection. While obtaining enough face images, we also ensured that there would be a certain amount of deflection angle, illumination, and other changes between the images of the same person. In the end, we constructed a dataset of 27,650 images containing 1595 people. Using the proposed approach to cluster the dataset, the final cluster number K is 1684, and the F-score is 0.7724. Although the number of images increases, accuracy is better than the results achieved on the CFP dataset, because the changes between images of the same person captured in a video are small.
From the above test, it can be found that when the number of identities and the number of images increase, the clustering accuracy decreases. Therefore, in practical applications, a large number of face images can be split into multiple folders, and then clustered. Automatic searching of clustering parameters can save a lot of parameter adjustment time in the clustering process, because for a large number of unlabeled face images, manual parameter adjustment needs to compare and confirm the clustering results one by one, and then adjust the clustering parameters. This step needs to be performed many times, and it is difficult to adjust manually when the number of images reaches thousands. Therefore, the adaptive clustering algorithm we proposed still has application value.
4. Conclusions
In this study, we proposed an adaptive face clustering method via a spectral clustering model and reinforcement learning. The method involved face representations, spectral clustering, and Q-learning. First, we used a DCNN to obtain the deep representations of face images. Then, we used the spectral clustering algorithm to achieve the partition of these face representations. To cluster these representations efficiently, we used internal clustering measures to evaluate the clustering quality. To seek the optimal parameters for the unknown identity datasets, we employed Q-learning as the feedback module to build the dynamic multparameter debugging process. The simulation results on the ORL and CFP Face Databases demonstrated that our proposed method effectively searched the optimal parameters, which helped to improve the clustering quality and accuracy. Due to the use of reinforcement learning to realize the parameter search, it can effectively cluster unknown face datasets and can reduce the time of manual parameter modulation.
In future studies, we plan to focus on the inefficiency of the exact searching within the optimal field. In addition, we plan to pay more attention to reducing the running complexity of our method and modifying the clustering method to improve the accuracy of image clustering for large amounts of data.