1. Introduction
Pedestrian re-identification (re-ID) aims to identify specific pedestrians through cameras at different locations, that is, to establish the correspondence between people at different visual ranges. It is a key task in most surveillance and security applications [
1,
2,
3], and has attracted increasing attention from the computer vision community [
4,
5]. However, in a real complex environment, such as different camera resolutions, viewing angles and background changes, lighting changes, occlusions and person pose changes can adversely affect pedestrian recognition, increase the difficulty of successful pedestrian recognition and make it face many technical challenges. Furthermore, there is still a big gap between the current person re-identification technology and practical applications.
This research direction has attracted the attention of a large number of scholars and research institutions. Aiming to solve the problem of person re-identification, the research mainly focuses on the following two aspects: the expression of pedestrian characteristics [
6,
7,
8,
9,
10,
11,
12] and similarity measurement learning [
13,
14,
15,
16,
17,
18]. The feature descriptors try to determine how to select visual features with good discrimination and robustness for pedestrian image matching. As high-dimensional visual features usually do not capture the invariant factors under sample variance, a distance metric is introduced into pedestrian re-identification. The main concept of metric learning is that the visual characteristics of different pedestrians should be more separate and the visual characteristics of the same pedestrian under different perspectives should be as similar as possible in the embedded space. Since sparse dictionary learning is a special case of metric learning, it has been successfully applied in computer vision fields such as face recognition [
19,
20], and is now applied to pedestrian re-recognition.
In 2010, Cong et al. first introduced dictionary learning into person re-identification [
21]. They first built a dictionary through a camera, and then the pedestrians of the other camera were represented by the dictionary sparsely and linearly. Khedher et al. [
22] used the surf descriptor to extract features from each person’s pictures and then constructed a known dictionary using reference SURFs. They used the sparse representation model to learn a coefficient and then determined the identity information. Karanam et al. [
23] learned a single view invariant dictionary for different cameras. They also improved the discriminative ability of the dictionary by adding explicit constraints on the sparse codes, which made the Euclidean distance between the coding coefficients of the same pedestrian under different views smaller than that between the sparse coding coefficients of different pedestrians. Jing et al. [
24] proposed a novel semi-coupled low-rank discriminant dictionary learning approach for high- and low-resolution images. Karanam et al. [
25] proposed a block sparse representation method based on dictionary learning. An et al. [
26] used canonical correlation analysis (CCA) to learn a subspace in which the goal is to maximize the correlation between data from different cameras but corresponding to the same people. Then, they jointly learned the dictionaries for each camera view in the CCA subspace. Zhou et al. [
27] proposed a novel joint learning framework that unifies representative feature dictionary learning and discriminative metric learning. Xu et al. [
28] proposed to separate the images of the same pedestrian observed from different camera views into view-shared components and view-specific components so as to improve the discriminating performance of the learned dictionary. Peng et al. [
29] proposed a novel dictionary learning model which divides the dictionary space into three parts corresponding to semantic, latent discriminative and latent background attributes, respectively. Li [
30] proposed a discriminative semicoupled projective dictionary learning (DSPDL) model that employs an efficient projection dictionary learning strategy and jointly learns a pair of dictionaries and a mapping function to model the correspondence of cross-view data. Li et al. [
31] proposed a person re-ID method to divide a pedestrian’s appearance features into different components. They developed a framework for learning a pair of commonality and specificity dictionaries, while introducing a distance constraint to force the particularities of the same person over the specificity dictionary to have the same coding coefficients and the coding coefficients of different pedestrians to a have weak correlation. Li et al. [
32] considered novel joint fusion and super-resolution framework based on discriminative dictionary learning. They jointly learned two pairs of low-rank and sparse dictionaries and a conversion dictionary, which are used to represent the low-rank and sparse components of low-resolution images, and to reconstruct a high-resolution fused result. However, to accurately characterize the sparsity and low rank, it is suggested to impose the sparsity and low-rank constraints directly instead of using the approximations/regularizations.
In 2014, Li et al. [
33] first used deep learning methods for person re-identification research, and since then, an increasing number of researchers have tried to combine deep learning methods with person re-identification research. Deep learning can integrate feature extraction and metric learning into a unified learning framework and is mainly focused on extracting global identity features from pedestrian images. He et al. [
34] proposed to use the Spatial Pyramid structure to extract sample features. Huang et al. [
35] used a deep neural network to learn different representation features of different parts of pedestrian appearance images, and then calculated the similarity of the corresponding parts of the image. Then, three sub-networks were constructed for each part to learn the differences between images, feature maps and spatial changes, and the results of the three sub-networks were combined. Wu et al. [
36] introduced a deep architecture that combines Fisher vectors and deep neural networks to learn a mixture of nonlinear transformations of pedestrian images into a deep space where the data can be linearly separated. Tao et al. [
37] utilized Cross-view Quadratic Discriminant Analysis (XQDA) metric learning for person recognition in order to achieve simultaneous spatial localization and feature representation. Compared with images, there are not only spatial dependencies, but also temporal order relationships between frames in video sequences. Reasonable use of the temporal features of videos can reflect the motion characteristics of pedestrians and improve the recognition accuracy. Therefore, for video-based pedestrian re-identification, the spatiotemporal features of videos are often extracted for recognition. Gao et al. [
38] proposed a temporally aligned pooling representation method, which uses the periodic characteristics of walking to divide the video sequence into independent walking cycles, and selects the cycle that best matches the characteristics of the sinusoidal signal to represent the video sequence. Rahmani et al. [
39] proposed a deep fully connected neural network by finding the nonlinear transformations of a set of connected views, which learn from 2D projections of the dense trajectories of synthetic 3D human models fitted to real motion capture data. Using the spatiotemporal motion characteristics of human walking, Khan et al. [
40] proposed a novel view-invariant gait representation deep fully connected neural network for cross-view gait recognition. However, spatiotemporal features are susceptible to factors such as viewing angle, scale and speed. With the substantial increase in pedestrians, the motion similarity between pedestrians also increases, which greatly reduces the ability to distinguish spatiotemporal features. At the same time, the large number of cameras in large datasets increases the pose differences and motion differences of the same pedestrian. Obviously, these all limit the role of spatiotemporal features in pedestrian re-identification.
It this paper, we propose a new special and shared dictionary learning model with structure characteristic constraints, which has stronger interpretability. We divide the learning dictionary into two parts. One is a shared dictionary, which represents some features shared by all pedestrians in the camera, such as the same background. The other is a special dictionary, which represents the unique characteristics of each pedestrian. Then, only the unique part that represents the identity of the pedestrian is considered in the recognition process, which can reduce the ambiguity caused by some other unnecessary visual feature factors. The main contributions of the paper are summarized as follows:
- (I)
The shared dictionary part, whose features are shared by all people, have a strong correlation, so the shared dictionary must be low rank; then, we directly impose the low-rank constraint. Next, we impose a norm constraint to the special dictionary, which has strong sparsity and contains only information unique to each person.
- (II)
In order to better describe the shared information of pedestrians and force the commonality of different pedestrians to have the same coding coefficients in the shared dictionary, we introduce the norm constraint on the coding coefficients .
- (III)
Due to the norm and low-rank constraints, the dictionary learning model with structure characteristics constraints is highly nonconvex and computationally NP-hard in general;therefore, we adopt the method of alternating directions to solve it. When dealing with each subproblem, we directly deal with the original problems with the norm and rank constraints instead of their convex relaxed form. Numerical experiments performed on some real datasets show that our method is superior to traditional methods, and even better than some deep learning methods on some datasets.
The rest of this paper is organized as follows. The joint dictionary learning model is presented in
Section 2, while
Section 3 is devoted to optimization algorithm for the special and shared dictionary learning model and the re-identification process. In
Section 4, the computational experiments are reported. Finally, we conclude the paper with future work in
Section 5.
2. Joint Dictionary Learning Model
We know that the general cameras are fixed in a place, so the picture of each person in the camera contains part of the same elements that do not help in recognition. What is useful is the part of unique features that represents information about each person. Assuming that we have two camera views,
a and
b, and
is a set of training samples composed of
N individuals images from
l-th view,
can be divided into two parts
Considering the actual situation, we mainly study the following dictionary learning model for the person re-identification problem
where
and
are the coding coefficients of person-specific components under camera views
a and
b;
is the coding coefficient of person-shared components under different camera views.
is the dictionary for the person-shared elements, and
is the dictionary for the person-specific elements.
are penalty parameters and
are three integers representing the prior information on the upper bounds of the sparsity and the rank, respectively.
is the zero norm of
, representing the number of its nonzero elements.
represents the rank of the matrix
.
is the zero norm of the rows of the matrix
, representing the sparsity of the rows of the matrix
.
We know that different people have different features, so the coding coefficients of different pedestrians should be largely irrelevant. The same pedestrian has a greater similarity under different cameras, i.e., one person under different views should have the same coding coefficient.
, where
are the given correlation parameters, and
is the correlation function. However, the correlation coefficient is more difficult to calculate; similar to article [
19], we transformed the correlation coefficient constraint into the following form:
,
. The same elements play the same role for each pedestrian under the camera, and the shared features are only a small part of all features. So,
was added. Generally, the common part has a strong correlation. For example, two cameras may have a part of the same background, and the picture background often has a low-rank structure. So, the shared dictionary should have a low-rank structure. At the same time, the unique information for each pedestrian is different, so it should have a sparse structure, i.e.,
,
. The identity information of the same pedestrian under different cameras is the same and should be as similar as possible. Therefore, the same pedestrian should have the same coefficient under different cameras; that is,
.
5. Conclusions and Discussion
In this paper, we propose a new special and shared dictionary learning model with structure characteristic constraints, including sparse, low-rank and row-sparse constraints. Here, we divided the dictionary into two, one to represent features shared by all pedestrians and the other to represent features unique to each individual. Then, only the unique part that represents the identity of the pedestrian was considered in the recognition process, which can reduce the ambiguity caused by some other unnecessary visual feature factors in the recognition process. In order to improve the accuracy of matching and to better characterize the structural characteristics of the dictionary, on the basis of original dictionary learning, norm and low-rank constraints were directly added instead of their convex regular form. We used the method of alternating directions to solve the optimization model, and when solving each sub-problem, we also directly solved the problem with constraints. Finally, experiments on different datasets showed that our algorithm has a high accuracy rate.
Since the objective function of dictionary learning, as well as the highly nonconvex and computationally NP-hard of norm and rank constraints, is bilinear, the optimality condition of the model and the convergence of the algorithm were not established here. It is difficult to not only explore the impact of feature extraction methods, but also compare the performance of metric learning methods. So our research focuses on the comparison of metric learning methods. We think that evaluating the GOG features using view transformation model (VTM) based approaches is a good attempt. These will be addressed in our future work.