1. Introduction
Among the numerous subjects in computer vision, object tracking is one of the most important fields. It has many applications such as human computer interaction, video analysis, and robot control systems.
Many object tracking algorithms have been brought up in the last decades. Welch [
1] proposed a Kalman filter-based algorithm considering Gaussian and linear problems to track one’s pose in interactive computer graphics. Later, a particle filter-based approach was introduced with respect to non-Gaussian and non-linear systems [
2,
3]. Other common trackers used include optical flow-based tracking [
4], multiple hypothesis tracking [
5,
6], and kernel-based tracking [
7,
8]. Recently, João F. Henriques et al. [
9] proposed a new kernel tracking algorithm called high-speed tracking with kernelized correlation filters that have been widely used. Unlike other kernel algorithms, the method has the exact same complexity as its linear counterpart.
Though these algorithms have been successful in many real scenes, they are still confronted with challenging problems, such as illumination changes, object occlusions, image noises, low illumination, fast motions and similarly colored backgrounds. One of the effective solutions is the mean-shift algorithm which can handle object partial occlusions and background clutters [
10,
11,
12]. Mean-shift is a non-parametric pattern matching tracking algorithm. It uses the color histogram as the target model and the Bhattacharyya coefficient as the similarity measure. The location of the target is obtained by an iterative procedure [
10]. The performance of the algorithm is determined by the similarity measure and the target feature. Because of the background interference, the tracking result may easily get biased or be completely wrong. The location of the target obtained by the Bhattacharyya coefficient [
7] or other similarity measures, such as normalized cross correlation, histogram intersection distance [
13], and Kullback–Leibler divergence [
14] may not be the ground truth. To improve the accuracy of object matching, a maximum posterior probability measure was proposed [
15]. It takes use of the statistical feature of the searching region and can effectively reduce the influence of background and emphasize the importance of the target.
In some scenes with dramatic intensity or color changes, the effectiveness of the color decreases. Thus, it is desirable that some additional features should be used as a complement to color to improve the performance of the tracking system [
16,
17]. For example, Collins et al. [
18] presented an online feature selection algorithm based on a basic mean-shift approach. The method can adaptively select the best features for tracking. They only used the RGB histogram in the algorithm, but it can be extended to other features. Wang et al. [
19] proposed integrating color and shape-texture features for reliable tracking, and their method was also based on the mean-shift algorithm. Ning et al. [
20] presented a mean-shift algorithm using the joint color-texture histogram, which proved to be more robust and insensitive than the color. Most of these methods used multiple features to describe the target model in order to reduce the mistakes of tracking systems. Unfortunately, color, shape-texture silhouettes or other traditional features can not track the target in some special scenes with variably scaled images or rotated images. In recent years, some new features have been proposed to solve these problems including Scale Invariant Feature Transform (SIFT) [
21], Principal Components Analysis-Scale Invariant Feature Transform (PCA-SIFT) [
22], Gradient Location and Orientation Histogram (GLOH) [
23], Speed-up Robust Feature(SURF) [
24], and Fast Retina Keypoint (FREAK) [
25], just to name a few. Among them, a texture feature named the local binary pattern (LBP) [
26] has been widely used in computer vision [
27] due to its advantages of fast computation and rotation invariance. Recently, some improvements have been made based on the LBP such as the center-symmetric local binary pattern (CS-LBP) [
28] and the local ternary pattern (LTP) [
29].
This paper proposes a centroid iteration algorithm with multiple features based on a posterior probability measure [
15] for object tracking. The main goal is to solve the difficulties in real scenes such as similarly colored backgrounds, object occlusions, low illumination color image and sudden illumination changes. The proposed algorithm consists of a target model construction step and a localization step. We improve the LBP descriptor to the DCS-LBP descriptor. For further improvement, a simplified version of the DCS-LBP is used, which we call the SDCS-LBP. It can describe important information of the image (the edge, the corner and so on). Then, this new texture feature and the color are combined to constitute the multiple features used in the target model, which we call the color and texture (CT) feature in this paper. After obtaining the target, three strategies for updating the target model are presented to reduce the tracking mistakes.
The rest of the paper is organized as follows: in
Section 2, a local color texture feature based on the DCS-LBP along with its simplified form is introduced. In
Section 3, the proposed tracking algorithm is illustrated in detail. Experimental results are shown in
Section 4.
Section 5 draws conclusions.
2. Multiple Features
Feature descriptors are very important in matching-based tracking algorithms, especially for applications in real scenes. In some simple scenes, color can work well because it distinguishes the targets from the background easily and contains a lot of useful information of the target. However, in complex scenes containing similarly colored backgrounds, object occlusions, low illumination color image and sudden illumination changes, the tracker only using the color feature may easily miss the target. One of the solutions is to integrate multiple features in the target model for reliable tracking.
2.1. Local Binary Patterns (LBPs)
The LBP is an illumination invariant texture feature. The operator uses the gray levels of the neighboring pixels to describe the central pixel. The texture model
is expressed as follows [
26]:
where
P is the number of the neighbours and
R is the radius of the central pixel.
denotes the gray value of the central pixel and
denotes that of the
P neighbours with
, and
represents the sign function.
Figure 1 gives an example of the LBP code when
and
.
There are two extensions of the LBP [
26]. The first one is to make the LBP as a rotation invariant feature as proposed by Ojala et al. [
26]. It is defined as:
where
performs a circular bit-wise right shift on the
number
x by
i times. Equation (
2) selects the minimal number to simply the function. They explained that there were 36 rotation invariant LBP codes at
,
. The second one is the uniform LBP, which contains at most one 0–1 and one 1–0 transition when viewed as a circular bit string. The uniform LBP codes contain a lot of useful structural information. Ojala et al. [
26] observed that although only 58 of the 256 8-bit patterns were uniform, nearly 90% of all observed image neighborhoods were uniform and many of the remaining ones contained noise. The following operator
is a uniform and rotation invariant pattern with
of at most 2:
If we set
,
, the nine most frequent patterns with index from 0 to 8 are selected from the 36 different patterns, which are the rotation invariant patterns as shown in
Figure 2.
2.2. Center-Symmetric Local Binary Patterns (CS-LBPs) and Local Ternary Patterns (LTPs)
In
Section 2.1, it can be seen that LBP codes have a long histogram, which require lots of calculations. Heikkilä et al. [
28] designed a method by comparing the neighboring pixels in order to reduce computation. They calculated the center-symmetric pairs of the pixels as defined in the following function:
This operator halves the calculations of LBP codes at the same neighbors. The LBP threshold depends on the central pixel, which makes the LBP sensitive to noise especially in flat regions of the image while the CS-LBP threshold is a constant value T that can be adjusted.
Tan et al. [
29] extended the LBP to 3-valued codes, called the local ternary pattern (LTP). They set the codes around
in a zone of width
to one. The codes above it are set to 2 and the ones below it are set to 0. It is defined as:
Here, T is the same threshold as the CS-LBP. Thus, the LTP is more insensitive to noise than the CS-LBP. However, it is no longer invariant to gray-level transformations.
2.3. Double Center-Symmetric Local Binary Patterns (DCS-LBPs)
In
Section 2.2, it is analyzed that the CS-LBP is more efficient than the LBP in calculation, but they are both sensitive to noise. The LTP is insensitive to noise, but its computation is too complex. A simple way is to combine the LTP and the CS-LBP, which yields the CS-LTP. It is defined as:
By definition, the CS-LTP retains the advantages of the CS-LBP and the LTP, but the ternary values are hard to calculate in the image.
Thus this motivates us to generate a DCS-LBP operator. The operator is divided into two parts:
, in which the gray levels of the center-symmetric pixels above
T are quantized to one while those below
T are quantized to zero, and
, in which the center-symmetric pixels on the other side below
are quantized to one while those below
T are quantized to zero.
T is the threshold used to eliminate the influence of weak noise. The value of
T determines the anti-noise capability of the operator. The upper-part and the lower-part of the DCS-LBP should be calculated separately and then be combined together for use. By definition, there are
different values, which are much less than the basic LBP (
) and the LTP (
), and are close to the CS-LBP (
) and the CS-LTP (
). When
,
, the DCS-LBP has 32 different values.
Table 1 shows examples of all of these five local patterns. The first row are three local parts of an image including texture flat areas, texture flat areas with noise, and texture change areas. The threshold is set to be 5. It can be seen that the LBP and the CS-LBP can not exactly distinguish between texture flat and change areas. The other three patterns are distinguishable and are all insensitive to noise, among which the computational complexity of the DCS-LBP is lower than the other two.
It should be noted that there is a great amount of redundant information in the DCS-LBP, which might cause matching errors. Thus, further optimization is necessary. The DCS-LBP patterns also have the rotation invariant identity as shown in
Figure 3. There are nine rotation invariant patterns. Similarly, both
and
have the same uniform patterns as the LBP. Pattern 5 to Pattern 8, which cannot describe the primitive structural information corresponding of the local image, are not uniform patterns. Pattern 0 to Pattern 4 each has its identity. Pattern 0 and Pattern 1 represent noise points, dark points and smooth regions. Pattern 2 represents the terminal. Pattern 3 represents angular points. Pattern 4 represents boundary. Thus, we improve the DCS-LBP to its simplified version (called SDCS-LBP), which retains only the patterns with index from 0 to 4.
2.4. Local Color Texture Feature (CT Feature)
Feature representation of the target model is very important for mean-shift based tracking algorithms. The original mean-shift algorithm selects the RGB color space (
= 4096) as the features. However, in real scenes which contain similarly colored background, object occlusion, low illumination color image and sudden illumination changes, the original mean-shift algorithm can not track the target continuously. Inspired by [
16], we consider designing a new feature combining the color and the texture.
This paper chooses to use the HSV color space, which contains Hue, Saturation and Value. The Value, which is measured with some white points, is often used for description of surface colors and remains roughly constant even with brightness and color changes under different illuminations. Hence, we replace the Value with the SDCS-LBP in the HSV space as the target model. The new feature which combines the color and the texture is called the CT feature in this paper. The CT feature can be considered as a special texture feature (terminal, angular point, boundary and some special points) with a certain color. The HSV color space is reduced to the size of
after excluding the part of the Value. Thus, the dimension of the CT feature is 640 (
).
Figure 4 shows three target models. For the CT feature,
Figure 4b,c is the same and are different from
Figure 4a, which can not be distinguished using the color alone. The CT feature has the rotation invariant identity and can distinguish between different texture patterns.
The calculation process of the CT feature is as follows. Firstly, let
be the set of pixels of the target. Calculate
,
and the HSV color space of each point in
in turn. If the value of
or
does not belong to the SDCS-LBP, the point will be seen as a meaningless point, which should be eliminated. Secondly, calculate
and
by multiplying the SDCS-LBP, the Hue and the Saturation. Third, after all the points of the target have been calculated,
and
of the target are worked out by putting the CT feature into the histograms. The histogram of the target model (
) is obtained by combining
and
.
Figure 5 shows the representation of a target model by the proposed method.
Figure 5a is the first frame of a sequence. The target is showed in
Figure 5b. The histogram of the CT feature is showed in
Figure 5c.
3. Tracking Algorithm Using the CT Feature
Recently, many similarity measures are used in object tracking algorithms, such as the Euclidean distance, the Bhattacharyya coefficient, the histogram intersection distance, and so on. However, there is still lots of mismatching or misidentification in the tracking process. One of the reasons is that the target model contains some background pixels [
15]. This paper proposes using the similarity measure based on maximum posterior probability to solve the problem.
3.1. Maximum Posterior Probability Measure
By introducing the candidate area, the maximum posterior probability measure (PPM) is able to decrease the influence of background and increase the importance of the target model in the tracking process. The PPM is a function to evaluate the similarity of the candidate and the target defined as:
where
and
are, respectively, the histogram features of the target candidate region and the target model;
is the feature of the search region of the target candidate;
m is the pixel number of the target model with
; and
is the dimension of feature.
Now, we define a vector
, which is computed according to Equation (
9).
is the feature of the
jth pixel;
is the PPM of the
jth pixel of the search region;
is the set of pixel of the
ith target candidate region in the search region. Thus, the original PPM can be converted into a simple one as [
15]:
From the function, it can be found that the PPM and have a liner relationship. Therefore, we compute the incremental part to obtain the PPM of neighborhood, which makes the recursive algorithm a suitable one.
According to Equation (
9), the PPM value of each pixel will be calculated, respectively. Thus, the matching process is simplified to find a target candidate region with the biggest sum of PPM value. The similarity measure of the target candidate and the target model is:
where
is the set of pixel’s position with the present frame centered at
;
is the PPM value at
; and
is the target candidate centered at
. Supposing the PPM value of each pixel as density and the similarity of the target candidate region as mass, the center of mass
is the target:
Figure 6 shows the PPM of the target model. The target bounded by the blue box and the target candidate region bounded by the green box in
Figure 6a are resized. The target model and the target candidate region are showed in
Figure 6b. The PPM of the target model, which holds monotonic and distinct peak shapes, is showed in
Figure 6c.
3.2. Scale Adaptation and Target Model Update
During the tracking process, the target always changes in shape, size, or color. Thus, the target model must be updated. The update must abide by certain rules to prevent the tracking drift. Three strategies are proposed for the target model update.
Introduce an adaptive process to fit the target region to a variable target scale for the purpose of precise target tracking.
Compute the similarity measure of the scale adapted target. If it is greater than a parameter, update the target model.
Introduce a parameter into the tracking algorithm to update part of the target model.
Strategy 1 introduces a scale adaptation function given by [
15]:
where
is the size of the target region at frame
k.
is the average of the PPM of each pixel. Furthermore,
means the
outer layer.
represents the target region border.
a is the comparison step of scale adaptation and is set to 1 without losing the generality. In Equation (
12), the expanding condition means the pixels around the border are likely to be a part of the target. The contracting condition means the target region should be reduced consequently. The function is an empirical one. The parameters should be trained by a great number of experiments.
Strategy 2 shows that the frame will not be updated until the similarity measure is greater than a certain parameter. In real scenes, some sudden changes may cause the tracking drift, so the update can not work every frame.
p is the current frame model, while
q is the target model.
is the similarity of the PPM for the current frame and the target model. If Equation (
13) is satisfied, we considered
p as the reliable CT feature model, and update the target model with
p:
Strategy 3 introduces a parameter into the algorithm to prevent the target model from being updated completely. Because of the limitations of the description to the target model,
p can not take the place of
q. The
parameter is used to partially update the target model:
where
is the update factor; and
is the updated CT feature model. In our experiment,
is set to be a small value to adapt the changes of the target slowly.
3.3. Tracking Algorithm
Initialization: select the target object and compute the histogram of the target model as . The center of the target is the initial position of the tracking object. Let be the set of pixel’s position with the present frame centered at .
Set as the initial position. Calculate of the search region as .
Calculate the PPM values
of each pixel of the region by Equation (
10).
Initialize the number of iterations as .
Calculate the target location by Equation (
11).
.
Repeat Step 4 until or .
Adjust the scale of the target region by Equation (
12)
Decide whether to update the target by Equation (
13). If satisfied, update the target model by Equation (
14).
Read the next frame of the sequence and turn to Step 1.
If the distance between two iterations is less than or the number of iterations exceeds N, the algorithm is considered converged.
4. Experiments
The environments are set in some real scenes with similarly colored backgrounds, object occlusions, low illumination color image, and sudden illumination changes [
12]. Eight public test sequences are used in experiments which are from the Visual Object Tracking challenge (
http://votchallenge.net/index.html) and the Visual Tracker Benchmark [
30] (
http://www.visual-tracking.net) (see
Figure 7). As the visual tracking benchmark, the test sequences are tagged with the following four attributes: low illumination color image (LI), sudden illumination changes (IC), object occlusion (OC), similarly colored background (SCB) (see
Table 2). We designed a tracking system based on Matlab R2014a (8.3.0.532). All the trackers run on a standard PC (Intel (R) Core (TM) i5 2.6 GHz CPU with 8 GB RAM).
We compared our algorithm with some state-of-the-art methods including classical mean-shift tracking (KBT) [
10], PPM-based color tracking algorithm (PPM) [
15], a mean-shift algorithm using the joint color-texture histogram (LBPT) [
20] and high-speed tracking with kernelized correlation filters (KCF) [
9]. In addition, extra experiments are designed to test the function of the two major parts of the proposed method-the CT feature and the PPM separately. One of the experiments that we use is the CT feature with the Euclidean distance (CT&ED) instead of the PPM as the similarity measure. The other one that we use is the LBP feature with the PPM (LBP&PPM) instead of the CT feature. Both of the two trackers are tested in the experimental framework. All the methods aim at tracking one object in our experiments. The target will be tracked continuously at the rest of the frames.
4.1. Parameter Setting
The size of the search region of our methods is set to 2.5 times the target size. In addition, there are five parameters in our tracking algorithm. We set
and
for the target model update in
Section 3.2.
is the control parameter used to determine whether update the model or not.
N and
are the iteration parameters for the tracking algorithm in
Section 3.3.
is the maximum number of the iteration, and
is the minimum threshold of the iteration. The threshold parameter
T is important in our algorithm. In order to test the sensitivity of the parameter, the central location error (CLE) is used to describe the tracking result. The CLE is defined as the Euclidean distance between the center of the box predicted by the tracker and that of the box of the ground truth. We set
for the calculation of the DCS-LBP. The results of eight test sequences are showed in
Table 3. It can be seen that our algorithm performs well on all the tests when T is a small value between 1 to 5. In addition, it only missed the target in the basketball test sequence when T gets larger. Therefore, we set
in the experiments.
4.2. Qualitative Comparison
Some key frames of each sequence are given in
Figure 8. The results of different trackers are shown by the bounding boxes in different colors.
- (1)
In the basketball sequence, the tracked player moves fast. The environment changes many times. CT&ED lose the target at frame 80. KBT, PPM, and LBP&PPM fail at frame 473, when the player goes through his partner. KCF, LBPT and our tracker can successfully locate the object.
- (2)
In the car sequence, the target is a car, but the road environment is dark. There are bright lights in the background. All of the trackers can merely track the car in the first 200 frames. However, at frame 260, the car turns right, and only KCF can track the car accurately.
- (3)
In the coke sequence, the target is a coke and the light changes three times. The coke moves fast and is blocked by plants sometimes. When the coke is blocked by the plants the first time, LBTP misses the target. At frame 221, the occlusion and the illumination happen at the same time, and KBT and PPM obtain the wrong place. During the tracking, both KCF and our method perform better than the others.
- (4)
The doll sequence has 3872 frames, which is a very long sequence. The target is a doll. It is blocked by the hand, and the scale of it changes sometimes. Because of the similar color with the background, LBP&PPM, LBPT, and CT&ED fail at frame 2378. KCF gives the best result followed by PPM and our tracker.
- (5)
The lemming sequence is a challenging situation with fast motion, significant deformation and long-term occlusion. KCF missed the target at frame 380 because the target moves fast with the similar background. Our method is more effective than the others during the tracking.
- (6)
In the matrix sequence, the target is the head. The sequence contains low illumination color image, sudden illumination changes, object occlusion, and similarly colored background. Our tracker gives the best result. At frame 30, all of the methods except ours lose the target. Our tracker misses the target at frame 90, when the target has dramatic changes in shape.
- (7)
In the trellis sequence, the target is a boy’s face in an outdoor environment. The situation has severe illumination and poses changes. All trackers except KCF and our tracker show some drifting effects at frame 270. The CT&ED loses the target at frame 410. Only KCF and our tracker show a good performance along the whole sequence.
- (8)
In the woman sequence, the track is a walking woman in the street. The difficulty lies in the fact that the woman is greatly occluded by the parked cars. All the tracks fail at frame 124 except KCF and our tracker because of the occlusion and the small size of the target.
4.3. Quantitative Comparison
For performance evaluation and comparison, two metrics are considered: the CLE and the success rate (SR), which have been widely used in object tracking [
12,
31]. A target is considered as successfully tracked if the overlap region between the predicted bounding box and the ground truth exceeds 50% in a frame [
32]. The
is defined as
where
is the bounding box predicted by the tracker.
is the ground truth bounding box. The function
means to calculate the area of a region. The CLE has been described in
Section 4.1. The results of different methods on eight test sequences are showed in
Table 4 and
Table 5. It can be seen from
Table 4 and
Table 5 that our algorithm achieves an SR of 94% and a CLE of 18 which are better than the other algorithms. We also report the central-pixel errors frame-by-frame for each video sequence in
Figure 9.
Now, we discuss the influence of the two major parts in our method: the CT feature and the PPM, separately. First, to test the influence of the similarity measure, we compare the trackers using the CT feature and different measures: the Euclidean distance (CT&ED) and the PPM (which is the proposed method—CT&PPM). It can be seen from
Table 4 and
Table 5 that the PPM achieves an SR of 94% and a CLE of 18, which are better than those achieved by the Euclidean distance (40% and 122%). Second, to test the influence of the feature, we compare the trackers using the PPM and different features: the color feature (PPM), the LBP (LBP PPM) and the CT feature (which is the proposed method—CT&PPM). It can be seen from
Table 4 and
Table 5 that the CT feature outperforms the others with the highest SR and a lowest CLE. The results demonstrate the effectiveness of both the CT feature and the PPM in improving the tracking accuracy.
4.4. Speed Analysis and Discussions
Table 6 lists the needed computation times of the five trackers on our test platform. The trackers run from 160 fps to 60 fps in the current Matlab implementation. The speed of the trackers depends on the area of the candidate region for all the test sequences and the number of iterations. Comparing with KBT, PPM, and KCF, LBPT and the proposed method spend lots of time on texture feature computation. However, they just calculate parts of useful points. Comparing with KBT, KCF and LBPT, PPM and our algorithm can calculate the target model and the search region by joint points to decrease the computational complexity. Because the dimension of the CT feature is 640 compared with KBT, PPM, LBPT, KCF, our tracker takes more time than the other trackers. However, the computational time can satisfy real-time applications.
5. Conclusions
A new object tracking method has been proposed in this paper. The algorithm can overcome some difficulties in real scenes such as object occlusion, sudden illumination changes, similarly colored backgrounds, and low illumination color images. This work integrates the outcomes of the color texture feature and PPM centroid iteration tracking. A color texture model called the CT feature is introduced. In addition, we propose using a posterior probability measure with the CT feature for target location. Three target model update strategies are designed to improve the tracking accuracy.
The tracking algorithm only using color can not track the target at similarly colored regions or low illumination regions. The combination of the color and the texture feature can overcome these difficulties, and the SDCS-LBP is a texture feature, which is robust against gray-scale changes. In real scenes, our algorithm shows a good performance. As our method is based on the histograms of the regions, it can overcome the problem of object partial occlusion. PPM measure and the target update strategies can reduce the tracking mistakes. In the experiments, our algorithm performs better than others for most of the test sequences. Future work will be dedicated to decreasing the complexity of the algorithm.