1. Introduction
Compressive measurements [
1] can save data storage and transmission costs. The measurements are normally collected by multiplying the original vectorized image with a Gaussian random matrix. Each measurement is a scalar and the measurement is repeated many times. The saving is achieved because the number of measurements is much fewer than the number of pixels in the original frame. To track a target using compressive measurements, it is required to reconstruct the image scene.
However, it is difficult, if not impossible, to carry out target tracking and classification directly using the compressive measurements that are generated by the Gaussian random matrix. This is because the target location, and target size and shape information in an image frame is destroyed by the Gaussian random matrix.
Recently, a new compressive sensing device known as Pixel-wise Code Exposure (PCE) camera was proposed [
2]. In [
2], the original frames were reconstructed using
L1 [
3] or
L0 [
4,
5,
6] sparsity based algorithms. It is well-known that it is computationally intensive to reconstruct the original frames and hence real-time applications may be infeasible. Moreover, information may be lost in the reconstruction process [
7]. For real-time applications, it will be important to carry out target tracking and classification using compressive measurement directly. Although there are some tracking papers [
8] in the literature that appear to be using compressive measurements, they are actually still using the original video frames for tracking.
In this paper, we propose a target tracking and classification approach in compressive measurement domain for long range and low quality optical and MWIR videos. First, YOLO [
9] is used for target tracking. The training of YOLO requires image frames with known target locations, which can be easily done. It should be noted that YOLO does have a built-in classifier. However, its performance is not good based on our past experience [
10,
11,
12,
13,
14]. As a result, ResNet [
15] has been used for classification because some customized training can be done via data augmentation of the limited video frames. Although other deep learning based classifiers could be used, we chose ResNet simply because its ability to avoid saturation issues. Our proposed approach was demonstrated using low quality videos (long range, low spatial resolution, and poor illumination) in the SENSIAC database. The tracking and classification results are reasonable up to certain ranges. Big improvement has been noticed over conventional trackers [
16,
17]. Moreover, conventional trackers do not work well for multiple targets [
10].
Although the proposed approach has been applied to shortwave infrared (SWIR) videos in an earlier paper [
10], the application of the proposed approach to SENSIAC videos is completely new. Most importantly, the video quality in terms of spatial resolution and illumination in SENSIAC videos is much worse than those SWIR videos in [
10]. The SENSIAC database contains both optical and MWIR videos collected from ranges of 1000 m up to 5000 m. In some videos, cameras also move and there are also air turbulence caused by desert heat. Some dust caused by moving vehicles can be seen in some optical videos. There are seven types of vehicles, which are hard to distinguish from long ranges. For MWIR videos, there are daytime and nighttime videos as well. We have demonstrated that the proposed deep learning approach is general and applicable to low quality optical and MWIR videos. Our studies also showed that optical has better tracking and classification performance than MWIR daytime videos and MWIR videos are more appropriate for nighttime operations.
It is worth to briefly review some state-of-the-art algorithms that performs action inference or object classification directly using compressive measurements. We will also highlight the differences between our approach and those other approaches.
Paper [
18] presents a reconstruction-free approach to action inference. The key idea is to build smashed filters using training samples that are affine transformed to a canonical viewpoint. The approach works very well even for 100 to 1 compression. However, the approach is for action inference (e.g., a moving car or some other actions), not for target detection, tracking, and classification (e.g., the moving car is a Ram, not a Jeep) in compressed measurement domain. Moreover, the smashed filter may assume that the camera is stationary and the angle is fixed. Extending the approach to target tracking and classification with moving cameras may be non-trivial.
In [
19], a CNN approach was presented to perform image classification directly in compressed measurement domain. The input image is assumed to be cropped and centered, and there is only one target in each image. This is totally different from our paper in which the target can be anywhere in the image frames.
Papers [
20,
21] are similar in spirit to [
19]. Both papers discussed direct object classification using compressed measurement. However, both papers assumed that the targets/objects are already centered. Moreover, it is a classification study only without target detection and tracking. This is similar to the ResNet portion of our approach. Again, the problem and scenarios in these papers are different from ours because the target can be anywhere in the video frames in our paper.
Strictly speaking, the approach in [
22] is not reconstruction free. The integral image is one type of reconstructed image. After the integral image is obtained, other tracking filters are then applied. There was also no discussion of object classification. Our paper does not require any image reconstruction.
Reference [
23] is interesting in that a random mask is applied to conceal the actual contents of the original video. They call the video with random mask a coded aperture video. If one looks closely, the coded aperture idea in [
23] is very different from the PCE idea in our paper. In addition, the key idea in [
23] is about action recognition (similar to [
18]), not object tracking and classification. Extending the idea in [
23] to object tracking and classification may not be an easy task.
Reference [
24] presents an object detection approach using correlation filters and sparse representation. There was no object classification. No reconstruction of compressive measurements is needed. The results are quite good. One potential limitation of the idea in [
24] is that the sparsity approach may be very time consuming when the dictionary size is large and hence may not be suitable for near real-time applications. Different from [
24], our paper focuses on object detection, tracking, and classification. Once trained, our approach can work in a near real-time fashion.
In [
25], the authors present an approach to extracting features out of the compressed measurements and then uses the features to create a proxy image, which is then used for action recognition. If our interpretation is correct, this approach may not be considered as a reconstruction free approach because there is a construction of a proxy image. Similar to [
19,
20,
21], it appears the approach is suitable for stationary camera cases and also the objects are already centered in the images. In our approach, the camera can be non-stationary and targets can be anywhere in the image.
Paper [
26] presents an online reconstruction free approach to object classification using compressed measurements. Similar to [
19,
20,
21,
25], the approach assumes the object is already at the center of the image. For an image frame where the target location is unknown, then it is not clear on how this approach can be applied to handle the above situation. We faced the same problem two years ago when we investigated a sparsity based approach [
7] that directly classifies objects using compressive measurements. However, we still could not solve the classification issue in which the target is located in a small and random location of an image frame. The methods in [
19,
20,
21,
25,
26] also did not address the above mentioned issue.
This paper is organized as follows: in
Section 2, we describe some background materials, including the PCE camera, YOLO, ResNet, SENSIAC videos, and performance metrics. In
Section 3, we present some tracking results using a conventional tracker, which clearly has poor performance when using compressive measurements directly.
Section 4,
Section 5 and
Section 6 then focus on presenting the deep learning results. In particular,
Section 4 summarizes the tracking and classification results using optical videos.
Section 5 and
Section 6 summarize the tracking and classification results for MWIR daytime and nighttime videos, respectively. Finally, we conclude our paper with some remarks for future research. To make our paper easier to read, we have moved some tracking and classification results to the Appendices.
2. Materials and Methods
2.1. PCE Imaging and Coded Aperture
Here, we briefly review the PCE or Coded Aperture (CA) video frames [
2]. The differences between a conventional video sensing scheme and PCE are shown in
Figure 1. First, conventional cameras capture frames at 30 or 5 or some other frames per second. A PCE camera, however, captures a compressed frame called motion coded image over a fixed period of time (
Tv). For instance, it is possible to compress 20 original frames into a single motion coded frame. The compression ratio is very significant. Second, the PCE camera allows one to use different exposure times for different pixel locations. Consequently, high dynamic range can be achieved. Moreover, power can also be saved via low sampling rate. One notable disadvantage of PCE is that, as shown in the right-hand side of
Figure 1, an over-complete dictionary is needed to reconstruct the original frames and this process may be very computationally intensive and may prohibit real-time applications.
The coded aperture image
is obtained by:
where
contains a video scene with an image size of
M ×
N and the number of frames of
T;
contains the sensing data cube, which contains the exposure times for pixel located at (
m,
n,
t). The value of
S (
m,
n,
t) is 1 for frames
t ∈ [
tstart,
tend] and 0 otherwise. [
tstart,
tend] denotes the start and end frame numbers for a particular pixel.
The video scene
can be reconstructed via sparsity methods (
L1 or
L0). Details can be found in [
2]. However, the reconstruction process is time consuming and hence not suitable for real-time applications.
Instead of performing sparse reconstruction on PCE images, our scheme directly works on the PCE images. Utilizing raw PCE measurements has several challenges. First, moving targets may be smeared if the exposure times are long. Second, there are also missing pixels in the raw measurements because not all pixels are activated during the data collection process. Third, there are much fewer frames in the raw video because a number of original frames are compressed into a single coded frame. This means that the training data will be limited.
In this paper, we have focused on simulating PCE measurement. We then proceed to demonstrate that detecting, tracking, and classifying moving objects is feasible. We carried out multiple experiments with three diverse sensing models: PCE/CA Full, PCE/CA 50%, and PCE/CA 25%.
The PCE Full Model (PCE Full or CA Full) is quite similar to a conventional video sensor: every pixel in the spatial scene is exposed for exactly the same duration of one second. This simple model still produces a compression ratio of 30:1. The number “30” is a design parameter. Based on our sponsor’s requirements, in our experiments, we have used 5 frames, which achieved 5 to 1 compression already.
Next, in the sensing model labeled as PCE 50% or CA 50%, there are roughly 1.85% pixels being activated in each frame with an exposure time of
Te = 133.3 ms. Since we are summing up 30 frames into a single coded frame, summing 30 frames of 1.85% is equivalent to 55.5% of all pixels that have exposure in the coded frame. Because the pixels are randomly selected in each frame, some pixels may overlap. So, activating 1.85% in each frame is roughly equivalent to 50% of activated pixels in the coded frame. Similarly, for PCE 25 case, the percentage of activated pixels in each frame will be reduced by half from 1.85% to 0.92%. The exposure duration is still set at the same conventional 4-frame duration.
Table 1 below summarizes the comparison between the three sensing models for data and power savaging ratios. Details can be found in [
10].
2.2. YOLO Tracker
YOLO [
9] is fast and similar to Faster R-CNN [
27]. We picked YOLO rather than Faster R-CNN simply because of easier installation and compatibility with our hardware. The training of YOLO is quite simple, as only images with ground truth target locations are needed.
YOLO is mainly performing object detection. The tracking is achieved by detection. That is, the detected object locations from all frames are connected together to form object tracks. Conventional trackers usually require a human operator to manually put a bounding box on the target in the first frame. This is not only inconvenient, but also may not be practical, especially for long term tracking where tracking may need to be re-started after some frames. Comparing with conventional trackers [
16,
17], YOLO does not require any information on the initial bounding boxes. Moreover, YOLO can handle multiple targets simultaneously.
YOLO also comes with a classification module. However, based on our evaluations, the classification accuracy using YOLO is not good as can be seen in [
10,
11,
12,
13,
14]. For completeness, we include a block diagram of YOLO-version 1 [
9] in
Figure 2. The input image needs to be resized to 448 × 448. There are 24 layers. YOLO version 2 has been used in our experiments.
2.3. ResNet Classifier
A common problem in deep CNN is performance saturation. The ResNet-18 model is an 18-layer convolutional neural network (CNN), which avoids performance saturation in training deeper layers. The key idea in ResNet-18 model is an identity shortcut connection, which skips one or more layers.
Figure 3 shows the architecture of an 18-layer ResNet.
Training of ResNet requires target patches. The targets are cropped from training videos. Mirror images are then created. We then perform data augmentation using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to create more training data. For each cropped target, we are able to create a data set with 64 more images.
The relationship between YOLO and ResNet is that YOLO determines where the targets are and bounding boxes are put around the targets. The pixels inside the bounding boxes will be fed into the ResNet-18 for classification.
The training of ResNet was done as follows: first, the targets are cropped from training videos at a particular range in the SENSIAC database. Second, mirror images were then generated. Third, we then applied data augmentation using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to generate more training data. For every cropped target, 64 additional synthetic targets were generated.
2.4. Data
To fulfill our sponsor’s requirements, our research objective is to perform tracking and classification of seven vehicles using the SENSIAC videos. There are optical and mid-wave infrared (MWIR) videos collected at distances ranging from 1000 to 5000 m with 500 m increments. The seven types of vehicles are shown in
Figure 4. These videos are challenging for several reasons. First, the target sizes are small due to long distances. This is quite different from some benchmark datasets such as MOT Challenge [
28] where the range is short and the targets are big. Second, the target orientations also change drastically. Third, the illuminations in different videos are also different. Fourth, the cameras also move in some videos. Fifth, both optical and MWIR videos are present. Sixth, some environmental factors such as air turbulence due to desert heat are also present in some optical videos.
Although there are other benchmark videos such as the MOT Challenge Database, our sponsor is aware of that database. However, since our sponsor is interested in long range, small targets (vehicles), and gray scale videos, MOT Challenge dataset does not meet the requirements of our sponsor. Most videos in the MOT Challenge dataset contain human subjects at close distance and the videos are color videos. Moreover, we have limited project funding to only focus on some relevant datasets. Consequently, we did not have time to explore other videos such as MOT Challenge.
Having said the above, we would like to mention that, in our experiments, a total of 378 videos comprising seven vehicles, six long distance ranges (1000 to 3500 m in 500 m increments), three imaging modalities (optical, MWIR daytime, MWIR nighttime), and three coded aperture modes. In short, our experiments are very comprehensive. No one has carried out such a comprehensive tracking and classification study for SENSIAC dataset in the compressed measurement domain. In this regard, our paper has reasonable contributions to the research community.
Here, we briefly highlight the background for optical and MWIR videos.
Figure 5 shows a few examples of optical and MWIR images. The optical and MWIR videos have very different characteristics. Optical imagers have a wavelength between 0.4 and 0.8 microns and MWIR imagers have a wavelength range between 3 and 5 microns. Optical cameras require external illuminations whereas MWIR counterparts do not need external illumination sources because MWIR cameras are sensitive to heat radiation from objects. Consequently, target shadows can affect the target detection performance in optical videos. However, there are no shadows in MWIR videos. Moreover, atmospheric obscurants cause much less scattering in the MWIR bands than in the optical band. As a result, MWIR cameras are tolerant of heat turbulence, smoke, dust and fog.
2.5. Performance Metrics
In our earlier paper [
10,
11,
12,
13,
14], we have included some tracking results where conventional trackers such as GMM [
17] and STAPLE [
16] were used. The tracking performance was poor when there are missing data.
Although there may be other metrics that could be used, some of the metrics have similar meanings. Hence, we believe that the following popular and commonly used metrics are sufficient for evaluating the tracker performance:
Center Location Error (CLE): It is the error between the center of the bounding box and the ground-truth bounding box.
Distance Precision (DP): It is the percentage of frames where the centroids of detected bounding boxes are within 20 pixels of the centroid of ground-truth bounding boxes.
EinGT: It is the percentage of the frames where the centroids of the detected bounding boxes are inside the ground-truth bounding boxes.
Number of frames with detection: This is the total number of frames that have detection.
For classification, we used confusion matrix and classification accuracy as performance metrics.