In all experiments, the metric that we optimize is the partial area under the curve (PAUC) [
5,
6]. The PAUC differs from the AUC in that it computes the performance only in a specified region of the ROC curve. We believe that this metric is more appropriate than the AUC for sensor-based alert systems as we can tune the PAUC to focus on performance only in the low-PFA region where these systems operate. We use PAUC limited to the 0%-10% PFA range for all of our results. Additionally, in all experiments we perform statistical significance testing using the paired t-test with 95% confidence. Results that are the best, or statistically significantly tied for the best, are boldface in the results tables.
5.1. Simulated Data
We evaluate the effectiveness of the proposed combiner and the any-combiner with the different normalization strategies given in
Section 4.4 vis-a-vis the conditionally non-discriminative condition defined in
Section 4.1. We do this by simulating scores for agents that are distributed according to a multi-variate Gaussian distribution based on the label. We compare the following agent-combiners: 1. (
Prpsd. Gaussian WLR) the proposed combiner given in (
3) with a Gaussian model for the agent score distributions, 2. (
Joint Gaussian) a jointly Gaussian generative meta-classifier that models the joint distribution of the agent scores conditioned on the label as
, 3. (
Ind. Gaussian) an independent Gaussian generative meta-classifier that models the joint distribution of the agent scores conditioned on the label as
, 4. (
AC Gaussian WLR) the any-combiner with weighted-likelihood normalization, 5. (
AC Z-norm) the any-combiner with Z-normalization, 6. (
AC F-norm) the any-combiner with F-normalization, and 7. (
AC EER-norm) the any-combiner with EER-normalization where the EER rate is calculated assuming Gaussian densities as in [
18]. Both the joint Gaussian and independent Gaussian combiners form a test statistic using the target vs. clutter likelihood ratio, that is,
We first evaluate the combiners when the conditionally non-discriminative assumption holds. We do this by simulating a ten-agent problem where the agent-outputs are independent and non-discriminative for targets for which they are not trained. We simulate a complex problem where the degree of difficulty for different targets varies significantly, a situation that has been shown to occur in practice [
27]. Define
and
as the mean and standard deviation of the
agent’s output when
. We first draw agent-score distribution parameters for clutter data, drawing
from a uniform
distribution and
from a uniform
distribution for each agent. Next, we make each agent
j non-discriminative for targets other than
j by setting
and
,
equal to the clutter parameters for agent
j:
and
. Finally, we draw
from a uniform
distribution and
from a uniform
distribution for each agent.
Figure 2 plots ROC curves in the low-PFA region for the ten agents on their specific target-type for one draw of the agent-score parameters. We generate ten-thousand test exemplars and a varying number of combiner training samples in order to evaluate the effectiveness of the combiners when they must estimate their parameters using either few or many combiner training samples. We set class-priors to
for clutter and
for each target-type. Due to the uniformity among the target-type priors the weighting in the proposed weighted-likelihood ratio normalization is irrelevant and thus the results compare the performance of the normalizations without considering the weighting. We repeat the experiment fifty times, generating new agent-output distributions and data for each iteration.
Table 1 gives the results averaged over the fifty experiments for each number of combiner training samples. The rightmost column gives an upper bound on the performance of each combiner when provided with the true parameters, and the joint Gaussian, independent Gaussian, and proposed method all perform the best when provided with the true parameters. However, when the combiners must estimate the parameters from data, the joint and independent Gaussian combiners perform worse in all cases than the proposed method. The proposed method is the best or statistically significantly tied for the best with 95% confidence at all numbers of training samples. At low numbers of training samples, the any-combiner with Z-normalization is statistically significantly tied for the best with the proposed method. The results show that, given limited training data, the performance of the proposed combiner greatly exceeds that of the joint combiner as well as the combiner based on an independence assumption. We can also see that the weighted likelihood normalization outperforms the normalization methods given in
Section 4.4. However, we note that the performance of the different normalization methods will depend on a variety of factors including the number of combiner training samples and how well the agent distributions match the assumptions used to derive the normalization methods. Regularization techniques for estimating the model parameters can also improve the performance when there are few training samples [
18].
We now compare performance when the agents are not conditionally non-discriminative. We simulate a problem with four target-types and four corresponding agents. We again use Gaussian conditional densities,
. We set the parameters of the conditional distributions to make agents one and two offer discriminative power on each other’s targets. In order to do this, we set:
and
Using these definitions we can determine the following conditional distributions:
and
We can control how closely the conditionally non-discriminative assumption holds by varying between 0 and 1. When agents one and two are independent and discriminative for one another’s target-types; thus the conditionally non-discriminative assumption is poor. At the conditionally non-discriminative condition holds.
Table 2 and
Table 3 give the results, averaged over fifty random draws of the data for
when the combiners are provided with 100 and 1000 training samples, respectively.
Table 2 shows that with fewer training samples, the proposed combiner is the best or statistically significantly tied for the best at all
values other than
, where the combiner based on the independence assumption performs better.
Table 3 shows that when the combiners have more training samples, the proposed combiner is the best at only the larger values of
and
. These results show that, in the case of limited data samples, the proposed combiner exhibit good performance even when the conditionally non-discriminative assumption does not hold.
5.2. Pin-Less Verification with Yale Faces
We perform a pin-less biometric verification experiment using the cropped version of the Extended Yale Faces dataset [
28]. The dataset contains frontal images of thirty-nine people, with sixty-five images of each person taken under different lighting conditions. In order to perform the pin-less verification experiment, we make persons one through five the targets, referred to as clients within the biometrics literature, and the remaining users the clutter, referred to as impostors.
The original images are 192 x 168 pixels. We down sample the images by a factor of two (The Yale Faces data is often down sampled in order to improve computational efficiency [
29,
30]), vectorize the resulting pixels, and perform principal components analysis retaining components that contain 95% of the data variance. This results in a sixty-four dimensional feature vector for each image.
We randomly select n images from each person in the client group for training and use the rest for testing. We randomly select twenty persons from the impostor group and use all of their images for training while using all images from the remaining fourteen impostors for testing. We vary the number of training images for each client over the set . We repeat the experiment twenty-five times for each n, randomly selecting different training and test images for each iteration.
We train five SVM agents, with the training data for the agent consisting of the n images for client i as target and all images from the twenty training impostors as clutter. We choose the SVM parameters via cross-validation on the training data, and then get unbiased training data for training the combiners by cross-validation.
We compare the combiner strategies from the previous section along with the addition of the proposed combiner with weighted likelihood normalization estimated via the Platt probabilistic normalization as described in
Section 4.4 (
Prpsd. Platt WLR) and a meta-SVM classifier using the Gaussian RBF kernel (
Meta SVM). We choose parameters for the meta-SVM via cross-validation on the combiner training data from the same pool of parameters used to train the agent classifiers.
Table 4 gives the results. If we look at the top three rows, we can see that at a high number of training samples the joint and independent Gaussian combiners result in better performance than the proposed combiner with a Gaussian model for weighted likelihood ratio. However, the proposed combiner with the weighted likelihood ratio estimated via Platt probabilistic normalization is statistically tied for the best for all numbers of training samples, outperforming both the generative joint Gaussian model and the discriminative meta-SVM. The any-combiner with F-normalization performs very well also, being statistically tied for the best at all numbers of training samples other than five. At five training samples, the any-combiner with Z-normalization is tied with the proposed combiner with Platt probabilistic normalization.
We also use this experiment to show that the normalized agent scores give additional information indicating which target-type causes an alert. To do this, we calculate the percentage of true-positive alerts for which the maximum normalized agent-score is from the agent trained for the target-type that causes the alert.
Table 5 gives the average accuracy for the different normalization methods when we set the threshold,
, in order to get a false alarm rate of five percent. The results show that the combiners that achieved the highest probability of detection according to
Table 4 also achieve high accuracy in terms of estimating the correct target-type.
5.3. Classification of Ground Vehicles Using Acoustic Signatures
The acoustic-seismic classification identification dataset (ACIDS) contains acoustic time-series data collected from nine different types of ground vehicles as they pass by a fixed location. An array of three microphones recorded the sound emitted from each passing vehicle. The recordings were made in four different locations including desert, arctic, and mid-Atlantic environments, with vehicle speeds varying from five to forty km/hour, and closest point of arrival (CPA) distances from twenty-five to one hundred meters. The data consists of 274 labeled recordings of CPA events, and
Table 6 gives a breakdown of the number of events for each type of vehicle. The Army Research Lab collected the data and made it available [
31]. We use the acoustic data from microphone one only. The sampling frequency is 1025.641 Hz and the data is bandpassed between twenty-five and four hundred Hz.
We pre-process the data and extract features in a manner that matches closely that of a previous study using the ACIDS dataset by Wu and Mendel [
32]. We first estimate the CPA for each time-series. We do this by filtering the magnitude of the acoustic response with a two-second moving-average filter and estimating CPA as the point where the output achieves its maximum value. We then limit the data to a forty-second window centered on the CPA. We convert the windowed time-series into a spectrogram by taking the short-time Fourier transform with a one-second window, frequency resolution of one Hz, and fifty percent overlap. This gives a spectrogram that contains eighty time-scans and 376 frequency bins. We pass the spectrogram through a spectrum normalizer to remove the broadband energy trend, so that the end result is a spectrogram that characterizes the narrowband content of the vehicle’s acoustic signature. The normalized spectrogram for the first CPA event in the ACIDS dataset is shown in
Figure 3.
We featurize the spectrograms using the magnitude of the spectrum at the second through twelfth harmonics of the fundamental frequency as was done in [
32]. To get these features, we first estimate the fundamental frequency of the predominant harmonic set in each spectrogram that, as described in [
33], typically relates to the engine cylinder firing rate. We estimate the predominant fundamental using an algorithm similar to the harmonic relationship detection algorithm described in [
33]. The algorithm detects the presence of harmonic sets with fundamental frequencies between six and twenty Hz on a scan-by-scan basis by looking for a repeated pattern of narrowband energy at the correct harmonics within the spectrum. We smooth the resulting estimate using a median filter, and take the magnitude of the spectrum at the second through twelfth harmonics, normalized to have a maximum of one, as the feature vector for the scan. This results in eighty features per event. These features are suitable for this application due to their invariance to changes in the vehicle’s speed [
33].
Vehicles 1, 2, 8, and 9 form a group of heavy tracked vehicles and we define these four vehicle types as our targets and the others as clutter. We perform ten-fold cross-validation. In each fold, we set aside ten percent of the events as test and the rest as training. We use the scans in the training data to train an ensemble of Gaussian RBF SVM classifier-agents, one for each target vs. clutter. We choose the SVM parameters for each classifier agent by ten fold cross-validation on the training data and also get unbiased data for combiner training by ten-fold cross-validation.
We compare the proposed combiner with weighted-likelihood ratio normalization, where we model the likelihood using Platt probabilistic normalization, Neyman-Pearson combination using a joint Gaussian model, Neyman-Pearson combination using an independent Gaussian model, the any-combiner using Z, F, and EER normalization and a meta-SVM combiner using a non-linear Gaussian RBF SVM trained to classify the output of the combiners. We train the meta-SVM combiner using the same parameters and procedure with which we train the classifier agents. We repeat this experiment twenty-five times, randomizing the cross-validation indices each time, to get our final set of results.
The SVM agents classify scans. In order to classify the events, we combine the scan-by-scan scores along the eighty scans for each event. Define
as the combiner output along the eighty scans for an event. Assuming that the scans are independent, we combine the scores for an event to get a likelihood ratio of target vs. clutter according to:
The two Neyman-Pearson combiners as well as the proposed combiner give score outputs that are already in the form of (
27), so we sum the log of the combiner outputs on scans to get the event score. In order to convert the Z-norm, F-norm, and EER-norm combiner outputs to a probability, we make the assumption that the output of the combiner is conditionally Gaussian. For the Meta-SVM, we train a Platt-scaling function for the Meta-SVM output and convert it to a likelihood ratio as in
Section 4.4.
Table 7 gives the PAUC, averaged over the twenty-five experiments, in the 0%–10% PFA operating region when classifying scans and events. The result shows that the proposed combiner with Platt weighted-likelihood normalization results in the highest PAUC in this region for both scans and events, and that the result is statistically significant at 95 % confidence.