This section evaluates the running performance, the flexibility, and the system structure of our proposed hierarchy-based CAR system compared to E2E model-based systems. Initially, we explain the implementation of both types of systems, the dataset used, and the metrics we used to measure their running performance, flexibility, and structure.
4.2. Dataset
Two datasets are utilized in the experiments, one is MERL shopping dataset (DS 1) [
26] that consists of 106 videos. Each video is about 2 min long with a resolution of 920 × 680 and an FPS of 30. A fixed RGB top-view camera is installed to observe people shopping in a retail environment. The left in
Figure 6 shows an example frame of DS 1 and the area division for each frame. Each video contains the start and end frames of the behavior as annotations, types of behaviors are as following: reach to shelf (reach hand into shelf), retract from shelf (retract hand from shelf), hand in shelf (extended period with hand in the shelf), inspect product (inspect product while holding it in hand, the type of products is not identified), and inspect shelf (look at shelf while not touching or reaching for the shelf). The second dataset (DS 2) is collected on our own. It comprises 19 half-a-minute videos with a resolution of 480 × 640 and an FPS of 15. Similar to DS 1, we build a laboratory retail environment and install an RGB top-view camera to get an occlusion-free view.
Figure 6 shows its example frame and area division. Videos are collected at a public activity, where the random 19 participants are requested to simulating shopping in front of the shelf one by one.
Figure 7 shows the example behavior annotations of DS 2. The start and end frames of those behavior in the videos are annotated. Except for the behavior annotations, we also annotate the bounding box of body, hands, and four products in each frame for the training of level 1 and level 2. About the annotator, both behavior and bounding box are annotated by one of the authors. Additionally, due to the privacy issues, DS 2 is currently not publicly available.
4.3. Performance Evaluation
To compare the performance of both systems in dealing with target changes, we designed three steps to simulate the change of target behavior. As shown in
Figure 5, both systems use the same method for object detection and tracking, thus, we skip the comparison of those parts, and the following three steps are designed for customer behavior recognition experiments.
4.3.1. Step 1. Recognize Six Types of Behavior
At the beginning of building a CAR system, we select six common behavior definitions from existing researches for recognition, these definitions are described in
Table 3. In
Table 3, the symbol “A” refers to product A. For existing methods, these six types of behavior should be annotated to train models for recognition. As observed, in our proposed method, only four events (highlighted in bold font in
Table 3) are required to define these behavior. The other events can be regarded as the union set of these four events.
4.3.2. Step 2. Add New Behavior “Selecting”
Since the six types of behavior in step 1 cannot reveal the period of selecting products in the shelf area, we would like to recognize a new behavior named “selecting” to reveal more details in the video. “selecting” means that a customer is choosing products without making any picking decision. To add the recognition of a new type of behavior, E2E models should be rebuilt and new training data are required to train the models entirely. However, in the case of our proposed hierarchy, we only add a definition of “selecting” to step 2 in
Table 3. The definition is explained as the whole person’s region moves in the shelf area. It is worth mentioning that the hand is usually occluded by the shelf, therefore, we use the whole person’s region (including arms) in the definition.
4.3.3. Step 3. Discriminate the Behavior “Selecting”
As a person picks something by one hand or both hands could reflect potential things more than a summarized behavior “selecting”, we would like to find more details by discriminating “selecting”. Similarly to step 2, E2E models require time-consuming data collection and re-training. For the proposed hierarchy, we only redefine “selecting” in step 3 of
Table 3. Selecting by one hand means a hand can be found out of the shelf. On the other hand, if there is no hand out of the shelf, the customer should be selecting by both hands. The three steps try to simulate the ever-changing needs. In this case, it needs to acquire more and more details from the video.
Since there is no annotation about the defined behavior of three steps in DS 1, we annotate DS 1 with the defined new behavior by modifying its original annotations. Since different types of behavior are required in the three steps, the number of annotated behavior varies during the three steps. DS 1 contains 6201, 7738, 7844 behavior annotations in step 1, step 2, step 3. DS 2 includes 153, 178, 181 behavior annotations in step 1, step 2, step 3. Since both datasets built different retail environments, we run both systems on both datasets. For the training of E2E models, we randomly choose 80% and 20% of the dataset as training set and test set.
Table 4 shows the performance results of running both systems once. mAP refers to the mean average precision of customer behavior recognition. CPU/GPU implies the memory usage when running the system. FPS represents the running speed of the system. Results show that the E2E model-based system has a better recognition accuracy. However, the E2E model-based system uses more CPU/GPU memory and runs slower.
Both systems share a relatively low mAP on DS 1 mainly because of three reasons. First, DS 1 has a larger size than DS 2. Therefore, the models in both systems still need to be improved to fit a larger dataset. Second, the detection model has a bad detection accuracy on various types of products, which leads to the wrong input information for the higher level. Third, the fish-eye view caused by the camera distortion in DS 1 leads to a coordinate system that is more difficult to process compared to the direct top-view in DS 2. Nevertheless, the models in corresponding levels in our proposed hierarchy-based system can be updated to solve these problems. However, the E2E model-based system should be entirely modified because of its tightly coupled design.
Since more and more behavior should be recognized from step 1 to step 3, the results of mAP and FPS decreases for both systems, and they also use more and more memory. Though the hierarchy-based system does not outperform E2E model-based system on mAP, the small difference of mAP implies that the proposed hierarchy-based system is feasible. Additionally, the proposed system increases the running speed and saves memory usage, which indicates better efficiency.
4.4. Flexibility Evaluation
We design the three steps and prepare two datasets not only to evaluate the running performance, but also to evaluate both systems ability of fitting changes of demands and environments, which is called “flexibility”. For the E2E model-based system, each kind of change requires the time-consuming re-training of E2E models. To prepare the training data for the E2E models, we spent about two months annotating behavior in the DS 1 and DS 2. However, for the hierarchy based system, the behavior are defined manually by events. Thus, no training data are required for behavior recognition. To fit the three steps, namely the changes of demands, we only spent one day defining the new behavior with events. To adapt to the changes of dataset and environment, we also spent one day modifying the parameters (
,
,
,
, and
in
Section 3.4) in level 3 where the events are recognized.
To quantify the amount of work during of a CAR system during fitting the changes of demands or environments, we calculate the flexibility with reference to the penalty of change in [
36] as below:
where
N is the number of videos in the dataset,
is the cost of annotating each video. For the E2E model-based system, we should annotate new behavior in the whole dataset for model training. Therefore,
,
. For the hierarchy-based system, we need to define new behavior by events, namely annotate new behavior by events. If we annotate one behavior in the video, we consider it defining the behavior by events because we recognize the events in our brain at first and then combine them to recognize the behavior. In DS 1 and DS 2, each video contains all types of behavior. Therefore, if we annotate all behavior in one video, it is considered as defining all new behavior by events.
,
. Thus,
Equation (
2) indicates that our proposed hierarchy-based system has the flexibility
N times better than the E2E model-based system, where
for DS 1 and
for DS 2. This result shows that compared to the time-consuming re-training models, the proposed hierarchy-based system is malleable enough to adapt to the changes of demands and environments.
4.5. Structural Evaluation
Except for the experiment that runs two systems, we design a score to evaluate a system’s structure. The score is calculated from coupling, cohesion, and complexity [
37], since these metrics reveal whether a system is efficient and easy to update. To calculate these metrics, firstly, we need to define the modules to be measure (i.e., the elementary entity). In the proposed system, we consider each level as a module, while in the case of E2E systems, each model is regarded as a module.
Based on the proposed formula in [
37], we define the coupling as follows:
where
w is the count of calling other module’s functions,
r is the count of functions called by other modules. For the existing system,
Figure 5 shows that Model 2 has similar functions of object detection and tracking as Model 1. Thus,
for Model 2 and
for Model 1. For the hierarchy, no levels share similar functions. Therefore,
for each level. After calculating the coupling of each module, the coupling of the whole system is defined as the average of all modules’ coupling values.
In the case of cohesion, both systems are estimated to have a high cohesion in their modules, which means no difference in this metric. Therefore, the cohesion calculation is omitted in the evaluation.
Finally, for the complexity calculation, we use a modified version of [
37], where some irrelevant values were not included, as described below:
where
is the count of required inputs from the user,
is the count of outputs from the system to the user,
is the count of model files in the system. In our case,
and
indicate the count of data type instead of the amount of data. For instance,
for the existing system because the input is an image.
for the hierarchy because an extra input about the behavior definition is required. Finally,
refers to the number of models in both systems.
To normalize the complexity value, the formula is modified as follows:
where
reaches the min value when
,
,
, and
increases to a value no more than 1 when
,
, and
increase. The complexity of the whole system is defined as the average of all modules’ complexity values.
Since a better system has lower coupling and complexity, we formulated the structure score (S) as
where
,
is the weight of
when designing a system, and a larger
means more emphasis on improving
.
About the structure score, we set two parameters (
) to influence the score and observe its changes, where
n is the number of implemented models in each system. For the existing system,
Figure 5 shows a simple system with only two models. However, usually, more models are required in a common structure, such as
Figure 1. Because it is difficult to recognize all customer behavior by only one model and it is the same for the other types of data. Therefore,
and
may change with the increasing number of models.
Figure 8 shows the change of the structure score when modifying
and
n. The x-axis is
n and the y-axis is
. The circle radius refers to the structure score. A larger circle represents a better structure score. The chart shows that our proposed system have a better score when
and
n increase. In most cases, the proposed system has a better score. Since a better structure score implies high efficiency and maintainability,
Figure 8 indicates that our proposed hierarchy-based system is better in most cases. To sum up, the performance, flexibility, and structure evaluation results show that the proposed hierarchy-based system has better adaptability, efficiency, and maintainability.