1. Introduction
Nowadays, video surveillance cameras are used as a crucial device everywhere, inside and outside buildings, to monitor people and prevent law-breaking, violence, kidnapping, etc. However, these cameras produce massive videos with an extensive duration. Thus, searching for a certain activity within these videos involves browsing the entire video content from the beginning to the required activity, which is considered an exhausted and time-consuming operation. Different solutions are proposed to tackle this challenge by summarizing these videos as video fast-forwarding [
1], video abstraction [
2], video montage, and video summarization [
3]. Some of these approaches summarize the video by omitting the inactive frames or selecting the keyframes that lead to losing the original video dynamic relations. However, the other approaches shift many space–time regions in both time and space, and then stitch them together, leading to obvious stitching seams in summary.
Recently, the research community presented a new smart technology that can create a condensed representation from the original video without losing significant activities from its content. This smart technology is called video synopsis. This technology improves the functionality of the surveillance videos because it helps the final user lessen the browsing hours of the captured video to minutes or seconds. Moreover, video synopsis affords a video condensation technique that relies on activities rather than those based on frames, so it achieves higher efficiency, as it provides the opportunity for better condensation due to its accurate analysis of video details. The video synopsis is generated by shifting all the objects in time to be presented simultaneously and creating a shorter video with a maximum number of activities, as shown in
Figure 1. The video synopsis framework incorporates four principal modules: object detection, object tracking, optimization of the cost function to obtain optimal temporal rearrangement, and, finally, segmenting and stitching the objects’ activities to the generated background.
The video synopsis challenge is to obtain the activities’ best rearrangement to exhibit most of these activities in the shortest time span without collisions of activities. Recently, various video synopsis procedures have been addressed to tackle this challenge. Li et al. [
4] introduced a solution for the object collision issue in video synopsis by suggesting scaling down the colliding objects. In this technique, the objects are stirred in temporal domains, then, if a collision is recognized, the objects’ sizes are minimized. A metric is used in an optimization step to represent the minimization factor for each object. Although the problem of object collision has been curtailed technically, the suggested approach might upset the user. Reducing the object size results in showing the video synopsis in an artificial view. A car and a person displayed in a scene close to each other may appear to have equal sizes.
He et al. [
5,
6] defined collisions’ status between objects activities, namely, collision-free, collision in the same direction, and collision in opposite directions, which result in a step further in the analysis of activity collision. They also proposed a collision graph-based optimization strategy to promote filling and rearrangement of the activity tubes in a deterministic manner to reduce the computational complexity. Hence, a further elaborate activity collision analysis is afforded compared to the other studies of video synopsis. Besides the improvements accomplished via collision minimization, some other metrics are disregarded as a chronological sequence and activity cost. Accordingly, finding the optimal temporal rearrangement of the activities using their optimization approaches still needs to be developed.
Moreover, Nie et al. [
7] presented a video synopsis technique, which aims to move the objects in the time and spatial domains to produce a condensed video, as well as to reduce their collision. On the other hand, Lin et al. [
8] introduced a distributed-based processing approach to decrease the complexity of computation for creating a video synopsis. The original video is partitioned into several segments. Each segment is assigned to a specific computer to gain the merits of the multi-core capabilities. Raman [
9] suggested a video synopsis procedure to maintain the relationships among objects. In this method, the interaction between objects is measured utilizing the differences among various objects’ tubes. If the difference is higher than a predetermined threshold value, the tubes are consolidated to generate a tube set. Ghatak et al. [
10] presented an effort for minimizing activity loss, collision number, and cost of temporal consistency. They proposed an improvement for the energy minimization strategy. They utilized a hybrid approach of both the simulated annealing (SA) and the teaching–learning-based optimization (TLBO) algorithms to reach a globally optimal solution besides a reduced computational processing time.
Additionally, Ghatak and Rup [
11] evaluated the performance of various optimization techniques for energy minimization for application of video synopsis, namely, SA, cultural algorithm (CA), TLBO, forest optimization algorithm (FOA), gray wolf optimizer (GWO), non-dominated sorting genetic algorithm-II (NSGA-II), JAYA algorithm, elitist-JAYA algorithm, and self-adaptive multi-population-based JAYA algorithm (SAMP-JAYA). This study infers that the present meta-heuristic methods are incapable of reducing energy regularly, though these techniques are popularly applied to minimize the energy for generating video synopsis. In [
12], Ghatak et al. suggested improving the energy minimization procedure utilizing a hybridized algorithm combining SA and JAYA. Yao et al. [
13] suggested using the genetic algorithm (GA) to produce a new formula for minimizing the energy function. Furthermore, Xu et al. [
14] recommended an optimization scheme based on GA to resolve the object tubes merging problem in originating a video synopsis. They deduced that the method based on GA outperforms the one based on SA in terms of information loss and time consumption. Moussa and Shoitan [
15] utilized particle swarm optimization to arrange the object tubes using an energy minimization function to decrease the collision, preserve the chronological order, and relate the objects.
Huang et al. [
16] confirmed the prominence of the online optimization techniques, which allow the rearrangement of tubes at the detection time without needing to wait for the process of optimization to begin. The most significant issue with their suggested approach is ignoring the activity collision states totally to enhance the operating time performance. Another defect of their suggested optimization technique is using a threshold value that is manually determined instead of using a decision technique. A trade-off issue between the operating time and the ratio of condensation also appears, which results in precision reduction.
Some other studies have addressed the video synopsis from other points of view. Feng et al. [
17] introduced a background generation method by choosing video frames with the most activities and background variations in images. Baskurt and Samet [
18] stated an adaptive background generation technique to increase the object detection robustness. Afterward, Feng et al. [
19] proposed a tracking method to overcome object blinking, which is responsible for the appearance of ghost objects in video synopsis. Baskurt and Samet [
20] proposed a tracking approach that concentrates on long-term tracking to realize each target object with only one activity in the created video synopsis. Lu et al. [
21] concentrated on the defects of object detection techniques, such as shadow and breaks of object tracking that yield to minimize the content analysis efficiency. Hsia et al. [
22] focused on introducing an efficient searching technique for an object activity database to produce a synopsis video. Therefore, a range tree technique was suggested to select object tubes and reduce the algorithm complexity efficiently. Ghatak et al. [
23] explored the notion of the multi-frame and scale procedure together with generative adversarial networks (MFS–GANs) to extract the foreground. A hybrid algorithm, including both grey wolf optimizer (GWO) and SA (HGWOSA), is suggested as an optimization algorithm to achieve the globally optimal result with a low computation cost.
On the other hand, different researchers address grouping similar activities in the video synopsis system based on a matching strategy or user-defined query. Lin et al. [
24] suggested an approach for video synopsis generation incorporating clustering activities and anomaly detection, object tracking, and optimization. Namitha and Narayanan [
25] provide a technique to maintain relationships among object tubes within the input video in the synopsis video. In the first stage, a recursive algorithm for grouping tubes is offered for finding the interaction behavior between tubes and grouping relevant tubes to create tube sets. The second stage aims to optimally rearrange the tubes in the video synopsis system using a spatial–temporal cube voting approach. Finally, an algorithm that relies on measuring the entropy for tube collisions to estimate the synopsis video duration is introduced. Pritch et al. [
26] proposed a real-time video synopsis according to a query from a user to show the activities during a specific duration on an endless webcam or surveillance cameras. Pritch et al. [
27] introduced a video synopsis showing similar activities with the same appearance and motion features.
Ahmed et al. [
28] generated a video synopsis technique for traffic monitoring application using a user query based on object attributes, such as the object classes and movements. First, the moving objects are tracked and classified using deep learning into different categories (e.g., car, pedestrian, and bike). Second, a query is obtained from a user then the tubes fulfilling the query are blended on the background frame for synopsis generation. Namitha et al. [
29] proposed an interactive visualization technique to build the synopsis video. Some basic visual features, such as color and size, and some spatial features are used to retrieve certain objects to be addressed in the synopsis. YOLOv3 and Deep-SORT are utilized for the detection and tracking stage. The techniques perform tube grouping to preserve relations between objects and use a space–time cube algorithm to arrange the tube groups in a predefined synopsis length.
Although all the aforementioned clustering and user query-based methods solve the issue of creating an unsatisfactory synopsis video for a crowded scene due to the collision, they do not consider specific appearance attributes such as gender, age, carrying something, having a baby buggy, and upper and lower clothing color. These attributes can help the camera monitor find a suspected person using an appearance description or a particular action happening in the scene. Thus, to achieve this goal, an analysis must be accomplished on the recorded video, depending on the user’s requests; a synopsis video will be constructed to attain the requirements. Accordingly, the process involves video analysis to retrieve the user appeal and an optimization stage to build the synopsis efficiently.
In this paper, a framework that sustains a smart-condensed video synopsis system relying on prescribed user recommendations is developed. The proposed system utilizes a highly detailed user-defined description for the desired persons, and then arranges them using an intelligent-optimization technique, the whale optimization algorithm [
30], to construct a low-collision condensed synopsis.
The contribution of this work can be abstracted as follows:
The proposed technique permits the user to stipulate a detailed description of the desired persons in three distinct aspects, precise visual appearance, motion description, and accessing regions of interest in the scene, contrary to the traditional user-query synopsis methods.
Several detailed distinct descriptions of a person are employed to design a user-defined query. These descriptions incorporate an elaborated person’s visual appearance, motion style, and motion type, and personal behavior concerning the region of interest.
Persons’ tubes are generated and assembled based on the relationships defined by the user’s query. Furthermore, using an intelligent optimization method, the whale optimization algorithm, the provoked tubes are arranged to construct a highly visually intelligible synopsis video preserving false overlapping between the persons, as well as conserving the correlation time order.
The sections henceforth are arranged as follows:
Section 2 describes the details of the proposed approach,
Section 3 demonstrates the experimental results, and, finally,
Section 4 contains the conclusion.
2. Methodology
Although video synopsis is an emerging technology in video analysis research, it faces different challenges, such as creating a synopsis video that involves a suspected person having a specific appearance description consistent with user preferences. In the proposed system, the user submits a query that specifies the detailed descriptions of retrieved desired persons. The descriptions enclose appearance features, such as gender, age (5 age ranges), carrying something or not, having a baby buggy or not, upper clothing color (11 colors), and lower clothing color (11 colors). Furthermore, the user can request to retrieve persons based on a moving direction (8 directions). Moreover, users may desire to retrieve persons entering or exiting a specific region of interest or based on their motion speed.
Figure 2 illustrates the proposed system architecture. The suggested system proceeds in two phases, each incorporating several steps. The first phase comprises extracting the background, tracking the existing persons, and generating their corresponding tubes. Moreover, during this phase, visual appearance and motion features are extracted. In the second phase, on the other hand, a user-defined query is used to retrieve the desired person’s tubes. These tubes are then arranged and utilized to construct the synopsis video.
As can be noticed in
Figure 2, the first phase commences with the camera monitor selecting the video from the video store that has the required person. Then, the temporal median method is applied to estimate the background. Afterward, the people bounding boxes from the selected video are extracted using the proposed detection and tracking algorithms. Finally, the visual features, comprising gender, age, carrying something or not, having a baby buggy or not, upper clothing color, and lower clothing color, as well as the motion features, according to the motion speed, motion direction, and the person accessing a region of interest, are extracted for each person tube using a person attribute recognition algorithm. On the other hand, in the second phase, the extracted visual and motion features are stored in the database, and according to the user query, the person tubes that satisfy the query are retrieved. Subsequently, a whale optimization algorithm is applied to determine the best starting time of each retrieved person tube that minimizes the synopsis length. Eventually, the retrieved person tubes are segmented and stitched on the estimated background to generate the video synopsis.
2.1. Phase 1: Tube Generation and Feature Extraction
This phase aims to detect and track multiple persons to generate tubes corresponding to each person and extract elaborated features for each person’s tube.
2.1.1. Background Estimation
The first step in video synopsis is extracting the background for stitching the generated person tubes. The temporal median method is used to extract the background by applying it to a group of 25 neighboring frames, exploiting the fact that the surveillance videos have fixed backgrounds with little change in the illumination. The background estimation step impacts the visual quality of the synopsis video, but it does not affect the effectiveness of its compression.
2.1.2. Person Tracking and Tube Creation
In this step, the Bytetrack algorithm is utilized to build a motion tube for each person (tracklet), which is a group of bounding boxes throughout the video. The Bytetrack algorithm is carried out in three stages: object detection, object localization, and association. The object detection step is responsible for recognizing objects within the frame. The YOLOX model was adopted for conducting this task. Afterward, the Kalman filter is applied to perform object localization to predict the location of each object in the next frame. The BYTE algorithm is then utilized for the association process to decide whether objects in various frames are related to the same identity. BYTE considers all detected boxes, not only the high scored ones. First, it links high score detected boxes with existing tracklets. Nevertheless, due to occlusion, size variation, and motion blurring, some tracklets are unmatched to a high score detected box. Accordingly, these tracklets are matched to the low detected score boxes. This strategy guarantees a higher tracking performance and less identity switching than traditional multi-object tracking algorithms [
31,
32].
2.1.3. Visual Appearance Features Extraction
For each tracked person, different attributes describing their visual appearance, such as age, gender, upper and lower clothing colors, etc., are extracted. In the proposed algorithm, one of the part-based person attribute recognition algorithms, the attribute localization module [
33] (ALM), is used. The advantage of ALM is that it concentrates on the person parts, improving the person attribute recognition. In the ALM algorithm, each person bounding box is fed into the main network with feature pyramid architecture, then the generated features from different levels are sent to a group of attribute localization modules to apply attribute localization and region-based feature learning for obtaining the attribute vector. Each attribute localization module is designed to serve one attribute at a single level. The features of each pedestrian bounding box are extracted based on the batch normalization inception (BN-inception) network as a backbone network, and each attribute localization module is designed to depend on a simplified spatial transformer network (STN) [
34]. The algorithm is trained on one of the person attribute recognition datasets, which is PETA [
35].
2.1.4. Motion Features Extraction
Retrieving persons based on their motion in the video is critical in determining who enters various locations regarding the surveillance scope. Furthermore, specifying the motion style affords awareness about the actions taking place in the scene. The proposed system provides some motion information that can assist the individual monitoring the camera to reveal suspected actions. The variations in the location of each person’s bounding boxes are employed to express their movement through three aspects:
Motion style: the speed of change of the bounding boxes’ centroids can state if the person is running, walking, or stopping at a specific area for a while.
Motion direction: the system can determine the route of each person through the 8 main directions (north, south, east, west, north-east, north-west, south-east, south-west);
Accessing regions: the camera monitoring man can specify some regions of interest in the surveillance scope to recognize the persons who entered and exited from these regions.
2.2. Phase 2: Persons Retrieval and Synopsis Generation
This phase commences by applying a user-defined query specifying the desired person’s description. The query is constructed by giving the camera monitor a set of options to select the required person’s specifications, such as age, gender, carrying something or not, having a baby buggy or not, lower and upper clothing color, motion style, motion directions, and accessing a region of interest. Then, these attributes from the user query are matched with the extracted attributes for each person tube to select only the matched one. Afterward, the matched persons are segmented and stitched on the estimated background in an optimized order.
2.2.1. Optimization
The visual appearance and motion features extracted from each tube are compared with the user query to determine the matching ones. These person tubes are arranged to preserve their chronological order. The whale optimization algorithm is suggested to organize the appearance order and the person’s starting time depending on a fitness function.
- (a)
Fitness function
The fitness function guarantees some constraints confrontation, building a synopsis as short as possible, preventing collisions between persons, preserving true collisions and correlation order. Each of these parameters has a weight value that the user can use to fine-tune the degree of importance of each of them. The proposed fitness function is expressed as
where E
Length, E
collision, E
true_collision, and E
temporal represent the synopsis length cost, the activity collision cost, the true collision cost, and the temporal consistency cost, respectively. Moreover,
symbolize synopsis length weight, collision weight, true collision weight, and temporal consistency weight, respectively. E
Length is responsible for decreasing the synopsis length as much as possible to not go above the longest tube, while E
collision reduces the object’s collision after mapping the object’s tubes. On top of that, the role of E
true_collision is to maintain the intersection relation in the original video to be mapped for the synopsis video and E
temporal preserves the chronological order of the object’s tubes in the generated synopsis. The whale optimization algorithm aims to minimize this fitness function for finding each tube’s starting time in the synopsis video that reduces the object’s tube collision, decreases the synopsis length, maintains the intersection relations as much as possible, and preserves chronological order. Additionally, the proposed algorithm attempts to maintain the counterpart relation in which the two tubes appear temporally and spatially near each other most of the time. If their temporal relationship exceeds 75% and their spatial distance does not exceed twice the person’s width, both tubes are coupled together. Eventually, the synopsis video is created by stitching the motion tubes of the persons in the order given by the optimization step.
- (b)
Whale optimization algorithm
The whale optimization algorithm is a heuristic optimization algorithm that mimics the humpback whale’s hunting behavior. Their foraging attitude is named the feeding method through bubble-net, which is performed by creating distinguishing bubbles that take a circle-shaped or ‘9’-shaped path. The simulation of hunting behavior to chase the prey is undertaken by using a spiral model to imitate the mechanism of bubble-net attacking of humpback whales. The mathematical model for the whale optimization algorithm includes a model for encircling prey, the spiral bubble-net attacking method (exploitation phase), and prey searching (exploration phase).
Humpback whales first recognize the prey’s location and then make a circle around it. The whale optimization algorithm assumes that the best current solution is the location of the target prey or close to it. After determining the best search agent, the other search agents will update their positions toward the best search agent. This manner is represented by the following equations:
where
is a position vector to the best solution so far and is updated at each iteration in case a better solution is found,
t is the current iteration, and
and
are coefficient vectors given by the following equations:
where the component
is decreased from 2 to 0 linearly through a pre-defined number of iterations and
is a random vector that takes the values [0, 1].
In this section, two approaches for designing the mathematical model for the behavior of the bubble-net of the humpback whale are presented:
- (a)
Shrinking encircling mechanism
The behavior of the bubble-net attacking method is conducted here by decreasing the value of vector and, consequently, vector also decreases and takes a random value in the interval from [−a, a]. The position of the search agent towards the position of the best current agent can be achieved in 2D space by .
- (b)
Spiral updating behavior
First, the distance between the location of the whale (
X,
Y) and the location of prey (
) is calculated. After that, the spiral equation is constructed between the whale position and the prey position to simulate the helix-shaped movement of the humpback whales, which is described as follows:
where
is the distance between the location of
ith whale to the prey location (best solution found so far),
l is a number taken randomly from the interval [−1, 1], and
b is a constant value used to define the logarithmic spiral shape.
The humpback whales swim simultaneously around the prey inside a shrinking circle and along a spiral-shaped path. For updating the whales’ positions during the optimization process, there is an assumption that there is a probability value of 50% to choose between using either the shrinking encircling or the spiral path. The mathematical model for the two behaviors is described as follows:
Humpback whales randomly search for prey according to their locations. Consequently, the random values of vector
are changed to fluctuate between 1 and −1 to direct the search agent to move far away from a reference whale. In the contraindication to the exploitation phase, the search agent’s position is updated in the exploration phase according to an agent chosen randomly, rather than choosing the best agent found so far, in which the value of
.
In the proposed system, the whale optimization technique is used to optimize the fitness function in Equation (1) to find the best location of the person tube that minimizes the synopsis length. First, the prey’s location is represented by a vector, and its length is equal to the number of person tubes matching the query. Then, each element of this vector is initialized by the starting frame to stitch this person’s tube in the synopsis video. At each iteration, the whale optimization algorithm attempts to find the best starting frame for each person tube that minimizes the optimization function.
2.2.2. Segmentation, Stitching, and Synopsis Creation
The Poisson technique is used to stitch the tubes into the estimated background. However, a mask is first constructed for each object in the bounding box based on specific morphological procedures to obtain the objects without the surrounding background, making the image appear more natural once the object’s tubes have been stitched. Subsequently, the segmented objects are stitched with the estimated background using seamless cloning based on the Poisson method to stitch these extracted objects with the generated background.