1. Introduction
In various sectors of contemporary society, a plethora of closed-circuit television (CCTV) cameras have been strategically deployed to monitor and record various safety incidents and criminal activities. The growing concern for societal safety has led to an increasing demand for CCTV and security surveillance. According to The Business Research Company’s “Global Security Surveillance System Market Report 2023” [
1], the global security surveillance system market is forecasted to expand from USD 130.08 billion in 2022 to USD 148.29 billion in 2023, at an annual growth rate of 14.0%. With the escalating need for safety in high-risk areas, it is anticipated that the surveillance system market will reach a substantial USD 230.47 billion by the year 2027.
Surveillance systems are engineered to identify patterns within monitored scenes, encompassing risky behavior, abnormal actions, and incidents. Extensive research has been conducted on these systems, primarily focusing on the detection of abnormal behavior such as ‘fighting’ and ‘fainting’. This involves the utilization of technologies such as object detection and recognition, tracking, pose estimation, movement detection, and anomaly detection of objects [
2,
3,
4,
5,
6,
7,
8,
9]. Jha et al. [
8] proposed an N-YOLO model designed for the detection of abnormal behaviors, such as fighting. This model tracks the interrelationship of detection results in subimages, integrating them with the inference outcomes through a modified YOLO [
10]. In a related study, Kim et al. [
9] introduced the AT-Net model, specifically designed for abnormal situation detection. The model aims to mitigate classification ambiguities and minimize information loss by integrating object detection and human skeletal information.
These various studies [
2,
3,
4,
5,
6,
7,
8,
9] have improved performance by incorporating diverse feature information related to abnormal behavior. Nevertheless, conventional methods exhibit certain limitations. Firstly, these methods do not comprehensively consider the spatial context associated with abnormal situations. It is crucial to recognize that, even with identical objects and behavior, interpretations may vary significantly based on the spatial context, resulting in different responses to specific risk scenarios. For instance, the action of sitting on a bench, compared to sitting on a railing or cliff—both categorized as sitting actions—results in distinct risk levels. Sitting on a bench is considered safe, whereas sitting on a cliff represents an extremely dangerous situation. Secondly, within the framework of current surveillance systems designed to detect potentially hazardous scenarios, the outcomes are typically displayed as alerts on the monitoring screen or delivered as recorded footage featuring the identified risk elements. However, in numerous instances, actions exhibiting a higher degree of movement compared to the surrounding behavior might attract attention but may not necessarily be indicative of an actual accident. For instance, conventional surveillance systems may flag regular running in a park as abnormal since it involves more movement than walking or sitting. Consequently, further scrutiny becomes imperative for the observer to re-evaluate the scene and interpret the situation accurately in order to discern the precise nature of the detected abnormal behavior. Ultimately, for a more accurate and prompt interpretation of risk situations, surveillance systems must not only consider the actions of the detected objects but also take into account the surrounding environment and situation, assess the overall risk level, and provide the observer with specifically interpreted information. Thirdly, surveillance systems should have the capability to detect and interpret a broad range of risk elements and accidents, without being limited to specific locations or behaviors. For example, a system designed to detect instances of drowning in specific locations, such as swimming pools, may efficiently identify dangerous situations by employing technologies like object detection and movement-based anomaly detection, incorporating learned behavior patterns associated with drowning. However, in the case of CCTV systems installed in public spaces or on the streets, which necessitate versatile detection capabilities, the spatial context required for detection differs from that in the pool example mentioned above. Numerous categories of risks manifest in a variety of environments, involving hazardous elements such as fires and safety-related incidents like traffic accidents or altercations. Relying solely on specific technologies presents a considerable challenge in detecting a wide array of situations and types, consequently limiting the applicability of conventional methods in surveillance systems. In summary, surveillance systems must have the capability to comprehensively leverage spatial information, convey data in a format that can be rapidly and accurately grasped by the observer, and represent learnable data for potential risk factors in diverse situations. To address this challenge, this paper introduces a novel approach for a surveillance system.
Humans utilize a structured and high-level expressive tool, namely, language. Applying human-friendly natural language to the surveillance system can effectively address the aforementioned challenges. A situation to be detected by the surveillance system can be articulated in sentence form using natural language. Natural language can be employed to express a scene’s danger, explain an accident in sentence form, and elucidate an accident in sentence form, encompassing various information such as the characteristics of the object, the presence or absence of a person, the distinction between adults and children, the type of accident, the place of occurrence, and nearby risk factors. When the scene is conveyed through such sentences, the observer can promptly comprehend the situation.
The rapid advancement of natural language processing (NLP) technology has paved the way for the utilization of words and sentences as data for learning, leading to the development of various models. Large language models (LLMs), in particular, which have been trained on extensive language datasets, exhibit remarkable performance in a multitude of natural language processing tasks. These large language models are proficient in diverse tasks such as sentence generation, translation, text summarization, and question answering.
Image captioning is a subfield of NLP that involves generating text describing the content of an image in natural language. By incorporating image captioning technology, enhanced with large language models boasting high generalization performance, into surveillance systems, it is possible to represent images through information-rich sentences. Notably, image captioning technology proves to be exceptionally well suited for the detection of hazardous behavior and the analysis of accident scenarios. This is attributed to its capacity to offer a detailed representation of the scene, encompassing information about individual characteristics, actions, and the spatial context. Furthermore, by enabling the system to autonomously assess risk levels for captions (sentences) generated through image captioning technology, proactive mitigation of safety incidents becomes feasible, allowing for prompt and accurate responses to any accidents that may occur.
In this paper, we introduce a novel surveillance system designed to surpass the limitations of traditional surveillance systems, which are frequently restricted to a primary emphasis on object-centric behavior analysis. Our system generates descriptive captions that encompass details regarding objects, actions, and spatial context extracted from surveillance target footage. These captions are subsequently utilized to assess the risk level of the observed scene. To generate captions for surveillance scenes, it is necessary to construct a dataset comprised of [Image-Caption-Danger Score]. The dataset should feature caption data presented in a novel sentence format, deviating from conventional caption structures, and should encompass a myriad of information encompassing objects, actions, and spatial context, facilitating comprehensive interpretation. To facilitate the interpretation of scenes by the image captioning model, we utilized BLIP-2 [
11], a large language model well regarded for its efficiency in handling multimodal tasks while minimizing parameters and training costs, all while delivering state-of-the-art performance. We fine-tuned the BLIP-2 [
11] model with the newly-constructed dataset to guide it in generating captions that adhere to the newly-defined sentence structure. Subsequently, we utilized BERT (bidirectional encoder representations from transformers) [
12] to interpret the semantic content of the generated sentences and assess the risk level associated with each scene.
The overall structure of this paper is as follows. In
Section 2, we delve into general research related to image captioning, existing studies on surveillance systems that incorporate image captioning, and the exploration of LLMs.
Section 3 provides a detailed description of our newly constructed dataset and the overarching system structure.
Section 4 presents a comprehensive evaluation of the proposed system’s performance, utilizing both quantitative and qualitative analyses. Additionally, we delve into potential avenues for system enhancement through result analysis. Finally, in
Section 5, we conclude the paper by summarizing the proposed system, discussing future development directions, and outlining potential areas for further research.
2. Related Work
Image captioning is a technology that generates descriptive captions for an input image. Notable studies in this domain are referenced as [
13,
14,
15,
16]. Vinyals, O. et al. [
13] proposed a method that connects an encoder constructed with convolutional neural networks (CNN) for extracting image information with a long short-term memory (LSTM) [
17] decoder for caption generation. Xu, K. et al. [
14] introduced an approach that enhances the relationship between images and captions by incorporating an attention mechanism. Liu, W. et al. [
15] suggested an image captioning technique that sequences images to serve as inputs for the transformer [
18] model. Wang, P. et al. [
16] proposed an integrated system that utilizes multimodal pre-training to represent data in a unified space. This approach allows for the expression of both image and language information in patches, facilitating simultaneous processing of image and language information within the transformer.
In the domain of surveillance systems incorporating image captioning, Dilawari et al. [
19] introduced a system utilizing the VGG-16 [
20] to extract specific situational information from videos and generate captions through a bidirectional-LSTM [
21], with a primary focus on object-related attributes. W. Chen et al. [
22] proposed a system that computes anomaly scores by combining captions generated using SwinBERT [
23] and video features extracted via the ResNet-50 architecture [
24]. These studies often rely on datasets like UCF Crime [
25], NTU CCTV-Fights [
26], ShanghaiTech [
27], and XD-Violence [
28], encompassing diverse types of behavior, including fighting, fainting, loitering, and abandonment. However, a noteworthy limitation arises from the prevalent lack of comprehensive captions within these datasets, which typically offer basic object descriptions but fail to capture nuanced object behavior or contextual space information. Consequently, despite the application of image captioning to enhance surveillance systems, the resulting captions primarily center on objects, neglecting the vital space context crucial for a comprehensive scene risk assessment. This limitation underscores the need for further research in integrating space information to enhance the efficacy of surveillance systems.
The advent of LLMs has brought about substantial advancements in NLP, enhancing the capacity to understand and generate human language by discerning word similarities and contextual relationships, and effectively handling sentence structure, grammar, and meaning.
Prominent models in the domain of LLMs include BERT [
12], GPT [
29], T5 [
30], and LaMDA [
31]. BERT [
12] is a deep learning-based model in NLP distinguished by its capability to extract bidirectional contextual information from extensive volumes of raw text, capturing sequential relationships between sentences and representing words and their context in vectors. This comprehensive approach empowers BERT to consider both preceding and subsequent text, granting it an extensive understanding of language. Originally designed for language comprehension, BERT has significantly advanced the field of NLP. Furthermore, the LLM landscape boasts a multitude of models, such as GPT [
29] and T5 [
30], which have demonstrated remarkable performance in tasks such as sentence translation and summarization. Additionally, LaMDA [
31], developed for interactive applications, represents another notable addition to the suite of LLMs. These models collectively signify the diverse and expanding capabilities of LLMs in the field of NLP.
In the domain of LLMs, a notable constraint is their inherent inability to comprehend image features due to the absence of image data during their training process. To overcome this limitation, extensive research is being conducted on large multimodal models (LMMs) that enrich LLMs with image information to establish a connection between images and text. These LMMs are pre-trained on a massive scale of diverse data types, including text, images, audio, and video, thereby equipping them with the capacity to perform a multitude of tasks, ranging from image captioning to vision question answering (VQA). Prominent systems, such as BLIP-2 [
11], OpenAI’s GPT-4 [
32], Google’s Gemini [
33], and LAVA 1.5 [
34], are all founded upon these large multimodal models, demonstrating the growing significance of this approach. Furthermore, a prevailing trend in the field involves the utilization of large web datasets for multimodal training, resulting in the development of a multitude of models. For instance, BLIP-2 [
11] is trained on an extensive dataset comprising image–text pairs gathered from the web. This model incorporates frozen pre-trained models in both its encoder and decoder and effectively addresses the modality gap between the encoder and LLMs through the query transformer (Q-Former), achieving remarkable state-of–the-art (SOTA) performance in various vision–language tasks while demonstrating significant zero-shot capabilities.
In this paper, we present a surveillance system tailored to address the inherent constraints of prevailing surveillance systems. Our approach involves the creation of a novel dataset specifically tailored for surveillance applications, encompassing comprehensive captions that incorporate object information, action details, and space context. To further enhance the interpretative capacity of this surveillance system, we fine-tune BLIP-2 [
11], a model trained on extensive web data, utilizing our newly curated dataset. Additionally, we leverage BERT [
12] to comprehend the semantic nuances of the generated captions, enabling us to quantify the risk level based on this refined interpretation. This combination of novel dataset construction, fine-tuning, and semantic understanding represents a significant step toward a more effective and context-aware surveillance system.
3. Methodology
Figure 1 depicts a schematic representation of the proposed system’s comprehensive workflow. The system initially takes video input from CCTV sources. Subsequently, this video data undergo processing through an image captioning network, generating interpretative captions to elucidate the content of the scenes. These captions encompass a wealth of details, encompassing object attributes, behaviors, and spatial context. Following caption generation, a specialized risk assessment module conducts an in-depth analysis of these generated captions. This module analyzes the information embedded within the captions, enabling a thorough evaluation of the scene’s risk level.
In
Figure 1, a pair of surveillance images depicting different levels of risk is presented as the input: the first depicts people running in the park, and the second depicts a car accident scenario. The deployed image captioning network effectively formulates precise captions corresponding to the content of each video, which are subsequently evaluated by the risk assessment module to determine the pertinent risk levels. In this evaluation, the first image, portraying a safe situation, is assigned a danger score of ‘1’, while the second image, depicting an actual accident, is associated with a danger score of ‘6’. The danger scores defined within our proposed system are categorized into seven distinct stages, with ‘1’ indicating a safe situation, and higher scores indicating escalating levels of risk. For an in-depth elucidation of the danger score methodology, please refer to
Section 3.1. This systematic approach enables the precise evaluation of risk levels in different surveillance scenarios.
3.1. The Construction of the Dataset
The proposed system utilizes the large multimodal model BLIP-2 [
11]. While BLIP-2 [
11] has undergone pre-training on a diverse dataset encompassing various natural language processing tasks, its inherent design lacks task-specific optimization. Therefore, to achieve appropriate results when applying the large multimodal model to the proposed surveillance system, fine-tuning is imperative.
Conventional training datasets for surveillance systems often lack interpretive captions, and when available, the captions typically offer only rudimentary object descriptions, lacking comprehensive information about object actions or the surrounding space context, essential for practical surveillance applications. Therefore, we have constructed a dataset incorporating detailed information regarding objects, behavior, and spatial context. Additionally, our dataset encompasses diverse environmental conditions, including overcast days, nighttime shooting, and low-light environments, ensuring robustness in the face of changes due to various shooting conditions. Furthermore, a dataset is specifically created to detect and interpret individuals even when they are at a distance from other objects, focusing on people who are the subjects of the monitored safety incidents. This ensures the delivery of sufficient information for an accurate understanding and analysis of the monitored situation, surpassing the mere listing of visual elements. Additionally, this approach significantly aids in identifying and analyzing essential elements, individuals, for risk assessment within the visual field. As a result, a total of 2741 comprehensive datasets were built to enable BLIP-2 [
11] to generate relevant captions for the surveillance system. Additionally, these datasets facilitate risk assessment following the processing of captions through BERT [
12]. The danger score here represents a numerical value corresponding to the risk level, with specific classification criteria provided in
Table 1, thereby facilitating the quantification of risk levels in the surveillance context. The construction of this comprehensive dataset is instrumental in enhancing the suitability and effectiveness of the proposed system for real-world surveillance applications.
3.2. Definition of Risk Level and Danger Score
The classification of risk levels for input images involves three fundamental categories: safe situations, hazardous situations, and accident occurrences, each being assigned distinct danger scores.
Table 1 illustrates that the risk levels representing safety and hazards are divided into two stages each, whereas the danger level is further segmented into three stages, contingent upon the varying levels of risk. This methodical classification system serves to delineate the entire range of risk across diverse surveillance scenarios.
The ‘Safety’ risk level typically involves routine activities such as walking, sitting, or running in secure areas like pathways, parks, or indoors. However, it is essential to note that even within the same risk level, situations involving children may inherently pose a relatively higher level of risk compared to those involving adults. Consequently, when adults are engaged in these activities, a danger score of ‘1’ is assigned, while the presence of children is associated with a higher danger score of ‘2’, signifying the increased potential for elevated risk in such scenarios. This nuanced approach enables a more accurate evaluation of potential danger in the presence of children during actions categorized as ‘safe’.
The ‘Hazard’ risk level encompasses routine activities that typically occur in safety situations but are contingent upon the context of the surrounding space. For instance, the act of sitting is considered safe when performed on a bench, meriting a danger score of ‘1’; however, if the same action takes place on a railing, bridge, or roof, which inherently pose higher risks, it is reclassified as hazardous and is assigned a danger score of ‘3’. This systematic approach ensures that the level of risk is appropriately assessed in various space contexts, thereby enhancing the precision of the surveillance system.
The ‘Danger’ risk level is designated for detected actions categorized as accidents, encompassing scenarios where individuals have collapsed or incidents involving activities such as fights, fires, or traffic accidents. The degree of danger, as quantified by the danger score, escalates when these actions occur in hazardous environments like railings, cliffs, roads, or construction sites. Furthermore, a higher danger score is allocated when such actions involve children. This systematic risk assessment approach enables a more nuanced and accurate evaluation of the risk level within situations classified as dangerous, accounting for the influence of environmental factors and the age of the individuals involved.
Table 2 provides a comprehensive overview of the dataset distribution, systematically organized according to distinct risk levels. A total of 2741 images were collected, each necessitating a thorough analysis of object attributes, behavior, and spatial context. These images were meticulously selected in accordance with the classification criteria outlined in
Table 1. The captions accompanying these collected images were meticulously crafted to delineate the distinctive characteristics of the object type, elucidate the actions of the object, and describe the space context in which the object is situated. Subsequently, datasets were systematically constructed by assigning an appropriate danger score to each situational scenario, ensuring the comprehensive representation of various risk levels in the surveillance system.
3.3. The Sentence Structure Format of Captions
During the dataset construction phase, careful consideration was given to the design of caption structures. These structures were intentionally designed to include essential components, specifically, the object type, object attributes, object behavior, and the spatial context. To offer further clarity and insight into the employed caption structure within the dataset,
Table 3 presents illustrative examples.
The captions presented in
Table 3 meticulously delineate the specifics of objects within surveillance footage. Objects are discerned by red highlights, with each highlighted section indicating the object’s category and pertinent details. This information encompasses the identification of whether the subject is a person, a vehicle, or a fire, and further specifies whether the person is an adult or a child, along with details about their attire. Green highlights are employed to depict the behavior of the object, while blue highlights convey information pertaining to the spatial context. For instance, the first and second examples within
Table 3 both depict an adult male engaged in walking. However, the contextual differences between these examples are significant. The first example takes place in a park, characterized as a safe environment. In contrast, the second example involves walking on a road, an action considered perilous due to jaywalking. Additionally, the third and fourth examples depict a fainting incident, classified as an emergent situation within the danger (accident occurrence) category. The third example takes place within a room, devoid of additional environmental hazards, resulting in an assigned danger score of ‘5’. In contrast, the fourth example occurs on a road frequented by cars, posing a higher risk due to the increased likelihood of subsequent accidents and is thus assigned a danger score of ‘6’. This meticulous approach to caption structuring ensures the effective capture and conveyance of comprehensive information regarding the objects, their actions, and the relevant spatial context. Consequently, it enhances the dataset’s applicability in the surveillance system.
3.4. Scene Descriptive Caption Generation and Risk Assessment
The architectural framework of the proposed system is illustrated in
Figure 2. The system follows a two-step process: Firstly, it fine-tunes BLIP-2 [
11] using the dataset specifically tailored to generate descriptive captions that interpret scenes. BLIP-2, an advanced model in image captioning, combines visual and textual data, making it adept at understanding and describing complex scenes in surveillance footage. This model’s strength lies in its ability to contextualize visual elements within the framework of natural language, offering a more nuanced interpretation than traditional image recognition models. Subsequently, BERT [
12] is employed to perform a semantic analysis of these captions. BERT’s key feature is its bidirectional training, allowing it to understand the context of a word based on all of its surroundings in a sentence. This is a significant departure from previous models that processed text in one direction, either left-to-right or right-to-left, which could overlook the broader context of certain words or phrases. BERT’s deep understanding of language nuances makes it particularly effective in assessing the risk levels in the captions generated by BLIP-2. Following the analysis, the system conducts a comprehensive assessment of the risk level associated with each scene. The risk levels are categorized and quantified on a scale ranging from 1 to 7, reflecting the severity of the risk. This multistage approach is a crucial component of the system’s capacity to deliver a nuanced evaluation of scene risk levels, significantly enhancing the efficacy of the surveillance system.
In the second image of
Figure 2, the BLIP-2 framework is depicted, featuring three fundamental components: the image encoder, the query transformer (Q-Former), and the large language model (LLM). The query transformer is a trainable module designed to bridge the gap between the fixed-weight image encoder and the large language model. It utilizes ViT-G [
35] for the image encoder and OPT 2.7B [
36] for the large language model. BLIP-2 [
11] undergoes fine-tuning using the constructed dataset, a process known to be computationally intensive. To mitigate the associated costs, the LoRA (low-rank adaptation) [
37] technique, as proposed by Hu, is implemented in this system.
The analysis of the semantic content within the generated captions is facilitated by the utilization of BERT [
12]. As illustrated on the right side of
Figure 2, the scene’s risk level is subsequently measured through a classifier. BERT represents a significant milestone in the domain of natural language processing, distinguished by its exemplary performance across a spectrum of language-related tasks. Diverging from its predecessors, BERT employs a bidirectional language model that comprehensively evaluates each word within a sentence, resulting in a more profound understanding of context. The special CLS token, situated in the upper right portion of
Figure 2, consistently appears at the start of BERT sequences. This token serves to encapsulate the overarching context of the input sequence and is particularly well suited for classification tasks. The proposed system is crafted to have captions generated by BLIP-2 to produce a CLS token through BERT. The vector derived from this CLS token then undergoes further processing via a linear layer, culminating in classification that determines the scene’s risk level and associated danger score using a softmax function. The combination of BLIP-2 and BERT in the proposed system leverages the strengths of advanced image processing and deep language understanding. This synergy results in a sophisticated risk assessment tool capable of interpreting complex surveillance scenarios with a high degree of accuracy.
5. Conclusions and Future Work
As the field of artificial intelligence advances and hardware capabilities improve, surveillance systems have evolved to handle a broad range of tasks. Our proposed surveillance system introduces a novel approach that empowers the system to autonomously interpret a wide spectrum of information, facilitating comprehensive situation analysis. This system leverages large multimodal models to generate descriptive captions for hazardous situations and employs semantic analysis to assess the associated risk levels effectively.
To create these captions, we incorporate object information, behavior details, and space context to monitor various situations, leveraging this information to measure risk. Our self-constructed dataset was designed to categorize risk levels based on factors such as the age group of individuals, types of actions, and the nature of locations. Through a series of experiments using these datasets, we demonstrate that they provide comprehensive information for risk assessment and exhibit exceptional performance in this regard. Compared to models pre-trained on existing datasets, our generated captions comprehensively encompass the requisite object attributes, behavior, and spatial context essential for the surveillance system. Furthermore, they exhibit adaptability to novel sentence structures, ensuring versatility across diverse contexts. The robustness of the dataset has also been evidenced by testing with images captured under various conditions, showing its adaptability to both indoor and outdoor environments. Consequently, monitoring personnel can make more accurate and quicker decisions by receiving combined information of the video, interpreted captions, and risk level assessment. Expanding our system to create caption data for additional situations can further enhance surveillance system performance, potentially culminating in a universally applicable system.
As part of our future research agenda, we plan to explore a system that combines multi-object detection and dense captioning technology to generate captions and seamlessly integrate sentences for multiple concurrently detected scenes. Furthermore, recognizing the constraints associated with detecting abnormal situations at the single-frame level, we aim to investigate the expansion of existing systems by incorporating video captioning technology that accounts for the context preceding and following an incident, thus enabling a more comprehensive and nuanced analysis.