Personalized Video Summarization: A Comprehensive Survey of Methods and Datasets
Abstract
:1. Introduction
- Many works were carried out that produced a personalized video summary in real time. Valdés and Martínez (2010) [17] introduced an application for interactive video summarization in real time on the fly. Chen and Vleeschouwer (2010) [18] produced personalized basketball video summaries in real time from data from multiple streams.
- Several works were based on queries. Kannan et al. (2015) [19] proposed a system to create personalized movie summaries with a query interface. Given a given semantic query, Garcia (2016) [20] proposed a system that can find relevant digital memories and perform personalized summarization. Huang and Worring (2020) [21] created a dataset based on a query–video pair.
- Few studies focused on egocentric personalized video summarization. Varini et al. (2015) [24] proposed a personalized egocentric video summarization framework. The most recent study by Nagar et al. (2021) [25] presented a framework of unsupervised reinforcement learning in daylong egocentric videos.
- Many studies have been conducted in which machine learning has been used in the development of a video summarization technique. Park and Cho (2011) [26] proposed the summarization of personalized live video logs from a multicamera system using machine learning. Peng et al. (2011) [27] proposed personalized video summarization by supervised machine learning using a classifier. Ul et al. (2019) [28] used the deep CNN model to recognize facial expression. Zhou et al. (2019) [29] proposed a Character-Oriented Video Summarization framework. Fei et al. (2021) [30] proposed a triplet deep-ranking model for personalized video summarization. Mujtaba et al. (2022) [31] proposed a framework for personalized video summarization using 2D CNN. öprü and Erzin (2022) [32] used Affective Visual Information for Human-Centric Video Summarization. Ul et al. (2022) [33] presented Object of Interest (OoI), a personalized video summarization framework based on the Object of Interest.
2. Applications of Personalized Video Summarization
2.1. Personalized Trailer of Automated Serial, Movies, Documentaries and TV Broadcasting
2.2. Personalized Sport Highlights
2.3. Personalized Indoor Video Summary
2.4. Personalized Video Search Engines
2.5. Personalized Egocentric Video Summarization
3. Approaches for Summarizing Personalized Videos
- Keyframe cues are the most representative and important frames or still images drawn from a video stream in the sequence of time [49]. For example, the personalized video summarization from Köprü and Erzin [32] is configured as a keyframe option that maximizes a scoring function based on sentiment characteristics.
- Video segment cues are a dynamic extension of keyframe cues, which are video segment cues [11]. These are the most important part of a video stream, and video summaries produced with dynamic elements in mind manage to preserve both video and audio [49]. These video summaries have the ability to preserve the sound element and movement of the video, making them more attractive to the user. A major drawback is that the user takes longer to understand the content of these videos [11]. Sharghi et al. [48] proposed a framework that checks the relevance of a shot to its importance in the context of the video and in the user’s query for inclusion in the summary.
- Graphical cues complement other cues using syntax and visual cues [49]. Using syntax as a substitute for other conditions and visual cues shows an extra layer of detail. Users perceive an overview of the content of a video summary in more detail due to embedded widgets that other methods cannot achieve [11]. This is illustrated by Tseng and Smith, who presented a method according to which the annotation tool learns the semantic idea, either from other sources or from the same video sequence [13].
- Social cues are connections of the user image through the social network. Yin et al. [51] proposed an automatic video summarization framework in which user interests are extracted from their image collections on the social network.
4. Classification of Personalized Video Summarization Techniques
4.1. Type of Personalized Summary
- Static storyboards are also called static summaries. Some personalized video summaries consist of a set of static images that extract highlights, and their representation is made into a photo album [15,16,21,22,23,29,33,35,43,44,48,52,53,54,55,56,57,58,59]. The generation of these summaries is performed by extracting the keyframes according to the user’s preferences and the summary criteria.
- Dynamic summaries are also called dynamic video skims. Personalized video summarization is performed by selecting a subset of consecutive frames from all frames of the original video that includes the most relevant subshots that represent the original video according to the user’s preferences as well as the summary criteria [12,14,17,18,19,20,25,27,28,30,31,36,38,39,45,46,48,50,55,60,61,62,63,64,65,66,67,68,69,70].
- A Hierarchical summary is a multilevel and scalable summary. It consists of a few abstractive layers, where the lowest layer contains the largest number of keyframes and more details, while the highest layer contains the smallest number of keyframes. Hierarchical video summaries provide the user with a few levels of summary, which provides the advantage of making it easier for users to determine what is appropriate [49]. Based on the preferences of each user, the video summary presents an appropriate overview of their original video. Sachan and Keshaveni [8], to accurately identify the related concept with a frame, proposed a hierarchical mechanism to classify the images. The role of the system that performs the hierarchical classification is the deep categorization of a framework against a classification that is defined. Tseng and Smith [13] suggested a summary algorithm, where server metadata descriptions, contextual information, user preferences, and user interface statements are used to create hierarchical summary video production.
- Multiview summaries are summaries created from videos recorded simultaneously by multiple cameras. When watching sports videos, these summaries are useful, as the video is recorded by many cameras. In the multiview summary, challenges can arise that are often due to the overlapping and redundancy of the contents, as well as the lighting conditions and visual fields from the different views. As a result, for static summary output, the basic frame can be extracted with difficulty, and for the video, to shoot border detection. Therefore, in this scenario, conventional techniques for video skimming and extracting keyframes from videos recorded by a camera cannot be applied directly [49]. In personalized video summarization, the multiview video summary is determined based on the preferences of each user and the summary criteria [18,26,39,71].
- Fast forward summaries: When a user watches a video that is not informative or interesting, they will often play it fast or move it forward quickly [72]. Therefore, in the personalized video summary, the user wants to fast forward the video segments that are not of interest. In [51], Yin et al., in order to inform the context, users play in fast forward mode the less important parts. Chen et al. [41] proposed an adaptive fast forward personalized video summarization framework that performs clip-level fast forwarding, choosing from discrete options the playback speeds, which include as a special case the cropping of content at a playback speed that is infinite. In personalized video summary, fast forwarding is used in sports videos, where each viewer wants to watch the main phases of the match that are of interest to them according to their preferences, while also having a brief overview of the remaining phases that are not as interesting or important to them. Chen and Vleeschouwer [42] proposed a personalized video summarization framework for soccer video broadcasts with adaptive fast forwarding, where efficient resource allocation selection selects the optimal combination of candidate summaries.
4.2. Features
4.2.1. Object Based
4.2.2. Area Based
4.2.3. Perception Based
4.2.4. Motion and Color Based
4.2.5. Event Based
4.2.6. Activity Based
Paper | Type of Summary | Features | Domain | Time | Method | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Storyboard | Video Skimming | Hierarchical | Multiview | Fast Forward | Object Based | Area Based | Perception Based | Motion and Color Based | Event Based | Activity Based | Domain Specific | Non-Domain Specific | Real Time | Static | Supervised | Weakly Supervised | Unsupervised | |
Tsenk et al., 2001 [12] | x | x | x | x | ||||||||||||||
Tseng and Smith, 2003 [13] | x | x | x | x | x | x | ||||||||||||
Tseng et al., 2004 [60] | x | x | x | x | x | x | x | |||||||||||
Lie and Hsu, 2008 [14] | x | x | x | x | x | x | x | |||||||||||
Chen and Vleeschouwer, 2010 [18] | x | x | x | x | x | |||||||||||||
Valdés and Martínez, 2010 [17] | x | x | x | x | x | x | ||||||||||||
Chen et al., 2011 [39] | x | x | x | x | x | x | x | |||||||||||
Hannon et al., 2011 [38] | x | x | x | x | ||||||||||||||
Katti et al., 2011 [22] | x | x | x | x | ||||||||||||||
Park and Cho, 2011 [26] | x | x | x | x | x | x | ||||||||||||
Peng et al., 2011 [27] | x | x | x | x | x | |||||||||||||
Yoshitaka and Sawada, 2012 [72] | x | x | x | x | ||||||||||||||
Kannan et al., 2013 [67] | x | x | x | x | x | x | x | |||||||||||
Niu et al., 2013 [46] | x | x | x | x | x | x | ||||||||||||
Zhang et al., 2013 [16] | x | x | x | x | x | |||||||||||||
Chung et al., 2014 [50] | x | x | x | x | x | |||||||||||||
Darabi et al., 2014 [35] | x | x | x | x | ||||||||||||||
Ghinea et al., 2014 [66] | x | x | x | x | x | x | x | |||||||||||
Kannan et al., 2015 [19] | x | x | x | x | x | x | x | |||||||||||
Garcia, 2016 [20] | x | x | x | x | ||||||||||||||
Sharghi et al., 2016 [48] | x | x | x | x | x | x | ||||||||||||
del et al., 2017 [68] | x | x | x | x | x | x | ||||||||||||
Otani et al., 2017 [74] | x | x | x | x | x | |||||||||||||
Sachan and Keshaveni, 2017 [8] | x | x | x | x | ||||||||||||||
Sukhwani and Kothari, 2017 [44] | x | x | x | x | x | x | ||||||||||||
Varini et al., 2017 [62] | x | x | x | x | x | x | x | x | x | |||||||||
Fei et al., 2018 [36] | x | x | x | x | x | |||||||||||||
Tejero-de-Pablos et al., 2018 [45] | x | x | x | x | x | |||||||||||||
Zhang et al., 2018 [64] | x | x | x | x | x | |||||||||||||
Dong et al., 2019 [53] | x | x | x | x | x | x | x | x | ||||||||||
Dong et al., 2019 [54] | x | x | x | x | x | x | x | x | ||||||||||
Gunawardena et al., 2019 [59] | x | x | x | x | x | |||||||||||||
Jiang and Han, 2019 [69] | x | x | x | x | x | |||||||||||||
Ji et al., 2019 [71] | x | x | x | x | x | |||||||||||||
Lei et al., 2019 [61] | x | x | x | x | x | |||||||||||||
Ul et al., 2019 [28] | x | x | x | x | x | |||||||||||||
Zhang et al., 2019 [65] | x | x | x | x | x | |||||||||||||
Baghel et al., 2020 [52] | x | x | x | x | x | |||||||||||||
Xiao et al., 2020 [55] | x | x | x | x | x | x | ||||||||||||
Fei et al., 2021 [30] | x | x | x | x | x | |||||||||||||
Nagar et al., 2021 [25] | x | x | x | x | x | x | ||||||||||||
Narasimhan et al., 2021 [58] | x | x | x | x | x | x | x | |||||||||||
Mujtaba et al., 2022 [31] | x | x | x | x | x | |||||||||||||
Ul et al., 2022 [33] | x | x | x | x | x | |||||||||||||
Cizmeciler et al., 2022 [70] | x | x | x | x | x |
4.3. Video Domain
- In techniques that refer to domain-specific analysis, video summarization is performed in that domain. Common content areas include home videos, news, music, and sports. When performing an analysis of video content, blur levels are reduced when focusing on a specific domain [11]. The video summary must be unique to the domain. To produce a good summary in different domains, the criteria vary dramatically [54].
- In contrast to domain-specific techniques, non-domain-specific techniques perform video summarization for any domain, so there is no restriction on the choice of video to produce the summary. The system proposed by Kannan et al. [67] can generate a video summarization without relying on any specific domain. The summaries presented in [12,13,14,17,20,22,25,27,30,31,33,35,46,48,52,55,56,58,59,60,61,64,65,66,68,69,70,71,72,74] are not domain specific. The types of domains found in the literature are personal videos [53,54], movies/serial clips [19,28,29,36,50], sports videos [18,36,38,39,40,41,42,43,44,45], cultural heritage [62], zoo videos [8], and office videos [26].
4.4. Source of Information
4.4.1. Internal Personalized Summarization Techniques
- Image features may have changes in the motion, shape, texture, and color of objects that come from the video image stream. These changes can be used to perform video segmentation in shots by identifying fades or hot boundaries. Fades are identified by slow changes in the characteristics of an image. Hot boundaries are identified by changes in an image’s features in a sharp way, such as clipping. Specific objects can be detected, and an improvement in the depth of summarization can also be achieved for videos with a known structure by the analysis of image features. Sports videos are suitable for event detection because of their rigid structure. At the same time, event and object detection can also be achieved in other content areas that present a rigid structure, such as in news-related videos, as the start includes an overview of headlines, then a series of references is displayed, and finally the anchor face is the return [11].
- Audio features are related to the video stream and appear in the audio stream in different ways. Audio features include music, speech, silence, and sounds. Audio features can help identify segments that are candidates for inclusion in a video summary, and improving the depth of the summary can be achieved using domain-specific knowledge [11].
- Text features in the form of text captions or subtitles are displayed in the video. Instead of being a separate stream, text captions are “burned” or integrated into the video’s image stream. Text may contain detailed information related to the content of the video stream and thus be an important source of information [11]. For example, in a football match broadcast live, captions showing the names of the teams, the score between them, the percentage of possession, the shots on target at that moment, etc., should appear during the match. As with video and audio features, events can also be identified from text. Otani et al. [74] proposed a text-based method. According to the text-based method, video blog posts use supporting texts that are used in the video summary at an earlier time. First, the video is segmented and then, according to the relevance of each segment to the input text, its priority is assigned to the summary video. Then, a subset of segments that have content similar to the content of the input text is selected. Therefore, based on the input text, a different video summary is produced.
4.4.2. External Personalized Summarization Techniques
4.4.3. Hybrid Personalized Summarization Techniques
4.5. Based on Time of Summarization
- In real-time techniques, the production of the personalized summary takes place during the playback of the video stream. Due to the fact that the output should be delivered very quickly, it is a difficult process to produce in real time. In real-time systems, an output that is delayed is incorrect [47]. The Sequential and Hierarchical Determinant Point Process (SH-DPP) is a probabilistic model developed by Sharghi et al. [48] to be able to produce extractive summaries from streaming video or long-form video. The personalized video summaries presented in [12,13,17,18,24,31,38,39,46,50,60] are in real time.
4.6. Based on Machine Learning
4.6.1. Supervised
- Keyframe selection: The goal is to identify the most inclusive and varied content (or frames) in a video for brief summaries. Keyframes are used to represent significant information included in the video [52]. The keyframe-based approach chooses a limited set of image sequences from the original video to provide an approximate visual representation [31]. Baghel et al. [52] proposed a method in which user preference is entered as an image query. The method is based on object recognition with automatic keyframe recognition. From the input video, the important frame is selected, so that the output video is produced from these frames. Based on the similarity score between the input video and the query image, a keyframe is selected. A selection table is created from the keyframe that is decided to be selected. A threshold is applied to the selection score. If the frame has a selection score greater than the threshold value, then this frame is a keyframe; otherwise, the frame is not considered a keyframe and is discarded.
- Keyshot selection: The keyshots comprise standard continuous video segments extracted from full-length video, each of which is shorter than the original video. Keyshot-driven video summarization techniques are used to generate excerpts from short videos (such as user-created TikToks and news) or long videos (such as full-length movies and soccer games) [31]. Mujtaba et al. [31] presented the LTC-SUM method to produce personalized keyshot summaries that minimize the distance between the semantic information of the side and the selected video frame. Using a supervised encoder–decoder network, the importance of the frame sequence is measured. Zhang et al. [65] proposed a mapping network (MapNet) to express the degree of association of a shot with a given query, to create a visual information mapping in the query space. Using deep reinforcement learning (SummNet), they proposed to build a summarization network to integrate diversity, representativeness, and relevance to produce personalized video summaries. Jiang and Han proposed a scoring mechanism [69]. In the hierarchical structure, the mechanism is based on the scene layer and the shooting layer and receives the output. Each shot is scored through this mechanism, and as basic shots, the shots are selected as high-rated shots.
- Event-based selection: The process of personalized summarization detects events from a video based on the user’s preferences. In the above method [31], to identify thumbnail events that are not specific domains, a two-dimensional convolutional neural network (2D CNN) model was implemented.
- Learning shot-level features: It involves learning advanced semantic information from a video segment. The Convolutional Hierarchical Attention Network (CHAN) method was proposed by Xiao et al. [55]. After dividing the video into segments, visual features are extracted using the pre-trained network. To perform shot-level feature learning, visual features are sent to the feature encoding network. To perform learning on a high-level semantic information video segment, they proposed a local self-attention module. To manage the semantic relationship between the given query and all segments, they used a global attention module that responds to the query. To reduce the length of the shot sequence and the dimension of the visual feature, they used a fully convolutional block.
- Transfer learning: Transfer learning involves adjusting the information gained from one area (source domain) to address challenges in a separate, yet connected area (target domain). The concept of transfer learning is rooted in the idea that when tackling a problem, we generally rely on the knowledge and experience we have gained from addressing similar issues in the past [76]. Ul et al. [28] proposed a framework using transfer learning to perform facial expression recognition (FER). More specifically, they presented the learning process that, to be completed, includes two steps. In the first step, a CNN model is trained for face recognition. In the second step, transfer learning is performed for the FER of the same model.
- Adversarial learning. This is a technique employed in the field of machine learning to trick or confuse a model by introducing harmful input, which can be used to carry out an attack or cause a malfunction in a machine learning system. A competitive three-player network was proposed by Zhang et al. [64]. The content of the video, as well as the representation of the user query, is learned from the generator. The parser receives three pairs of digests based on the query so that the parser can distinguish the real digest from a random one and a generated one. To train the classifier and the generator, a lossy input of three players is performed. Training avoids the generation of random summaries which are trivial, as the summary results are better learned by the generator.
- Vision-language: A vision language model is an artificial intelligence model that integrates natural language and computer vision processing abilities to comprehend and produce textual descriptions of images, thus connecting visual information with natural language explanations. Plummer et al. [77] used a two-branch network to learn the integration model of the vision language. Of the two branches of the network, one receives the text features and the other the visual features. The triple loss based on the margin trains the network by combining a neighborhood-preserving term and two-way ranking terms.
- Hierarchical self-attentive network: The hierarchical self-attentive network (HSAN) is able to understand the consistent connection between video content and its associated semantic data at both the frame and segment levels. This ability enables the generation of a comprehensive video summary [78]. A hierarchical self-attentive network was presented by Xiao et al. [78]. First, the original video is divided into segments, and then, using a pre-trained deep convolutional network, the visual feature is extracted from each frame. To record the semantic relationship at the section level and at the context level, a global and a local self-care module are proposed. To learn the relationship between visual content and caption, the self-attention results are sent to a caption generator, which is enhanced. An importance score is generated for each frame or segment to produce the video summary.
- Weight learning: The weight learning approach was proposed by Dong et al. [54], in which using maximum-margin learning, it can automatically learn the weights of different objects. Learning can occur for processing styles that are not the same or different types of product, as these videos contain annotations that are highly relevant to the domain expert’s decisions. For different processing styles or product categories, there may be different weightings of audio annotations built directly with domain-specific processing decisions. For efficient user exploration of the design space, there may be default storage of these weights.
- Gaussian mixture model: A Gaussian mixture model is a clustering method used to estimate the likelihood that a specific data point is part of a cluster. In the user preference learning algorithm proposed by Niu et al. [46], the most representative are initially selected as temporary keyframes from the extracted frames. To indicate a scene change, temporary frames are displayed to the user. If the user is not satisfied with the selected temporary keyframes, they can interact by manually selecting the keyframes. A Gaussian mixture model (GMM) is modeled by learning user preferences. The parameters of the GMM are automatically updated based on the user’s manual selection of keyframes. Production of the personalized summary is performed in real time as the personalized frames update the selected frames from the temporary base. Personalized keyframes represent user preferences and taste.
4.6.2. Unsupervised
- Contrastive learning: Using a pretext, self-supervised pretraining of a model can be performed, which is the approach to contrastive learning. According to contrastive learning, the model learns to repel representations intended to be far away, called negative representations, and to attract them from positive representations intended to be close to discriminate between different objects [56].
- Reinforcement learning: A framework for creating an unsupervised personalized video summary that supports the integration of diverse pre-existing preferences, along with dynamic user interaction for selecting or omitting specific content types, was proposed by Nagar et al. [25]. Initially, the egocentric video captures spatio-temporal features using 3D convolutional neural networks (3D CNNs). Then, the video is split into non-overlapping frame shots, and the features are extracted. Subsequently, the features are imported by the reinforcement learning agent, which employs a bi-directional long- and short-term memory network (BiLSTM). Using forward and backward flow, BiLSTM serves to encapsulate future and past information from each subshot.
- Event-based keyframe selection (EKP): EKP was developed by Ji et al. [71] so that keyframes can be presented in groups. The separation of groups is based on specific facts that are relevant to the query. The Multigraph Fusion (MGF) method is implemented to automatically find events that are relevant to the query. The keyframes in the different event categories are then separated from the correspondence between the videos and the keyframes. Through the two-level structure, the summarization is represented. Event descriptions are the first layer, and keyframes are the second layer.
- Fuzzy rule based: To represent human knowledge, which includes fuzziness and uncertainty, a method is a fuzzy system. From the theory of fuzzy sets, the fuzzy system is a representative and important application [26]. Park and Cho [26] used a system based on fuzzy TSK rules. This system was used to evaluate video event shots. Also, in this rule-based system, consistency is a function of the variables used as input and not a linguistic variable, and therefore, the time-consuming decomposition process can be avoided. The summaries in [50,51,56,59,74] use an unsupervised technique to produce a summary.
4.6.3. Supervised and Unsupervised
4.6.4. Weakly Supervised
5. Types of Methodology
5.1. Feature-Based Video Summarization
5.2. Keyframe Selection
- Object-based keyframe selection: Summaries that are based on objects specifically concentrate on certain objects within a video. These are scrutinized and used as benchmarks to identify the video keyframes. The techniques in the following studies [13,16,26,46,48,52,58] are object-based keyframe selection.
- Event-based keyframe selection: Event-oriented summaries pay special attention to particular events in a video, which are examined and used as benchmarks for the summarization process. The keyframes of the video are identified based on the selected event. The personalized video summary technique in [71] is the selection of event-based keyframes.
5.3. Keyshot Selection
- Object-based shot selection: The summaries that are object based place special emphasis on distinct objects within a video, which are examined and used as benchmarks for the summarization process. Based on the selected object, the keyshots of the video are detected. In the following studies [12,19,60,62,66,67,69,70], the keyshot selection technique is based on object detection.
- Area-based shot selection: Area-based summaries focus particularly on specific regions within a video, which are scrutinized and employed as reference points for the summarization procedure. Depending on the chosen region, the video keyshots are identified. For the personalized summaries presented in [62,66], the selection of keyshots is based on the area.
5.4. Video Summarization Using Trajectory Analysis
Dense Trajectories
5.5. Personalized Video Summarization Using Clustering
- Hierarchical clustering: Hierarchical clustering, or hierarchical cluster analysis, is a method that aggregates similar items into collections known as clusters. The final result is a series of clusters, each unique from the others, and the elements within each cluster share a high degree of similarity. Yin et al. [51] proposed a hierarchy to encode visual features. The hierarchy will be used as a dictionary to encode visual features in an improved way. Initially, a cluster of leaf nodes is created hierarchically based on their paired similarities. The same process is then retrospectively performed to cluster the images into subgroups. The result of this process is the creation of a semantic tree from the root to the leaves.
- K-means clustering: The Non-negative Matrix Factorization (NMF) method was proposed by Liu et al. [63] to produce supervised personalized video summarization. The video is segmented into small clips, and each clip is described by a word from the bag-of-words model. The NMF method is used for action segmentation and clustering. Using the k-means algorithm, the dictionary is learned. The unsupervised method proposed by Ji et al. [71] uses a k-means algorithm to cluster words containing similar meanings. In the method proposed by Cizmeciler et al. [70], clustering is performed using the k-means algorithm. From the frames, the extraction of visual characteristics is carried out using CNN, which is pre-trained. Each shot is represented by the average of visual characteristics. With the k-means method, the shots are clustered into a set of clusters. The centers are then initialized using the Euclidean distance metric by taking random samples from the data. When the shots closest to each center of the cluster are selected, the summary is created. To cluster photos, Darabi and Ghinea [7] used the K-means algorithm in their supervised framework to categorize them based on RGB color histograms.
- Dense-neighbor based clustering: Lei et al. [61] proposed an unsupervised framework to produce a personalized video summary. With a clustering method, the video is divided into separate segments. The method is based on dense neighbors. They clustered the video frames into discrete segments that are semantically consistent frames using a clustering center clustering algorithm.
- Concept clustering: Sets of video frames that are different but semantically close to each other are grouped together. Clusters of concepts are formed from this grouping. For the representation of macro-optical concepts, which are created from similar types of micro-optical concepts of video units that are different, these clusters are intended. Using the cumulative clustering technique, sets of frames are clustered. The summation technique is top–down and is based on the pairwise similarity function (SimFS) operating on two sets of frames [8].
- Affinity propagation: It is a clustering approach that has a higher fitness value than other approaches. AP is based on factor graphs and has been used as a tool to create storyboard summaries and cluster video frames [15].
6. Personalized Video Datasets
- 1.
- Virtual Surveillance Dataset for Video Summary (VSSum): According to the permutation and combination method, Zhang et al. [86] created a dataset in 2022. The dataset consists of virtual surveillance videos and includes 1000 videos that are simulated in virtual scenarios. Each video has a resolution of in 30 fps and a duration of 5 min.
- 2.
- The Open Video Project (OVP): This dataset was created in 1999, including 50 videos recorded at a resolution of 352 × 240 pixels at 30 fps, which are in MPEG-1 format and last from 1 to 4 min. The types of content are educational, lecture, historical, ephemeral, and documentary [87]. This repository contains videos longer than 40 h, mostly in MPEG-1 format, pulled from American agencies such as NASA and the National Archives. A wide range of video types and features are reflected in the repository content, such as home, entertainment, and news videos [84].
- 3.
- TVC: In 2019, Dong et al. [53] created the TVC dataset which consists of 618 TV commercial videos.
- 4.
- Art City: The largest unconstrained dataset of heritage tour visits, including egocentric videos sharing narrative landmark information, behavior classification, and geolocation information, was published by Varini et al. [62] in 2017. The videos have recorded cultural visits to six art cities by tourists and contain geolocation information.
- 5.
- Self-compiled: Based on the UT Egocentric (UTE) dataset [82], a new dataset was compiled after collecting dense concept annotations on each video capture by Sharghi et al. [5] in 2017. The new dataset is designed for query-focused summarization to provide different summarizations depending on the query and provide automatic and efficient evaluation metrics.
- 6.
- Common Film Action: This repository includes 475 video clips from 32 Hollywood films. The video clip that contains it is from eight categories of actions that have been annotated [88].
- 7.
- Soccer video: This dataset contains 50 soccer video clips whose annotation was manually collected from the UEFA Champions League and the FIFA World Cup 2010 championship [36].
- 8.
- UGSV Kendo: This dataset was created in 2016. It contains 10 RGB-D videos containing 12 combats totaling 90 min. Videos were recorded using a Microsoft Kinect V2 sensor [89]. Eighteen additional self-recorded RGB-D Kendo videos were added to the dataset. The videos have a total duration of 246 min and a frame of around 20 fps [45].
- 9.
- Hollywood-2: It includes 2859 videos with a resolution of 400–300 × 300–200 pixels from 69 films, some of which are educational and others test. For testing, it is 884 actions and for training, it is 823 actions. For testing, it is 582 scenes, and for training, it is 570 scenes [90].
- 10.
- Sports-1M: It was published in 2014. This dataset includes 1 million YouTube sports videos that have been annotated in 487 classes. Each class includes 1000–3000 videos, and five percent of the videos have been annotated with at least one class. On average, each video is 5 min and 36 s long [85].
- 11.
- Weizman dataset: This dataset includes 90 video sequences which have a resolution of 180 × 144 pixels and are de-interlaced at 50 fps. The videos show for each of the 9 people the 10 physical actions they perform [63].
- 12.
- UGSum52: This dataset was created in 2019. It contains 52 videos, one to nine minutes long, that have been recorded in various ways such as egocentric, dynamic, and static views. Each video contains 25 summaries of people. All videos are minimally edited [61].
- 13.
- UT Egocentric (UTE): It was published in 2012. Using the portable Looxcie camera, 10 videos have been recorded with a resolution of 320 × 480 pixels and 15 fps each. Each video has a duration of 3 to 5 h, and the total duration of all videos is 37 h. Videos capture various physical daily activities. It is based on the most important people and the most important objects. It is used to produce first-person view-based video summaries. Object annotation is included in the video data [82].
- 14.
- SumMe: This dataset was created in 2014. It includes 25 videos that are slightly processed or raw. The content of the videos is related to sports, events, and holidays. The duration of each video varies from 1 to 6 min. The dataset was annotated for video segments with human scores [80].
- 15.
- TVSum: It was published in 2015. This dataset includes a collection of 50 YouTube videos. Videos are divided into 10 categories. Each category contains 5 videos, and each video represents a specific genre such as documentaries, news, and egocentrics. The duration of each video is 2 to 10 min. Through crowd sourcing, video scores are annotated at the level of shots [81].
- 16.
- 17.
- UTEgo: It was created in 2012. Using the Looxcie Wearable Camera, the 10 videos were collected. Each video has a resolution of 320 × 480 pixels at 15 fps that lasts 3 to 5 h. Various activities are recorded in the videos, such as cooking, driving, lecture tracking, shopping, and food. As ground truth, the annotations taken in Mturk were used. Each video has annotated frames that average 680 [82].
- 18.
- CSumm: This dataset was published in 2017. Using a Google glass, 10 videos were captured with a pixel resolution at 29 fps. Each video is annotated, lasts between 15 and 30 min, and is non-restrictive. Some of the activities included in the videos are having dinner, enjoying nature, watching sports, exercising, etc. A wide range of motions and viewpoints are included in the videos. The best segments are manually annotated with a 10 s time limit [68].
- 19.
- Query-Focused dataset (QFVS): It was published in 2017. The repository includes 4 videos with different scenarios of everyday life that are uncontrollable. Each video lasts between 3 and 5 h. For each user’s query, a dictionary of 48 concepts related to everyday life is provided in order to make a video summary based on queries [64]. Users annotate in each video the absence/presence of concepts [5].
- 20.
- Animation: From 32 cartoons, the dataset includes 148 videos. Each video includes 3000 bullet screen comments and lasts about 24 min [29].
- 21.
- Movie: The dataset includes 19 movies, 2179 bounding boxes, and 96 characters. The sample tests consist of 452 bounding boxes and 20 characters [29].
- 22.
- 23.
- TV episodes: This dataset was published in 2014. It includes ground truth summaries and text annotations for TV episodes. The number of episodes is 4 and the duration of each episode is 45 min. The total duration of the annotations is 40 h of data. In 11 videos, the 40 h separation takes place [83].
- 24.
- Large-scale Flickr images: This dataset was created in 2021. It includes 420,000 athletic images that were shared by photographers. Images are divided into seven categories, and 30,000 queried photos are taken in response to each category. Pictures of “Interest” are the photos that appear after each question. To train the improved deep model, images of “no interest” and “interest” are used [30].
- 25.
- 26.
- Query–video pair based: This dataset was published in 2020. The repository includes 190 YouTube videos with annotations for each video. All videos are sampled with one frame per second (FPS). Annotations are frame-based. Amazon Mechanical Turk (AMT) was used to produce frame annotations with the labels of the question scores that are relevant and text-based [21].
- 27.
- MVS1K: It was published in 2019. This dataset includes 10 queries of around 1000 videos. Videos are crawled by YouTube, and web images that are relevant, annotations that are manual, and video tags. As queries have been selected from Wikipedia News, the list includes 10 hot events from 2011 to 2016. For each query, about 100 videos were collected from YouTube. Each video lasts between 0 and 4 min. Annotation was applied to videos of human judgments [71].
- 28.
- TRECVID 2001: This dataset was created in 2001. It includes TV and documentaries videos. Videos are short, that is, less than 2 min, or long, that is, longer than 15 min [59].
- 29.
- INRIA Holiday: This dataset includes mostly high-resolution personal vacation photos. A very wide variety of scene types are included in the dataset. Each individual scene is represented by one of the 500 groups of images contained in the dataset [94].
- 30.
- ImageNet: This dataset was published in 2009. It includes 5247 categories with 3.2 million annotated images in total [95].
- 31.
- UCF101: This dataset was published in 2012. It contains more than 13K clips, which are downloaded from YouTube. The videos are of type AVI, have a resolution of and a frame rate of 25 fps. In the dataset, videos are divided into 5 types and include 101 action categories [96].
- 32.
- Relevance and Diversity Dataset (RAD): This dataset was created in 2017. It is a query-specific dataset. It includes 200 annotated videos with query-specific and diversity-relevant labels. Given a different question, the retrieval of each video was performed. The top queries were pulled between 2008 and 2016 from YouTube with seed queries in 22 different categories. Each video is between 2 to 3 min long and is sampled at one frame per second [97].
- 33.
- FineGym: It was published in 2020. This dataset includes 156 high-resolution videos (720p and 1080p) from a YouTube gymnasium of 10 min each. The annotations provided in FINEGYM are related to the fine-grained recognition of human action in gymnastics [98].
- 34.
- Disney: This dataset was published in 2012. Videos were collected by more than 25 people using a GoPro camera mounted on the head. The videos have resolutions of and were sampled at 30 FPS. The dataset includes 8 topics and the total duration of the videos is more than 42 h. At 15 fps, the images were extracted, and in total more than 2 million images were exported. Throughout the video, the end and start times of the intervals corresponding to the types of social interaction were manually labeled [99].
- 35.
- 36.
- New YouTube: Includes 25 videos in 5 categories, such as food, urban placement, natural scene, animals, and landmark [51].
- 37.
- YouTube Highlights: This dataset was published in 2014. It includes YouTube videos in six different fields, such as skiing, surfing, parkour, dogs, gymnastics, and skating. Each section contains about 100 videos of varying lengths. The total duration of all videos is equal to 1430 min. The dataset has been labeled using Amazon Mechanical Turk for evaluation and analysis purposes [101].
7. Evaluation
7.1. Characteristics of Well-Crafted Summary
- Representativeness: The summary should effectively convey the main points of the original video. Although some information may be lost in the process of summarizing, it is essential to create a concise summary that clearly distinguishes the original video from others. Each video consists of various sub-videos, each presenting a distinct storyline. Therefore, the ideal summary is a blend of sub-videos that closely represents all narratives within the original video while remaining distinct from sub-videos in other videos [56].
- Sparsity: Original videos could come from various sources, including video news, a video blog, long online streams, or a television program. In all instances, the preferred synopsis should be brief, typically lasting only a few seconds, as that is the typical duration a viewer will invest before determining whether to engage further or move on. Although, for concise videos, the ideal summary might be approximately 15% of the total duration, this proportion may decrease considerably for longer videos [56].
- Diversity: Another essential characteristic of a video summary is the diversity present in its frames. While it may be feasible to generate a summary for certain videos and datasets that predominantly consists of similar frames and clips, it is still preferable to have a summary that offers a broader range of visual content. Although a viewer can grasp the essence of the video by examining a single frame, it is advantageous to have a summary that encompasses a diverse set of visual information [56].
7.2. Datasets
7.3. Evaluation Metrics and Parameters
7.3.1. Objective Metrics
- : Denotes the number of matching keyframes in the automated summary (AS).
- : Denotes the number of non-matching keyframes in the automated summary.
- : Denotes the number of keyframes in the user manual summary (US).
- TP (true positive): Denotes the number of keyframes extracted by the summarization method that exist in the ground truth.
- FP (false positive): Denotes the number of keyframes extracted by the summarization method that do not exist in the ground truth.
- FN (false negative): Denotes the number of keyframes existing in the ground truth that are not extracted by the summarization method.
7.3.2. Subjective Metrics
7.4. Evaluation Measures and Protocols
7.4.1. Evaluating Personalized Video Storyboards
7.4.2. Evaluating Personalized Video Skims
7.4.3. Evaluating Personalized Fast Forward
7.4.4. Evaluating Personalized Multiview
8. Quantitative Comparisons
- The FrameRank (KL divergence based) approach performs higher on the TVSum dataset than it does on the SumMe dataset. This unbalanced performance shows that this technique is more suited to the TVSum dataset. On the contrary, the CSUM-MSVA and CLIP-It approaches have balanced performance, as they show high performance on both datasets.
- A very good choice for unsupervised personalized video summarization is to use the contrastive learning framework which includes diversity and representatives, as this framework is based on the CSUM-MSVA method, which is the most effective and top performing method in both datasets.
- The advanced CLIP-It method (CLIP-Image + Video Caption + Transformer) provides a high F-score on both datasets. From the comparison of this method with other methods (GoogleNet + bi-LSTM, ResNet + bi-LSTM, CLIP-Image + bi-LSTM, CLIP-Image + Video Caption + bi-LSTM, GoogleNet + transformer, ResNet + transformer, CLIP-Image + transformer) in [58], the benefits of this method are realized, as it has better results in all three settings. According to the CLIP-It method, the fusion of language and image embedding is performed using pre-trained networks through learned language, which is guided by multiple attention heads. All frames are tracked together using the Frame Score Transform, which predicts frame relevance scores. Using the Knapsack algorithm, frame scores are converted into high-scoring shot scores. Therefore, it is one of the top methods for unsupervised personalized video summarization.Figure 11 underscores the importance of incorporating language, particularly through dense video captions, for creating generic video summaries with a qualitative example. The video features a woman explaining the process of making a chicken sandwich. The ground truth summary presents scores derived by averaging user annotations and highlights the keyframes that scored highly. Subsequently, the figure displays outcomes from the baseline CLIP-Image+Transformer, which relies solely on visual cues without linguistic input. The scoring indicates that frames highly rated in the ground truth also score well in the baseline; however, numerous irrelevant frames also score highly. Consequently, the model selects keyframes where the individual is either speaking or consuming the sandwich (marked in red), which are not representative of the crucial steps in the video. Introducing language through generated captions mitigates this issue. The final row depicts captions created by the bi-modal transformer (BMT) [107]. These results, from the complete CLIP-It model, show predicted scores aligning more closely with the ground truth, with the highest scoring keyframes matching those of the ground truth.Figure 11. Comparison of the reference summary with the outcomes obtained from the CLIP-Image + Transformer and the complete CLIP-It model (CLIP-Image + Video Caption + Transformer). The content provided is a video demonstrating a cooking procedure. Without captions, the model gives high scores to some irrelevant frames, like those showing the woman talking or eating, which negatively impacts the precision. By including captions, the cross-attention mechanism guarantees that frames featuring significant actions and objects receive elevated scores [58].
- The multimodal Multigraph Fusion (MGF) method, introduced in the QUASC approach to automatically identify events that are related to the query based on the tag information in the video, such as descriptions and titles, does not allow one to create a good summary. As a result, the QUASC approach is not competitive with any other approach in the TVSum dataset.
- The Actor–Critic method allows the creation of a satisfactory personalized video summary, which is shown by the F-score in the two datasets. To capture spatio-temporal features, 3D convolutional neural networks (CNNs) are used. In addition, a bi-directional short-term memory network (BiLSTM) is used to learn the extracted features. However, the Actor–Critic method is not competitive with the pioneering CSUM-MSVA and CLIP-It method.
- Using the reinforced attentive description generator and the query-aware scoring module(QSM) does not seem to help, as Xiao et al.’s [78] method is not competitive compared to the CLIP-It method on both datasets.
- Balanced performance seems to be given by the [78] method of Xiao et al. and by the CLIP-It method, as neither seems to be adapted to a single dataset.
- The CLIP-It method is a very good choice for creating supervised personalized video summarization, as no other method has competitive F-score values on both datasets.
9. Conclusions and Future Directions
- The personalized video summary techniques can be classified based on the type of personalized summary, on the criteria, on the video domain, on the source of information, on the time of summarization, and on the machine learning technique.
- Depending on the type of methodology, the techniques can be classified into five major categories, which are feature-based video summarization, keyframe selection, shot selection-based approach, video summarization using trajectory analysis, and personalized video summarization using clustering.
- RNNs and CNNs, when integrated into neural networks, can greatly contribute to the production of high-quality personalized video summaries.
- The advancement of current techniques and the invention of new methods in unsupervised personalized video summarization call for increased research efforts, given that supervised summarization methods frequently require extensive datasets with human annotations.
- Machine learning methods tend to outperform conventional techniques, given their enhanced ability to extract efficient features.
- For a personalized summary to be more effective, it is crucial to incorporate user preferences extensively. Merely extracting images from the Web is not enough, but there is a need to pull out additional data like user preferences, status, tags, shares, etc. To accomplish this data extraction, the use of advanced joint vision language models along with the application of multimodal information is a necessity in the future.
- The advancement of convolution-oriented deep learning techniques for video summarization in image classification, aimed at conceptual cluster labeling, enhances the quality of summarization while maintaining the precision of image classification. Event detection precision can be significantly improved by developing more intricate and expansive structures with the use of deep neural networks.
- Investigating the potential of architectures using CNNs to improve run-time performance, as query-based summarization methods suffer from an accuracy versus efficiency trade-off.
- Exploring the capacity of query-driven methods for summarization to include the extent to which a user retains an image, entropy, and user focus grounded on visual consistency.
- Further research in summarization techniques that utilize the open problem of audio from a clip to identify objects, subjects, or events, for instance, in sports footage, locating scoring occurrences, or identifying a suspect in surveillance footage. The user’s speech can also be considered to gather additional information to enhance the video summary.
- The development of personalized unsupervised video summarization approaches that incorporate visual language into the multitask learning framework to accurately map the relationship between query and visual content.
- The absence of a uniform approach to evaluate produced summaries using standard datasets is a concern, as each study may employ its own evaluation technique and dataset. This situation raises questions about the precision of the outcomes from objective comparisons across numerous studies, complicating the comparison of personalized summarization methods. Consequently, to ensure the credibility of the results of objective comparisons and to facilitate the comparison of personalized summary methods, the establishment of shared datasets and evaluation techniques is imperative.
Author Contributions
Funding
Conflicts of Interest
References
- Pritch, Y.; Rav-Acha, A.; Peleg, S. Nonchronological video synopsis and indexing. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1971–1984. [Google Scholar] [CrossRef] [PubMed]
- Rochan, M.; Wang, Y. Video summarization by learning from unpaired data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7902–7911. [Google Scholar]
- Gygli, M.; Grabner, H.; Van Gool, L. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3090–3098. [Google Scholar]
- Panagiotakis, C.; Doulamis, A.; Tziritas, G. Equivalent key frames selection based on iso-content principles. IEEE Trans. Circuits Syst. Video Technol. 2009, 19, 447–451. [Google Scholar] [CrossRef]
- Sharghi, A.; Laurel, J.S.; Gong, B. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 4788–4797. [Google Scholar]
- Panagiotakis, C.; Papoutsakis, K.; Argyros, A. A graph-based approach for detecting common actions in motion capture data and videos. Pattern Recognit. 2018, 79, 1–11. [Google Scholar] [CrossRef]
- Darabi, K.; Ghinea, G. Personalized video summarization using sift. In Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain, 13–17 April 2015; pp. 1252–1256. [Google Scholar]
- Sachan, P.R.; Keshaveni, N. Utilizing Image Classification based Semantic Recognition for Personalized Video Summarization. Int. J. Electron. Eng. Res. 2017, 9, 15–27. [Google Scholar]
- Panagiotakis, C.; Papadakis, H.; Fragopoulou, P. Personalized video summarization based exclusively on user preferences. In Proceedings of the European Conference on Information Retrieval, Lisbon, Portugal, 14–17 April 2020; pp. 305–311. [Google Scholar]
- Darabi, K.; Ghinea, G. Personalized video summarization based on group scoring. In Proceedings of the 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), Xi’an, China, 9–13 July 2014; pp. 310–314. [Google Scholar]
- Money, A.G.; Agius, H. Video summarisation: A conceptual framework and survey of the state of the art. J. Vis. Commun. Image Represent. 2008, 19, 121–143. [Google Scholar] [CrossRef]
- Tseng, B.L.; Lin, C.Y.; Smith, J.R. Video summarization and personalization for pervasive mobile devices. In Proceedings of the Storage and Retrieval for Media Databases 2002, San Jose, CA, USA, 19–25 January 2002; Volume 4676, pp. 359–370. [Google Scholar]
- Tseng, B.L.; Smith, J.R. Hierarchical video summarization based on context clustering. In Proceedings of the Internet Multimedia Management Systems IV, Orlando, FL, USA, 7–11 September 2003; Volume 5242, pp. 14–25. [Google Scholar]
- Lie, W.N.; Hsu, K.C. Video summarization based on semantic feature analysis and user preference. In Proceedings of the 2008 IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (sutc 2008), Taichung, Taiwan, 11–13 June 2008; pp. 486–491. [Google Scholar]
- Shafeian, H.; Bhanu, B. Integrated personalized video summarization and retrieval. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; pp. 996–999. [Google Scholar]
- Zhang, Y.; Ma, C.; Zhang, J.; Zhang, D.; Liu, Y. An interactive personalized video summarization based on sketches. In Proceedings of the 12th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and Its Applications in Industry, Hong Kong, China, 17–19 November 2013; pp. 249–258. [Google Scholar]
- Valdes, V.; Martínez, J.M. Introducing risplayer: Real-time interactive generation of personalized video summaries. In Proceedings of the 2010 ACM Workshop on Social, Adaptive and Personalized Multimedia Interaction and Access, Firenze, Italy, 29 October 2010; pp. 9–14. [Google Scholar]
- Chen, F.; De Vleeschouwer, C. Automatic production of personalized basketball video summaries from multi-sensored data. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 565–568. [Google Scholar]
- Kannan, R.; Ghinea, G.; Swaminathan, S. What do you wish to see? A summarization system for movies based on user preferences. Inf. Process. Manag. 2015, 51, 286–305. [Google Scholar] [CrossRef]
- Garcia del Molino, A. First person view video summarization subject to the user needs. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 1440–1444. [Google Scholar]
- Huang, J.H.; Worring, M. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 242–250. [Google Scholar]
- Katti, H.; Yadati, K.; Kankanhalli, M.; Tat-Seng, C. Affective video summarization and story board generation using pupillary dilation and eye gaze. In Proceedings of the 2011 IEEE International Symposium on Multimedia, Dana Point, CA, USA, 5–7 December 2011; pp. 319–326. [Google Scholar]
- Qayyum, H.; Majid, M.; ul Haq, E.; Anwar, S.M. Generation of personalized video summaries by detecting viewer’s emotion using electroencephalography. J. Vis. Commun. Image Represent. 2019, 65, 102672. [Google Scholar] [CrossRef]
- Varini, P.; Serra, G.; Cucchiara, R. Personalized egocentric video summarization for cultural experience. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, Shanghai, China, 23–26 June 2015; pp. 539–542. [Google Scholar]
- Nagar, P.; Rathore, A.; Jawahar, C.; Arora, C. Generating Personalized Summaries of Day Long Egocentric Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 6832–6845. [Google Scholar] [CrossRef]
- Park, H.S.; Cho, S.B. A personalized summarization of video life-logs from an indoor multi-camera system using a fuzzy rule-based system with domain knowledge. Inf. Syst. 2011, 36, 1124–1134. [Google Scholar] [CrossRef]
- Peng, W.T.; Chu, W.T.; Chang, C.H.; Chou, C.N.; Huang, W.J.; Chang, W.Y.; Hung, Y.P. Editing by viewing: Automatic home video summarization by viewing behavior analysis. IEEE Trans. Multimed. 2011, 13, 539–550. [Google Scholar] [CrossRef]
- Ul Haq, I.; Ullah, A.; Muhammad, K.; Lee, M.Y.; Baik, S.W. Personalized movie summarization using deep cnn-assisted facial expression recognition. Complexity 2019, 2019, 3581419. [Google Scholar] [CrossRef]
- Zhou, P.; Xu, T.; Yin, Z.; Liu, D.; Chen, E.; Lv, G.; Li, C. Character-oriented video summarization with visual and textual cues. IEEE Trans. Multimed. 2019, 22, 2684–2697. [Google Scholar] [CrossRef]
- Fei, M.; Jiang, W.; Mao, W. Learning user interest with improved triplet deep ranking and web-image priors for topic-related video summarization. Expert Syst. Appl. 2021, 166, 114036. [Google Scholar] [CrossRef]
- Mujtaba, G.; Malik, A.; Ryu, E.S. LTC-SUM: Lightweight Client-driven Personalized Video Summarization Framework Using 2D CNN. arXiv 2022, arXiv:2201.09049. [Google Scholar] [CrossRef]
- Köprü, B.; Erzin, E. Use of Affective Visual Information for Summarization of Human-Centric Videos. IEEE Trans. Affect. Comput. 2022, 14, 3135–3148. [Google Scholar] [CrossRef]
- Ul Haq, H.B.; Asif, M.; Ahmad, M.B.; Ashraf, R.; Mahmood, T. An Effective Video Summarization Framework Based on the Object of Interest Using Deep Learning. Math. Probl. Eng. 2022, 2022, 7453744. [Google Scholar] [CrossRef]
- Saquil, Y.; Chen, D.; He, Y.; Li, C.; Yang, Y.L. Multiple Pairwise Ranking Networks for Personalized Video Summarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1718–1727. [Google Scholar]
- Darabi, K.; Ghinea, G. Personalized video summarization by highest quality frames. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Chengdu, China, 14–18 July 2014; pp. 1–6. [Google Scholar]
- Fei, M.; Jiang, W.; Mao, W. Creating personalized video summaries via semantic event detection. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 14931–14942. [Google Scholar] [CrossRef]
- Haq, H.B.U.; Asif, M.; Ahmad, M.B. Video summarization techniques: A review. Int. J. Sci. Technol. Res. 2020, 9, 146–153. [Google Scholar]
- Hannon, J.; McCarthy, K.; Lynch, J.; Smyth, B. Personalized and automatic social summarization of events in video. In Proceedings of the 16th International Conference on Intelligent User Interfaces, Palo Alto, CA, USA, 13–16 February 2011; pp. 335–338. [Google Scholar]
- Chen, F.; Delannay, D.; De Vleeschouwer, C. An autonomous framework to produce and distribute personalized team-sport video summaries: A basketball case study. IEEE Trans. Multimed. 2011, 13, 1381–1394. [Google Scholar] [CrossRef]
- Olsen, D.R.; Moon, B. Video summarization based on user interaction. In Proceedings of the 9th European Conference on Interactive TV and Video, Lisbon, Portugal, 29 June 29–1 July 2011; pp. 115–122. [Google Scholar]
- Chen, F.; De Vleeschouwer, C.; Cavallaro, A. Resource allocation for personalized video summarization. IEEE Trans. Multimed. 2013, 16, 455–469. [Google Scholar] [CrossRef]
- Chen, F.; De Vleeschouwer, C. Personalized summarization of broadcasted soccer videos with adaptive fast-forwarding. In Proceedings of the International Conference on Intelligent Technologies for Interactive Entertainment, Mons, Belgium, 3–5 July 2013; pp. 1–11. [Google Scholar]
- Kao, C.C.; Lo, C.W.; Lu, K.H. A personal video summarization system by integrating RFID and GPS information for marathon activities. In Proceedings of the 2015 IEEE 5th International Conference on Consumer Electronics-Berlin (ICCE-Berlin), Berlin, Germany, 6–9 September 2015; pp. 347–350. [Google Scholar]
- Sukhwani, M.; Kothari, R. A parameterized approach to personalized variable length summarization of soccer matches. arXiv 2017, arXiv:1706.09193. [Google Scholar]
- Tejero-de Pablos, A.; Nakashima, Y.; Sato, T.; Yokoya, N.; Linna, M.; Rahtu, E. Summarization of user-generated sports video by using deep action recognition features. IEEE Trans. Multimed. 2018, 20, 2000–2011. [Google Scholar] [CrossRef]
- Niu, J.; Huo, D.; Wang, K.; Tong, C. Real-time generation of personalized home video summaries on mobile devices. Neurocomputing 2013, 120, 404–414. [Google Scholar] [CrossRef]
- Basavarajaiah, M.; Sharma, P. Survey of compressed domain video summarization techniques. ACM Comput. Surv. (CSUR) 2019, 52, 1–29. [Google Scholar] [CrossRef]
- Sharghi, A.; Gong, B.; Shah, M. Query-focused extractive video summarization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 3–19. [Google Scholar]
- Tiwari, V.; Bhatnagar, C. A survey of recent work on video summarization: Approaches and techniques. Multimed. Tools Appl. 2021, 80, 27187–27221. [Google Scholar] [CrossRef]
- Chung, C.T.; Hsiung, H.K.; Wei, C.K.; Lee, L.S. Personalized video summarization based on Multi-Layered Probabilistic Latent Semantic Analysis with shared topics. In Proceedings of the The 9th International Symposium on Chinese Spoken Language Processing, Singapore, 12–14 September 2014; pp. 173–177. [Google Scholar]
- Yin, Y.; Thapliya, R.; Zimmermann, R. Encoded semantic tree for automatic user profiling applied to personalized video summarization. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 181–192. [Google Scholar] [CrossRef]
- Baghel, N.; Raikwar, S.C.; Bhatnagar, C. Image Conditioned Keyframe-Based Video Summarization Using Object Detection. arXiv 2020, arXiv:2009.05269. [Google Scholar]
- Dong, Y.; Liu, C.; Shen, Z.; Han, Y.; Gao, Z.; Wang, P.; Zhang, C.; Ren, P.; Xie, X. Personalized Video Summarization with Idiom Adaptation. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1041–1043. [Google Scholar]
- Dong, Y.; Liu, C.; Shen, Z.; Gao, Z.; Wang, P.; Zhang, C.; Ren, P.; Xie, X.; Yu, H.; Huang, Q. Domain Specific and Idiom Adaptive Video Summarization. In Proceedings of the ACM Multimedia Asia, Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
- Xiao, S.; Zhao, Z.; Zhang, Z.; Yan, X.; Yang, M. Convolutional hierarchical attention network for query-focused video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12426–12433. [Google Scholar]
- Sosnovik, I.; Moskalev, A.; Kaandorp, C.; Smeulders, A. Learning to Summarize Videos by Contrasting Clips. arXiv 2023, arXiv:2301.05213. [Google Scholar]
- Choi, J.; Oh, T.H.; Kweon, I.S. Contextually customized video summaries via natural language. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1718–1726. [Google Scholar]
- Narasimhan, M.; Rohrbach, A.; Darrell, T. CLIP-It! language-guided video summarization. Adv. Neural Inf. Process. Syst. 2021, 34, 13988–14000. [Google Scholar]
- Gunawardena, P.; Sudarshana, H.; Amila, O.; Nawaratne, R.; Alahakoon, D.; Perera, A.S.; Chitraranjan, C. Interest-oriented video summarization with keyframe extraction. In Proceedings of the 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 2–5 September 2019; Volume 250, pp. 1–8. [Google Scholar]
- Tseng, B.L.; Lin, C.Y.; Smith, J.R. Video personalization and summarization system for usage environment. J. Vis. Commun. Image Represent. 2004, 15, 370–392. [Google Scholar] [CrossRef]
- Lei, Z.; Zhang, C.; Zhang, Q.; Qiu, G. FrameRank: A text processing approach to video summarization. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 368–373. [Google Scholar]
- Varini, P.; Serra, G.; Cucchiara, R. Personalized egocentric video summarization of cultural tour on user preferences input. IEEE Trans. Multimed. 2017, 19, 2832–2845. [Google Scholar] [CrossRef]
- Liu, H.; Sun, F.; Zhang, X.; Fang, B. Interactive video summarization with human intentions. Multimed. Tools Appl. 2019, 78, 1737–1755. [Google Scholar] [CrossRef]
- Zhang, Y.; Kampffmeyer, M.; Liang, X.; Tan, M.; Xing, E.P. Query-conditioned three-player adversarial network for video summarization. arXiv 2018, arXiv:1807.06677. [Google Scholar]
- Zhang, Y.; Kampffmeyer, M.; Zhao, X.; Tan, M. Deep reinforcement learning for query-conditioned video summarization. Appl. Sci. 2019, 9, 750. [Google Scholar] [CrossRef]
- Ghinea, G.; Kannan, R.; Swaminathan, S.; Kannaiyan, S. A novel user-centered design for personalized video summarization. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Chengdu, China, 14–18 July 2014; pp. 1–6. [Google Scholar]
- Kannan, R.; Ghinea, G.; Swaminathan, S.; Kannaiyan, S. Improving video summarization based on user preferences. In Proceedings of the 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), Jodhpur, India, 18–21 December 2013; pp. 1–4. [Google Scholar]
- del Molino, A.G.; Boix, X.; Lim, J.H.; Tan, A.H. Active video summarization: Customized summaries via on-line interaction with the user. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Jiang, P.; Han, Y. Hierarchical variational network for user-diversified & query-focused video summarization. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; pp. 202–206. [Google Scholar]
- Cizmeciler, K.; Erdem, E.; Erdem, A. Leveraging semantic saliency maps for query-specific video summarization. Multimed. Tools Appl. 2022, 81, 17457–17482. [Google Scholar] [CrossRef]
- Ji, Z.; Ma, Y.; Pang, Y.; Li, X. Query-aware sparse coding for web multi-video summarization. Inf. Sci. 2019, 478, 152–166. [Google Scholar] [CrossRef]
- Yoshitaka, A.; Sawada, K. Personalized video summarization based on behavior of viewer. In Proceedings of the 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems, Sorrento, Italy, 25–29 November 2012; pp. 661–667. [Google Scholar]
- Namitha, K.; Narayanan, A.; Geetha, M. Interactive visualization-based surveillance video synopsis. Appl. Intell. 2022, 52, 3954–3975. [Google Scholar] [CrossRef]
- Otani, M.; Nakashima, Y.; Sato, T.; Yokoya, N. Video summarization using textual descriptions for authoring video blogs. Multimed. Tools Appl. 2017, 76, 12097–12115. [Google Scholar] [CrossRef]
- Miniakhmetova, M.; Zymbler, M. An approach to personalized video summarization based on user preferences analysis. In Proceedings of the 2015 9th International Conference on Application of Information and Communication Technologies (AICT), Rostov on Don, Russia, 14–16 October 2015; pp. 153–155. [Google Scholar]
- Seera, M.; Lim, C.P. Transfer learning using the online fuzzy min–max neural network. Neural Comput. Appl. 2014, 25, 469–480. [Google Scholar] [CrossRef]
- Plummer, B.A.; Brown, M.; Lazebnik, S. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5781–5789. [Google Scholar]
- Xiao, S.; Zhao, Z.; Zhang, Z.; Guan, Z.; Cai, D. Query-biased self-attentive network for query-focused video summarization. IEEE Trans. Image Process. 2020, 29, 5889–5899. [Google Scholar] [CrossRef]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. Proc. IEEE 2021, 109, 1838–1863. [Google Scholar] [CrossRef]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VII 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 505–520. [Google Scholar]
- Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
- Lee, Y.J.; Ghosh, J.; Grauman, K. Discovering important people and objects for egocentric video summarization. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1346–1353. [Google Scholar]
- Yeung, S.; Fathi, A.; Fei-Fei, L. Videoset: Video summary evaluation through text. arXiv 2014, arXiv:1406.5824. [Google Scholar]
- Geisler, G.; Marchionini, G. The open video project: Research-oriented digital video repository. In Proceedings of the Fifth ACM Conference on Digital Libraries, San Antonio, TX, USA, 2–7 June 2000; pp. 258–259. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- Zhang, Y.; Xie, Y.; Zhang, Y.; Dai, Y.; Ren, F. VSSum: A Virtual Surveillance Dataset for Video Summary. In Proceedings of the 5th International Conference on Control and Computer Vision, Xiamen, China, 19–21 August 2022; pp. 113–119. [Google Scholar]
- De Avila, S.E.F.; Lopes, A.P.B.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
- Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Tejero-de Pablos, A.; Nakashima, Y.; Sato, T.; Yokoya, N. Human action recognition-based video summarization for RGB-D personal sports video. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
- Gautam, S.; Kaur, P.; Gangadharappa, M. An Overview of Human Activity Recognition from Recordings. In Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 12–13 October 2018; pp. 921–928. [Google Scholar]
- Nitta, N.; Takahashi, Y.; Babaguchi, N. Automatic personalized video abstraction for sports videos using metadata. Multimed. Tools Appl. 2009, 41, 1–25. [Google Scholar] [CrossRef]
- Ringeval, F.; Sonderegger, A.; Sauer, J.; Lalanne, D. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Proceedings of the 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–8. [Google Scholar]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
- Jegou, H.; Douze, M.; Schmid, C. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; Proceedings, Part I 10; Springer: Berlin/Heidelberg, Germany, 2008; pp. 304–317. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Vasudevan, A.B.; Gygli, M.; Volokitin, A.; Van Gool, L. Query-adaptive video summarization via quality-aware relevance estimation. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 582–590. [Google Scholar]
- Shao, D.; Zhao, Y.; Dai, B.; Lin, D. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2616–2625. [Google Scholar]
- Fathi, A.; Hodgins, J.K.; Rehg, J.M. Social interactions: A first-person perspective. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1226–1233. [Google Scholar]
- Poleg, Y.; Ephrat, A.; Peleg, S.; Arora, C. Compact cnn for indexing egocentric videos. In Proceedings of the 2016 IEEE winter conference on applications of computer vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
- Sun, M.; Farhadi, A.; Seitz, S. Ranking domain-specific highlights by analyzing edited videos. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 787–802. [Google Scholar]
- Mirzasoleiman, B.; Jegelka, S.; Krause, A. Streaming non-monotone submodular maximization: Personalized video summarization on the fly. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Garcia del Molino, A.; Lim, J.H.; Tan, A.H. Predicting visual context for unsupervised event segmentation in continuous photo-streams. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 10–17. [Google Scholar]
- Kendall, M.G. The treatment of ties in ranking problems. Biometrika 1945, 33, 239–251. [Google Scholar] [CrossRef]
- Zwillinger, D.; Kokoska, S. CRC Standard Probability and Statistics Tables and Formulae; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Iashin, V.; Rahtu, E. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. arXiv 2020, arXiv:2005.08271. [Google Scholar]
Domain | Papers |
---|---|
Cultural heritage | [62] |
Movies/serial clips | [19,28,29,36,50] |
Office videos | [26] |
Personal videos | [53,54] |
Sports video | [18,36,38,39,40,41,42,43,44,45] |
Zoo videos | [8] |
Dataset Name | Year | Short Description | Num. Vids | Length per Video (Min.) |
---|---|---|---|---|
OVP [84] | 1999 | Home, entertainment and news videos | 50 | |
Sports-1M [85] | 2014 | YouTube sports videos | 5.6 (average) | |
UTE [82] | 2012 | Egocentric videos capture Physical daily activities | 10 | |
SumMe [80] | 2014 | Sport, event and holiday videos | 25 | |
TVSum [81] | 2015 | Documentaries, news and egocentric videos | 50 | |
FineGym [98] | 2020 | YouTube gymnasium videos | 156 | 10 |
Method | Dataset | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SumMe | TVSum | UGSum | TVC | UT Egocentric | TV Episodes | OVP | Youtube Action | VSUMM | Sports-1M | Hollywood-2 | Query-Focused | ActivityNet | FineGym | Flickr | UGSum52 | Disney | MVS1K | HUJI | |
Shafeian and Bhanu, 2012 [15] | x | ||||||||||||||||||
Sharghi et al., 2016 [48] | x | x | |||||||||||||||||
Yin et al., 2016 [51] | x | x | |||||||||||||||||
Plummer et al., 2017 [77] | x | x | |||||||||||||||||
Choi et al., 2018 [57] | x | x | |||||||||||||||||
Mirzasoleiman et al., 2018 [102] | x | x | |||||||||||||||||
Tejero-de-Pablos et al., 2018 [45] | x | x | |||||||||||||||||
Zhang et al., 2018 [64] | x | ||||||||||||||||||
Dong et al., 2019 [53] | x | ||||||||||||||||||
Dong et al., 2019 [54] | x | x | |||||||||||||||||
FrameRank, 2019 [61] | x | x | x | x | x | ||||||||||||||
Gunawardena et al., 2019 [59] | x | ||||||||||||||||||
Jiang and Han, 2019 [69] | x | ||||||||||||||||||
Ji et al., 2019 [71] | x | x | |||||||||||||||||
Qayyum et al., 2019 [23] | x | ||||||||||||||||||
Xiao et al., 2020 [78] | x | x | x | ||||||||||||||||
Nagar et al., 2021 [25] | x | x | x | x | x | x | |||||||||||||
Narasimhan et al., 2021 [58] | x | x | x | ||||||||||||||||
Saquil et al., 2021 [34] | x | x | x | ||||||||||||||||
Köprü and Erzin, 2022 [32] | x | ||||||||||||||||||
Mujtaba et al., 2022 [31] | x | ||||||||||||||||||
Ul et al., 2022 [33] | x | x | |||||||||||||||||
Sosnovik et al., 2023 [56] | x | x | x |
Method | F-Score |
---|---|
FrameRank (KL divergence based) [61] | 0.453 |
Actor–Critic [25] | 0.464 |
CLIP-It [58] | 0.525 |
CSUM-MSVA [56] | 0.582 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peronikolis, M.; Panagiotakis, C. Personalized Video Summarization: A Comprehensive Survey of Methods and Datasets. Appl. Sci. 2024, 14, 4400. https://doi.org/10.3390/app14114400
Peronikolis M, Panagiotakis C. Personalized Video Summarization: A Comprehensive Survey of Methods and Datasets. Applied Sciences. 2024; 14(11):4400. https://doi.org/10.3390/app14114400
Chicago/Turabian StylePeronikolis, Michail, and Costas Panagiotakis. 2024. "Personalized Video Summarization: A Comprehensive Survey of Methods and Datasets" Applied Sciences 14, no. 11: 4400. https://doi.org/10.3390/app14114400
APA StylePeronikolis, M., & Panagiotakis, C. (2024). Personalized Video Summarization: A Comprehensive Survey of Methods and Datasets. Applied Sciences, 14(11), 4400. https://doi.org/10.3390/app14114400