Next Article in Journal
The Exergo-Economic and Environmental Evaluation of a Hybrid Solar–Natural Gas Power System in Kirkuk
Previous Article in Journal
Experimental Investigation of Tensile, Shear, and Compression Behavior of Additional Plate Damping Structures with Entangled Metallic Wire Material
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DanceCaps: Pseudo-Captioning for Dance Videos Using Large Language Models

1
Music and Audio Research Group, Department of Intelligence and Information, Seoul National University, Seoul 08826, Republic of Korea
2
Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul 08826, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(22), 10116; https://doi.org/10.3390/app142210116
Submission received: 5 October 2024 / Revised: 24 October 2024 / Accepted: 28 October 2024 / Published: 5 November 2024
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
In recent years, the dance field has been able to create diverse content by leveraging technical advancements such as deep learning models, generating content beyond the unique artistic creations that only humans can create. However, in terms of dance data, there are still a lack of video and label datasets or datasets that contain multiple tags for videos. To address this gap, this paper explores the feasibility of generating dance captions from tags using a pseudo-captioning approach, inspired by the significant improvements large language models (LLMs) have shown in other domains. Various tags are generated from features extracted from videos and audio, and LLMs are then instructed to produce dance captions based on these tags. Captions were generated using both the open dance dataset and Internet dance videos, followed by user evaluations of randomly sampled captions. Participants found the captions effective in describing dance movements, of expert quality, and consistent with video content. Additionally, positive feedback was received on the evaluation of the gap in image extraction and the inclusion of tag data. This paper introduces and validates a novel pseudo-captioning method for generating dance captions using predefined tags, contributing to the expansion of data available for dance research and offering a practical solution to the current lack of datasets in this field.

1. Introduction

Dance encompasses both visual information from movements and auditory information from music, making it essential to consider these multimodal characteristics in dance research. With recent advancements in deep learning, the dance field has rapidly evolved. In particular, dance generation research has significantly progressed through the use of deep learning models such as RNNs [1], transformers [2], and diffusion [3] models, leading to the creation of more sophisticated dance movements. Despite these advancements, there remains a lack of comprehensive datasets containing the necessary information for training and evaluating these models. As in other fields, dance research requires not only video data, but also datasets that include tags or captions that capture detailed information about the dance itself.
In contrast, vision language models utilize publicly available image-label paired datasets such as MS-COCO Captions [4], Flickr30k Captions [5], VQA 2.0 [6], ImageNet [7], and LAION-5B [8] to address tasks like image captioning, text-to-image generation, and visual question answering. Additionally, CLIP [9] and BLIP [10] are multimodal models that simultaneously understand images and text, having been trained on large-scale web data consisting of image–text pairs. Similarly, audio language models rely on audio–label paired datasets like AudioCaps [11], Clotho [12], and MACS [13] for tasks including audio captioning, text-to-audio generation, and audio question answering. Both vision and audio research benefit from the availability of open or private datasets that facilitate ongoing model training and evaluation through benchmarking. Dance research, however, has lagged behind these traditional fields, with fewer open datasets and limited label information.
The most commonly used open dance datasets in dance research currently include the AIST dance dataset [14], AIST++ dance dataset [15], and AniDance dance dataset [16], with the AIST++ dance dataset being the most widely utilized. However, these datasets mainly focus on the joint coordinates of dance motions and provide limited labeling, so more specific research requires the creation of research-specific datasets. Examples of such individual studies of datasets include Yao et al. [17], which used the game ‘Just Dance’ as a dataset; Sun et al. [18] and Lee et al. [19], which utilized YouTube videos of specific dance genres; and Huang et al. [20], which utilized regional music features. These private datasets often lack transparency about their composition, quality, and features, making it difficult to analyze the models trained on them and the results obtained. Furthermore, the lack of datasets focused on dance-related labels, tags, or dance captions makes it more difficult to conduct research related to dance labels.
Therefore, the development of a dance caption dataset with paired dance movements and natural language descriptions could significantly expand research possibilities. Unlike previous methods requiring direct annotation of dance–caption pairs, recent advancements in LLM and pseudo-captioning techniques present new opportunities for generating such paired datasets. These technological advancements are expected to facilitate the creation of dance caption datasets, enhancing the scope and depth of choreographic research.
This paper focuses on generating captions for dance videos that lack existing labels or tags by leveraging multiple information sources. We propose a pseudo-captioning method that generates captions by selecting meaningful tags according to the goal. First, the visual and audio features extracted from the dance videos are classified into four groups: movement, dance, music, and others, and each category is subdivided into detailed tags and extracted. Then, we propose a method to generate three types of captions using LLMs by utilizing the extracted tags and instructions suitable for the purpose of each caption. Ultimately, the generated captions aim to contribute to caption development more comprehensively by enabling generation from open dance datasets and dance videos on the Internet.
The main objectives of this research are as follows: (1) To propose a method for generating a large amount of dance-related tags using video and audio analysis techniques; (2) to define appropriate instructions for LLM to generate accurate captions based on tags; and (3) to evaluate the quality of captions generated from open dance datasets and Internet dance videos.

2. Related Work

2.1. Dance Dataset

One of the main open dance datasets is the AniDance dataset, which focuses on choreography. It consists of 40 choreographies captured using optical motion capture equipment (Vicon). While Vicon provides high accuracy for capturing movements, the dataset includes a relatively small amount of choreography data, which may present challenges when used for model training. In contrast, the AIST dataset differs from general human activity datasets such as COCO [4] or KIT [21] by focusing solely on dance movements. It features data from 10 dance genres, captured in a controlled studio environment using multiple cameras, allowing for the extraction of clean motion data. The dataset provides detailed information such as the dance genre, specific choreography steps, and camera positions. However, a limitation of the AIST dataset is that it only offers video data, requiring researchers to separately extract motion poses, and it includes only 60 music tracks, which is relatively few compared to the number of videos available. The AIST++ dataset is an extension of the AIST dataset, offering refined joint data in an open format, but its accuracy may be lower compared to the AniDance dataset. In summary, the amount of open dance datasets remains limited, and the labels available for use vary across datasets and are often quite sparse.

2.2. Pseudo-Captioning

In the field of computer vision, there is a growing body of work leveraging LLMs [22,23] across various applications such as visual question answering, image captioning, and text-to-image generation. For instance, Rotstein et al. [24] enhanced image captioning by refining critical image details using object detectors and attribute recognizers, and then fusing these details with original captions via LLM. This method was able to process 12 million image-caption pairs to automatically improve the caption quality. Similarly, Bianco et al. [25] addressed the issue of generic captions by using multiple models to rank generated captions and then merging the top two with an LLM for a more descriptive output. In the audio domain, a comparable trend is emerging, with LLMs being applied to tasks like audio question answering [26,27] and text-to-audio captioning [28,29,30]. The expansion of audio datasets has driven performance improvements in these areas. For example, Mei et al. [31] addressed the limitations of current audio–language datasets by utilizing an LLM to filter and refine noisy data sourced from the internet, resulting in a dataset of 0.4 million audio clips paired with captions. Likewise, Doh et al. [32] tackled the challenge of small and costly music–language datasets by generating 2.2 million captions using existing music tags and LLMs. These studies underscore the effectiveness of LLMs in enhancing existing captions or generating new ones, showcasing their promising performance across both visual and audio domains.

3. Pseudo-Captioning Based Dance Caption Generation

This section outlines the process of defining tags that capture various components of a dance video, such as movements, dance, and music, and utilizing LLM to generate captions based on these tags. Tags are identified as key descriptors that represent the distinctive features of a dance video and are crucial for generating meaningful captions. As shown in Figure 1, the first step is to carefully select and define effective tags for dance captioning. Tags are extracted separately for dance, movement, music, dancers, and other relevant aspects, ensuring that each tag accurately encapsulates the attributes of its respective component. Additionally, the structure of captions that best describe the dance video is established, along with the corresponding LLM instructions for generating captions using the predefined tags. This method allows for precise descriptions of any dance video, producing captions tailored to specific objectives.

3.1. Tags

Tags function as concise representations of the key characteristics or essential information required for generating captions for a dance video using LLM. They serve as the foundational components from which captions are constructed. Consequently, selecting tags that accurately describe a video is crucial for producing effective captions. By extracting relevant information as tags, rather than directly analyzing the video, the quality of the generated captions can be enhanced, while also significantly reducing the generation time. Since dance is a multidisciplinary art form, it encompasses various components such as dance movements, music, dancers, and environments. Therefore, accurately capturing a dance requires considering not only the characteristics of the dance itself but also the sequence of movements, background music, the dancers, and the surrounding environment. The following subsections detail the process of extracting and defining tags for each of these components. Table 1 and Table 2 provides an overview of the categories, tag names, and corresponding examples for each category.

3.1.1. Motion Tags

Motion tags are tags that describe a dancer’s movements over time, encompassing both the physical changes in body movement and the detailed linguistic descriptions of the dancer’s actions throughout the sequence. These tags not only visually represent the progression of the dance over time but also capture the overall dynamics of the movements.
The physical variation in body movement over time is tracked by monitoring the positions of a dancer’s joints in video frames, which not only captures the overall characteristics of the body’s motion but also indirectly reflects the tempo and mood of the entire dance. To achieve this, the joint positions of the dancer in each video frame must first be detected, and in this study, OpenPose [33] was used for joint position estimation. The positions of 14 joints, as shown in Figure 2, were extracted, and the speed and acceleration of each joint over time were calculated based on these position values. To capture the overall characteristics of the body’s movement, the speed and acceleration values of all joints were averaged to derive the body’s mean speed and acceleration. These values were then averaged over time to calculate the body’s overall mean speed and acceleration. Rather than directly using the calculated average speed and acceleration, these values were translated into descriptive terms that could be intuitively utilized by the LLM to generate captions. The mean speed and acceleration values were categorized into five groups each and used as tags. For category selection, the average speed and acceleration of the dancer were computed for all single-dancer videos in the AIST++ dataset, and the 20th, 40th, 60th, and 80th percentiles of this distribution were chosen as thresholds. Based on these thresholds, speed was divided into five categories: rapid, high, moderate, low, and slow, while acceleration was categorized as rapid, high, moderate, gradual, or slight.
In addition to the physical calculation of body movement, temporal motion tags were extracted to describe the dancer’s detailed movements in text. To achieve this, video captioning algorithms were utilized. However, the algorithms proposed by Krishna et al. and Iashin et al. [34,35,36,37] focus more on describing the task depicted in the video rather than the temporal details of the movement. For example, when inputting a dance performance video, the resulting captions tend to be simple descriptions like “A man is dancing and singing”, “A man is dancing in a music video”, or “A video game is being played”, focusing on the fact that the video features a man dancing. Moreover, these captions often include false information such as “singing”, “video game is being played”, or “music video”, which are unrelated to the actual content. While CLIP-based models improved the precision of frame-level descriptions, they still focus on the overall video description rather than capturing the temporal dynamics of the movement, making them unsuitable for motion tag extraction. As a result, motion-to-text algorithms [38] have been used instead of general video captioning to describe the motion. However, these methods have also faced challenges, as they primarily focus on classifying the type of motion based on predefined motion categories rather than providing detailed temporal descriptions. Additionally, there are a lack of detailed motion categories specifically for dance, leading to vague descriptions like “dancing” or misclassification into unrelated motion categories.
Therefore, an LLM was utilized to extract temporal motion tags. It was observed that depending on how the instructions were designed, the LLM could describe the dancer’s body movements over time rather than simply stating the motion category. Along with the video frames, an instruction like, “The image is a series of dance movements that change over time, from the starting dance movement on the left to the right. Please list the dance moves of the dancers in the image that go well with the music in chronological order, and briefly explain the dance moves. Please express it in one paragraph.” was provided to the LLM. This approach yielded results as shown in Table 2 and Figure 3. As seen in Figure 3, the temporal motion tags effectively describe the dancer’s movements over time in sequential order.
Similar to CLIP-based video captioning algorithms, the process of extracting temporal motion tags using LLMs is costly due to the use of video frames for captioning. Utilizing all frames from the video for captioning inherently introduces redundancy, and duplicated embeddings or captions can negatively affect the final caption quality. Therefore, it is necessary to sample frames in a way that sufficiently captures temporal changes in the motion while minimizing unnecessary frames. Dance is inherently linked to the rhythm and beat of the accompanying music [39,40,41], and the positions of the beats in the music serve as important markers that divide and define key choreography movements. However, if captioning is performed using only the frames corresponding to the beat positions, information about the transitional movements between beats may be lost, so it is also necessary to include inter-beat frames. In the proposed method, frames corresponding to beat positions and evenly spaced intervals between beats are sampled, as shown in Figure 4. The optimal frame selection interval is experimentally determined to achieve the best results.

3.1.2. Dance Tags

Dance tags describe the characteristics of choreography in the video, independent of other elements such as body movements, dancers, location, or music. Rather than focusing on detailed movements, these tags represent broader and more general features like the type and genre of dance, providing a summary of the choreography’s key characteristics. In the proposed method, two tags are extracted and used: choreography steps and dance genre.
Choreography steps are fundamental building blocks in dance composition, representing specific sequences of movements that form the core of a dance routine. These steps encompass a wide range of elements, including spatial patterns, body shapes, timing, and energy dynamics, which are carefully combined to convey emotional or narrative content within the dance. Examples of common choreography steps include “walking out step” and “Brooklyn step”. To extract these choreography steps, the metadata of the choreography video is utilized. For in-the-wild choreography videos, metadata typically includes the video title, description, and hashtags, which often provide information about the choreography or music featured in the video. However, the metadata does not always include choreography steps in sufficient detail to build a classifier. When choreography steps are present in the metadata, they are extracted and used, but if they are not, the choreography steps are not extracted separately.
For the dance genre, the method uses the genres from the AIST++ dataset, which include Break, Pop, Lock, Waack, Hip-hop, House, Krump, Street Jazz, and Ballet. K-pop, unlike traditional dance genres, is characterized by its fusion of multiple genres, and therefore is treated as a separate genre, bringing the total number of genres to 10. Similar to choreography steps, if the genre is tagged in the video’s metadata, it is extracted and used as a tag. If the metadata does not contain genre information, a classifier is trained to identify the dance genre from the movements displayed in the video. Since the previously defined temporal motion tag already describes the detailed choreography movements over time, a model is trained to classify dance genres based on the temporal motion tag. To do this, dance videos with genre information in the metadata are collected, and a model, as illustrated in Figure 5, is trained for genre classification. A pretrained BERT encoder is used to encode the temporal motion tags, which is then connected to a dense layer to classify the genre. During training, the BERT encoder is frozen, and only the dense layer is updated. The trained model is then used to extract the genre tag from the temporal motion tag in a sequential process.

3.1.3. Music Tags

We extracted musical information by focusing on tempo, beat positions, pitch levels, mood, and instruments. Initially, we utilized librosa [42] to determine the tempo, beat positions, and average pitch level of the music. Tempo was categorized into low, medium, and high based on BPM standards like adagio, andante, and allegro in Western music. The average pitch level was calculated from the extracted pitches, classified into high (above 800 Hz), low (below 500 Hz), and mid-range. Beat positions were identified using onset strength from librosa and were aligned with the video’s frame rate to correlate beats with specific frames. Music genres, such as hip-hop, jazz, and electronic, were classified using a genre classification model [43]. Results were compared with those from auto-tagging [44], and where applicable, overlapping and differing outcomes were both included. Mood and instrument data were also extracted using auto-tagging techniques. To enhance accuracy, we imposed limits, selecting only 2–3 instruments and moods for representation, while genre tags were further refined using metadata. Other music captioning methods [23,29] were excluded due to the tendency to generate excessive music-related details that overshadowed movement descriptions, or because they produced numerous inaccurate tags.

3.1.4. Other Tags

Features or elements of the choreography video that were not described by previous tags, such as characteristics of the dancers or the environment, were classified as other tags. These tags included attributes related to the dancer, called dancer characteristics, such as their number, attire, and gender, as well as background details, also called spatial characteristics, like camera position and whether the setting was indoors or outdoors, and the time of day. Figure 6 shows the entire process undertaken to extract other tags. We first chose the very first frame of the video as the thumbnail, where the dancer is holding the initial pose and remaining still, as the music has not yet started and the dancer is in a state before beginning to move. To extract both the characteristics of the dancer and the place where the dance is being performed, the thumbnail was separated into the picture of dancer on an empty background, with the residual representing place. For separation, we adopted a salient object detection algorithm [45] to crop out the dancer from the thumbnail. An image of the dancer was then applied to the image-to-text model [10] to obtain dancer characteristics, like costume, gender, and notable features of the dancer. The residual image was also applied to the image-to-text model to obtain spatial characteristics, depicting the features of the background.

3.2. Caption Types

In this study, three distinct types of captions were defined based on their purpose in dance captioning. The first type, known as the ‘dance summary caption’, was constructed primarily from the extracted dance tags to provide a basic description of the dance. This caption served as the fundamental form of description, capturing essential components of the dance. However, while the ‘dance summary caption’ provides a general overview, it may not encompass the full scope of the dance video, which includes temporal choreography and additional contextual components beyond the dance itself. To address this limitation, the second type of caption, termed ‘time series motion caption’, was designed to detail the sequence of dance movements over time. This type captured dynamic changes and intricate details that the ‘dance summary caption’ might have overlooked. The final type, referred to as the ‘multimodal caption’, aimed to represent non-dance components, such as music, dancer characteristics, and environmental factors, as well as the interactions and harmonies between these components and the dance.
Although it is feasible to create an all-encompassing caption that integrates all three types, such comprehensive captions may include extraneous information, leading to inefficiencies in both temporal and quality aspects during dataset generation. Consequently, tailoring captions to specific purposes can improve the efficiency and effectiveness of learning and research processes. The subsequent subsections will detail the tags necessary for each caption type and provide the instructions required for generating these captions using an LLM. This study employs GPT-4o [46] as the LLM, and the related instructions and results are derived from this model. Table 3 summarizes the contents of the following subsections, with detailed explanations provided therein.

3.2.1. Dance Summary Caption

The ‘dance summary caption’ recognizes a series of images as a single dance performance and aims to provide a comprehensive description of the dance movements. This caption is designed to leverage the detailed descriptions from temporal motion tag, and dance tags to articulate the movements. The primary focus is on delivering a concise description that reflects the basic perceptual understanding of the dance as experienced by viewers. Typically, viewers concentrate on the movements and the genre of the dance. Therefore, the dance summary caption emphasizes the use of key tags such as dance genre and choreography step to describe the movements. To ensure a thorough description, the temporal motion tag is summarized to retain only the essential content, supplemented by dance tags to enhance the completeness of the caption. This approach prevents the caption from being overly simplified, which could otherwise result in a description that covers only a fraction of the movements. Additionally, the instruction is designed to limit the description to a single line, aiming to produce a concise caption that can be quickly grasped at a glance.

3.2.2. Time Series Motion Caption

The ‘time series motion caption’ focuses on detailing the changes in movements over time, with the primary objective of providing a comprehensive account of the actions depicted in the video. This caption is based on the temporal motion tag, which describes a series of images in chronological order. It is designed to guide the generation of captions that chronologically detail the sequence of movements. Similar to the ‘dance summary caption’, including information about the dance genre and characteristics helps in better understanding the movements. The generated caption provides an in-depth explanation of the movements, which aids in creating the ‘dance summary caption’. Therefore, the instructions direct the generation of detailed descriptions of the movements, excluding non-movement related explanations, and the length of the captions is restricted to no more than 1000 characters.

3.2.3. Multimodal Caption

The ‘multimodal caption’ generates a comprehensive caption that encompasses all elements present in the video, including dance tags, motion tags, music tags, and other contextual tags such as dancer attributes and environmental factors. The instructions are crafted to ensure that each tag contributes to a balanced caption, presenting the dance movements temporally without excessive detail. The information is arranged appropriately to provide a coherent output, and the caption length is limited to 80 words to maintain conciseness.

4. Experiments

To assess the effectiveness of the proposed method, we first evaluated the accuracy of the self-defined tags, followed by a user study of the captions generated using these tags. The accuracy of the tags represents the accuracy of the information to be included in the captions, which in turn determines the reliability of the captions. The proposed captioning method was applied to commonly used dance datasets and in-the-wild choreography videos extracted from social media platforms to generate captions describing these choreography videos. A user evaluation was then conducted, where the generated captions were randomly presented to users. This evaluation assessed whether the generated captions accurately reflected the tag information and maintained readability.

4.1. Datasets for Evaluation

To validate the tags and instructions proposed in Section 3, we utilized the AIST dance dataset, which is an open and widely used dataset, to extract ‘dance summary caption’, ‘time series motion caption’, and ‘multimodal caption’. First, we verified whether the tags we aimed to extract were available through the AIST dance dataset metadata. Information such as the dance steps and dance genre was provided by the AIST dance metadata; thus, we used this information instead of directly extracting it. The AIST dance dataset includes videos captured from multiple cameras, and we used data from only the central c01 camera among the ten cameras. This was because the tag content from other camera positions largely matched, rendering it meaningless in terms of the content of the captions. In contrast to the refined videos provided by the AIST dance dataset, this study aimed to generate tags and dance captions using pseudo-captioning from wild videos commonly found on the internet. While the AIST dance dataset provides accurate information on dance genres and dancer steps, videos from platforms such as YouTube or TikTok provide metadata through titles and tags, with diverse information on dancers, environments, and music. Thus, unlike the specific music data within the AIST dance dataset, there is a vast variety of music–dance pair data on the Internet. Since Internet videos lack regularity, all tag information for the videos was extracted according to the tag generation methodology.

4.2. Tag Evaluation

This study aimed to quantitatively evaluate the effectiveness of self-defined tags that characterize choreography videos. Specifically, we focused on two types of self-defined tags: temporal motion tags and dance genre tags. For temporal motion tags, we assessed the accuracy of these tags in describing the actual temporal dance movements observed in the videos. This evaluation helped determine how well our defined tags captured the dynamic aspects of choreography. Regarding dance genre tags, in cases where genre information was not included in the video’s metadata, we employed a trained model to infer the genre. The reliability of these tags was then evaluated based on the accuracy of the model’s inference results. This two-pronged approach allowed us to comprehensively assess the efficacy of our tagging system in capturing both the temporal dynamics and stylistic categorization of dance videos.
We first evaluated the similarity between annotations generated by human annotators (professional dancers) and LLMs for 877 frames extracted from 50 videos in the AIST++ dataset. As depicted in Figure 7, three professional dancers and the LLM independently annotated the movements in each frame, and the similarity was analyzed using cosine similarity and BERT Score.
We first transformed the frame annotations into TF–IDF (Term Frequency–Inverse Document Frequency) vectors. This vectorization process allowed us to represent the textual annotations as numerical vectors, capturing the importance of each term within the context of the entire dataset. Subsequently, we computed the cosine similarity between these TF-IDF vectors. This similarity measure enabled us to quantify the semantic relatedness between different frame annotations.
BERT Score is a method for evaluating the semantic similarity between candidate and reference sentences using contextual embeddings from BERT. First, embeddings are generated for each token in both sentences using BERT. Next, the cosine similarity between each token in the candidate sentence and each token in the reference sentence is calculated. To compute precision, the maximum similarity for each token in the candidate sentence with respect to tokens in the reference sentence is used, while recall is calculated by taking the maximum similarity for each token in the reference sentence with respect to tokens in the candidate sentence. Finally, the F1 score is obtained by calculating the harmonic mean of precision and recall. This process allows the BERT Score to capture semantic similarity between sentences more accurately. The analysis revealed an average cosine similarity of 0.81 ± 0.09 and a BERT Score (F1 score) of 0.88 ± 0.06. These results clearly demonstrate that the choreography annotations generated by the LLM have a high level of semantic similarity to those of experts. The high similarity scores prove that LLMs can accurately capture the core elements and details of choreography, strongly suggesting the feasibility of automated choreography analysis and description systems.
To assess the accuracy of the dance genre classification, we utilized a dataset comprising 1500 dance videos for training our proposed model and evaluated its performance on a dataset of 200 in-the-wild videos evenly distributed across 10 dance genres. The train dataset was constructed based on the AIST++ dataset, supplemented with additional in-the-wild videos collected from social media platforms to reflect real-world conditions. The proposed model achieved an average classification accuracy of 72% on the test set, demonstrating good performance in classifying diverse dance videos in in-the-wild environments. The confusion matrix shown in Figure 8 provides a details of the model’s classification performance.
The Ballet, Break, and Krump genres demonstrated high classification accuracy, attributed to their distinctive movement characteristics. Street Jazz showed the lowest accuracy, with confusion across multiple genres. This reflects the genre’s incorporation of elements from various dance styles, making it challenging to distinguish. It is important to note that this genre classification method was employed specifically for videos lacking genre information in their metadata. Consequently, the overall accuracy of dance genre tagging in a real-world scenario would substantially surpass the accuracy achieved by this classification model alone. This is because a significant portion of dance videos typically already contain accurate genre tags in their metadata, provided by content creators or platforms.

4.3. User Study

To evaluate the significance of the generated tags and captions, as well as to verify the quality of the dataset, a user study was conducted. Since no prior research existed on dance captioning, there were no benchmark datasets available. Thus, we defined evaluation criteria based on studies from audio and music captioning and conducted a qualitative assessment. The study involved a total of 34 participants, including 6 experts and 28 non-experts, who evaluated the captions at their convenience after receiving adequate instructions. The group consisted of 11 men and 23 women. Participants reviewed the videos and captions and answered questions through a web-based survey. Evaluation criteria included whether the captions appeared to be written by a professional dancer, whether they aligned with the video content, and how well the caption details matched the video, all rated on a five-point scale. Additionally, participants evaluated the ‘dance summary caption’ for how comprehensively it covered essential content, the ‘time series motion caption’ for how accurately it described the movements, and the ‘multimodal caption’ for how well it integrated multiple components, all using the same five-point scale. Furthermore, frame extraction positions and caption type preferences were assessed. For frame extraction positions, captions generated at beat, half-beat, and quarter-beat intervals were compared to determine which best described the video. Finally, participants were provided with all three caption styles for a single video and were asked to select their preferred caption style.
The results are summarized in Table 4, Table 5 and Table 6. First, all three types of dance captions received high ratings in terms of appearing as if they were written by dance professionals. This indicates that, despite being automatically generated, the captions accurately reflected the content of the videos. For the first type, the ‘dance summary caption’, the evaluations for captions from both the AIST dance dataset and the in-the-wild video dataset were similar. Although the differences between the two datasets were minimal, this caption type tended to score lower in terms of the amount of information conveyed compared to the other two types. Participants noted that while the dance summary captions effectively described and summarized the dances, they lacked the level of detail present in the other caption types. Additionally, the alignment between the video and the caption was rated relatively lower, likely due to the omission of key movement details, as well as tags related to music and the environment, during the summarization process, leading to a lower information density.
The second type, the ‘time series motion caption’, focused on describing the dance movements chronologically and included the most information, being the longest caption type. It was rated higher in terms of how well it captured the dance content compared to the other two types. Furthermore, regarding the alignment between the video and the caption, the ‘time series motion caption’ received the highest scores. These two metrics suggest that participants prioritized the detailed description of movements when assessing the captions. Despite its longer length, the ‘time series motion caption’ was also seen as efficient in conveying relevant information without including unnecessary details.
Lastly, although the ‘multimodal caption’ incorporated several additional tags, it did not score higher than the ‘time series motion caption’. This implies that when reading dance captions, participants prioritize the structure and composition of the movements over information about the music or environment. However, the ‘multimodal caption’ was rated higher than the ‘dance summary caption’ in terms of the amount of information and the overall structure. Although it is shorter than the ‘time series motion caption’, the ‘multimodal caption’ compresses more information and effectively integrates various elements into a cohesive and consistent format.
As shown in Figure 9, the participants’ responses to the frame extraction location evaluation were analyzed by music BPM and dance genre. After watching the video, the participants evaluated their preferences for temporal motion tags extracted at beat, half-beat, and quarter-beat locations. First, the participants’ preferences did not significantly differ depending on the music BPM, and they preferred the half-beat location the most and the quarter-beat location the least in all music BPMs. This shows that tag preferences were not affected by whether the music BPM was slow or fast, and that tags at the half-beat location most appropriately described the video. Second, the participants evaluated their preferences for temporal motion tags according to the dance genre. Half-beat was the most preferred in most genres, followed by beat and quarter-beat, but dance genres such as middle hip-hop showed exceptions where tags at the beat location were preferred over tags at the half-beat location. In conclusion, regardless of the differences in music BPM and dance genre, most evaluators preferred temporal motion tags at the half-beat location. This is likely because tags at beat positions can miss important information about the movement, and tags at quarter-beat positions tend to be overly long, including excessive details that deviate from the main point. Meanwhile, regarding caption preferences, 71.4% of respondents preferred ‘time series motion captions’, 25.7% preferred ‘dance summary captions’, and 2.9% chose ‘multimodal captions’. Most respondents preferred captions that provided detailed descriptions of dance movements, but captions with additional information, such as music, were generally less preferred.

4.4. Statistics

To evaluate the AIST dance dataset, we analyzed a total of 1510 videos and generated captions. The statistical results are summarized in Table 7 and Table 8. Firstly, the ‘dance summary caption’ had the fewest number of words and vocabulary. This indicates that ‘dance summary captions’ were concise and focused on conveying essential information, with the generated sentences being composed of words that best reflected the video from a movement-centric perspective. Secondly, the ‘time series motion captions’ had the highest word count and longest length, which was attributed to its emphasis on describing the temporal changes in movements. In contrast, the ‘multimodal caption’ exhibited more words and higher vocabulary usage, reflecting its aim to provide a rich description by integrating multiple modalities of information. However, since the word count was restricted to 80 words, the descriptions of movements in the ‘multimodal captions’ were shorter relative to the ‘time series motion captions’, and they balanced the inclusion of diverse modalities of information. The proportion of unique words in the ‘time series motion captions’ was significantly lower compared to the ‘dance summary captions’ and ‘multimodal captions’. This can be attributed to the limited range of terms used for describing movements, such as physical terms, movement directions, and degrees of freedom. Additionally, the statistical analysis of in-the-wild dance videos involved evaluating a total of 200 videos. In-the-wild dance videos exhibited a higher proportion of unique words compared to the AIST dance dataset. This suggests that in-the-wild dance videos likely include a broader variety of backgrounds, styles, and more free-form expressions, with the significant difference indicating that the in-the-wild video dataset contains a much greater diversity of dance types, music, environment, and dancer information compared to AIST dance dataset. Furthermore, while differences in other caption forms were not markedly prominent, the vocabulary and word count for ‘time series motion captions’ significantly increased compared to the AIST dance dataset, indicating that in-the-wild dance videos available on the internet feature more complex movements and temporal changes, which the captions effectively capture.

5. Conclusions

This paper presented a novel methodology for generating captions for dance videos using an LLM and a pseudo-captioning technique. Specifically, we proposed three captioning approaches: ‘dance summary captioning’, which summarizes dance movements; ‘time series motion captioning’, which provides detailed descriptions of movements over time; and ‘multimodal captioning’, which integrates various elements such as music and environment. By applying these methodologies, we created and evaluated a dataset of captions based on dance videos. This dataset can be utilized in various applications, including dance retrieval systems and dance-music recommendation. Our study addressed the issue of data scarcity in dance research by providing previously lacking paired data of dance videos and text labels. However, some tags are automatically generated, which may result in lower accuracy, and the absence of baseline datasets limits quantitative evaluation. Future work should focus on establishing publicly available benchmark datasets to address these limitations.

Author Contributions

Conceptualization, S.K.; Methodology, S.K.; Software, S.K.; Validation, S.K.; Formal analysis, S.K.; Investigation, S.K.; Resources, S.K.; Data curation, S.K.; Writing—original draft, S.K.; Writing—review & editing, S.K. and K.L.; Visualization, S.K.; Supervision, K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This work involved human subjects or animals in its research. Approval of all ethical and experimental procedures and protocols was granted by Institutional Review Board of Seoul National University under Application No. 2408/001-002, and performed in line with the Declaration of Helsinki.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  2. Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 17, 6000–6010. [Google Scholar]
  3. Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning 2015, Lille, France, 6–11 June 2015; PMLR: Warrington, UK, 2015; pp. 2256–2265. [Google Scholar]
  4. Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
  5. Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
  6. Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
  7. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  8. Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Musllis, C.; Wortsman, M.; et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 2022, 35, 25278–25294. [Google Scholar]
  9. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning 2021, Online, 18–24 July 2021; PMLR: Warrington, UK, 2021; pp. 8748–8763. [Google Scholar]
  10. Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning 2022, Baltimore, MD, USA, 17–23 July 2022; PMLR: Warrington, UK, 2022; pp. 12888–12900. [Google Scholar]
  11. Kim, C.D.; Kim, B.; Lee, H.; Kim, G. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolos, MN, USA, 2–7 June 2019; pp. 119–132. [Google Scholar]
  12. Drossos, K.; Lipping, S.; Virtanen, T. Clotho: An audio captioning dataset. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 736–740. [Google Scholar]
  13. Morato, I.M.; Mesaros, A. MACS—Multi-Annotator Captioned Soundscapes [Data set]; Zenodo: Geneva, Switzerland, 2021. [Google Scholar] [CrossRef]
  14. Tsuchida, S.; Fukayama, S.; Hamasaki, M.; Goto, M. AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing. ISMIR 2019, 1, 6. [Google Scholar]
  15. Li, R.; Yang, S.; Ross, D.A.; Kanazawa, A. Ai choreographer: Music conditioned 3d dance generation with AIST++. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 13401–13412. [Google Scholar]
  16. Tang, T.; Mao, H.; Jia, J. Anidance: Real-time dance motion synthesize to the song. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1237–1239. [Google Scholar]
  17. Yao, S.; Sun, M.; Li, B.; Yang, F.; Wang, J.; Zhang, R. Dance with you: The diversity controllable dancer generation via diffusion models. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 8504–8514. [Google Scholar]
  18. Sun, G.; Wong, Y.; Cheng, Z.; Kankanhalli, M.S.; Geng, W.; Li, X. Deepdance: Music-to-dance motion choreography with adversarial learning. IEEE Trans. Multimed. 2020, 23, 497–509. [Google Scholar] [CrossRef]
  19. Lee, H.Y.; Yang, X.; Liu, M.Y.; Wang, T.C.; Lu, Y.D.; Yang, M.H.; Kautz, J. Dancing to music. Adv. Neural Inf. Process. Syst. 2019, 32, 3586–3596. [Google Scholar]
  20. Huang, Y.F.; Liu, W.D. Choreography cGAN: Generating dances with music beats using conditional generative adversarial networks. Neural Comput. Appl. 2021, 33, 9817–9833. [Google Scholar] [CrossRef]
  21. Plappert, M.; Mandery, C.; Asfour, T. The kit motion-language dataset. Big Data 2016, 4, 236–252. [Google Scholar] [CrossRef]
  22. Gan, Z.; Li, L.; Li, C.; Wang, L.; Liu, Z.; Gao, J. Vision-language pre-training: Basics, recent advances, and future trends. Found. Trends® Comput. Graph. Vis. 2022, 14, 163–352. [Google Scholar] [CrossRef]
  23. Hao, Y.; Song, H.; Dong, L.; Huang, S.; Chi, Z.; Wang, W.; Ma, S.; Wei, F. Language models are general-purpose interfaces. arXiv 2022, arXiv:2206.06336. [Google Scholar]
  24. Rotstein, N.; Bensaïd, D.; Brody, S.; Ganz, R.; Kimmel, R. Fusecap: Leveraging large language models for enriched fused image captions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2024, Waikoloa, HI, USA, 3–8 January 2024; pp. 5689–5700. [Google Scholar]
  25. Bianco, S.; Celona, L.; Donzella, M.; Napoletano, P. Improving image captioning descriptiveness by ranking and llm-based fusion. arXiv 2023, arXiv:2306.11593. [Google Scholar]
  26. Lipping, S.; Sudarsanam, P.; Drossos, K.; Virtanen, T. Clotho-aqa: A crowdsourced dataset for audio question answering. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1140–1144. [Google Scholar]
  27. Fayek, H.M.; Johnson, J. Temporal reasoning via audio question answering. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2283–2294. [Google Scholar] [CrossRef]
  28. Mei, X.; Liu, X.; Plumbley, M.D.; Wang, W. Automated audio captioning: An overview of recent progress and new challenges. EURASIP J. Audio Speech Music. Process. 2022, 2022, 26. [Google Scholar] [CrossRef]
  29. Gontier, F.; Serizel, R.; Cerisara, C. Automated audio captioning by fine-tuning bart with audioset tags. In Proceedings of the DCASE 2021–6th Workshop on Detection and Classification of Acoustic Scenes and Events, Online, 15–19 November 2021. [Google Scholar]
  30. Koizumi, Y.; Masumura, R.; Nishida, K.; Yasuda, M.; Saito, S. A transformer-based audio captioning model with keyword estimation. arXiv 2020, arXiv:2007.00222. [Google Scholar]
  31. Mei, X.; Meng, C.; Liu, H.; Kong, Q.; Ko, T.; Zhao, C.; Plumbley, M.D.; Zou, Y.; Wang, W. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3339–3354. [Google Scholar] [CrossRef]
  32. Doh, S.; Choi, K.; Lee, J.; Nam, J. Lp-musiccaps: Llm-based pseudo music captioning. arXiv 2023, arXiv:2307.16372. [Google Scholar]
  33. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
  34. Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; Niebles, J.C. Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017, Venice, Italy, 22–29 October 2017; pp. 706–715. [Google Scholar]
  35. Iashin, V.; Rahtu, E. Multi-modal Dense Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2020, Seattle, WA, USA, 14–19 June 2020; pp. 958–959. [Google Scholar]
  36. Tang, M.; Yin, Z.; Liu, X.; Zhang, X.; Wang, S.; Zhang, W. CLIP4Caption: CLIP for Video Caption. arXiv 2021, arXiv:2110.06615. [Google Scholar]
  37. Xue, H.; Zhao, W.; Zhang, Z.; Wang, Z.; Zhang, H.; Bai, S. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 20164–20173. [Google Scholar]
  38. Guo, C.; Xu, X.; Shi, B.; Yin, K.; Zheng, S. TM2T: Stochastical and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In Proceedings of the European Conference on Computer Vision (ECCV) 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 612–629. [Google Scholar]
  39. Burger, B.; Thompson, M.R.; Luck, G.; Saarikallio, S.; Toiviainen, P. Influences of rhythm- and timbre-related musical features on characteristics of music-induced movement. Front. Psychol. 2013, 4, 183. [Google Scholar] [CrossRef]
  40. Miura, A.; Kudo, K.; Ohtsuki, T.; Kanehisa, H. Coordination modes in sensorimotor synchronization of whole-body movement: A study of street dancers and non-dancers. Hum. Mov. Sci. 2011, 30, 1260–1271. [Google Scholar] [CrossRef] [PubMed]
  41. Kim, J.; Kim, H. A Study on the Correlation between Musical Rhythm and Dance Movement. Korean J. Danc. Sci. 2016, 33, 1–16. [Google Scholar]
  42. McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In SciPy; SciPy: Tacoma, WA, USA, 2015; pp. 18–24. [Google Scholar]
  43. Goiiyon, F. Dance music classification: A tempo-based approach. In Proceedings of the ISMIR 2004, 5th International Conference on Music Information Retrieval, Barcelona, Spain, 10–14 October 2004. [Google Scholar]
  44. Won, M.; Ferraro, A.; Bogdanov, D.; Serra, X. Evaluation of cnn-based automatic music tagging models. arXiv 2020, arXiv:2006.00751. [Google Scholar]
  45. Kim, T.; Kim, K.; Lee, J.; Cha, D.; Lee, J.; Kim, D. Revisiting image pyramid structure for high resolution salient object detection. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 108–124. [Google Scholar]
  46. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Figure 1. The proposed pseudo-captioning algorithm. Tags generated from dance videos are fed into a LLM along with instruction prompts to produce dance captions.
Figure 1. The proposed pseudo-captioning algorithm. Tags generated from dance videos are fed into a LLM along with instruction prompts to produce dance captions.
Applsci 14 10116 g001
Figure 2. An example of predicting the positions of major joints in the body.
Figure 2. An example of predicting the positions of major joints in the body.
Applsci 14 10116 g002
Figure 3. An example of creating temporal motion tags for dance movements over time. These tags are designed to focus exclusively on the movements by describing them in a sequence of images arranged in temporal order, excluding any background elements.
Figure 3. An example of creating temporal motion tags for dance movements over time. These tags are designed to focus exclusively on the movements by describing them in a sequence of images arranged in temporal order, excluding any background elements.
Applsci 14 10116 g003
Figure 4. The positions of beat, half-beat, and quarter-beat within the context of observing temporal movements in dance. This particularly illustrates the number of frames that are downsampled based on the beat positions of the music, along with the corresponding temporal intervals.
Figure 4. The positions of beat, half-beat, and quarter-beat within the context of observing temporal movements in dance. This particularly illustrates the number of frames that are downsampled based on the beat positions of the music, along with the corresponding temporal intervals.
Applsci 14 10116 g004
Figure 5. Overall flow of extracting the genre tag from the dance video. If genre is contained in the metadata, it will be directly exploited as genre tag. If not, genre is estimated from the temporal motion tag using BERT encoder and trained genre classifier.
Figure 5. Overall flow of extracting the genre tag from the dance video. If genre is contained in the metadata, it will be directly exploited as genre tag. If not, genre is estimated from the temporal motion tag using BERT encoder and trained genre classifier.
Applsci 14 10116 g005
Figure 6. The processing of extracting other tags from the thumbnail of the dance video.
Figure 6. The processing of extracting other tags from the thumbnail of the dance video.
Applsci 14 10116 g006
Figure 7. Examples of movements described using LLM and movements annotated by a human.
Figure 7. Examples of movements described using LLM and movements annotated by a human.
Applsci 14 10116 g007
Figure 8. Confusion matrix of the dance genre classification for 200 in-the-wild dance videos (20 videos per each genre).
Figure 8. Confusion matrix of the dance genre classification for 200 in-the-wild dance videos (20 videos per each genre).
Applsci 14 10116 g008
Figure 9. Participants’ preferences for temporal motion tags extracted in beat, half-beat, and quarter-beat intervals based on music BPM and dance genre. Participants reviewed the videos and tags and selected the tag they felt most accurately described the motion.
Figure 9. Participants’ preferences for temporal motion tags extracted in beat, half-beat, and quarter-beat intervals based on music BPM and dance genre. Participants reviewed the videos and tags and selected the tag they felt most accurately described the motion.
Applsci 14 10116 g009
Table 1. All tags proposed in this paper, categorized into five main groups. Tags related to ‘Background’ and ‘Person’ are ultimately classified under ‘Other Tags’.
Table 1. All tags proposed in this paper, categorized into five main groups. Tags related to ‘Background’ and ‘Person’ are ultimately classified under ‘Other Tags’.
CategoryTag
MotionAverage velocity of movement, average acceleration of movement, temporal motion
MusicMusic genre, music tempo, beat location, instrument, average pitch level, mood
PersonDancer gender, clothes, number of dancers
BackgroundCamera position, background information
Table 2. The types of tags and their corresponding examples.
Table 2. The types of tags and their corresponding examples.
CategoryTag NameExample
Motion TagsTemporal motionThe dancer begins in a waacking pose, then changes the movement by stretching out her right arm and pulling her right arm inward while lifting her left arm in a cross motion.
Average velocity of movementSlow, Low, Moderate, High, Rapid
Average acceleration of movementSlight, Gradual, Moderate, High, Rapid
Dance TagsDance genre“Hip-hop”, “Break”, “Krump”
Choreography step“Walking out”, “whip”, “Brooklyn step”
Music TagsMusic genre“Hip-hop”, “Electronic”, “Jazz”
Music tempoLow, Medium, High tempo
Beat location[1.22, 4.55, 7.88, …], frame position
Instrument“Piano”, “Synthesizer”, “Acoustic guitar”
Average pitch levelLow, Medium, High
Music mood“Happy”, “Energetic”, “Film”
Other TagsDancer characteristics“Man in a black shirt and gray pants”
Spatial characteristics“Courtyard”
Table 3. Three types of captions, detailing the caption names, required tags, associated instructions, and examples for each type.
Table 3. Three types of captions, detailing the caption names, required tags, associated instructions, and examples for each type.
Caption NameRequired TagsInstructions and Examples
Dance summary captionDance genre, choreography step, temporal motionThe tag information below includes a description of the dance and details on how the series of dance movements change over time. Based on the given information, please describe how the movements evolve over time in a detailed yet concise manner. Write it in one sentence, naturally as if a person were speaking, focusing solely on the dance movements and making sure to include the dance genre.
Break dance moves involve the dancer striking a dynamic pose with their arms outstretched, then rhythmically shifting their weight and performing a fluid two-step movement with sharp, precise arm movements, before confidently returning to the starting position.
Time series motion captionTemporal motion, average acceleration of movement, average velocity of movement, dance genre, choreography stepPlease describe the dancer’s movements in detail in chronological order, using the tag information provided. Create a natural-flowing paragraph that captures the progression of the dance moves clearly, without additional explanations.
At first, he starts by putting his left foot forward, then slowly moves his right foot to the side, lowers his left arm, and slightly leans his body as if walking, executing these movements in the style of a hip-hop dance. He raises his right arm, steps his right foot forward, then moves his left foot to the side again, turns his body, and stretches both arms forward. He places his left foot forward, slightly rhythmically lifts his right foot, and then he puts his right foot down and repeats the walking motion again. These movements continue continuously, and he seems to be walking, and the dance is repeated continuously.
Multimodal captionsMusic genre, tempo, instrument, average pitch level, mood, dance genre, choreography step, temporal motion, dancer characteristics, spatial characteristicsUsing the tag information, list and briefly describe the dancer’s movements in chronological order, ensuring they align with the music. Keep the paragraph concise and under 80 words, capturing the essence of the dance moves in relation to the music and environment tags.
The dancers attempt to embody the Grand Jeté movement, with graceful leg lifts and graceful landings in a white-backed studio, while maintaining the energy of the dance with lively arm movements and turns. The movements are dynamic, with varying speeds, and the rock and pop music creates a lively and happy atmosphere.
Table 4. User study result: dance summary caption (five-point scale, {average} ± {standard deviation}).
Table 4. User study result: dance summary caption (five-point scale, {average} ± {standard deviation}).
ProfessionalContaining Info.Video-Caption MatchMotion Summary
AIST dance videos4.309 ± 0.8334.235 ± 1.0244.206 ± 0.8024.176 ± 1.078
In-the-wild dance videos4.279 ± 0.7504.338 ± 0.6374.176 ± 0.7914.235 ± 0.794
Table 5. User study result: time series motion caption (five-point scale, {average} ± {standard deviation}).
Table 5. User study result: time series motion caption (five-point scale, {average} ± {standard deviation}).
ProfessionalContaining Info.Video-Caption MatchMotion Detail
AIST dance videos4.279 ± 0.9444.603 ± 0.6944.632 ± 0.6674.529 ± 0.801
In-the-wild dance videos4.412 ± 0.7964.559 ± 0.6554.676 ± 0.6094.456 ± 0.818
Table 6. User study result: multimodal caption (five-point scale, {average} ± {standard deviation}).
Table 6. User study result: multimodal caption (five-point scale, {average} ± {standard deviation}).
ProfessionalContaining Info.Video-Caption MatchComposition
AIST dance videos4.279 ± 1.0054.368 ± 0.9604.412 ± 0.8684.309 ± 0.918
In-the-wild dance videos4.309 ± 0.9024.426 ± 0.8164.162 ± 0.8914.338 ± 0.956
Table 7. Statistics of AIST dance dataset video–tags paired captions.
Table 7. Statistics of AIST dance dataset video–tags paired captions.
VocabularyWordsUnique Word (%)
Dance summary captions36.45 ± 7.5632.90 ± 5.030.130
Time series motion captions119.75 ± 29.7879.52 ± 14.790.088
Multimodal captions51.13 ± 7.5142.87 ± 5.990.129
Table 8. Statistics of in-the-wild dance dataset video–tags paired captions.
Table 8. Statistics of in-the-wild dance dataset video–tags paired captions.
VocabularyWordsUnique Word (%)
Dance summary captions36.60 ± 7.0231.55 ± 5.160.359
Time series motion captions138.05 ± 30.5089.40 ± 13.590.274
Multimodal captions45.30 ± 3.9439.15 ± 2.890.421
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, S.; Lee, K. DanceCaps: Pseudo-Captioning for Dance Videos Using Large Language Models. Appl. Sci. 2024, 14, 10116. https://doi.org/10.3390/app142210116

AMA Style

Kim S, Lee K. DanceCaps: Pseudo-Captioning for Dance Videos Using Large Language Models. Applied Sciences. 2024; 14(22):10116. https://doi.org/10.3390/app142210116

Chicago/Turabian Style

Kim, Seohyun, and Kyogu Lee. 2024. "DanceCaps: Pseudo-Captioning for Dance Videos Using Large Language Models" Applied Sciences 14, no. 22: 10116. https://doi.org/10.3390/app142210116

APA Style

Kim, S., & Lee, K. (2024). DanceCaps: Pseudo-Captioning for Dance Videos Using Large Language Models. Applied Sciences, 14(22), 10116. https://doi.org/10.3390/app142210116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop