To assess the effectiveness of the proposed method, we first evaluated the accuracy of the self-defined tags, followed by a user study of the captions generated using these tags. The accuracy of the tags represents the accuracy of the information to be included in the captions, which in turn determines the reliability of the captions. The proposed captioning method was applied to commonly used dance datasets and in-the-wild choreography videos extracted from social media platforms to generate captions describing these choreography videos. A user evaluation was then conducted, where the generated captions were randomly presented to users. This evaluation assessed whether the generated captions accurately reflected the tag information and maintained readability.
4.2. Tag Evaluation
This study aimed to quantitatively evaluate the effectiveness of self-defined tags that characterize choreography videos. Specifically, we focused on two types of self-defined tags: temporal motion tags and dance genre tags. For temporal motion tags, we assessed the accuracy of these tags in describing the actual temporal dance movements observed in the videos. This evaluation helped determine how well our defined tags captured the dynamic aspects of choreography. Regarding dance genre tags, in cases where genre information was not included in the video’s metadata, we employed a trained model to infer the genre. The reliability of these tags was then evaluated based on the accuracy of the model’s inference results. This two-pronged approach allowed us to comprehensively assess the efficacy of our tagging system in capturing both the temporal dynamics and stylistic categorization of dance videos.
We first evaluated the similarity between annotations generated by human annotators (professional dancers) and LLMs for 877 frames extracted from 50 videos in the AIST++ dataset. As depicted in
Figure 7, three professional dancers and the LLM independently annotated the movements in each frame, and the similarity was analyzed using cosine similarity and BERT Score.
We first transformed the frame annotations into TF–IDF (Term Frequency–Inverse Document Frequency) vectors. This vectorization process allowed us to represent the textual annotations as numerical vectors, capturing the importance of each term within the context of the entire dataset. Subsequently, we computed the cosine similarity between these TF-IDF vectors. This similarity measure enabled us to quantify the semantic relatedness between different frame annotations.
BERT Score is a method for evaluating the semantic similarity between candidate and reference sentences using contextual embeddings from BERT. First, embeddings are generated for each token in both sentences using BERT. Next, the cosine similarity between each token in the candidate sentence and each token in the reference sentence is calculated. To compute precision, the maximum similarity for each token in the candidate sentence with respect to tokens in the reference sentence is used, while recall is calculated by taking the maximum similarity for each token in the reference sentence with respect to tokens in the candidate sentence. Finally, the F1 score is obtained by calculating the harmonic mean of precision and recall. This process allows the BERT Score to capture semantic similarity between sentences more accurately. The analysis revealed an average cosine similarity of 0.81 ± 0.09 and a BERT Score (F1 score) of 0.88 ± 0.06. These results clearly demonstrate that the choreography annotations generated by the LLM have a high level of semantic similarity to those of experts. The high similarity scores prove that LLMs can accurately capture the core elements and details of choreography, strongly suggesting the feasibility of automated choreography analysis and description systems.
To assess the accuracy of the dance genre classification, we utilized a dataset comprising 1500 dance videos for training our proposed model and evaluated its performance on a dataset of 200 in-the-wild videos evenly distributed across 10 dance genres. The train dataset was constructed based on the AIST++ dataset, supplemented with additional in-the-wild videos collected from social media platforms to reflect real-world conditions. The proposed model achieved an average classification accuracy of 72% on the test set, demonstrating good performance in classifying diverse dance videos in in-the-wild environments. The confusion matrix shown in
Figure 8 provides a details of the model’s classification performance.
The Ballet, Break, and Krump genres demonstrated high classification accuracy, attributed to their distinctive movement characteristics. Street Jazz showed the lowest accuracy, with confusion across multiple genres. This reflects the genre’s incorporation of elements from various dance styles, making it challenging to distinguish. It is important to note that this genre classification method was employed specifically for videos lacking genre information in their metadata. Consequently, the overall accuracy of dance genre tagging in a real-world scenario would substantially surpass the accuracy achieved by this classification model alone. This is because a significant portion of dance videos typically already contain accurate genre tags in their metadata, provided by content creators or platforms.
4.3. User Study
To evaluate the significance of the generated tags and captions, as well as to verify the quality of the dataset, a user study was conducted. Since no prior research existed on dance captioning, there were no benchmark datasets available. Thus, we defined evaluation criteria based on studies from audio and music captioning and conducted a qualitative assessment. The study involved a total of 34 participants, including 6 experts and 28 non-experts, who evaluated the captions at their convenience after receiving adequate instructions. The group consisted of 11 men and 23 women. Participants reviewed the videos and captions and answered questions through a web-based survey. Evaluation criteria included whether the captions appeared to be written by a professional dancer, whether they aligned with the video content, and how well the caption details matched the video, all rated on a five-point scale. Additionally, participants evaluated the ‘dance summary caption’ for how comprehensively it covered essential content, the ‘time series motion caption’ for how accurately it described the movements, and the ‘multimodal caption’ for how well it integrated multiple components, all using the same five-point scale. Furthermore, frame extraction positions and caption type preferences were assessed. For frame extraction positions, captions generated at beat, half-beat, and quarter-beat intervals were compared to determine which best described the video. Finally, participants were provided with all three caption styles for a single video and were asked to select their preferred caption style.
The results are summarized in
Table 4,
Table 5 and
Table 6. First, all three types of dance captions received high ratings in terms of appearing as if they were written by dance professionals. This indicates that, despite being automatically generated, the captions accurately reflected the content of the videos. For the first type, the ‘dance summary caption’, the evaluations for captions from both the AIST dance dataset and the in-the-wild video dataset were similar. Although the differences between the two datasets were minimal, this caption type tended to score lower in terms of the amount of information conveyed compared to the other two types. Participants noted that while the dance summary captions effectively described and summarized the dances, they lacked the level of detail present in the other caption types. Additionally, the alignment between the video and the caption was rated relatively lower, likely due to the omission of key movement details, as well as tags related to music and the environment, during the summarization process, leading to a lower information density.
The second type, the ‘time series motion caption’, focused on describing the dance movements chronologically and included the most information, being the longest caption type. It was rated higher in terms of how well it captured the dance content compared to the other two types. Furthermore, regarding the alignment between the video and the caption, the ‘time series motion caption’ received the highest scores. These two metrics suggest that participants prioritized the detailed description of movements when assessing the captions. Despite its longer length, the ‘time series motion caption’ was also seen as efficient in conveying relevant information without including unnecessary details.
Lastly, although the ‘multimodal caption’ incorporated several additional tags, it did not score higher than the ‘time series motion caption’. This implies that when reading dance captions, participants prioritize the structure and composition of the movements over information about the music or environment. However, the ‘multimodal caption’ was rated higher than the ‘dance summary caption’ in terms of the amount of information and the overall structure. Although it is shorter than the ‘time series motion caption’, the ‘multimodal caption’ compresses more information and effectively integrates various elements into a cohesive and consistent format.
As shown in
Figure 9, the participants’ responses to the frame extraction location evaluation were analyzed by music BPM and dance genre. After watching the video, the participants evaluated their preferences for temporal motion tags extracted at beat, half-beat, and quarter-beat locations. First, the participants’ preferences did not significantly differ depending on the music BPM, and they preferred the half-beat location the most and the quarter-beat location the least in all music BPMs. This shows that tag preferences were not affected by whether the music BPM was slow or fast, and that tags at the half-beat location most appropriately described the video. Second, the participants evaluated their preferences for temporal motion tags according to the dance genre. Half-beat was the most preferred in most genres, followed by beat and quarter-beat, but dance genres such as middle hip-hop showed exceptions where tags at the beat location were preferred over tags at the half-beat location. In conclusion, regardless of the differences in music BPM and dance genre, most evaluators preferred temporal motion tags at the half-beat location. This is likely because tags at beat positions can miss important information about the movement, and tags at quarter-beat positions tend to be overly long, including excessive details that deviate from the main point. Meanwhile, regarding caption preferences, 71.4% of respondents preferred ‘time series motion captions’, 25.7% preferred ‘dance summary captions’, and 2.9% chose ‘multimodal captions’. Most respondents preferred captions that provided detailed descriptions of dance movements, but captions with additional information, such as music, were generally less preferred.
4.4. Statistics
To evaluate the AIST dance dataset, we analyzed a total of 1510 videos and generated captions. The statistical results are summarized in
Table 7 and
Table 8. Firstly, the ‘dance summary caption’ had the fewest number of words and vocabulary. This indicates that ‘dance summary captions’ were concise and focused on conveying essential information, with the generated sentences being composed of words that best reflected the video from a movement-centric perspective. Secondly, the ‘time series motion captions’ had the highest word count and longest length, which was attributed to its emphasis on describing the temporal changes in movements. In contrast, the ‘multimodal caption’ exhibited more words and higher vocabulary usage, reflecting its aim to provide a rich description by integrating multiple modalities of information. However, since the word count was restricted to 80 words, the descriptions of movements in the ‘multimodal captions’ were shorter relative to the ‘time series motion captions’, and they balanced the inclusion of diverse modalities of information. The proportion of unique words in the ‘time series motion captions’ was significantly lower compared to the ‘dance summary captions’ and ‘multimodal captions’. This can be attributed to the limited range of terms used for describing movements, such as physical terms, movement directions, and degrees of freedom. Additionally, the statistical analysis of in-the-wild dance videos involved evaluating a total of 200 videos. In-the-wild dance videos exhibited a higher proportion of unique words compared to the AIST dance dataset. This suggests that in-the-wild dance videos likely include a broader variety of backgrounds, styles, and more free-form expressions, with the significant difference indicating that the in-the-wild video dataset contains a much greater diversity of dance types, music, environment, and dancer information compared to AIST dance dataset. Furthermore, while differences in other caption forms were not markedly prominent, the vocabulary and word count for ‘time series motion captions’ significantly increased compared to the AIST dance dataset, indicating that in-the-wild dance videos available on the internet feature more complex movements and temporal changes, which the captions effectively capture.