FIOVA (Five-in-One Video Annotations) Benchmark

This initiative aims to evaluate the gap between human and machine video understanding by benchmarking LVLMs against comprehensive human annotations.

(Contact: Shiyu Hu and Xuchen Li)

Latest News

  • [2024.10.22] We have updated the arXiv version of the paper.
  • [2024.10.18] The home page has been released! More information will be available soon.
  • Can LVLMs Describe Videos Like Humans? A Five-in-One Video Annotations Benchmark

    Figure 0


    Abstract

    Large vision-language models (LVLMs) have made significant strides in addressing complex video tasks, sparking researchers' interest in their human-like multimodal understanding capabilities. Video description serves as a fundamental task for evaluating video comprehension, necessitating a deep understanding of spatial and temporal dynamics, which presents challenges for both humans and machines. Thus, investigating whether LVLMs can describe videos as comprehensively as humans—through reasonable human-machine comparisons using video captioning as a proxy task—will enhance our understanding and application of these models. However, current benchmarks for video comprehension have notable limitations, including short video durations, brief annotations, and reliance on a single annotator's perspective. These factors hinder a comprehensive assessment of LVLMs' ability to understand complex, lengthy videos and prevent the establishment of a robust human baseline that accurately reflects human video comprehension capabilities. To address these issues, we propose a novel benchmark, FIOVA (Five In One Video Annotations), designed to evaluate the differences between LVLMs and human understanding more comprehensively. FIOVA includes 3,002 long video sequences (averaging 33.6 seconds) that cover diverse scenarios with complex spatiotemporal relationships. Each video is annotated by five distinct annotators, capturing a wide range of perspectives and resulting in captions that are 4~15 times longer than existing benchmarks, thereby establishing a robust baseline that represents human understanding comprehensively for the first time in video description tasks. Using the FIOVA benchmark, we conducted an in-depth evaluation of six state-of-the-art LVLMs (VideoLLaMA2, LLaVA-NEXT-Video, Video-LLaVA, VideoChat2, Tarsier, and ShareGPT4Video), comparing their performance with humans. Results show that while current LVLMs demonstrate some perception and reasoning capabilities, they still struggle with information omission and descriptive depth. Moreover, we found significant discrepancies between LVLMs and humans in complex videos, particularly where human annotators exhibited substantial disagreement, whereas LVLMs tended to rely on uniform strategies for challenging content. These findings underscore the limitations of using a single human annotator as the groundtruth for evaluation and highlight the need for new evaluation perspectives. We believe this work offers valuable insights into the differences between LVLMs and humans, ultimately guiding future advancements toward human-level video comprehension.



    Our Contributions


    (1) Comprehensive dataset construction: We curated a dataset of 3,002 long video sequences (averaging 33.6 seconds) that cover diverse scenarios with complex spatiotemporal relationships. Each video is annotated by five distinct annotators, capturing a wide range of human perspectives and resulting in captions that are 4 to 15 times longer than existing benchmarks, establishing a robust baseline that comprehensively represents human understanding in video description tasks (see Section 2).

    (2) Evaluation of state-of-the-art LVLMs: We conducted an in-depth evaluation of six representative open-source LVLMs (VideoLLaMA2, LLaVA-NEXT-Video, Video-LLaVA, VideoChat2, Tarsier, and ShareGPT4Video), ensuring our evaluation reflects the latest advancements in the field. Additionally, we applied diverse processing techniques to model outputs, enabling a more comprehensive assessment of their capabilities and limitations (see Section 3).

    (3) Fine-grained human-machine comparative analysis: Leveraging the FIOVA benchmark, we performed detailed experiments to analyze the differences between LVLMs and human annotations across various aspects of video comprehension. This comparative study offers critical insights into the limitations of LVLMs and underscores the need for new evaluation perspectives that capture semantic understanding, fluency, and content relevance (see Section 4).



    Construction of the FIOVA Dataset


    We propose the FIOVA dataset, designed to comprehensively evaluate video comprehension. FIOVA includes 3,002 video sequences covering 38 diverse themes, each annotated by five distinct annotators to capture a wide range of human perspectives. The dataset is unique in its length and detail, providing 4-15 times longer descriptions than existing benchmarks. Additionally, FIOVA addresses human variability by consolidating multiple annotations into a groundtruth baseline, thus facilitating detailed human-machine comparisons in video description tasks.

    Table 1


    We assess the quality of human-generated video captions by evaluating them across five key dimensions: consistency, context, correctness, detail orientation, and temporality. Each dimension is scored on a scale of 1 to 10, offering a comprehensive evaluation of how well the captions capture the video’s content. Using GPT, the captions are analyzed for coherence, accuracy, and chronological order. Additionally, we measure variability in annotations through a coefficient of variation (CV) to identify discrepancies among annotators, helping to classify videos based on the level of human agreement and disagreement. This assessment provides a multidimensional baseline for comparing LVLM-generated captions with human annotations.

    To create a reliable reference for evaluation, we synthesize the five human-generated captions for each video into a single comprehensive groundtruth using GPT. This process integrates key elements from each annotation, balancing diversity of perspectives with consistency and coherence. The groundtruth is designed to capture the most critical details of the video while maintaining chronological and contextual accuracy. This consolidated groundtruth serves as a robust baseline for comparing LVLM outputs, ensuring that no important details are overlooked in the evaluation of machine-generated captions.

    Figure 4

    LVLMs Response Collection


    We collected video captions generated by six state-of-the-art Large Vision-Language Models (LVLMs): VideoLLaMA2, LLaVA-NEXT-Video, Video-LLaVA, VideoChat2, Tarsier, and ShareGPT4Video. Each model processed the same set of 3,002 videos, generating captions based on their visual understanding. The LVLMs were fine-tuned with specific configurations to optimize performance in video captioning tasks. We then compiled a comprehensive dataset of video-description-response pairs, which allows for detailed comparisons between the human groundtruth and model-generated captions. This collection enables a robust analysis of how well LVLMs understand and describe complex video content.

    Overall Evaluation for LVLMs


    In the overall evaluation, we assess the performance of six LVLMs—VideoLLaMA2, LLaVA-NEXT-Video, Video-LLaVA, VideoChat2, Tarsier, and ShareGPT4Video—using traditional metrics such as BLEU, GLEU, and METEOR, as well as the AutoCQ framework, which focuses on event-based evaluation. The results show that while models like Tarsier and VideoLLaMA2 excel in covering key events, they often struggle with descriptive precision and omit important details. On the other hand, ShareGPT4Video achieved the highest precision but lacked comprehensiveness, frequently omitting crucial information. The findings highlight the trade-off between accuracy and completeness, underscoring the need for models that can both capture critical events and provide detailed, fluent descriptions.

    Table 2

    Batch Score Evaluation for LVLMs


    The batch score evaluation divides the dataset into eight sub-groups based on the complexity and variability of video descriptions. We rank the LVLMs across these groups, analyzing their ability to handle both simple and complex videos. Tarsier consistently performs well in capturing temporal changes and maintaining coherence, especially in groups with frequent scene transitions. However, all models show a significant performance drop when dealing with the most complex videos (Group H), characterized by high variability in human annotations. This evaluation results based on various sub-groups and metrics highlight the models' varying strategies—some prioritize completeness, while others focus on precision—demonstrating the need for balanced models capable of handling diverse video scenarios.

    Figure 7

    Batch Ranking for LVLMs and Humans


    The batch ranking for LVLMs is based on calculating the coefficient of variation (CV) across different video groups to assess performance consistency. Results show that in simpler videos (Groups A and B), models exhibited higher variability, reflecting divergent strategies. As complexity increased (Groups F to H), CV values decreased, indicating more consistent but less diverse outputs. Human annotations were more consistent in simpler videos, while models outperformed humans in consistency for the most complex videos (Group H). This analysis highlights how models tend to adopt uniform strategies in complex situations, while human descriptions are more diverse. Tarsier ranked highest overall, particularly in handling complex scenarios, while ShareGPT4Video excelled in precision but sacrificed completeness in complex videos. This analysis highlights the trade-off between precision and recall, emphasizing the need for balanced models that can handle diverse video content effectively.

    Figure 8


    More Specific Examples



    Figure A12

    Analysis:

    Human performance is relatively consistent, but there is significant variation among models, indicating that the models have poor descriptive ability in these scenarios. In some simple scenarios, humans are not only able to quickly capture key content in videos and describe it effectively, but also show a high degree of consistency. In contrast, LVLMs often struggle to grasp key details when handling such videos, leading to inadequate descriptive ability. This difficulty primarily stems from the models' limitations in understanding the overall context and interconnections within the video, particularly in integrating video events with background information. As a result, these models often fail to match human performance in terms of narrative coherence and accuracy.