[21]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and
Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million
narrated video clips. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 2630–2640, 2019.
[22]
Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, and
Yale Song. Acav100m: Automatic curation of large-scale datasets for audio-visual video
representation learning. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 10274–10284, 2021.
[23] Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu,
and Baining Guo. Advancing high-resolution video-language representation with large-scale
video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 5036–5045, 2022.
[24]
David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics
(ACL-2011), Portland, OR, June 2011.
[25]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan
Russell. Localizing moments in video with natural language, 2017.
[26]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet:
A large-scale video benchmark for human activity understanding. In Proceedings of the ieee
conference on computer vision and pattern recognition, pages 961–970, 2015.
[27]
Luowei Zhou, Chenliang Xu, and Jason J. Corso. Towards automatic learning of procedures
from web instructional videos, 2017.
[28]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex:
A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
[29]
Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang,
and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355,
2023.
[30]
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language
model for video understanding, 2023.
[31]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya
Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
[32]
Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language
models are effective temporal learners, 2024.
[33]
Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language
models are effective temporal learners. https://arxiv.org/abs/2404.00308, 2023.
[34]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image
pre-training with frozen image encoders and large language models. In International conference
on machine learning, pages 19730–19742. PMLR, 2023.
[35]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng
Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose
vision-language models with instruction tuning, 2023.
[36]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang
Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding,
localization, text reading, and beyond, 2023.
[37] Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
[38]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating
object hallucination in large vision-language models, 2023.
[39]
Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou,
Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models, 2024.
[40]
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan
Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior
12