Bridging Images and Text – A Survey of VLMs

cvaman · on Sept 17, 2024

Curious to know if VLMs be adapted/extended for video-based tasks (generating video summaries, question answering from video...) by understanding interframe context and temporal dynamics?

spikyspider · on Sept 18, 2024

Papers like OneVision have started looking into this. But most of the research is still in nascent stages, answering in one word/phrase for simple questions. I don't even think there's a good enough benchmark dataset to evaluate such models.