Curious to know if VLMs be adapted/extended for video-based tasks (generating video summaries, question answering from video...) by understanding interframe context and temporal dynamics?
Papers like OneVision have started looking into this. But most of the research is still in nascent stages, answering in one word/phrase for simple questions. I don't even think there's a good enough benchmark dataset to evaluate such models.