Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bridging Images and Text – A Survey of VLMs (nanonets.com)
9 points by StarrySkies11 on Sept 17, 2024 | hide | past | favorite | 2 comments


Curious to know if VLMs be adapted/extended for video-based tasks (generating video summaries, question answering from video...) by understanding interframe context and temporal dynamics?


Papers like OneVision have started looking into this. But most of the research is still in nascent stages, answering in one word/phrase for simple questions. I don't even think there's a good enough benchmark dataset to evaluate such models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: