Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They're really only abysmal if you attempt to one-shot it and probe with tasks that would require a human a scratchpad to accomplish.

Humans can't one-shot non trivial planning tasks either. It's the one problem i have with all the papers that try to evaluate planning for LLMs.

Step away from that approach and they're ok.

https://innermonologue.github.io/

https://tidybot.cs.princeton.edu/



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: