They're really only abysmal if you attempt to one-shot it and probe with tasks t...

They're really only abysmal if you attempt to one-shot it and probe with tasks that would require a human a scratchpad to accomplish.

Humans can't one-shot non trivial planning tasks either. It's the one problem i have with all the papers that try to evaluate planning for LLMs.

Step away from that approach and they're ok.