But we still generalize it to bigger and bigger without seeing matching brace problems of all sizes we can handle.
In the paper I don't think they are looking for perfect, but they show which models can't seem to learn significantly beyond the size of examples they explicitly saw.
I don't know what point you're making, GPT-3 almost certainly has training examples for greater depth.
Pattern recognition can fail. Mine does, GPT's does.
I count further than I subitize, but from experience my mind may wander and I may lose track after 700 of a thing even if there's literally nothing else to focus on.
It seems transformers don't generalize in this class and neural turing machines and neural stack machines do.
You get the general rule (in linguistics known as competency) but can't flawlessly do it (performance). Transformers can't seem to get competence here.
In the paper I don't think they are looking for perfect, but they show which models can't seem to learn significantly beyond the size of examples they explicitly saw.