The Chomsky hierarchy is beautiful, and a fundamental result in
formal languages and automata theory and complexity that has far-reaching
implications for computer science (from compiler design to computability).
What I learned only in a theoretical/formal linguistics class is that
there exist other hierarchies of languages and associated machines (e.g.
A, B, C1, C2, C3 languages) entirely orthogonal to the Chomsky hierarchy's
classes, i.e. recursively enumerable, context-sensitive, context-free, and regular classes.
The way rules may look like and how they get applied, induce alternative universes of (hierarchies of) formal languages and their automata.
Neural networks have come a long way from the Perceptron's inability to compute
XOR to the OP's paper. It would be interesting to push that work further so as to include alternative hierarchies.
This is a bit of a layman's outsider perspective, so I'd welcome having my perceptions corrected by people who are more up on the way formal grammars are employed in the ML community.
To my mind, though, the Chomsky-ish ML focus on 'grammars' always seemed to me to miss some of the power of language. Like, the idea that there is something profound in the fact that "Colorless green ideas sleep furiously" is a grammatical fragment that is semantically meaningless seems to me to miss the fact that it's still a meaningful utterance in a language, even if what it denotes is nonsensical. As is proven by the fact that Chomsky used it in his work on the subject! Its meaninglessness is meaningful - you can use it to illustrate a point. The same is true of ungrammatical sentences, too. Not only do actual humans make ungrammatical utterances all the time, but for example if you are trying to teach someone a lamguage, you might show them an ungrammatical sentence by way of an example of what not to do. So any language that is capable of talking about its own grammar has to admit sentences that are ungrammatical in that language, precisely so you can talk about them!
Human language understanding isn't 'parsing', we don't reject utterances because we can't lex the tokens or construct an unambiguous syntax tree or extract semantic meaning. Chomskyish ML researchers seem focused on the problem of making systems that produce grammatical utterances. But I'm honestly more impressed with, for example, GPT's ability to handle and produce the ungrammatical and semantically nonsensical using the same system that it uses to work with grammatical and sensical input and output. That seems much more human-like to me.
That a particular neural architecture is incapable of 'rejecting' certain ungrammatical structures doesn't feel like it matters, so long as the architecture is capable of recognizing a more fuzzy analogue degree of 'gramatticalness' of an utterance.
As many of you pointed out, memory is always finite in practice. The Chomsky hierarchy, however, only makes sense in the infinite memory case, since anything with finite memory will end up in the lowest category, i.e. finite state automatons (recognizing regular languages). So why does it matter ?
Because of generalisation, and deep within it, compression. I wrote this paper because I wanted to understand what it really means to generalize.
Yes you can reverse strings by creating a massive lookup table of all strings and their associated reverse. You can also do it by hard coding it for all lengths between say 1 and 500 (for length n, take the first token and swap it with the nth, then the second token and swap it with the (n-1)th etc). Both solutions would work in practice. But is it really what we mean by 'reversing a string'? Is it what we want an intelligent system to do?
The answer is clearly no. What we want is to do the same thing but using a simple, compressed algorithm (in the reverse string case, push on a stack, then pop from it: near and simple). That's what we mean in general by 'this network generalises well': it has found the most compressed algorithm reproducing the data we fed it.
That's really what the Chomsky hierarchy is about: it tells you how to get systems that generalise better. And that's basically what AI is all about.
That's how you probe the abilities of neural nets! Empirical evaluation done in the context of a theoretical framework. Not just poking models to see what happens [1].
Huh. No wonder Marcus Hutter's name is in the list of authors.
_____________
[1] Although poking stuff to see what happens is the root of all science, don't get me wrong.
Absolutely, but we only understand things in the context of what we already know, and we build up our understanding of new things from that.
Besides, a theory can also guide your experiments. For example, in physics, you wouldn't try to calculate the mass of all strawberry icecream in the world because physics doesn't, as a matter of course, consider the properties of strawberry icecream as particularly special or interesting. They're not a concept of physics, that is.
I like the idea of formalizing the computing power of NN by mimicking what we have done with automata theory, but I didn't quite grok how a NN would make use of an "external memory structure."
we showed that models interacting with an external memory structure, such as a stack or a finite tape, can climb the Chomsky hierarchy
I think of a trained NN as containing a set of weights which control the various perceptrons, convolution kernels, etc... as input data passes thru the network. Exactly how does an external memory source (like a stack or Turing tape) come in to play? Are perceptrons augmented with the ability to record information that changes their behavior? What is a simple example of a NN that makes use of an external memory structure?
I'm not an expert, so some or all of this may be wrong, but my understanding of how it works is this. You have a stack memory, which is just a list of numbers (or maybe a list of N-dimensional vectors). You add the contents of the stack at time t-1 to the input of the network at time t. The hidden layer contains a special number that controls whether to push or pop from the stack (or leave it alone). The values in the stack are determined from the values in the hidden layer at the time they were pushed and the special push/pop control parameter. The new value of the stack is passed into the network on the next time step, and so on.
The tricky thing is that the stack operations have to be differentiable, otherwise there's no way to optimize the push/pop control and the set of weights that determines how to turn the hidden layer into a stack value. So the special control variable in the hidden layer is a continuous value that controls "how much" push vs. pop to do. The value of the top of the stack is a weighted average of the values in the hidden layer and the current value on the top of the stack.
So, if the push vs. pop knob is turned all the way toward pop, then the value of the stack will be whatever the second-from-the-top value is. If it's turned all the way toward push, then the value at the top will be a weighted average of the values in the hidden layer. If it's 50% push and 50% pop, then the value at the top of the stack will be 0.5 * <hidden layer value> + 0.5 * <second-from-top value>, and so on. It's hard to think about what a half push, half pop would actually mean (at least for me), but it apparently works.
As a sibling commenter wrote, there are fancier architectures that provide a full differentiable RAM or other types of gadgets for the network to control.
RETRO has external source of data but is something entirely different to what is discussed in the paper. In simplistic terms, in RETRO architecture you store embeddings in an external database and then use kNN search. It is not an end-to-end model capable of things like those with stack or tape memory (what OP asked about), retrieval models are only good for limited Q/A atm.
The Chomsky hierarchy is all about the type of memory.
First note that to get to interesting classes, i.e. everything above regular languages / finite state automata (FSA), you need infinite memory. When you have finite memory, it is always only just as powerful as a FSA. Adding one (infinite) stack will give you pushdown automata complexity, and adding two (or more) (infinite) stacks will get you to Turing machines.
Consider that a float in hardware (f32, bf16, or whatever you like), but also in reality (think of the neuronal spike voltage) is not infinite in memory but finite. This is different from mathematics, where you can store infinite memory in a real number. Note that there is an infamous paper, "On the computational power of neural nets", Siegelmann & Sonntag, 1992, which states that RNNs are Turing complete. But this construction assumes that you can store infinite memory in a single neuron activity, which is never true in practice. In practice, a RNN or LSTM has finite memory.
However, in practice, also any computer has finite memory. Also the human brain has finite memory. So, it follows, you always only have the complexity of FSAs. But is this right? Wouldn't the intuition say sth different? Maybe the Chomsky hierarchy is not really so relevant? Or the question is somewhat ill posed.
What would make a computer Turing complete? You need infinite memory. So, you need to abstract away from a particular computer towards the concept of a computer with infinite memory. The official C language definition calls this an "abstract machine". This abstract machine is Turing complete.
How can you apply this to neural networks? How to define an abstract neural network with some explicit memory component, which can grow to infinite sizes? You get to memory-augmented models like the differential neural computer (extended from the neural Turing machine). In theory, you can think of abstract variants of those models with infinite memory, and then you can think about the Chomsky hierarchy.
In practice, the memory is always finite though. What they do in the paper is to focus more on the Chomsky hierarchy in practice, i.e. applied to some actual benchmarks. When you limit the length of the input problems, there is some maximum amount of memory which should be sufficient to solve them. Depending on the structure of the neural model, it gives you a clue to what compute complexity it mostly corresponds to when you test it.
> The point of stating that a mathematical model is Turing Complete is to reveal the capability of the model to perform any calculation, given a sufficient amount of resources (i.e. infinite), not to show whether a specific implementation of a model does have those resources. Non-Turing complete models would not be able to handle a specific set of calculations, even with enough resources, something that reveals a difference in the way the two models operate, even when they have limited resources. Of course, to prove this property, you have to do have to assume that the models are able to use an infinite amount of resources, but this property of a model is relevant even when resources are limited.
In other words, when you have reasonably big practical finite problem with limited memory. FSA runs quickly out of memory.
> In other words, when you have reasonably big practical finite problem with limited memory. FSA runs quickly out of memory.
Yes this is a difference in the expressiveness of each model, because FSAs can require a combinatorial explosion in the number of states to model a Turing machine. The parent is still technically correct though.
> First note that to get to interesting classes, i.e. everything above regular languages / finite state automata (FSA), you need infinite memory. When you have finite memory, it is always only just as powerful as a FSA.
While this is true, I'd like to point out that this distinction is purely theoretical. Nothing infinite exists in the real world, but some things are so plentiful we might consider them infinite for the sake of easier reasoning.
A very practical example of this is cloud computation. All data centers in the world obviously have a finite amount of memory, however, the amount is so huge that for the sake of any practical computation, we might as well consider them infinite (though we are more limited by the money required than by the physical resources themselves).
The distinction is very practical. It's better to think about it in terms of hard-coded limits vs. arbitrary scaling than finite vs. infinite. Anything designed to handle n items needs "infinite" memory.
Language classes are about the code, not about the computer. With a finite state automaton, you have to commit to the size of the input in advance and write the state transitions explicitly. While there exist structurally similar automata that can handle larger inputs, your automaton – your code – cannot handle them. You can generate those automata algorithmically, but in order to do that, you need a more expressive formalism that can handle n items.
Actually, time (and maybe space) are infinite (countably so, thanks to Heisenberg's Uncertainty Principle) ?
Not a physicist, so better get a second opinion, but I do not think that this is something you can get out of the uncertainty principle. As far as I can tell it does not turn continuous space and time into some discrete voxel space, i.e. while there is uncertainty, the location of the peak of the amplitude, for example, can still be located anywhere in continuous space.
Energy probably isn't though (maybe it is, if space is ?), that is going to put a limit on non-reversible computation ?
The total energy of the universe might be zero. [1]
Physicist here. You are correct. The uncertainty principle doesnt turn space into some sort of voxel grid.
While we still don't know what spacetime looks like at the smallest scales (we're still waiting on a quantum theory of gravity), quantum field theory measures spacetime using continuous real numbers
> While we still don't know what spacetime looks like at the smallest scales (we're still waiting on a quantum theory of gravity), quantum field theory measures spacetime using continuous real numbers
IIRC, the uncertainty principle limits the precision in any possible measurement to something like 60 or 70 digits. So yes, current theories use continuous real numbers, but I wouldn't generalize that to say that's confirmed because we're nowhere near being able to test that level of precision.
I guess you are confusing two things - think of a mark on a real line at x somewhere between 17 mm and 18 mm from the origin. If you measure its location with a ruler, you might only be able to say that it is somewhere between 17 mm and 18 mm from the origin, but this uncertainty in your measurement in no way constraints x to only be located at integer millimeter positions or something like that.
There is not even a real limit imposed by the uncertainty principle, you can measure positions as precisely as you want, you just have to pay in momentum uncertainty. Where we really seem to run into a wall is that if you keep making the measurement more and more precise, you need higher and higher energies to achieve ever shorter wavelengths and dumping a lot of energy into a small volume to measure the position really precisely will eventually result in black holes.
> There is not even a real limit imposed by the uncertainty principle, you can measure positions as precisely as you want, you just have to pay in momentum uncertainty.
Yes, that's the current continuous model of how this works. That isn't necessarily reflective of reality though.
> Where we really seem to run into a wall is that if you keep making the measurement more and more precise, you need higher and higher energies to achieve ever shorter wavelengths and dumping a lot of energy into a small volume to measure the position really precisely will eventually result in black holes.
Exactly. In other words, all measurements necessarily have finite precision due to various uncertainties or other physical limits.
I didn't pull this limit out of the aether, I came across it from one of Shor's posts [1] where he states that physical constants can't be defined to greater precision than what I specified above. If the physical constants can't have more than 60 digits of precision, then neither can any calculations or measurements based on them.
The fact is, we seem to be bounded on all sides to finite precision.
My point is just that limited measurement precision does not imply anything about the underlying structure, could be continuous or not, I am not arguing for one side or the other.
Because if position and momentum are not quantized like this, you cannot get discrete bits of information (aka negative entropy) and the whole (so-called) "2nd law of thermodynamics" becomes impossible to derive, which is kind of a problem ?
Note that this underlying structure is subjective, relative to the observer, not something objective... (but we already know that there's no such thing as an "objective underlying structure" from elsewhere from quantum mechanics and also from relativity)
> (but we already know that there's no such thing as an "objective underlying structure" from elsewhere from quantum mechanics and also from relativity)
No we don't. Don't confuse the map with the territory.
In the sense that physics is about "map-making", not the "true nature of reality", and has been for a while now (since it split from philosophy ? since the beginning of postmodern physics in 1905 ?).
Even philosophy has pretty much given up that claim : with Gödel/Church/Turing having blown up to smithereens the positivist project of a "theory of everything" for mathematics, and Wittgenstein/Kuhn/Derrida/Foucault/Chomsky having redirected the rest of philosophy towards "the naming of things".
And that project had been ironically doomed from before its start anyway : Descartes both laid the groundwork for it by elevating epistemology to "fist philosophy" and for solipsism - which, while a dead-end, cannot be ruled out !
(Also honorable mention for Max Tegmark's Mathematical Universe I guess, which, in a way, indirectly achieves the claim by positing a mathematical multiverse so vast that "our" reality can only be contained inside it.)
So the only discipline left that still lays claim on Truth and the True Nature of Reality is theology. (Note that this is how Descartes "solved" the problem of solipsism.)
Physics is map making, but from that your claims - subjective nature of reality, no objective structure - do not follow, at best you could claim that physics has nothing to say about the nature of reality. And I have doubts about that, at least to some extend. What kind of answer would you expect for a question like what is the true nature of an electron? You can describe the properties of an electron, what more do you want? What is a thing above and beyond the sum of its properties?
The tricky issue here is that "the electron" itself is a specific model that only makes sense under some paradigms...
There's also a point in how one paradigm might be ontologically radically different compared to the previous one... but what science cares about more is the new paradigm being a "tighter fit" between its new models and the results of the new experiments.
Also, beyound a single "thing", it's when we start considering collections of things that the situation can get very tricky very fast, like chaotic behavior from something as simple as 3 masses under the Newtonian paradigm ! (See also : "emergence")
Or the concept of temperature : it doesn't go "down" to the "true nature of reality", but is a statistical one that is not even always defined, yet is still quite useful.
But yet again I would like to emphasize how in several subfields of physics we now are in a situation where we had to give up an objective viewpoint of the situation for a subjective one, and where the information itself that we have about a system (aka negative entropy) is another variable in a super system that includes us (and our instruments) and the system being studied, and we are forced to consider that supersystem instead, or at least also, in order to go "deeper".
(It's also impressive how in some cases we now study "turtles all the way down" situations, with models having an infinite number of correction steps that we can cheat around thanks to the use of advanced mathematics. But maybe in the future these will be seen as trivial as we see the Zeno paradox today ?)
Well then you seem to be agreeing with my original statement:
> So yes, current theories use continuous real numbers, but I wouldn't generalize that to say that's confirmed because we're nowhere near being able to test that level of precision.
Yes and no, I guess, depending on what exactly you intended to say. We agree, I guess, that we are currently using real numbers but that does not mean that the universe is actually continuous. Where I am not so sure that we agree is about the experimental side. We could run into some barrier when probing shorter and shorter distances, but this would not necessarily imply that space is not continuous. On the other hand we could also observe effects that clearly indicate a non-continuous structure of space without running into some measurement limit.
> Where I am not so sure that we agree is about the experimental side. We could run into some barrier when probing shorter and shorter distances, but this would not necessarily imply that space is not continuous
Agreed, though I personally find continuous quantities implausible, despite preferring them at one point. As long as we had a discrete theory with equal predictive and explanatory power and it was equally parsimonious to a continuous theory, I would likely prefer it. I think the next revolution in physics will see an expansion of discretization or other forms of finitism.
> On the other hand we could also observe effects that clearly indicate a non-continuous structure of space without running into some measurement limit.
I want to tone that down somewhat : I think that I misremembered that Planck's length and time were derived from the uncertainty principle... while looking into it, it might not be as obvious ? (So you need both ?)
And would you get quantification of mass~energy from that of the impulse (through v having dimensions of space and time), or do you need to use a different approach and assume black holes ? (Or both, and pick the biggest value ?)
The Planck length and time don't have any special physical significance, they're just convenient, and coincidentally happen to be roughly the size where we know for sure QFT is insufficient. The "smallest meaningful unit of distance" stuff is nonsense.
> And would you get quantification of mass~energy from that of the impulse (through v having dimensions of space and time), or do you need to use a different approach and assume black holes ? (Or both, and pick the biggest value ?)
Neither, though the impulse thing is at least in the right neighborhood. (Black holes have absolutely nothing to do with this). You get quantized energy whenever the Hamiltonian of your system has a pure point spectrum. Typically this happens for finite dimensional systems, and infinite dimensional systems with potentials that grow rapidly with increasing distance. Generic infinite dimensional systems will usually have continuous spectra.
> (Black holes have absolutely nothing to do with this)
Don't they have, in the sense that to keep probing ever more precisely, you need ever more energy, and at some point too much mass~energy in a too small volume is going to form a black hole and you cannot go further ?
----
Argh, I should have known that this discussion would start to involve operators at some point, especially ones with an infinite number of dimensions... XD
Though this seems related mathematically to how the uncertainty principle can be interpreted through Fourier transforms : localized <=> spread out ; quantized <=> continuous ; finite (+ conditions on potential) <=> infinite (+ other conditions on potential) ?
> Don't they have, in the sense that to keep probing ever more precisely, you need ever more energy, and at some point too much mass~energy in a too small volume is going to form a black hole and you cannot go further ?
This may or may not be true - we can't probe those length scales yet, and maybe ever, so we don't really know - but it has no bearing on ordinary QM, which is nonrelativistic.
> Though this seems related mathematically to how the uncertainty principle can be interpreted through Fourier transforms
Not especially. They're both results in functional analysis, but that's about it.
You get the Planck units by setting c, hbar, G, and Boltzmann's constant to 1. This is convenient for notational purposes but it has no inherent physical significance.
That isn't the only way to argue for a minimal significant distance. Arguably QM sets a strict ~60-70 digit precision limit on physical constants [1], beyond which you arguably can't differentiate between discrete and continuous theories, and so a minimum distance seems like a perfectly sensible way to frame it.
> Actually, time (and maybe space) are infinite (countably so, thanks to Heisenberg's Uncertainty Principle) ?
Time is infinite in theory, but not practically. Time is fairly meaningless after the heat death of the universe. Some quantum field theories also suggest that spacetime becomes unstable if energy falls below a certain density IIRC, so even spacetime expansion might have a limit.
What I meant to say is that nothing that can be used for computation is infinite. Resources are finite, there is a finite amount of RAM/hard drive/paper space in the world.
Space is infinite, but we haven't yet found a technique to use pure space as computational memory.
But in a real-world application of course I have to think about the queue's capacity and bandwidth limits and monitor accordingly. Assuming infinite anything in production would be a time bomb of tech debt.
While I see where you coming from, the whole "scalability" story is about assuming infinite resources and scaling your usage as the demand spikes. Kubernetes (I think) is specialized in that regard - assuming you can create a potentially infinite number of instances, you create a system that consumes as much resources as it needs for fulfilling the demand.
Thanks, this made me understand (without reading the whole paper, might do that now) why they were talking about :
> only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks
I guess because in practice (for computers, not brains... at least not yet) we can actually cheat by using "extensible on demand memory" rather than an actually infinite one (which is impossible) ?
Also, I wonder if the difference between countable infinite (a Turing's machine tape) and the continuum (that, assuming ZFC, might be somehow achieved by doubling a number for each slot on the Turing tape) might be of relevance even in their finite, extensible at will (for all practical purposes) equivalents ??
but what about human intellingence, clearly it does not require infinite anything.
also, what about accumulated 'memory' in our human culture? how finite is that? how do you quantify this? (I'd think it'd be quantified by a timestamp, like a reference to an era or an historic age?? shrug)
> How to define an abstract neural network with some explicit memory component, which can grow to infinite sizes?
I think "culture" is the answer.. but maybe I actually mean a free "natural language" (in contrast with a formal (or synthetic) language)
How can we know that? If memory in some way uses energy levels for encoding then we know it must be finite because of quantization but do we know that?
I don't think there's any reason to assume that it's finite or infinite. There could be physical properties in real life that can vary continuously. Energy isn't one, but there could be others.
Candela is not a physical unit (but one based on human biology).
----
Of course, that's a different question of whether their amounts in our universe can be (countably) infinite : so far as we know, time is (and maybe space, which should make other space-extensive quantities like mass-energy and charges also (countably) infinite ?)
This isn't an argument that should convince anyone unless we both have a theory of everything and you want to then enumerate every possible variable involved. Any ignorance would leave a gap. The Bekenstein bound mentioned in the other comment though demonstrates that the approach isn't even necessary to come to the same conclusion though (as far as I can tell anyway.)
That doesn't seem reasonable to me : we do not need to preface every statement with "with our current understanding of physics", this goes without saying.
We might never get a "theory of everything", and we shouldn't assume the existence of new physical properties without any other reason to do so, this goes against the principle of parsimony ! (Especially continuous ones, none of which seem to exist for now !)
And if our understanding of physics does change, it could have other dramatic consequences - that Bekenstein bound might not be valid any more for instance...
We're not assuming the existence of such physical properties. We're also not assuming the inexistence of such physical properties. That's the point. You're making a leap based on no evidence. I'm making no such commitment and your arguments shouldn't convince anyone to make that leap with you. Worse, you want to use this unfounded leap to claim we know something else. It's terrible epistemology. It turns out we do know that other fact, but certainly not by following you down your unsound road.
I can't say it's totally surprising you need a machine (store, stack, instructions) to handle higher language issues that GPT targets
Not to crit the paper: formally stating things is good. But it feels like the outcome was highly predictable. Inferences being drawn from the language model where the language is a real world language and not a restricted subset were always going to hit up against the complexities of real languages.
And I'd be mindful that conforming to Chomsky Hierarchies says nothing about the nature of universal language models, or {{GPT or turing machines} and emergent intelligence}. Which again, isn't to lay things on the authors. People have a habit (here, in general) of taking things out of context.
I kind of feel thats about it: If the analysis of the classifier architecture shows constructs which map to a machine capable of being the higher Chomsky class, then its likely that reflects the complexity of the languages being analysed.
Matching braces is part of written natural language and would be beyond the power of regular languages wouldn't it? I thought there was something special with transformers allowing to learn that at least within its context window size in a generalizable way.
It also sounds like from the paper transformers are formally proven to capable of it, but the training can't get the weights to make it capable of it, even though there should exist such weights, if I am reading them right.
But we still generalize it to bigger and bigger without seeing matching brace problems of all sizes we can handle.
In the paper I don't think they are looking for perfect, but they show which models can't seem to learn significantly beyond the size of examples they explicitly saw.
I don't know what point you're making, GPT-3 almost certainly has training examples for greater depth.
Pattern recognition can fail. Mine does, GPT's does.
I count further than I subitize, but from experience my mind may wander and I may lose track after 700 of a thing even if there's literally nothing else to focus on.
It seems transformers don't generalize in this class and neural turing machines and neural stack machines do.
You get the general rule (in linguistics known as competency) but can't flawlessly do it (performance). Transformers can't seem to get competence here.
this may well be the only real skill of programmers who learned C style syntaxes, andor LISPs
on the ohter hand, thinking with the parenthesis takes language up to the level of functional-thinking. computable stuff is all about substitution of all things inside a parenthesis for a single thing. and this is the essence of all classic computation
The paper covers that. They only test to a finite size, but the models that fail can't generalize significantly beyond the size of explicit examples they were trained on.
Formally transformers should be capable of modeling it with the right weights, but they show good evidence those weights can't be learned with gradient descent. Whereas a neural turing machine or stack machine can learn them with gradient descent.
You might mean something like if you restrict the set of all valid inputs to N characters then the language is regular, but that's a much stronger condition than requiring the input to be finite. Furthermore, such a restriction makes the language non-recursive; it's effectively a trivial language.
In standard automata theory, all inputs to automatons are finite.
I'm thinkin if your understanding were sound I would have to listen to it, but I rather read about these kinds of things, so whis is why I say unsound is best
on the other hand, putting things down into writing is difficult. So while you diminish this paper's accomplishments, you couldn't have done it yourself in spite of easily understanding it.
Does it matter if you can outsource computation to Wolfram Alpha or Python interpreter? I am interested if Transformers can (learn to) reason. They for sure have good statistical model of world.
What I learned only in a theoretical/formal linguistics class is that there exist other hierarchies of languages and associated machines (e.g. A, B, C1, C2, C3 languages) entirely orthogonal to the Chomsky hierarchy's classes, i.e. recursively enumerable, context-sensitive, context-free, and regular classes.
The way rules may look like and how they get applied, induce alternative universes of (hierarchies of) formal languages and their automata. Neural networks have come a long way from the Perceptron's inability to compute XOR to the OP's paper. It would be interesting to push that work further so as to include alternative hierarchies.