Stack Computers: the new wave (1989)

bmitc · on Aug 15, 2022

Does anyone know of the state of the art of stack and symbolic CPUs?

I'm reading a few books that address stack computers and also symbolic CPUs, ones which may run things like Forth, Prolog, and Lisp directly on the CPU (or more directly rather than via abstractions sitting on top of Von Neumann architectures). The ideas in the books, such as those in A High Performance Architecture for Prolog, seem important and useful, but it's hard to tell if they aren't useful in today's modern age or if they simply fell out of favor or interest.

In today's age with FGPAs, ASICs, RISC-V, etc., it seems ripe territory for custom CPU development, but about the only thing I know of that is along the lines of the above is the GreenArrays GA144 chip.

https://www.greenarraychips.com/

tdeck · on Aug 16, 2022

There were a bunch of Java chips in the early 2000s, and ARM cores with a subset of the JVM instruction set.

https://en.m.wikipedia.org/wiki/Java_processor

https://en.m.wikipedia.org/wiki/Jazelle

abecedarius · on Aug 15, 2022

25 years old, but http://bernd-paysan.de/4stack.html sounded interesting at the time.

rwmj · on Aug 15, 2022

The conventional view is that stack computers didn't scale because they're very demanding in memory bandwidth - every operations ends up reading 2 operands from memory and perhaps writing one result back. Which is of course exactly the wrong thing because registers got much faster than RAM. But they're also elegantly small. Are there any stack computers used today, perhaps in systems where slow speed is not a problem but ultra small size (and code size) is an advantage? I only see https://en.wikipedia.org/wiki/ZPU_(microprocessor) (2008) - 442 LUTs(!).

dbcurtis · on Aug 15, 2022

Memory bandwidth can be managed by all manner of microarchitectural methods. The issue in the modern world is that doing any kind of register renaming that would allow issuing 4 or more instructions at a time gets convoluted very quickly. If you could crack that, the code density would be great. But in the end giving the compiler 32 or so architectural registers and adding a renaming backend with 100+ micro architectural registers is probably a good trade off and avoids gnarly logic long-paths.

Akronymus · on Aug 16, 2022

I could see a circular buffer implemented in the registers working. Basically keeping the top x of the stack loaded in the reigsters, and pushing/pulling values when you get close to the opposite end (pushing values to main memory if you get too full, pulling if you get too empty)

Dunno if I am explaining it well.

On a reread, I think I might just be stating what you did, but in different, worse, words.

Definitely a VERY different hardware architecture than "normal" machines though

Joker_vD · on Aug 16, 2022

AMD Am29000 family had exactly that. The problem is, you still have to be able to name registers indirectly, or you'll have to either give up on separate compilation or do link-time intra-procedural register allocation. Both of those things are, of course, feasible in small embedded environments... but not in general-purpose computing, especially if you want to bring in most of the pre-existing software in.

anamax · on Aug 16, 2022

The problem isn't just faking a stack with a fast buffer, it's figuring out where the sources are, or rather, will be, with pipelined execution, or worse yet, decoding multiple instructions at once.

You also want to have multiple instructions writing a given architectural resource live at once. That's tough if top-of-stack is a buffer location.

actionfromafar · on Aug 16, 2022

Didn't SPARC have some kind of sliding window register set?

bouvin · on Aug 16, 2022

It did. I have fond memories (no doubt in part shaped by Stockholm syndrome) of writing a Pascal(ish) compiler targeting the SPARC ISA as an undergrad back in 1993 based on four? pages of ISA documentation. As I recall, handling procedure calls and returns was quite elegant.

ithkuil · on Aug 16, 2022

And so did the Berkrley RISC design. Also notably the Intel i960 and the Itanium

rahen · on Aug 15, 2022

When Burroughs started experimenting with stack processors in their B5000 computer, they didn't do it for code size but the ability to use high level languages first, going as far as to hide assembly language to the user. Compilers are fairly straightforward to write on such an architecture, and can produce reasonably fast machine code in one pass. A register based CPU make this quite more complex, and is indeed a more performant but less elegant approach than a stack based CPU.

This later inspired HP for the HP 3000 and then its RPN calculators, as well as other, lesser known machines. Nowadays I think I've only seen this with the JVM (to reduce the IL size) and a couple MCUs.

retrac · on Aug 16, 2022

Stack-based VMs are more common than just the JVM. Python and Ruby's bytecodes are stack-oriented. Stack machines are also still common for compilers where "good enough" code is acceptable. As you say, it's relatively easy to write a code generator for a stack machine.

int_19h · on Aug 16, 2022

CIL (.NET bytecode) is also stack-based, and perhaps the most interesting modern incarnation as well, because it is sufficiently full featured - pointers w/arithmetic, unions, stackalloc etc - to be a target for a C compiler.

pjmlp · on Aug 16, 2022

No wonder, given that supporting C++ was one of its design goals, first with Managed C++ Extensions, followed by C++/CLI on .NET 2.0.

cmrdporcupine · on Aug 16, 2022

WebAssembly is explicitly a stack computing model. Even purer than the JVM, as it is quite a bit lower level, with no types/objects other than primitive scalars, and no garbage collection or other runtime services (yet), just a simple linear memory model.

kaba0 · on Aug 16, 2022

I am not too familiar with WASM, but isn’t it expression-tree-based that happens to have a stack-like flattening as well? I remember reading something like that here, and afaik it only has structured control flow (no gotos).

The JVM just happens to have some higher level primitives, but WASM is not really stack-based in the same way at all.

randomifcpfan · on Aug 16, 2022

Linus Torvalds recently explained a deeper reason why stack architectures are unsuccessful/unpopular: they do not handle messy control flow well:

https://www.realworldtech.com/forum/?threadid=207836&curpost...

throwaway81523 · on Aug 15, 2022

Look at for example the GA144 and similar tiny deeply embedded cores like http://bernd-paysan.de/b16.html , where the stack is implemented as a register file. Those cores are actually fast, but limited in capability and a nuisance to program.

int_19h · on Aug 16, 2022

One thing I always wondered about is why those CPUs that implement stack in hardware using registers still have so few. Is it a cost issue or a space issue?

Gordonjcp · on Aug 16, 2022

It depends what you mean by "scale". If you mean "make it as big and as fast as possible" then yes, they suck because of the memory bandwidth thing.

If you mean "can be rapidly deployed from bare metal to a rich and complex application stack" then they scale out quite well, because a stack-based architecture - or even an architecture that lends itself to implementing stack-based languages, like the Motorola 6809 - kind of points you in the direction of Forth, and once you've decided you're using Forth then you've only to write a few dozen lines of assembly and the rest of your Forth is written in Forth.

cmrdporcupine · on Aug 15, 2022

Wouldn't a stack machine also present difficulties for branch prediction and out of order execution techniques?

I have no proof of this and I am not a processor designer. Just a hunch that this kind of analysis would be harder to perform on a stack oriented program flow.

rahen · on Aug 16, 2022

This works differently on a stack CPU but it can also be done.

https://en.wikipedia.org/wiki/Stack_machine#Out-of-order_exe...

amelius · on Aug 15, 2022

> Which is of course exactly the wrong thing because registers got much faster than RAM.

Why not cache the top N words then?

Stacks are local to a CPU, so you don't have to deal with coherence.

mgsouth · on Aug 16, 2022

Done. The HP 3000 "Classic" hardware had 2 to 8 16-bit words of top-of-stack registers. There was only one CPU, so CPU-level coherence wasn't an issue. There were memory caches and sometimes multiple ALUs, that was handled by the microcode pipeline logic. I imagine DMA required some bus snooping.

The "hot" area of the stack didn't need to be very deep. Most code loaded a word or two on the stack from some locations in memory, maybe re-used a word on the stack, ran an op which pushed a word back on the stack, and then the result was read or popped & stored back in general memory. Sometimes you'd DUP one of the operands. If you're familiar with HP RPN calculators, it was a lot like that. The HP-41 and -42, for example, only have 4 words for the whole stack.

nine_k · on Aug 15, 2022

It would be fun to imagine an architecture that would be needed for OoO / speculative execution in presence of a on-die stack (say, a few KB of SRAM).

Would it be harder than register renaming? What kind of data structure would it take>

mgsouth · on Aug 16, 2022

A stack is register renaming. Instead of "Register #n" you refer to it as "SP - n". The difference is that it's rebased with every op, and in ops the registers are almost always implied: "ADD" always loads reg #0 and #1, always puts results back in #1, then renames #1 .. n to #0 .. n-1. OoO should handle the continuous renaming OK, I think. ADD + PUSH would add (orig #0, orig #1), place result in orig #1, put new data in new #0, alias new #0 as "SP-0". Only the aliasing would need to be ordered. Maybe it could be part of the decode. As I noted elsewhere, some HP 3000 models had multiple ALUs so some form of OoO.

int_19h · on Aug 16, 2022

An example of this is Intel x87 floating point, although the stack discipline isn't strict there:

https://en.wikibooks.org/wiki/X86_Assembly/Floating_Point#FP...

kragen · on Aug 16, 2022

Renaming is one way to implement a stack, but you can also physically shift the bits around every cycle.

LAC-Tech · on Aug 15, 2022

Couldn't the stack just be made out of the same super fast memory processor registers are made of?

p1necone · on Aug 15, 2022

I assume part of the problem is the physical distance from the CPU hard limiting speed due to the speed of light. 3d stacking of memory might enable a design with a large enough stack close enough to the CPU? I don't really know what I'm talking about though.

kaba0 · on Aug 16, 2022

As mentioned before, a circular buffer with the last items moved to main memory could work (and was apparently already done on several CPUs). You have basically n registers, if it gets full you just override the last (which could have already been saved to ram out of order). Repeated popping can reintroduce these back.

kragen · on Aug 16, 2022

I think the "new wave" Koopman was talking about in his book title was precisely stack computers that weren't demanding in memory bandwidth. The MuP21 and the F18 core used in the GA144 don't store their operand and return stacks in RAM; they use them as a replacement for a register file, just like the 8087 does.

I think the conventional view on designs like the MuP21 and the 8087 is that they don't scale because they are bad at supporting out-of-order execution, because every instruction has a data dependency on the previous instruction. But if I'm reading https://stackoverflow.com/questions/18138382/which-registers... correctly, it says that in fact the 8087 instructions have decent performance even on current amd64 processors.

Stack computers do tend to have smaller code size than register machines, but a direct comparison of, say, 16 bits for RVC or Thumb instructions vs. 5 bits for GA144 x18 instructions is a little bit misleading, because you usually end up running about twice as many instructions to do the same job. So you end up saving about a third of your code space, not the two thirds you'd naively expect. Compensatorily you need about twice the clock speed to get the same performance.

Chuck Moore says you can get a higher clock speed because you can wire your ALU inputs directly to the top-of-stack and next-on-stack registers, so you don't have operand field decoding and register-file demultiplexing in your critical path. (But then he went off and gave up on clocks altogether in the x18, which is entirely asynchronous.) The attention the RISC-V designers gave to keeping the operand fields in the same place in every instruction format, even in RVC, reinforces my perception that this is still a live issue, not a holdover from the early 90s.

Maybe this is a personal problem stemming from lack of experience, but I still find stack code more bug-prone and harder to read than register-machine assembly languages.

BTW SeRV implements RISC-V in under 200 LUTs, and 4-LUTs at that, but the ZPU is probably faster.

On a register machine, duplicating a value is implicit (it happens implicitly every time you read it) but operand choice is explicit. On a stack machine, duplicating a value is explicit (with DUP or OVER, for example) but operand choice is implicit. A stack is sort of like a register that can hold more than one value, but in which each value pushed must normally be used exactly once.

As I understand it, the F18's operand stack has ten items, of which eight are cyclic: TOS, NOS, and eight more. Presumably the other eight items are cyclic because pushing and popping them is implemented by incrementing and decrementing a counter. But this still takes less area than a multi-ported register file; Shirriff's reverse-engineering of the 8086 found that its register file had three read ports and one write port, so the pass transistors for the muxing took up more area than the inverter transistors that implemented the memory bits themselves!

mud_dauber · on Aug 16, 2022

I worked with Phil back in the day at Harris Semi on their RTX architecture. AFAIK it was used on a couple of space probes due to small size, consistent code execution, and (I think) the ability to offer a rad-hard design for space.

Forth was the language of choice. I wish it could come back, because it was such a neat mind exercise.

dang · on Aug 15, 2022

Stack Computers: the new wave (1989) - https://news.ycombinator.com/item?id=8657654 - Nov 2014 (3 comments)

Stack Computers: the new wave (1989) - https://news.ycombinator.com/item?id=4620423 - Oct 2012 (37 comments)

pohl · on Aug 15, 2022

This reminds me that it's been a while since I've checked in on the belt machine that Ivan Godard & friends were working on at Mill Computing.

It looks like their website has change a bit since I last looked: https://millcomputing.com/

shrubble · on Aug 16, 2022

The old Byte magazine's special edition that covered Modula-2 described the Lilith workstation as a stack machine CPU. It's on archive.org .

metalforever · on Aug 15, 2022

I had this guy as a professor and he was great.

LAC-Tech · on Aug 15, 2022

Would love to hear more about this.

The stack machines book is the only computer architecture book I've read cover to cover.

buescher · on Aug 15, 2022

Koopman's more recent "Better Embedded System Software" is a great overview of its subject.

manholio · on Aug 15, 2022

I wonder how soon we will have microarchitectures that are purely the result of some deep learning algorithms optimizing for performance and compilability of high level languages. We would be able to prove them correct - that they correctly execute the code and have no exploitable bugs - but we will have almost no idea how exactly they achieve said performance.

batmanturkey · on Aug 17, 2022

I can’t find the link atm, but the reference was a self-optimized circuit that achieved xyz objective but the actual fpga bitstream was a p&r nightmare of latches and impossible simplifications taking advantage of phantom this and inductive that all accidentally and due to proximity but measured results and feedback and such. Totally specific and local, down to clock skew usage, whatever, no one understood it… will have to find the link…

manholio · on Aug 17, 2022

That sounds about right. Why bother with a clock anyway, asynchronous is the way of the hardware samurai.

Although it should be fuzzed a bit with signal and power noise, to kill those "solutions" that only work due to a lucky photo-resist error that manifests itself at exactly 3.03V.

mlyle · on Aug 16, 2022

I don't think we really know how to learn code generation techniques and prove them correct, yet. That would be, in general, a really neat and useful trick.

batmanturkey · on Aug 17, 2022

What, divide in half and compare?

digdugdirk · on Aug 15, 2022

Not sure if this quite counts for what I'm asking, but:

Is there a good list of "fundamentally new" types of computing paradigms over time? I'd love to see what paradigms branched off of others, what reached dead ends and why, etc.

xhevahir · on Aug 16, 2022

There's this chart: http://666-bits.blogspot.com/2007/08/principal-programming-p...

7thaccount · on Aug 15, 2022

I've seen some chart like this before. For example, APL started the array family that has APL, J, A+, K, Kona...etc. Algol started the procedural family of Algol, Fortran, Basic...etc. Lisp started the family of lisp languages...and so on. The taxonomy is convoluted and not quite that neat though.