Hacker News new | past | comments | ask | show | jobs | submit login
Stack Computers: the new wave (1989) (cmu.edu)
98 points by JoachimS on Aug 15, 2022 | hide | past | favorite | 48 comments



Does anyone know of the state of the art of stack and symbolic CPUs?

I'm reading a few books that address stack computers and also symbolic CPUs, ones which may run things like Forth, Prolog, and Lisp directly on the CPU (or more directly rather than via abstractions sitting on top of Von Neumann architectures). The ideas in the books, such as those in A High Performance Architecture for Prolog, seem important and useful, but it's hard to tell if they aren't useful in today's modern age or if they simply fell out of favor or interest.

In today's age with FGPAs, ASICs, RISC-V, etc., it seems ripe territory for custom CPU development, but about the only thing I know of that is along the lines of the above is the GreenArrays GA144 chip.

https://www.greenarraychips.com/


There were a bunch of Java chips in the early 2000s, and ARM cores with a subset of the JVM instruction set.

https://en.m.wikipedia.org/wiki/Java_processor

https://en.m.wikipedia.org/wiki/Jazelle


25 years old, but http://bernd-paysan.de/4stack.html sounded interesting at the time.


The conventional view is that stack computers didn't scale because they're very demanding in memory bandwidth - every operations ends up reading 2 operands from memory and perhaps writing one result back. Which is of course exactly the wrong thing because registers got much faster than RAM. But they're also elegantly small. Are there any stack computers used today, perhaps in systems where slow speed is not a problem but ultra small size (and code size) is an advantage? I only see https://en.wikipedia.org/wiki/ZPU_(microprocessor) (2008) - 442 LUTs(!).


Memory bandwidth can be managed by all manner of microarchitectural methods. The issue in the modern world is that doing any kind of register renaming that would allow issuing 4 or more instructions at a time gets convoluted very quickly. If you could crack that, the code density would be great. But in the end giving the compiler 32 or so architectural registers and adding a renaming backend with 100+ micro architectural registers is probably a good trade off and avoids gnarly logic long-paths.


I could see a circular buffer implemented in the registers working. Basically keeping the top x of the stack loaded in the reigsters, and pushing/pulling values when you get close to the opposite end (pushing values to main memory if you get too full, pulling if you get too empty)

Dunno if I am explaining it well.

On a reread, I think I might just be stating what you did, but in different, worse, words.

Definitely a VERY different hardware architecture than "normal" machines though


AMD Am29000 family had exactly that. The problem is, you still have to be able to name registers indirectly, or you'll have to either give up on separate compilation or do link-time intra-procedural register allocation. Both of those things are, of course, feasible in small embedded environments... but not in general-purpose computing, especially if you want to bring in most of the pre-existing software in.


The problem isn't just faking a stack with a fast buffer, it's figuring out where the sources are, or rather, will be, with pipelined execution, or worse yet, decoding multiple instructions at once.

You also want to have multiple instructions writing a given architectural resource live at once. That's tough if top-of-stack is a buffer location.


Didn't SPARC have some kind of sliding window register set?


It did. I have fond memories (no doubt in part shaped by Stockholm syndrome) of writing a Pascal(ish) compiler targeting the SPARC ISA as an undergrad back in 1993 based on four? pages of ISA documentation. As I recall, handling procedure calls and returns was quite elegant.


And so did the Berkrley RISC design. Also notably the Intel i960 and the Itanium


When Burroughs started experimenting with stack processors in their B5000 computer, they didn't do it for code size but the ability to use high level languages first, going as far as to hide assembly language to the user. Compilers are fairly straightforward to write on such an architecture, and can produce reasonably fast machine code in one pass. A register based CPU make this quite more complex, and is indeed a more performant but less elegant approach than a stack based CPU.

This later inspired HP for the HP 3000 and then its RPN calculators, as well as other, lesser known machines. Nowadays I think I've only seen this with the JVM (to reduce the IL size) and a couple MCUs.


Stack-based VMs are more common than just the JVM. Python and Ruby's bytecodes are stack-oriented. Stack machines are also still common for compilers where "good enough" code is acceptable. As you say, it's relatively easy to write a code generator for a stack machine.


CIL (.NET bytecode) is also stack-based, and perhaps the most interesting modern incarnation as well, because it is sufficiently full featured - pointers w/arithmetic, unions, stackalloc etc - to be a target for a C compiler.


No wonder, given that supporting C++ was one of its design goals, first with Managed C++ Extensions, followed by C++/CLI on .NET 2.0.


WebAssembly is explicitly a stack computing model. Even purer than the JVM, as it is quite a bit lower level, with no types/objects other than primitive scalars, and no garbage collection or other runtime services (yet), just a simple linear memory model.


I am not too familiar with WASM, but isn’t it expression-tree-based that happens to have a stack-like flattening as well? I remember reading something like that here, and afaik it only has structured control flow (no gotos).

The JVM just happens to have some higher level primitives, but WASM is not really stack-based in the same way at all.


Linus Torvalds recently explained a deeper reason why stack architectures are unsuccessful/unpopular: they do not handle messy control flow well:

https://www.realworldtech.com/forum/?threadid=207836&curpost...


Look at for example the GA144 and similar tiny deeply embedded cores like http://bernd-paysan.de/b16.html , where the stack is implemented as a register file. Those cores are actually fast, but limited in capability and a nuisance to program.


One thing I always wondered about is why those CPUs that implement stack in hardware using registers still have so few. Is it a cost issue or a space issue?


It depends what you mean by "scale". If you mean "make it as big and as fast as possible" then yes, they suck because of the memory bandwidth thing.

If you mean "can be rapidly deployed from bare metal to a rich and complex application stack" then they scale out quite well, because a stack-based architecture - or even an architecture that lends itself to implementing stack-based languages, like the Motorola 6809 - kind of points you in the direction of Forth, and once you've decided you're using Forth then you've only to write a few dozen lines of assembly and the rest of your Forth is written in Forth.


Wouldn't a stack machine also present difficulties for branch prediction and out of order execution techniques?

I have no proof of this and I am not a processor designer. Just a hunch that this kind of analysis would be harder to perform on a stack oriented program flow.


This works differently on a stack CPU but it can also be done.

https://en.wikipedia.org/wiki/Stack_machine#Out-of-order_exe...


> Which is of course exactly the wrong thing because registers got much faster than RAM.

Why not cache the top N words then?

Stacks are local to a CPU, so you don't have to deal with coherence.


Done. The HP 3000 "Classic" hardware had 2 to 8 16-bit words of top-of-stack registers. There was only one CPU, so CPU-level coherence wasn't an issue. There were memory caches and sometimes multiple ALUs, that was handled by the microcode pipeline logic. I imagine DMA required some bus snooping.

The "hot" area of the stack didn't need to be very deep. Most code loaded a word or two on the stack from some locations in memory, maybe re-used a word on the stack, ran an op which pushed a word back on the stack, and then the result was read or popped & stored back in general memory. Sometimes you'd DUP one of the operands. If you're familiar with HP RPN calculators, it was a lot like that. The HP-41 and -42, for example, only have 4 words for the whole stack.


It would be fun to imagine an architecture that would be needed for OoO / speculative execution in presence of a on-die stack (say, a few KB of SRAM).

Would it be harder than register renaming? What kind of data structure would it take>


A stack is register renaming. Instead of "Register #n" you refer to it as "SP - n". The difference is that it's rebased with every op, and in ops the registers are almost always implied: "ADD" always loads reg #0 and #1, always puts results back in #1, then renames #1 .. n to #0 .. n-1. OoO should handle the continuous renaming OK, I think. ADD + PUSH would add (orig #0, orig #1), place result in orig #1, put new data in new #0, alias new #0 as "SP-0". Only the aliasing would need to be ordered. Maybe it could be part of the decode. As I noted elsewhere, some HP 3000 models had multiple ALUs so some form of OoO.


An example of this is Intel x87 floating point, although the stack discipline isn't strict there:

https://en.wikibooks.org/wiki/X86_Assembly/Floating_Point#FP...


Renaming is one way to implement a stack, but you can also physically shift the bits around every cycle.


Couldn't the stack just be made out of the same super fast memory processor registers are made of?


I assume part of the problem is the physical distance from the CPU hard limiting speed due to the speed of light. 3d stacking of memory might enable a design with a large enough stack close enough to the CPU? I don't really know what I'm talking about though.


As mentioned before, a circular buffer with the last items moved to main memory could work (and was apparently already done on several CPUs). You have basically n registers, if it gets full you just override the last (which could have already been saved to ram out of order). Repeated popping can reintroduce these back.


I think the "new wave" Koopman was talking about in his book title was precisely stack computers that weren't demanding in memory bandwidth. The MuP21 and the F18 core used in the GA144 don't store their operand and return stacks in RAM; they use them as a replacement for a register file, just like the 8087 does.

I think the conventional view on designs like the MuP21 and the 8087 is that they don't scale because they are bad at supporting out-of-order execution, because every instruction has a data dependency on the previous instruction. But if I'm reading https://stackoverflow.com/questions/18138382/which-registers... correctly, it says that in fact the 8087 instructions have decent performance even on current amd64 processors.

Stack computers do tend to have smaller code size than register machines, but a direct comparison of, say, 16 bits for RVC or Thumb instructions vs. 5 bits for GA144 x18 instructions is a little bit misleading, because you usually end up running about twice as many instructions to do the same job. So you end up saving about a third of your code space, not the two thirds you'd naively expect. Compensatorily you need about twice the clock speed to get the same performance.

Chuck Moore says you can get a higher clock speed because you can wire your ALU inputs directly to the top-of-stack and next-on-stack registers, so you don't have operand field decoding and register-file demultiplexing in your critical path. (But then he went off and gave up on clocks altogether in the x18, which is entirely asynchronous.) The attention the RISC-V designers gave to keeping the operand fields in the same place in every instruction format, even in RVC, reinforces my perception that this is still a live issue, not a holdover from the early 90s.

Maybe this is a personal problem stemming from lack of experience, but I still find stack code more bug-prone and harder to read than register-machine assembly languages.

BTW SeRV implements RISC-V in under 200 LUTs, and 4-LUTs at that, but the ZPU is probably faster.

On a register machine, duplicating a value is implicit (it happens implicitly every time you read it) but operand choice is explicit. On a stack machine, duplicating a value is explicit (with DUP or OVER, for example) but operand choice is implicit. A stack is sort of like a register that can hold more than one value, but in which each value pushed must normally be used exactly once.

As I understand it, the F18's operand stack has ten items, of which eight are cyclic: TOS, NOS, and eight more. Presumably the other eight items are cyclic because pushing and popping them is implemented by incrementing and decrementing a counter. But this still takes less area than a multi-ported register file; Shirriff's reverse-engineering of the 8086 found that its register file had three read ports and one write port, so the pass transistors for the muxing took up more area than the inverter transistors that implemented the memory bits themselves!


I worked with Phil back in the day at Harris Semi on their RTX architecture. AFAIK it was used on a couple of space probes due to small size, consistent code execution, and (I think) the ability to offer a rad-hard design for space.

Forth was the language of choice. I wish it could come back, because it was such a neat mind exercise.


Related:

Stack Computers: the new wave (1989) - https://news.ycombinator.com/item?id=12237539 - Aug 2016 (28 comments)

Stack Computers: the new wave (1989) - https://news.ycombinator.com/item?id=8657654 - Nov 2014 (3 comments)

Stack Computers: the new wave (1989) - https://news.ycombinator.com/item?id=4620423 - Oct 2012 (37 comments)


This reminds me that it's been a while since I've checked in on the belt machine that Ivan Godard & friends were working on at Mill Computing.

It looks like their website has change a bit since I last looked: https://millcomputing.com/


The old Byte magazine's special edition that covered Modula-2 described the Lilith workstation as a stack machine CPU. It's on archive.org .


I had this guy as a professor and he was great.


Would love to hear more about this.

The stack machines book is the only computer architecture book I've read cover to cover.


Koopman's more recent "Better Embedded System Software" is a great overview of its subject.


I wonder how soon we will have microarchitectures that are purely the result of some deep learning algorithms optimizing for performance and compilability of high level languages. We would be able to prove them correct - that they correctly execute the code and have no exploitable bugs - but we will have almost no idea how exactly they achieve said performance.


I can’t find the link atm, but the reference was a self-optimized circuit that achieved xyz objective but the actual fpga bitstream was a p&r nightmare of latches and impossible simplifications taking advantage of phantom this and inductive that all accidentally and due to proximity but measured results and feedback and such. Totally specific and local, down to clock skew usage, whatever, no one understood it… will have to find the link…


That sounds about right. Why bother with a clock anyway, asynchronous is the way of the hardware samurai.

Although it should be fuzzed a bit with signal and power noise, to kill those "solutions" that only work due to a lucky photo-resist error that manifests itself at exactly 3.03V.


I don't think we really know how to learn code generation techniques and prove them correct, yet. That would be, in general, a really neat and useful trick.


What, divide in half and compare?


Not sure if this quite counts for what I'm asking, but:

Is there a good list of "fundamentally new" types of computing paradigms over time? I'd love to see what paradigms branched off of others, what reached dead ends and why, etc.



I've seen some chart like this before. For example, APL started the array family that has APL, J, A+, K, Kona...etc. Algol started the procedural family of Algol, Fortran, Basic...etc. Lisp started the family of lisp languages...and so on. The taxonomy is convoluted and not quite that neat though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: