Hacker News new | past | comments | ask | show | jobs | submit login
The Art of Picking Intel Registers (2003) (swansontec.com)
133 points by Tomte on April 22, 2018 | hide | past | favorite | 44 comments



>"There are three major processor architectures: register, stack, and accumulator. In a register architecture, operations such as addition or subtraction can occur between any two arbitrary registers. In a stack architecture, operations occur between the top of the stack and other items on the stack. In an accumulator architecture, the processor has single calculation register called the accumulator."

I have only ever seen register and stack used to distinguish between a virtual machine implementation - i.e a "stack machine", example - the JVM , a "register machine", example - Dalvik. The author explains that the x86 is register based architecture but what are some examples of a stack based CPU architecture and an accumulator based CPU architecture?


The 6502 and many of its kindred are accumulator based machines. From http://nesdev.com/6502.txt

> THE ACCUMULATOR

> This is THE most important register in the microprocessor. Various machine language instructions allow you to copy the contents of a memory location into the accumulator, copy the contents of the accumulator into a memory location, modify the contents of the accumulator or some other register directly, without affecting any memory. And the accumulator is the only register that has instructions for performing math.

It has some other registers (X, Y, Status, Program Counter and Stack Pointer). Looking at the instruction set ( http://www.e-tradition.net/bytes/6502/6502_instruction_set.h... ) , you'll see that all of the math operands work off of the accumulator. X and Y are primarily for the offset for a memory location.


Thanks, architectures where the accumulator is always used implicitly then?

I have often heard "register based architecture" used to describe MIPS/load store type architectures. It seems that the author here is implying that the x86 is "register based" since they''re stating its not actually an "accumulator machine" does have an accumulator like register?


The definitions are not strict, and there is often significant overlap between the categorization types (esp. as follow-on additions are added to the initial designs).

But generally, things fall out this way:

1) register based

Has a pool of registers, and the registers are general purpose (meaning they can be used as source or destination for most any computation operation the CPU can perform).

2) accumulator based

Has a "dignified" register (the accumulator) and most computations only happen in/out of the accumulator. Other registers exist for more specialized purposes to facilitate feeding data into and out of the accumulator. Many DSP architectures are very highly "accumulator based". General purpose CPU's, less so unless you go back in time to late 70's or early 80's architectures.

3) stack based

Has no registers (or very few) other than a stack pointer. All temporary data is stored on the stack, and all computation instructions involve computing with data items at the top of the stack and returning the result to the top of the stack. An example of such an architecture was the HP3000 system from the 70's (https://en.wikipedia.org/wiki/HP_3000#Use_of_stack_instead_o...).


Thanks for the detailed explanations. I had a few questions about what you wrote:

>"Has a "dignified" register (the accumulator) and most computations only happen in/out of the accumulator."

I was curious about your use of the word dignified in quotes. Are you just conveying its importance or with this or something else?

>"Many DSP architectures are very highly "accumulator based". "

Is there something about DSP and their workloads that make accumulator based architectures a good fit?

>"All temporary data is stored on the stack, and all computation instructions involve computing with data items at the top of the stack "

So are computation limited to the top two positions of the stack then? As in the operands can be obtain by two pop operations and not arbitrary positions in the stack?


> I was curious about your use of the word dignified in quotes. Are you just conveying its importance or with this or something else?

Just conveying its importance. Since in a pure accumulator based architecture the accumulator is the only register into which computations can be performed, it often is quite important.

> Is there something about DSP and their workloads that make accumulator based architectures a good fit?

They often do a lot of operations that involve accumulating running totals of long strings of numbers, so the math operations they typically process lend themselves to a predisposition to being designed in this fashion. Beyond that I am unsure of any other reasons beyond guessing at possibilities.

> So are computation limited to the top two positions of the stack then?

Typically the number of stack top elements that are retrieved depend on the instruction being executed. So an add that consumes two inputs would take in the top two stack elements, and return a single sum. But an 'increment' (really just an add where one input is a constant) would only consume one stack element. There is no reason (other than the additional design complexity) that a three or four or five operand input instruction could not be designed, and it would consume its designed number of stack top positions.

> and not arbitrary positions in the stack?

Generally, no, in stack based systems one usually can not address arbitrary positions in the stack. However, there are always exceptions, and many systems have a "roll" operation that rotates the top X stack positions one place (a few provide a way to roll X stack top positions by Y steps), which provides a way to get at data below the top of the stack without loosing the current data that is at the top. With that said, there is probably at least one architecture that allows for specifically addressing particular stack positions without popping/rolling.


Thanks for the great explanations and insights, I really appreciate it. Cheers.


The PIC16 is probably the closest - all arithmetic has to go through the W register. Confusingly PIC calls all internal memory "registers", but really it's more like a one-register machine.

The Z80 routes most arithmetic through its A (accumulator) register.

The inheritor of this is the X86, where there are much fewer register limitations but tradition and the lettering remain; (E)AX is the calling convention's first argument and return parameter. And the article itself details that you can get shorter opcodes by preferring to do arithmetic through EAX.


In 6502, there is LDA, LDX, LDY for various loads. However, ADC (add with carry) only works on the accumulator.

Another important aspect of (at least the 6502) is a very limited instruction set. With addressing modes and op codes there are at most 256 of them (one byte). ADC with addressing modes is 8 of the 256 instructions. AND (bitwise and memory location with accumulator) is another 8. And so on. To add the corresponding op codes for the X and Y registers - beyond adding more transistors - adds more op codes which were also at a premium. Granted, there are 105 of the 256 that weren’t used, but to make the 6502 into a design that would more closely match a three register architecture rather than an accumulator architecture would significantly eat into the unused op codes. Hypothetical ADCX and ADCY would be 12 or 16 more op codes. Eight more opcodes with that fully support the X and Y registers would consume all of those potential instructions (only six more opcodes if the accumulator is allowable as an offset).


The terms aren't super-formally defined but "register based" usually suggests "has a bunch of general purpose registers", load store doesn't have to be a part of it. A 68k has a bunch of general purpose(ish) registers.


Most of the early computers were accumulator-based; and the concept predates computers, since the early mechanical calculators had a single "register" --- the accumulator --- which... accumulated the results of successive calculation. It's no surprise that computers, which initially were developed to be essentially automatic calculators, adopted the same architecture:

https://en.wikipedia.org/wiki/Arithmometer

https://en.wikipedia.org/wiki/Comptometer


I suspect it was also rational, given the performance trade-offs at the time.

As transistors have become smaller, the caching hierarchy has become more important. A modern register set is a level-0 cache -- i.e. a useful pool of memory that is really close to the the computation units.

In the old days, the communication between components was less of a bottleneck than the speed of the components themselves. So there was less advantage to this. On the other hand, making a CPU simpler by having just one or two registers was more of an advantage than it is now.


For stack machine, RTX2010 comes to mind. It's a radiation-hardened processor used in various spaceships.


Neat, I didn't know about this chip. Is there anything particular about a stack architecture that makes it ideal for radhard applications, or is it just the one they selected because they wanted to use Forth?


It's mostly about convenience. Hubble uses a radiation hardened 486 instead of those.


Also, the INMOS Transputer. It's a seriously odd but inspired architecture. Didn't scale in performance. Had some really interesting ideas in its ISA and interprocessor communication system.


I think F18A architecture (used in GA144 chips) is stack based, optimized for running forth

http://www.greenarraychips.com/home/documents/index.html#arc...


One way you can look at 'accumulator architecture' is as a special, constrained case of 'register machine' that makes sense when you're making CPUs out of 3510 transistors. A modern virtual machine and the CPUs it would target have grown past the constraint so you don't see that particular approach any more.


Sorry can you explain why it only makes sense when your making cpus out of 3510 transistor? What constraint specifically are you referring to? Thanks.


Others have mentioned various reasons. Here's yet another reason:

Shorter instructions.

Let's say you have an 8-bit machine, like a 6502 or an 8080 (the architectural predecessor to the x86). Let's say everything goes thru the ACCumulator.

Then you can have 1 byte opcodes to do things such as:

   clear accumulator
   shift left everything in ACC by 1 bit
   rotate left everything in ACC by 1 bit
etc.

You can also have 2 byte opcodes to do things such as:

   add an 8 bit immediate to the ACC
      8 bit opcode plus 8 bit data

   exclusive or an 8 bit immediate to the ACC
      8 bit opcode plus 8 bit data
etc.

Those early microprocessors used a lot of 8 bit and 16 bit instructions. Why? Cost and chip count.

Early EPROMs (equivalent of today's flash but could only be erased by shining UV light on them), like the Intel 1702A, were 256 bytes. That's it. 2048 bits of memory in a chip. It was great when the Intel 2708 became reasonably priced. That got you an entire 1024 bytes of ROM in a single chip.

Early RAMs, like the 2102, were 1024 words by 1 bit. Put 8 chips together and you got 1024 bytes of memory.

So, yeah, shorter instructions were a big big deal in "the old days", i.e. the mid 1970s.


Thanks for the examples. I guess I'm missing what the connection between 1 byte opcodes and the presence of a dedicated accumulator register. Sorry if I'm missing something obvious. Can you elaborate?


You have many things a microprocessor needs to do, even an 8-bit one from 45 or so years ago.

You need opcodes to shift left and right, rotate left and right (possibly with and without carry), add, subtract, and, or, exclusive or. Perhaps other arithmetic and logical operations.

You need opcodes to move values to/from memory, to/from other registers such as index registers. You don't just have opcodes that specify absolute memory locations, you need more opcodes to specify that you want to use an index register instead.

You need opcodes to jump conditionally. You need opcodes to call and return from subroutines. You need opcodes to enable and disable interrupts. You need opcodes to push and pop registers from the stack.

Etc etc etc.

You can't easily encode all those opcodes into a single byte and still set aside bits from that one byte to specify one of a number of accumulators. Hence why the Intel 8080 had a single accumulator. Because all math happened with the accumulator, you didn't need any bits in the opcode to specify that. It was implied.

Having said that, there was one commercially available microprocessor, the Motorola 6800, that had two general purpose accumulators.[1] It still had many single-byte opcodes. But it also gave some stuff up. It only had a single index register. The competitor MOS Technology 6502 had a single accumulator[2], but that left opcode room and die area for two index registers.

The Woz chose the more flexible (and far cheaper) 6502 for the Apple I and Apple II. And the rest, as they say, is history.

[1] https://en.wikipedia.org/wiki/Motorola_6800 [2] https://en.wikipedia.org/wiki/MOS_Technology_6502


>"Because all math happened with the accumulator, you didn't need any bits in the opcode to specify that. It was implied."

Ah OK. This is the bit I was missing. This makes perfect sense. Thanks you for the detailed response!


Registers are part of the CPU so they take up room and transistors - when those things are at a premium, you use fewer registers. Then Moore's law magically lets you use more so you do!


It's not so much the registers themselves, it's the plumbing required if you're going to let any register supply any input or output of the ALU. Hardwiring a register to one of the ALU's ports means you avoid a multiplexer (and the same applies to hardwiring a register to the AGU's ports, which is how you get 'base' and 'index' type registers).


Sure, there are lots and lots of deeper details but fundamentally, more registers means more transistors (or a higher-level logical unit of choice) and it's the constraint that drives accumulator-centric designs, specialized registers, etc. Relax the constraint and they become less necessary.


The original CISC HP 3000s were a stack machine.

Newer HP 3000s used PA-RISC CPUs instead (register not stack CPU), but the MPE operating system included an emulator/translator so they could still run the legacy CISC code.

(The same architecture was also used in some of the early HP 9000 workstations, specifically the 500 series.)


Another example of stack machines was the original Tandem architecture, which was based on HP 3000. (Most of the early Tandem team had come from HP.) Tandem then moved away from stack machines, to MIPS – but like the HP-3000 RISC transition, the MIPS Tandems had an emulator/translator for the original stack machine architecture.

From MIPS, Tandem moved to Itanium, and now to x86. And, having been founded by ex-HP engineers, Tandem is now part of HPE.


> what are some examples of a stack based CPU architecture

https://en.wikipedia.org/wiki/Burroughs_large_systems


Knuth’s MIX. Z80 is also sort-of an accumulator architecture, to a greater extent than x86.


isn't the Microchip Pic an Accumulator based architecture?


Yeah, those are basically fancier clones of a PDP-8.


Another interesting trivia about the x86 registers: while most programmers think of their order as being EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, when you look at how the registers are numbered in the instruction set, you see that their true order is different: EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI.


What's the significance of this?


Next time you play trivia at your local bar and get asked, "what's the true order of x86 registers," you will get 5 points


Ha ha.... :)


This is a great article. I used it to optimize the code for a code golf puzzle our hackerspace made: http://codegolf.coredump.ch/challenges/asm-compass/ If you want a challenge to reduce the binary size of a very simple program (= choosing instructions with smaller opcodes), give it a try :)


I've done a few challenges like that[1] but seeing that the smallest so far is over 300 bytes makes me think most of those 300 bytes are not actual code but program header overhead, because it should be doable in much fewer bytes of actual code. Once you've done these challenges a few times, you can easily estimate a rough size:

The template is 48 bytes. Add 16 bytes for an [offset,char] table. Maybe another 30 bytes (let's be generous and round that up to 32) for code to pull the commandline parameter, do the indexing and template modification, write the output, and exit. 96 bytes of code. Add 64 bytes of minimal ELF headers[2] and you're still well below 200 bytes.

I don't have an environment at the moment that doesn't do ELF files, so I just tried it with 16-bit DOS and ended up at... 95 bytes:

https://pastebin.com/C7vDfFbR

I wasn't trying very hard so someone will probably find an even smaller version, and it was just for fun since it doesn't meet the rules anyway.

[1] http://www.hugi.scene.org/compo/compoold.htm

[2] http://www.muppetlabs.com/~breadbox/software/tiny/teensy.htm...


Thanks for the links! Most of the submitted approaches don't even mess with the ELF binary structure. But even if you don't do that, there's a lot you can learn by optimizing (sometimes doing something with more instructions instructions actually result in a smaller binary because of shorter opcodes).

It was the first time I really messed around with x86 assembly and I learned a lot during the progress :)

Edit: Wording

Edit2: Also note that for this challenge you need to submit NASM assembly, and not the resulting binary. So you can't resort to hex editing the binary itself.



I've been working on learning x86 ASM, and this was a fun challenge. Cool stuff.


Don’t get any ideas though. Whilst these instructions might make the output code smaller, it will probably make the program slower (1). Processors are optimised to run code generated by compilers fast, and compilers stopped outputting these a long time ago.

1/ There are edge cases like when the code is very big then less code might fit in the cache better or perhaps dispatch better .. but probably not many.


In my experience, cache matters a lot and compilers aren't smart enough to figure out what parts of the code are time-critical and what parts aren't, so they just use the fastest (and sometimes significantly bigger) instruction sequences everywhere. Optimising non-critical-path code by making it smaller even if it's somewhat slower can actually make the program as a whole faster, since the cache misses decrease.


On Ivybridge and later the string instructions are actually faster than the equivalent unrolled instructions. https://news.ycombinator.com/item?id=12353455




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: