At the time that Alpha came out, the x86 was struggling to get more clock speed, and the groupthink was that RISC was the way forward. RISC chips would run at 4-8x the clock speed of an x86 and even when you adjusted for needing 2-3x more instructions, RISC was 50-200% faster.
DEC had a winner with Alpha. It had speed, and most importantly, you could run Windows NT on it. NT mattered because most Unix vendors at the time wanted $1,000/CPU for their licenses, and NT was $cheap (the OEM version was around $300 if I recall correctly).
As other posters have said, DEC just could not get out of their own way and let Alpha succeed. Wierd sales policies, hostile partnerships, and intense competition all really stymied DEC. A lot of the weirdness came from trying to protect their legacy base of VAX and PDP midrange systems and a general hatred of IBM (who was pushing OS/2).
BTW as I recall, Windows NT supported x86, Alpha and MIPS (another RISC vendor) with the first commercially available version of NT. MS added a few other RISC architectures in the following years (ARM most notably). x86 closed the speed gap a few years later with the Pentium II (the Pentium Pro was largely used in servers) and the rest is history.
Shortly after I graduated in '98, I purchased an Alpha that ran both NT and some Unix (I forget which?). This purchase was motivated by nostalgia for the DECs in the computer lab at school, which were awesome, although I agree that in the long run I wasn't completely satisfied with it.
Yeah, I remember writings and interviews from that time period, where he spoke and wrote well on the subject.
Can't say I understand very well what was going on in his mind, but somehow I got the impression that some of that was not totally specific to alpha in particular, but that alpha represented non-x86 at large, and he was excited and enthusiastic to tackle the technical problem of making the kernel portable. And certainly it paid dividends; eg. linux on ARM is running on billions of devices today, and that surely would have been more work had they not decided to make the kernel portable in the first few years of its history.
I had a DEC Alpha to play with, back in the day. Was a port target for some of our projects, and I occasionally booted it into NT for the druthers.
I much preferred my SGI boxen at the time. And when Linux hit Alpha, NT was little more than a plaything.
Oddly enough, its how I feel about the "NT on non-x86" rigs I occasionally run into, also being pitched as a port target for some of my modern projects.
Which alpha? The 21264 was introduced in 1996. It was a 4-issue out of order processor that could have 80 instructions in flight. It had 2 LSUs and could clock up to 500 MHz initially.
The Pentium Pro, released in late 1995, was a 3-way design clocking up to 200 MHz. It was surprisingly competitive. It had all the modern requirements: out of order execution, out of order memory pipeline, on board L1 and L2 caches. It just had less resources in each dimension. Intel caught up to 500 MHz in 1999, with the PIII. That was still a 3-day design. Intel’s first four way design was the Core 2 in 2006. That could hold 96 instructions in flight. Intel didn’t release a CPU with two load store units until 2011 (Sandy Bridge could do two loads per cycle) or possibly 2019 (Sunny Cove was the first that could do two stores per cycle).
Obviously, Intel caught up based on a combination of features long before then. I’d guess around early 2000s, when clock speed hit 1 GHz+. Clock for clock, the Pentium Pro wasn’t actually that much slower than Alpha on integer code. It took much longer for intel to be competitive on floating point code.
(By the way, it blows my mind that we went from 200 MHz to 1 GHz in about 4 years. Meanwhile, it’s been 15 years since my first 2 GHz+ MacBook, and a new MacBook still has a base clock around that range. Maybe it’ll turbo boost to 4 GHz for 10 seconds before it exceeds its thermal envelope.)
A 33 MHz 486 to 200 MHz was about 5 years. A 8 MHz 286 was released 7 years before that, right at the start of the uptick. Before that a MOS IC was generally capped at about 5 MHz while people slowly figured out how to shrink them and developed EDA tools that allowed designed to scale quickly across processes.
Dennard scaling was a wild ride while it lasted, but it's been done for more than a decade now.
Pentium Pro was really unbelievable at the time. Full Linux kernel recompile went from 15-20 minutes to less than 5 minutes. You had to use 'gcc -pipe' because the drives were then the limiting factor.
But I think it was not great on 16-bit code or something. The boost was not as impressive for DOS and Windows (still had lots of 16-bit code at the time).
I bought my Pentium Pro + motherboard and RAM at a computer show for ~$1000. This was a fairly good purchase. Around the same time I also bought an HP CD burner for around $1000- big mistake since it was slow and the prices soon plummeted.
> Full Linux kernel recompile went from 15-20 minutes to less than 5 minutes.
We had a similar jump going from the 486 to the Pentium. I distinctly recall a Linux kernel compile being a 45-60 minute adventure on a 486DX2-66, but only 5 minutes on a Pentium 100. (I felt like a lot of elements of PC architecture improved together around that time as well.)
People don't realize how good Intel was (and is!). Take a good look at what the latest gen Xeon or i9 can do (and its SIMD instructions which are like a separate RISC computer inside) and you'll see why they're the leader. It wasn't just strongarming vendors, etc.
One minor quibble. Intel pushed Itaniums hard at on point. The machines were huge and I had to have additional power routed to my office to run it. Thankfully AMD introduced the x64 architecture and the PC industry was saved. Between the power consumption of the Itanium and how hard the IA64 assembly was to debug. I atleast welcomed the X64 architecture.
IA64 assembly was "hard" because it wasn't expected to be something that humans would write or debug routinely - it was basically meant for compilers.
This also describes VLIW architectures more generally, and even more so wrt. current speculative developments like Mill, which AIUI doesn't even have a single, standardized "machine-level" assembly; what the machine runs at its lowest level is only defined by your actual chip configuration. (It is understood that GPU compute works much the same way, with the GPU driver acting as a "compiler" of sorts.)
> Maybe it’ll turbo boost to 4 GHz for 10 seconds before it exceeds its thermal envelope.)
Actually that's on Apple, the manufacturer, for choosing to undercool these chips (probably thinking that most people won't care, which I tend to agree with, but it's problematic when the machine is sold as a "Pro" device).
Anecdotal comparison, my Thinkpad sustains its 4 GHz for hours on end no problem. There's no comparing the performance of these i7 / R7 with decade-old dual cores at 2 gigs, it's just unfair.
If you sell me a "Pro" machine, I expect to be able to fully utilize all its resources for as long as I need to. While I don't need full power all the time (once or twice a month I need it for more than 10 minutes), when I need it, I'd prefer it to be available for more than a couple minutes at a time and not make my laptop sound like a hair drier and compute like an Atom laptop.
Having said that, my 15" MBP is sufficient for my needs, but I'd not recommend it for anyone doing heavy numeric lifting. Or compiling large volumes of code.
There's not, but in general one would guess they use the machine during their work. If a machine is expected to put in an 8-12 hour workday, whether it's a computer, car, truck, power tool or whatever, it needs to be able to handle operation at full load for extended periods of time. Adequate cooling, lubrication, duty cycles, lifespan and so forth should be considered by the manufacturer's engineers before they slap the "Pro" label on anything.
>Meanwhile, it’s been 15 years since my first 2 GHz+ MacBook, and a new MacBook still has a base clock around that range. Maybe it’ll turbo boost to 4 GHz for 10 seconds before it exceeds its thermal envelope.)
The short version is that MacBook is not designed for speed/performance but looks. It overheats by default.
2700k could reach/overclock to 4.5Ghz w/o much trouble, given proper cooling - 2011. The reason for lower base clock is the Intel TDP definition.
The Alpha had an incredibly weird memory model that could create this behavior:
Assume: p=&a, a=1, b=0
Thr. 1 | Thr. 2
b = 1 |
memory_barrier | i = *p
p = &b |
The result can be i = 0
Even though p=&b happens after b=1 in Thread 1, it is seen as happening before it by Thread 2. This is basically because the b=1 part might get ignored by Thread 2 unless that thread also goes through a memory barrier of its own. When the i = \*p is executed, Thr. 2 might read p from main memory as written by Thr. 1, but then still rely on the old b=0 value sitting in its cache. It's hard to describe that as being "superior", but maybe it enabled some performance optimizations in the common case that made it worthwhile.
I'm not sure I understand your example. The Alphas had a pretty standard MESI cache architecture and a very typical cache coherence behavior. It's true that individual CPUs require barriers to enforce ordering, which is not true on x86 (where the architecture guarantees consistent ordering or memory operations between all entities on the bus, and spends significant die area to do that). But honestly x86 is the odd architecture here -- almost no one else felt that was a good tradeoff. (In hindsight, x86 was right, of course.)
Where Alpha was weird was in its floating point implementation (two versions on chip! to support the crazy VAX formats that no one cared about even at the time) and its alignment requirements (the hardware, not really atypically, couldn't make a misaligned load, but actually "could" by trapping into a microcode handler that was 10x slower, leading to mysterious insane performance regressions).
The example isn't explained well in general, IMHO.
Suppose thread 1 executes the following code:
x = 1;
store_fence();
p = &x;
At a hardware level, it's guaranteed that the cache coherency traffic to update the value of p is going to happen after the traffic to update the value of x. So it's natural to assume that means that anyone who sees that p == &x must have to see that x == 1 as a result of this traffic. And for most architectures you'd be correct.
But you would be wrong on Alpha. Alpha has two cache banks, and there is no coordination between them (in the absence of memory barriers). So if p and x reside in different cache banks, it's possible for a thread to load the value of p (observing that it is &x), and then load the value of *p and fail to see the assignment of x = 1--if the cache bank that contains x is somewhat overloaded on processing the bus traffic, for example.
Incidentally, you can't actually take advantage of the opportunity to avoid the memory barrier in this scenario on everyone-but-Alpha in the C++11 memory model, because it turns out that figuring out how to specify the (hardware) data dependency matters for ordering purposes in a source language is a lot trickier than it might appear. memory_order_consume (added for this purpose) is lowered to memory_order_acquire in all compilers I'm aware of.
I've seen it mentioned in several memory model texts that, while ARM's model is more "relaxed" than x86, the DEC Alpha's model is even more relaxed. For example: [1] says "The venerable DEC Alpha is everybody’s favorite example of a weakly-ordered processor. There’s really no mainstream processor with weaker ordering", and [2] says "Some CPUs (such as i386 or x86_64) are more constrained than others (such as powerpc or frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside of arch-specific code." [1] adds, specifically, "[data dependency ordering means that] if you write A->B in C/C++, you are always guaranteed to load a value of B which is at least as new as the value of A. The Alpha doesn’t guarantee that."
The Alpha did have a memory model that was weaker in some respects than pretty much every other RISC CPU. I'm not clear on the details, but I think that one of them is described in https://www.kernel.org/doc/Documentation/memory-barriers.txt in the 'GUARANTEES' section -- if you do "Q = LOAD P; D = LOAD *Q" then on Alpha and only on Alpha extra barriers are required to ensure that the second memory access (whose address depends on the result of the first load) really does get issued in-order after the first.
Because when it doesn't, it leads to "insane" results like the one upthread. Reasoning about memory barriers is outrageously hard, and asymptotic software quality is outrageously expensive. At the end of the day an x86 kernel is going to have fewer race conditions than one for a traditional architecture, and that's worth something.
This is a lesson that software folks have been pushing with tools like static typing and memory lifetime analysis. The same applies at the hardware level.
You "just" need a memory barrier on every ptr indirection (including array indexing, etc.) that might go through something that's being shared among multiple threads. The more interesting question is whether requiring that additional barrier buys more performance in the typical case where you're not touching shared data.
> The more interesting question is whether requiring that additional barrier buys more performance in the typical case where you're not touching shared data.
Probably yes, i.e., there was some hardware reason for it. You can find one explanation [1] floating about that explains the effect was due to the cache banking design, with separate invalidation queues for each bank. Presumably that was better in some respect than the alternative designs available to implementors.
That said, the difference is probably fairly small and if Alpha were around today they would almost certain want to abandon that particular reordering since they are the only one doing it and there would be a lot of pressure on them to make it fast (this comes up in all sorts of non-trivial scenarios like double checked locking, reads of final fields in Java, etc).
That's generally the pattern, I think: if you are the odd man out with the weakest model, and you aren't the dominant player, you will have a lot of pressure to at least offer fast ways of lining up with the stronger models. So even if a weak model is a better design point in a vacuum, it might not be true when real-world implementation pressures are considered.
> This is basically because the b=1 part might get ignored by Thread 2 unless that thread also goes through a memory barrier of its own.
Note: without a memory barrier on BOTH sides, your code is factually wrong anyway and is open to memory-ordering issues.
With the modern "acquire-release" memory model, you need a release-barrier in Thread#1, and an acquire-barrier in Thread#2, for code to be fully correct. Acquire and Release barriers, also known as "half barriers", weren't invented in the time of DEC Alpha. But DEC Alpha's full barrier ("mb") would make the code correct.
I am not sure what you mean by "factually wrong": the whole point of the example is on no other platform is a memory barrier required on the RHS.
Even on ARM and POWER, very weak models, an address dependency orders the reads. This is the 'MP+dmb/sync+addr′' litmus test in 4.1 in [1].
This type of dependency, which doesn't need a barrier to work, is why C++ invented the whole convoluted "consume" ordering in the memory model, pretty much doubling the size and complexity of the model.
> I am not sure what you mean by "factually wrong": the whole point of the example is on no other platform is a memory barrier required on the RHS.
Think about the C++ code and how the compiler could implement your code.
Thread 1:
b = 1;
memory_barrier();
p = &b;
Thread 2:
*p = (whatever);
...
i = *p;
A slight change, but lets say Thread2 has the p = (whatever) a bit earlier. The compiler could cache p inside of a register, and may NEVER read p from memory and update it in time!
The memory barrier in thread2 is still needed so that the COMPILER knows that its register state is potentially stale. Even today, with half-barriers and a stronger memory model, the compiler can't read the mind of the programmer.
Based on that reply, I am going to assume you are not familiar with how these litmus tests are written, or more generally with how hardware memory models are discussed (hint: this is just C-like pseudocode showing the order of accesses to discuss hardware reordering so saying "what if this is cached in a register by the compiler" is a non sequitur: there isn't any compiler involved here).
So I don't think I can usefully continue this branch of the conversation, sorry. That said, if you want some good reading on the topic I recommend Cambridge memory model group reference I linked above. In fact, all of the Cambridge stuff is good: they got the x86 memory model right (correctly described), even before Intel did!
FWIW, the OP was spot on: this is a canonical example of a reordering the Alpha does (did) that no one else does. It comes up over and over when you talk about memory models. It's almost a meme: "that's the one that Alpha did..."
My main point is that a C++ level programmer BETTER be putting memory barriers in Thread2.
I guess if you're working at the system / assembly level and only using C-like pseudocode for simplicity sake, there's a different set of expectations. But I know that I've been "burned" by the compiler before moving some stuff around.
Programmers have to work with not only the CPU, but also with their compiler's memory model. Its a fact that is often forgotten. Even if the CPU implements a particular memory model, the compiler has a (slightly) different one, and memory barriers or fences (or Java volatile) may still be required for proper memory ordering.
----------
I see your point that this "weakness" in DEC Alpha's memory model is probably hard for system programmers and compiler writers to figure out. But it should be noted that in C++11, the Thread2 code is still wrong without the presence of a memory barrier of some kind. (and once we add the memory barrier, mb, into Thread2, it fixes the problem at both the C++11 level AND at the DEC Alpha level)
For the code in thread 2 to need a barrier in C++, you’d need your modified version of it where *p is assigned earlier. And even then you only need a compiler fence (`asm volatile (“” ::: “memory”)`).
That’s a two part refutation but the first part is most important: the example only needs a barrier in C++ if you change the example as you did. If you don’t change it then from the C++ standpoint there is no earlier value that could have been cached. That arises naturally in C++ code since there’s bound to be a function boundary or lock-related fence nearby.
And if there isn’t one nearby then C++ programmers love compiler fences for that. You don’t need a CPU fence for this except on Alpha.
I still run an AlphaServer DS25. It's amazingly quick, even in 2020. It outperforms CPUs which came years after it.
People sometimes think that the market caused the Alpha to fail. Really, it was Intel. They wanted Itanic to succeed so much that they made a deal with HP to end the Alpha prematurely, even though demand for Alpha systems was high.
Even after HP announced that the Alpha would stop being developed, Alphas were being sold as a high end systems for all sorts of uses and had many entries in the Top 500List of supercomputers - Alpha systems were four of the top ten on the planet in November, 2002, which was after HP announced they wanted to transition from Alpha to Itanic.
My AlphaServer DS25 is beautiful hardware. Everything in the machine is manufactured to a standard we just don't see any more.
In interviewed at Intel & DEC in '95, worked at DEC 96-97
Corporate infighting, incompetent sales, and misaligned vision pervaded the culture. It didn't help that PPro came out of left field and took everybody by surprise
When I interviewed at Intel in '95, they were absolutely giddy with how they stole Alpha's thunder. DEC was a better geographic choice for me, and, well...
HP didn't come into the picture until long after - DEC sold to Compaq in '98, Compaq to HP in '02. The race was over at least 7 years before that.
Edit: Not to mention the absolutely hostile stance DEC took to MS wrt NT. Due to Cutler's involvement with NT, DEC sued MSFT and somehow thought that would make MSFT become a loyal partner.
I remember Office 97 performing terribly on Alpha. We sent a compiler guy there to figure out why; turns out the Office team had a single Alpha in their pipeline, set to compile -O0, and pretty much said "this is there to check the lawsuit checkmark"
I bet the Office software was riddled with C++ undefined behavior. The Alpha was one of the closest CPUs to the mythical Deathstation 9000 which will destroy you if you make even one mistake in your code.
Like Itanium, compiling for Alpha with -O3 would expose you to large amounts of bugs in your code and bugs in the compiler too.
>turns out the Office team had a single Alpha in their pipeline, set to compile -O0, and pretty much said "this is there to check the lawsuit checkmark"
what does this mean? compile -O0? lawsuit checkmark?
-O0 means compiling without optimization. Presumably they were contractually required to have a version Office for Alpha, so they did the minimum effort thing to produce one.
> People sometimes think that the market caused the Alpha to fail. Really, it was Intel. They wanted Itanic to succeed so much that they made a deal with HP to end the Alpha prematurely, even though demand for Alpha systems was high.
This is misleading at best - "The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly developed by HP and Intel." (https://en.wikipedia.org/wiki/Itanium)
A more realistic view would see it as a casualty of the DEC->Compaq->HP merger sequence, with HP long committed to Itanium as the successor to its own PA-Risc architecture.
My dad worked in the Alpha memory dept at DEC / Compaq / HP. They continued supporting Alpha servers long after they stopped selling them to new customers. Wikipedia says 2007 which sounds about right. Customers spent hundreds of thousands on these systems and, from what I was told, really liked them.
They still performed well when they ended support and laid off everyone on the Alpha teams, including my dad.
One of my now-retired colleagues worked on real-time compression, and he told me that the Alpha AXP processor was very weak on bit-twiddling instructions. For the algorithm that was ultimately chosen, the smallest unit of encoding in the compressed stream was the nibble; anything smaller would slow things down by too much. This severely hampers your ability to get good compression ratios.
The thing I remember about DEC Alpha was that they were used as renderfarm processors by special effects artists working in Lightwave on Amiga (for shows like Babylon 5, etc). I used to drool over ads for DEC Alpha “screamernet” machines hooked up to Amiga 4000s. Alas, I was stuck with a Tandy 486 SX...
The animation studio I worked for in the mid-late 90s was a Softimage shop using SGI Indigo and Intel pentium workstations with a 20-30 machine Alpha (NT) render farm. It was a very good setup with one weird issue: depending on the type of render being done, especially anything with ray marching, the SGI or Intel workstations couldn't participate in the render because the results looked subtly different.
It was explained to me as a difference in the way certain arithmetic calculation were made in the actual hardware. The render engine was mental ray. It was really only apparent if there were atmospherics, such as volumetric lighting, in the scene. Net renders with the workstations participating would come back with some of the resulting tiles looking noticeably different—the density of the atmospheric effects (ray marching) would appear to have differing densities between the alpha-rendered tiles and the SGI or Intel-rendered tiles.
I got a rack of Alphas that were supposed to have been part of the farm doing water effects in "Titanic". One of the stack overflow poster mentioned how hard it was to actually get your hands on this hardware, even in 1999/2000 when they were second generation, it was difficult.
They never did show up on the resale market much; it took me years to find out why: Credit Unions. A big Credit Union software package was written for Alpha (and apparently not ported) and there's still some folks stuck on this hardware. Something to do with the CPU having a "how to treat roundoff" switch i gather, and the fact that they're great at chopping strings by bytes which old finance protocols do quite a lot of.
It was none faster because it was impossible to program, without the ability to load or store anything but natively aligned and sized quads and with a quite useless memory order model (essentially anything could be reordered past any other thing). Reading, modifying, and writing one byte on this rig was basically impossible and don’t even think about how hard it would be to write a mutex.
> essentially anything could be reordered past any other thing
Except for memory barriers. The DEC Alpha command "mb".
x86 has a (relatively) strong memory model, but ARM and POWER9 both have weak memory models. Not quite as weak as DEC Alpha, but weak enough that you need to be very careful about memory-barrier placement in both ARM and POWER9.
XBox360, PS3, and ARM (cell phone) system programmers would know about the difficulties involved. Yeah, its hard to learn, but totally possible to write a mutex.
--------------
There's probably more code running on "weak memory models" today (ie ARM) than on "strong memory models" (aka: x86) today.
> without the ability to load or store anything but natively aligned and sized quads
Check out the assembly code generated with -O3 with GCC or LLVM. x86 has natively aligned quads for performance reasons, and compilers have also solved this problem. In fact, you'll see plenty of "nop" generated in compiler code, so that all of your assembly is aligned to the cache-line to maximize uop cache issues on modern x86 machines.
x86 doesn't have any REQUIREMENT for aligned assembly code or data. But x86 is far more efficient when reading/writing from cache-aligned locations. (In particular: reading across an 64-byte boundary naturally results in 2x L1 cache reads instead of 1x L1 cache read).
Both the alignment, and memory-barrier, issue are solved today. Arguably, they weren't solved back in the DEC Alpha days (I was too young to be programming at that time)... but modern compilers and toolchains can certainly work with memory barriers and "aligned only" memory.
ARM’s memory model is stronger than Alpha’s especially on ARM64.
It’s not about just whether it’s possible to write a mutex but also whether you can write a really good one. The best concurrent algorithms - whether mutexes or lock-free data structures - benefit from a careful mix of strength and weakness in the memory model. I think Alpha is too weak. X86 may be too strong - so say smart people - but it still manages to be fast as fuck.
No idea what you’re talking about wrt alignment. On x86, you can load/store misaligned. On Alpha, you can’t. On ARM, you can on some but not on others. The CPUs where you can are easier to program.
No idea what you’re talking about wrt memory barriers. It’s not a solved problem. Memory barriers slow things down so it’s better not to have to use them. X86 and ARM64 give you tricks to avoid using barriers, or to use cheap barriers, in many important racy algorithms. Alpha gives you fewer tricks. (Specifically x86 gives you lots of ordering “for free” while arm64 let’s you use the self-xor dependency trick to cheaply order loads and has a generous buffet of half fences.) The memory model matters a lot - the weaker it is, the fewer tricks programmers have to avoid doing expensive things to request specific orderings.
At a performance penalty. Read or write across a 64-byte cacheline, and your CPU will be forced to perform 2x loads to implement your singular unaligned load. An unaligned load across a cacheline is literally 1/2 the speed of an aligned load, and is something the modern programmer (and compiler) is designed to avoid.
The compiler basically avoids all misaligned loads/stores, even on x86. (Aside from when the programmer really forces it: like reinterpret_cast<int>(0x800003f) or something)
-------
> The best concurrent algorithms - whether mutexes or lock-free data structures - benefit from a careful mix of strength and weakness in the memory model.
And all of those algorithms would run CORRECTLY, if those half-barriers were replaced with full barriers. Maybe slower, but they'd be correct.
That "generous buffet of half fences" can only exist on a weak memory-model system (like ARM), because x86 AUTOMATICALLY performs those half-fences after every load / before every store instruction. That's the thing about strong memory models: once you're "too strong", it doesn't even make sense to have those half-barrier instructions.
ARM and PowerPC were too weak 10-years ago. They've "strengthened" their memory model by adding new half-barriers to their instruction set. That's the real secret: to change your CPU over time to match programmer's preferences. ARM / PowerPC started off too weak, but are now approaching "just right", with the addition of new instructions.
DEC Alpha can't do that, because the DEC company died decades ago. Its only fair to consider the programming environment and expectations of the time period that the DEC Alpha existed in.
The perf penalty of misaligned loads and stores is incredibly low. JavaScriptCore uses them quite a bit. The penalty is way lower than handling the misalignment by way of a trap, which is what you’d do on Alpha if you had to run some code that was designed to use misalignment. That’s just a fact - the original comment was about how this fact made Alpha a worse target.
I think you’re glossing over a lot of details about the memory model. What the model says about the ordering of dependent loads isn’t a matter of just adding fences later - it’s more fundamental than that. Of Alpha achieved it’s perf advantage thanks to speculating loads then it would have to lose that advantage when it was modernized to current standards. Also, the original comment was about Alpha back then versus Intel—and-others back then, so it’s not interesting to say that Alpha could have just improved - that’s not really responding to the original comment about how hard Alpha was to program.
> The perf penalty of misaligned loads and stores is incredibly low.
Yeah, this.
You might as well think of them as free in most scenarios involving small reads and writes. If there is some small advantage to misalignment (often, reduced memory use through better packing), do it!
The cross-line penalty would occur only for 3 out of 64 alignments for a 4-byte load: less than 5% of the time, and then the penalty is small.
That % is more or less worst case in well designed code [1]. Sometimes you can do misaligned loads that you guarantee will never cross. A common example is a misaligned LUT where the values overlap. E.g., given n, you want to load 8 bytes with:
[n, n+1, n+2, ..., n+7]
Let's say n ranges from 0 to 15. A traditional aligned LUT would have 16x 8-byte values, one for each n (128 bytes). You could also do it with a single 23 byte LUT, running from 0 to 22, and a misaligned load into that LUT at byte position n. As long as you align the LUT itself (e.g., to 32 bytes), you will never cross a line.
For large accesses, things become less clear. After all, a 512-bit AVX-512 access is guaranteed to cross if it is misaligned, and 256-bit access randomly distributed cross half the time, etc. Vectorized code is also the type of code that may be written to approach the 2/1 load/store per cycle limit, so it really pays to try to get alignment for any type of non trivial loop.
---
[1] Specifically, this is the crossing % you would get if you could assume nothing about the distribution of the accesses, i.e., they are uniformly randomly distributed. If you know something about the expected alignment, then you can shift everything to make crossing less likely.
On the second point, the OP is complaining more a about the lack of a way to directly access anything smaller than 4 bytes, i.e., that there are no 1 or 2 byte reads or writes.
In addition to being slow for e.g. text processing, you couldn't implement a modern memory model like Java's or C++ on it, without serious compromises, due to the inability to make non-interfering writes of adjacent elements.
> In addition to being slow for e.g. text processing, you couldn't implement a modern memory model like Java's or C++ on it, without serious compromises, due to the inability to make non-interfering writes of adjacent elements.
But consider that the modern model accounts for 64-byte (aka: 512-bit) cachelines. Any reads or writes to a 64-byte cacheline is going to be false-shared between cores. From a performance point of view, you cannot make non-interfering writes of adjacent elements in the modern memory model!
If you're dealing with single-threaded code, then bit-mask instructions for registers should be sufficient to emulate the ability. I can see that bit-masking every character instruction would be slower, but it would certainly get the job done.
I don't think the performance thing is relevant here.
First you have to be able to implement the model correctly. That means non-interfering writes. 99.99% of those writes will never involve any sharing (false or otherwise) and will perform just fine.
Not being able to do byte-granular writes means you simply can't implement things like this at all unless you:
1) Decide that "char" will be 32 bits (i.e., make a byte on your system 4x larger than usual).
2) Do some locking around each narrow write (or perhaps a series of them) and ensure everyone does the same.
So I am not really following your argument with regard to false sharing.
I don't know what you mean by "the modern model accounts for 64-byte (aka: 512-bit) cachelines". I don't know of any (software) memory model that even mentions cache lines. Hardware memory models don't lean heavily on them either, although they do get mentioned (e.g., on x86, Intel provides additional guarantees for some misaligned reads that do not cross a cache line, and AMD provides different guarantees).
> 1) Decide that "char" will be 32 bits (i.e., make a byte on your system 4x larger than usual). 2) Do some locking around each narrow write (or perhaps a series of them) and ensure everyone does the same.
Or 3) Decide that for performance reasons, reads/writes to an array (even a char* array or string) would be locked in 64-byte chunks at a minimum.
My point with false-sharing is that "true" byte-granularity comes with severe performance penalties on modern x86 systems. Its a thing that you want to avoid.
If one thread is reading / writing to char myString[0], while a 2nd thread is reading/writing to myString[63], your memory will ping-pong between the two threads (because x86 shares 64-bytes at a time between caches).
--------
I guess your point is that the CPU will "do the right thing, even if they're forced to do it slowly" in this case. So I guess you're right about that. Your code CAN be written as multithreaded + byte granular, its just going to be slow in practice.
I guess my point is that the high-performance programmer needs to understand this situation anyway, and that high-performance programmers have developed methodologies to avoid false-sharing. Any high-performance data-structure needs to understand false-sharing. You really shouldn't be doing things like reading/writing to myString[0] and myString[63] in two separate threads.
Interesting, TIL I learned that POSIX implies CHAR_BIT == 8.
I knew POSIX required a lot of other things that are unspecified in C/C++ but usually have a single sane implementation, but not that particular one. I know I have static_assert(CHAR_BIT == 8, ...) sprinkled around in quite a few places.
One day I'm going to write a header that just checks a big list of things that are "obviously sane" but not actually specified in C/C++, and error out if any are not true. Then I can include this in every project and just code in this new sane world. Not strictly portable, but "portable to where it matters".
The issue is not with atomic but with all non atomic accesses. The standard guarantees that two concurrent writes to contiguous but distinct char-sized memory location to be legal and data-race-free.
> Or 3) Decide that for performance reasons, reads/writes to an array (even a char* array or string) would be locked in 64-byte chunks at a minimum.
First note that this doesn't have anything to do with arrays specifically: any type of adjacent elements (in a struct, loose on the stack, whatever) have this problem.
That said, I have no idea how you would implement this suggestion.
E.g., how would you compile the following very simple function?
void write1(char *c) {
*c = 1;
}
> My point with false-sharing is that "true" byte-granularity comes with severe performance penalties on modern x86 systems. Its a thing that you want to avoid.
It obviously does not! x86 has true byte granularity and loads and stores are very fast, just as fast as any other size. Just because the possibility of false sharing exists, doesn't mean it somehow slows down byte writes. False sharing doesn't even have anything to do with byte writes: it affects any size of access. So I am really not following your argument.
To summarize my claim:
- Modern memory models generally require that writing some "object" (which could be a primitive value like a char, int, whatever) not interfere with adjacent objects.
- This applies to all writes, unless perhaps the compiler can prove the data never escaped and cannot be shared, so in practice it means that you can only efficiently write at the smallest granularity supported by hardware.
- This means that you pretty much need 32-bit char on platforms where the smallest hardware write is 32 bits, or just give up on supporting the memory model.
None of that has anything to do with false sharing really. Not even really performance: we are just trying to get it to be correct. Saying "well you don't want to concurrently read nearby elements anyways" doesn't mean much because it doesn't lead to a practical way to implement the language (unless I am missing something).
> I guess your point is that the CPU will "do the right thing, even if they're forced to do it slowly" in this case. So I guess you're right about that. Your code CAN be written as multithreaded + byte granular, its just going to be slow in practice.
Not at all. My point has nothing to do with false sharing. You brought up false sharing.
My point is that the memory model is defined not prohibit behavior X, and platforms that need to do a large-word RMW to modify smaller elements can't effectively prohibit X.
This is true if you have false sharing or no false sharing or a small amount of false sharing. This is true even if you don't start any threads. This is true also for scenarios of "true sharing", e.g., where the other thread wants to share data (i.e., after reading the adjacent element maybe it is going to read the other one too).
> I guess my point is that the high-performance programmer needs to understand this situation anyway
It has nothing to do with the (application) programmer or performance. It is more fundamental than that: how does the compiler writer write a compiler that produces correct code under the memory model?
You said it a lot better than I could. It’s really weird in our field when people can look at a machine in isolation and praise its abundance of execution resources without really considering if it’s possible to exploit them in software. The point of a computer is to run programs after all.
> First note that this doesn't have anything to do with arrays specifically: any type of adjacent elements (in a struct, loose on the stack, whatever) have this problem.
Structs are usually 64-bit aligned in x86 though. You need #pragma pack to be no longer 64-bit aligned on modern compilers.
> E.g., how would you compile the following very simple function?
I don't know DEC Alpha assembly, but by my understanding:
mov R1, [c] ; 32-bit move into R1
and R1, 0xFFFFFF00
or R1, 0x00000001 ; Move 1 into the first byte of R1
mov [c], R1
Later DEC Alphas had byte-wise granularity in registers. But when it was 32-bit only, you'd have to do something along these lines.
If "c" was {4, 4, 4, 4}, after the store, it would be {1, 4, 4, 4}. If you are asking me where to place the memory barriers... note that C++ is Relaxed memory model and has undefined behavior unless you put the appropriate atomics<> in the correct location (which would generate a "mb" instruction for DEC Alpha).
I'm pretty sure DEC Alpha can implement C++11's memory model, albeit inefficiently with full-barriers... but it probably would work.
> Structs are usually 64-bit aligned in x86 though. You need #pragma pack to be no longer 64-bit aligned on modern compilers.
Totally false (just try it).
> I'm pretty sure DEC Alpha can implement C++11's memory model, albeit inefficiently with full-barriers... but it probably would work.
No, they can't (without widening char to 32 bits). The code you have shown modifies the adjacent elements. Anyone that write a nearby byte in the window between the load and the store will have their write silently clobbered.
You can't fix it with barriers. You need a lock or LL/SC loop.
> You can't fix it with barriers. You need a lock or LL/SC loop.
Agreed on this point.
> No, they can't (without widening char to 32 bits). The code you have shown modifies the adjacent elements. Anyone that write a nearby byte in the window between the load and the store will have their write silently clobbered.
I guess my misunderstanding was that C++11 gave this guarantee. That's pretty cool.
I guess I typically expected specific bitwise operators (ex: atomic_and or atomic_or) to be necessary in these cases. But if DEC Alpha is really the only processor in (somewhat) recent memory that doesn't allow 8-bit granularity, then it makes sense to just standardize upon 8-bit granular reads/writes.
I would be very hard to write correct software involving threads without that guarantee. Most threaded programs are using lots of private data, and imagine if another thread could just randomly stop on your private data while it was trying to do a nearby write?
Maybe you could try to formalize the idea of "keep private data away from data on other threads" or something, but this seems very difficult (in practice, private data is private as an emergent property of the whole application, so is not easily identifiable), and you'd have to give up on a lot of things the existing model offers in terms of write-independence.
Of course it is not 8-bit aligned! I never said everything was 8-bit aligned, just that your alignment claim was bizarre.
ints will always be (at least) 32-bit aligned.
The rules are simple [1]:
1-byte primitives (e.g., chars) are 1-byte aligned
2-byte primitives (e.g., shorts) are 2-byte aligned
4-byte primitives (e.g., ints) are 4-byte aligned
8-byte primitives are 8-byte aligned
Etc.
The rules for structures are also simple: they will have the alignment of their highest alignment member. That makes sense: the language rules (and ABI) are designed to make sure that primitives have (at least) the above alignment, regardless of where they are: putting them into a structure (or array, etc) can't violate that.
This is basically common across any modern platform too - not x86 specific, although some details may vary (e.g., size of primitives). There are weird cases too, like on some 32-bit platforms even (some) 64-bit values are only 32-bit aligned, and the whole 80-bit long double thing, etc: but the basic rules hold.
---
[1] Here I give some examples of data types that _commonly_ have the given size, but except for char this is not always true (e.g., some platforms have 8 byte ints, and some have 4 byte longs).
> Not being able to do byte-granular writes means you simply can't implement things like this at all unless you:
> 1) Decide that "char" will be 32 bits (i.e., make a byte on your system 4x larger than usual).
With the C++ memory model you can make std::atomic<char> larger than char. A lot of code assumes that this is not the case and that it is possible to cast between the two, but it's (surprise) undefined behavior.
> Check out the assembly code generated with -O3 with GCC or LLVM. x86 has natively aligned quads for performance reasons, and compilers have also solved this problem.
You make it sound like a compiler choice (that possibly only occurs when certain optimization levels are used). In fact, it is part of the language spec: everything has to be properly aligned based on the C and C++ rules. This leaks into the ABI too when it comes to structure packing, etc. So alignment is pervasive because those are the rules, and the rules were made for performance, e.g., to support platforms like Alpha which couldn't even _do_ an unaligned read but would have to construct the value from two aligned reads.
> x86 doesn't have any REQUIREMENT for aligned assembly code or data. But x86 is far more efficient when reading/writing from cache-aligned locations. (In particular: reading across an 64-byte boundary naturally results in 2x L1 cache reads instead of 1x L1 cache read).
x86 is not far more efficient when reading/writing aligned locations, in most cases! One of the performance secrets is that unaligned access is really fast in modern x86. There is no penalty at all if the load doesn't cross a cache line, and especially for the 4 and 8 byte reads we are talking about, that would be the usual case. Even when they cross, they are just half as fast: so "only" 1 load/cycle, rather than 2. Yes, 1 unaligned load every cycle. A lot of code is doing less than 1 load/cycle anyways, so you may not bump up against this limit. So average those types of reads might be a few % slower in a theoretical sense (due to the occasional line crossing), and often close to zero in reality.
Of course you could design pathological cases, like page crossing every load...
> x86 doesn't have any REQUIREMENT for aligned assembly code or data
For data, it does, e.g., any SSE instruction with a memory operand, or any of the "aligned" load variants like vmovdqa. People who say "x86 doesn't need alignment" and create unaligned data will eventually get bitten when a compiler inserts one of those instructions for you. So even x86 compilation relies on the language-level alignment rules.
One interesting note about alpha is that the reason PuTTY exists is that the creator had a windows NT alpha workstation, but there was no native telnet clients that had good terminal emulation for Alpha NT.
I highly suggest viewing one of the Alpha's original designers, Jim Keller, recent chat with MIT Prof Lex Fridman https://youtu.be/Nb2tebYAaOA I was stunned by Jim's total command of Computer Architecture, and his deep insights about the subject. I will never again utter the phrase 'modern computer' without requisite awe.
Jim Keller and the teams he leads/enables has their fingerprints all over modern CPU architecture - From AMD's K8 (Athlon 64) and Zen (their modern competitive product) to Apple's A4 and A5.
MIPS/SPARC/POWER chips were all meaningfully better than x86 chips at the time too. Alpha had the trick that it could kind of run Windows NT, but it was a waste to bother since 3rd party app support was so spotty.
DEC was on the forefront of so many technologies and did set so many standards (first Unix machine, vt100, first viable 64-bit processor, first viable search engine, much more even before that).
It is sad to see that this company and its products were brought down by pure political maneuvers and hostile takeovers instead of competition or merit.
In 1993 I had a DEC AXP 3000/300L workstation (Alpha with a 32-bit memory bus). It was fast, but the DEC OSF/1 OS was pretty bad and hobbled the machine.
There were lots of portability issues, for instance GNU Emacs wasn't 64-bit clean whereas Lucid Emacs was, so for two decades I ran Lucid/XEmacs instead.
Alpha was the first microprocessor to hit the billion instructions per second mark (BIPS-0) and was used to build oa 50Gbps router in 1998 using Alphas 21164 with routing code running entirely in L1 cache:
Lightwave 5.5 was about 6x faster than Pentium 66 at the time I used it configured with 256 megs of memory. Which was already faster than what I used at home which was just an
Amiga 3000 25 mhz with 16 megs of memory.
I had four Digital Ultimate Workstations (=AlphaServer 1200s) with dual 533MHz 21164 Alpha CPUs running Windows 2000 Server back in 2000. They ran Windows 2000 Server much faster than any PC hardware I could get my hands on at the time.
One real-world look at Alpha vs x86 (and MIPS) performance was this .plan post from John Carmack:
-----------------------------------------
John Carmack's .plan for Jun 25, 1997
-----------------------------------------
We got the new processors running in our big compute server today.
We are now running 16 180mhz r10000 processors in an origin2000. Six months ago, that would have been on the list of the top 500 supercomputing systems in the world. I bet they weren't expecting many game companies. :)
(14 to 1 scalability on 16 cpus, and that's including the IO!)
The timings vary somewhat on other tools - qrad3 stresses the main memory a lot harder, and the intel system doesn't scale as well, but I have found these times to be fairly representative. Alpha is almost twice as fast as intel, and mips is in between.
None of these processors are absolutely top of the line - you can get 195 mhz r10k with 4meg L2, 300 mhz PII, and 600 mhz 21164a. Because my codes are highly scalable, we were better off buing more processors at a lower price, rather than the absolute fastest available.
Some comments on the cost of speed:
A 4 cpu pentium pro with plenty of memory can be had for around $20k from bargain integrators. Most of our Quake licensees have one of these.
For about $60k you can get a 4 cpu, 466 mhz alphaserver 4100. Ion Storm has one of these, and it is twice as fast as a quad intel, and a bit faster than six of our mips processors.
That level of performance is where you run into a wall in terms of cost.
To go beyond that with intel processors, you need to go to one of the "enterprise" systems from sequent, data general, ncr, tandem, etc. There are several 8 and 16 processor systems available, and the NUMA systems from sequent and DG theoretically scale to very large numbers of CPUS (32+). The prices are totally fucked. Up to $40k PER CPU! Absolutely stupid.
The only larger alpha systems are the 8200/8400 series from dec, which go up to 12 processors at around $30k per cpu. We almost bought an 8400 over a year ago when there was talk of being able to run NT on it.
Other options are the high end sun servers (but sparc's aren't much faster than intel) and the convex/hp systems (which wasn't shipping when we purchased).
We settled on the SGI origin systems because it ran my codes well, is scalable to very large numbers of processors (128), and the cost was only about $20k per cpu. We can also add Infinite Reality graphics systems if we want to.
I remember reading a book describing the Alpha architecture as an undergrad. It was so much cleaner than x86. I was sad to see it fail in the marketplace, but sadly it's a common story in tech.
The Alpha was running both Unix (Digital UNIX aka OSF/1 AXP aka Tru64 UNIX) and Windows up until 2000 RC1. Even Linux was officially supported in the last years.
Based upon the original comment I'm guessing the system he used was running OpenVMS. Many universities replaced their previous VAX systems running VMS with Alpha servers running OpenVMS.
I actually bought myself an Alpha for my home machine as a grad student in 1996. I think they were remaindering low end machines; the Multia[1], a little 21066a machine probably designed to run WNT. What I remember about it in particular is it came with a bum memory chip, and ... somehow they sent an actual service engineer to come fiddle with my computer in the Berkeley hills. He had to do this twice. Can't imagine what it cost for that kind of service; I think I paid a few hundred bucks for the Multia -I was poor! Kind of guessed DEC needed to adjust their business model.
The computer itself was fairly poorly designed, and I remember the memory bus (and g77 compiler) in the thing kept it only about as fast as a contemporary intel chip; it eventually suicided itself during one of Berkeley's frequent power outages.
I had one of those too, though I feel like I might have gotten it in 1995. I immediately put Linux on it and found that it was quite fast for some tasks where the open source software compiled well on Alpha.
The biggest practical drawback for me was that Netscape wasn't available as an Alpha binary. Running it with the open source FX32 x86 emulator was too slow. It performed about as well as a Python-based browser at the time. I ended up using ethernet to run Netscape and other x86-focused software on a 486DX4-100 with remote X display back to the Alpha which had my nice monitor.
I brought that Multia with me to my first job, and used it to do 64-bit portability work (on Linux) for an middleware/communication project that we were busy porting to many obscure 64-bit HPC platforms. It was fun to have a 64-bit Linux at home, long before x86_64 came onto the scene.
Somehow I don't recall this limitation. Maybe someone compiled a version of Mosaic for it? I probably wasn't very webby at the time; grew up on text based interbutts. Was definitely ringing up the University internets using a 19k modem.
I do remember playing Quake on the thing; that was pretty cool.
I got a Multia surplus in '99. I've had it sitting in the shipping box ever since. I keep meaning to get it out and tinker with it but never have.
I had one Customer in the late 90's w/ an Alpha-based application (based on SQL Server under Windows NT on an Alpha Server 5305). As a tech who had only used x86-based servers up to that point I found the machine to be very easy to setup and use. It just felt like a strange PC. (I knew it was different-- one look inside was enough to tell that-- but it didn't feel tremendously different than, say, a fat and heavy dual Pentium Pro-based system from Intel in a similar chassis.)
I'd be curious if it still works. The power supply had some fragility that caused it to eventually nuke the mainboard in a power outage situation. Not sure mine made it to Y2K to see if it was all there on that front.
The first one I used was a DEC UNIX based workstation. That thing was pretty cool at the time, circa 1993. I learned a ton about UNIX on it and was really my only access to that type of OS until Linux first came out a year or so later.
Later on, I worked at a healthcare company that used a cluster of Alphas running OpenVMS or whatever the hell that OS is called. This was circa 2011-2014. It was DEFINITELY NOT COOL ANYMORE. That hardware was really old and the software on it was subject to frequent restarts due to memory leaks. The company used it to try and operate order intake for a multi-site online pharmacy. The system was impossible to interact with, it had a bizarre TCP socket based API, one awful, buggy, SOAP service, and otherwise data could come out via reports generated in a binary file format. Not strictly DEC's fault as the Alpha hardware did last a long time, but it was generally an awful experience for me to have to deal with that particular system. The company tried and failed to replace it so they doubled down, bought the source code for a ridiculous sum in the millions of dollars, and proceeded to try and maintain it themselves by hiring crusty old timers whose people skills were either way out of date or never existed in the first place.
Oh, and the repairs to the hardware were through a local computer salvage firm that basically bought boards and other bits off of EBay. This is a major player in the online pharmacy space, mind you.
DEC had a winner with Alpha. It had speed, and most importantly, you could run Windows NT on it. NT mattered because most Unix vendors at the time wanted $1,000/CPU for their licenses, and NT was $cheap (the OEM version was around $300 if I recall correctly).
As other posters have said, DEC just could not get out of their own way and let Alpha succeed. Wierd sales policies, hostile partnerships, and intense competition all really stymied DEC. A lot of the weirdness came from trying to protect their legacy base of VAX and PDP midrange systems and a general hatred of IBM (who was pushing OS/2).
BTW as I recall, Windows NT supported x86, Alpha and MIPS (another RISC vendor) with the first commercially available version of NT. MS added a few other RISC architectures in the following years (ARM most notably). x86 closed the speed gap a few years later with the Pentium II (the Pentium Pro was largely used in servers) and the rest is history.