The main issue AMD seems to be solving here is the yields at 7nm and lower processes.
Smaller chips means that you get more yield in the presence of errors. Intel builds relatively large 600mm^2 chips (like the XCC aka the 28-core Xeon), but AMD thinks the future is to build networks of ~200mm^2 chips, like what they've done with Zen / Threadripper / EPYC.
The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.
That's it. One singular chip design, mass produced over and over again, to handle AMD's entire consumer and high-end line. Keep this one design small to help yields and maybe AMD can get a process advantage over Intel's larger designs.
AMD's "mobile" or "APU" line is Raven Ridge (a 2nd design at 193mm^2) that doesn't use this.
----------------
The above is the current status quo. This "active interposer" that AMD is developing in the article would go above and beyond in terms of integration.
Note that HBM2 (next-generation high-bandwidth RAM) requires the interposer. PCB is not good enough for HBM's protocol. Ditto with Hybrid-Memory Cube (a competing standard). So it seems like the future of computer parts will be the interposer.
The interposer isn't necessary for AMD's CPU strategy however. So the roadmap for this network won't come till 2020+ or later (CPUs). Unless AMD might be building this network for their GPU line?? (But that roadmap is also past 2020). I bet this is all research-and-development, and may never come out as a commercial product.
> The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.
While a popular meme, this is not actually true. Epyc is actually a totally different die, stepping B2 vs the B1 die used in Ryzen+TR.
The 2700X is actually on a different die as well, and Raven Ridge on another. There will probably be another die for Banded Kestrel, if AMD ever gets around to releasing that. Presumably, their embedded SOC products are their own die as well.
So, about 5 dies per generation, across AMD's lineup (Epyc, Ryzen, APU, Atom, and embedded SOC). They're using about half as many dies as Intel is - still a significant difference, but far from the "one die for the whole lineup!" meme.
The big difference is that they're serving the whole server market with one small die, vs the three that Intel uses.
Of course, a small die isn't all roses either - both mfrs limit you to 8 dies per system, so right now with an 8-core die AMD systems are limited to 64-core systems (dual-socket Epyc) versus the 224-core systems that you can do with octo-socket 28-core Xeons. But, not everybody needs a million-dollar octo-socket system either.
Once AMD takes a node advantage that advantage will be diminished somewhat, but Intel's 10nm woes are a whole different story ;)
Steppings are minor revisions, perhaps analogous to a kernel patch revision. So they do have to manufacture B2 die and B1 die and they are not interchangeable- but they are 99.999% identical.
It still means these dies have to be fabbed and binned separately. Epyc and Ryzen are different dies and not interchangeable, they are not "the same die used across the whole lineup".
All the Ryzen 1000 processors (including TR) were B1, at least as far as AMD told the USB-IF (and AFAIK nobody ever observed anything else in the wild). Epyc is on B2.
At the time, there was some speculation that B2 might be the "mirror image" die that Epyc uses 2 of.
However, they now report Pinnacle Ridge (Ryzen 2000) as being on the B2 stepping.
Ryzen 2000 is on a slightly updated 12nm process but does not incorporate any library changes. I'm not sure if you could just drop the existing die onto the new process (given that 12nm is really a 14+), or if you could un-flip the die to the proper pinout using the substrate, but it seems like that might be where they moved to the B2 stepping.
But yeah, AMD produced the B1 stepping for an uncommonly long time. At least through the end of 2017, and the first B2 steppings showed up like June 2017.
Intel's EMIB is a specific design that is cheaper than a full silicon interposer.
As such, EMIB is a methodology which may enable chiplet designs in the future. Well, I guess Intel did marry that Xeon+FPGA chip earlier this year, so EMIB is deployed today. That's probably as "chiplet" as you can get, since the FPGA is actually on the CPU's cache-coherence network.
Alas: Intel's UPI (ultrapath interconnect) is designed for PCB boards. Its how 2socket or 8-socket chips communicate together. So its not really doing anything "special" with EMIB, its just shrinking down a PCB in some respects.
> My guess is that future APU will be chiplet based design as well.
It really depends on what the "sweetspot" is. Below a certain size, you don't get any yield benefits. While above a certain size, it becomes expensive... and then impractical... to build chips.
AMD's APUs are generally seen as low-cost cheap chips. At under 200mm^2, its likely that APUs will simply be more efficiently manufactured as single pieces of silicon.
Now, if AMD decided to make a "high end" APU, kinda like their Intel+AMD collaboration project Hades Canyon, then maybe that would use chiplets of some kind. From my understanding, the Hades Canyon collaboration project is just running PCIe however, so its not really a chiplet yet.
Writing code to run in parallel across a large number of small "chips" is probably the future of programming. In this sense, new programming languages like Julia might prove useful?
So in many regards, these "chiplets" aren't new, at least conceptually. What needs to happen is for the protocols to be redefined and respec'd for today's technology.
In particular: the silicon interposer allows for many more connections than previous technologies, as well as much lower power consumption. So protocols need to be designed with those power-requirements and huge "pin counts" so to speak.
Consider HBM2: its a 1024-bit bus. But there's four of them per Vega64, so that's 4096-wires on the interposer connecting the GPU to the RAM.
This is a level of integration never before seen, even in the MCM world. The future is with larger pin counts at far lower energy costs than before.
So its not so much that these problems are purely research. They're closer to engineering. There already exist protocols for cache coherency or fast communications at these levels, they just need to be tweaked for the new scale of things.
There's already been an example of this with the AMD + Intel combination sold by Intel in the Hades Canyon NUC. It's an Intel CPU and an AMD GPU in a single package on some sort of interposer.
In the beginning, a sea of discrete components made up a system. Investment in fab technology caused process nodes to shrink very 18 months so these discrete components gave way to the System-On-Chip where the board of chips was replaced by a single chip.
Now physics is harder to overcome, the cost of development at the bleeding edge of technology is higher than ever, and the continued desire for larger and larger systems caused the SoC to break apart again. It’ll be interesting to see if this is what the future looks like for silicon-based chips, or if this is a temporary shortcut.
I'm not sure if SoC is breaking apart actually. I think chipbuilders are figuring out that its more efficient to combine some dies together at the package level.
Consider that EPYC is basically a miniturized multi-socket design. Infinity Fabric is really AMD's new protocol built on top of HyperTransport (multi-socket protocol from the past). Before, AMD used to support 8-sockets. But today, AMD stitches 4-chips together and only supports 2-sockets.
From a software perspective (ie: NUMA), EPYC x2 sockets looks like an 8-socket chip of old. In effect, AMD has miniturized the 4x-socket setup in the form of EPYC. And it has also miniturized the 2x socket setup in the form of Threadripper.
----------------------
These Threadripper / EPYC chips have the same downsides as all old 2x, 4x, and 8x NUMA designs of the past. High latency and poor communications between cores.
The thing is: the modern environment is a highly virtualized, highly independent set of systems. Running 8x NUMA efficiently today is as simple as spinning up 8x VMs, one for each NUMA node.
IIRC, people are finding that Intel's 28-core design is far more effective in say... unified Database performance. Intel's design has a true L3 cache which can be used by all 28-cores, while AMD's L3 cache is split between each die. 4x 8MB caches cannot function as a singular cache in a large-scale database application.
But there's enough situations (ie: VMs, multitasking, render farms) where AMD's NUMA + Infinity Fabric is good enough. And with prices anywhere from 1/4th to 1/2 the cost of Intel, AMD's chips these days are certainly worth considering.
What is old is new again. We'll probably go in this direction for a decade or three, breaking things up. Then we'll go the other way again, integrating everything into one chip again.
After having programmed for some unusual architectures (CM2, others) I have to say that the GreenArrays chip looked to be... impressively difficult to program for.
Then I played the game "TIS-100" and found out that my intuition was very likely correct.
It certainly does seem like chip design is marching grudgingly from Core i7 to Cell BE and eventually to Connection Machine. Physics doesn’t really care about ease of programming.
The big difference is that Connection Machine enjoyed being developed in programming languages that were better suited to distributed computing, whereas current chip design we still need to drag system developers away from C.
I think would be interesting to see someone like Mellanox make a chiplet with their tech which could be fully integrated into an AMD SoC or APU or whatever they're calling them now.
Really the network is the least expensive part. To get IO that can saturate a 10G costs around 20k, while the infiniband card sets you back less than 4k. Now you're talking about 100G, which will go faster, you could easily be looking at a 500k box of ssds.
Wait, what? 10G is 1.2 GByte/s, you can get that from a single SSD easily. 100G is 12 GByte/s, so 5 consumer-level ssds. Squeezing 12 GByte/s over the buses twice might be tricky, but certainly not a 500k problem.
That's 2600 MB/s read speeds. Or more than double your 10G Ethernet.
------------
RAID0 8 of them together with ASRock's M2. Quad Ultra, and you've got $1040 of SSDs + $200 for the 2x Quad Ultra cards, or just $1200 for 20GB/s read/write speeds. More than enough to saturate any network I'm aware of.
10 gigabit Ethernet is 10 gigabits of data. It's mainly Infinband that used a horrible marketing tactic of saying the signal rate instead of the data rate.
10 gigabit ethernet is not 1 GB/s. It's 10 gigabits of data per second, or 1.25 gigabytes per second. The encoding is not an issue with these data rate numbers because Ethernet quotes their data rate as a data rate.
Sounds like it's just the next step in the chain from discrete components -> ICs on a circuit board -> this. The active interposer is filling the role that the circuit board currently fills, with devices etched into the interposer filling the role of discrete components, making everything more compact.
Any hardware gurus out there care to tak about how this helps? I guess having a flat pool of heterogenous resources is nice. As long as there’s a decent SDK that abstracts the hard stuff away I’m all for it.
It won't be user-facing, so there won't be an SDK. It's a way of building chips better,like AMD's Infinity Fabric. You could integrate GPUs, multiple CPU dies (like Epyc), and DRAM on a single package and tie them all together with interposers, which would look to the user like a CPU with an integrated GPU and a big L4 cache.
Using multiple small dies and tying them together has several advantages. Small dies yield better, so sometimes several small dies are cheaper than one large one. There's also versatility because you can mix and match components.
As I see it, this is just a network on interposer instead of a network on chip (NoC). NoCs have simple rules for routing that also prevent deadlocks, so I am not sure that the idea here is that significant. The active interposer is a quite new idea. I haven’t followed it. Maybe the journalist found the rules more exiting than the active interposer idea.
Either way, the research is one of the many small steps forward to better chips.
I don't think that's even needed. I just assume what you'll see from inside the program is some resources connected to some sort of very fast interconnect. It's nowhere near what Cell was - a really bad barrel-like PPC core that had to manage DMA to feed a bunch of marginally-smarter-than-DSP cores (with an incompatible ISA, obviously) that had no access to the main memory.
Pointing to one example of someone failing to do something in the past is not strong evidence it won’t happen this time. If anything there was lessons learned from that failure.
A more modular package brings a variety of benefits. Small chips are cheaper per unit area, because yields are higher. Reuse in different products can also lower development costs. Different IP can live on the most appropriate technology node- mature & cheap, new & fast, etc. IP can be developed on different schedules. IP from completely different companies can be integrated more easily. (see the intel-AMD partnership, it's "chiplet")
It's really not that complicated an idea- modular packages are more flexible! What's new is making it work within a compelling power, price, and performance envelope.
Is the Big Deal about this having what used to be different cards/chips sharing a cpu-style insanely fast bus, instead of trickling stuff over pcie or dram channels? If that is the case, the advantage of this will depend on the amount of bus saturation to be eliminated. Should be interesting. Everything works on Infiniti fabric!
That’s the advantage over having multiple packaged chip on a traditional PCB. The advantage over an SoC is that you can have different subsystems on decoupled development schedules and different process nodes all come together on the same “chip.”
With HBM2 being used in AMD's (and NVidia's) high-end products, it seems like the DRAM Channels are going to require an expensive interposer.
But "what else" can benefit from an interposer? If your RAM requires it, are there cheaper or more efficient designs that are built out of a network of chiplets on an interposer, as opposed to building out huge chips all the time?
AMD is already forced to build an interposer for Vega64. Might as well research other uses of it.
I wonder if FPGAs could be a node too. That would allow us to mix programmable, highly parallel analog modeled acceleration onto the same high speed / direct connect bus as all the other fixed, traditional computing components. I really like the approach AMD is suggesting here, treating it like nodes on a network.
You've been able to buy FPGAs tightly coupled with AMD CPUs since 2006 or so. The tech back then was to either plug them into an HTX Hypertransport slot, or in a cpu socket. Very few customers actually wanted to buy these things and I think all of the makers lost money.
They definitely are missing an application that probably won't be coming until specialized chip designs are a commodity (eg agi+). Right now you can get specialized chips for much less than the cost of the programmable chip + custom design. Once a company identifies a kick-ass fpga design (and it has any decent market) they move to asic to drive down costs. I guess if fpga costs were as low as an asic it would be viable as you could change the application depending on current needs. But currently fpga chips are 10-100x the cost of asics.
Hypothetically, yes. But FPGA's are very area-hungry; a big powerful one isn't going to fit nicely into a modestly sized package along with lots of other chips.
Sounds somewhat similar to what Intel started offering recently, in collaboration with AMD. Basically a Intel CPU and an AMD GPU in a single package, for use in laptops that needed something beefier than an Intel GPU but didn't have the volume allowance for a full GPU card.
They also have a budget Ryzen that does the same thing but with their own Vega graphics. I am using one now for web development and distributed ledger development.
My guess is modular will still be an option, if not the only option. x86 CPUs have really been PCBs for over 20 years. You are now starting to see a move toward SoMs take off in the embedded world, which combine the CPU and RAM on a module to simplify board layout. In the past, these would have been separate components that customers laid down themselves on their custom PCB.
That’s obviously not a hardware development, but I feel like the motivation may be similar: make components more modular; stabilize, standardize and align their interfaces.
By making these components “plug and play” the distance between a logical flow chart and the actual implementation is somewhat reduced, making the development of custom components more efficient and agile.
Smaller chips means that you get more yield in the presence of errors. Intel builds relatively large 600mm^2 chips (like the XCC aka the 28-core Xeon), but AMD thinks the future is to build networks of ~200mm^2 chips, like what they've done with Zen / Threadripper / EPYC.
The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.
That's it. One singular chip design, mass produced over and over again, to handle AMD's entire consumer and high-end line. Keep this one design small to help yields and maybe AMD can get a process advantage over Intel's larger designs.
AMD's "mobile" or "APU" line is Raven Ridge (a 2nd design at 193mm^2) that doesn't use this.
----------------
The above is the current status quo. This "active interposer" that AMD is developing in the article would go above and beyond in terms of integration.
Note that HBM2 (next-generation high-bandwidth RAM) requires the interposer. PCB is not good enough for HBM's protocol. Ditto with Hybrid-Memory Cube (a competing standard). So it seems like the future of computer parts will be the interposer.
The interposer isn't necessary for AMD's CPU strategy however. So the roadmap for this network won't come till 2020+ or later (CPUs). Unless AMD might be building this network for their GPU line?? (But that roadmap is also past 2020). I bet this is all research-and-development, and may never come out as a commercial product.