AMD Tackles Coming “Chiplet” Revolution With New Chip Network Scheme

dragontamer · on June 24, 2018

The main issue AMD seems to be solving here is the yields at 7nm and lower processes.

Smaller chips means that you get more yield in the presence of errors. Intel builds relatively large 600mm^2 chips (like the XCC aka the 28-core Xeon), but AMD thinks the future is to build networks of ~200mm^2 chips, like what they've done with Zen / Threadripper / EPYC.

The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.

That's it. One singular chip design, mass produced over and over again, to handle AMD's entire consumer and high-end line. Keep this one design small to help yields and maybe AMD can get a process advantage over Intel's larger designs.

AMD's "mobile" or "APU" line is Raven Ridge (a 2nd design at 193mm^2) that doesn't use this.

----------------

The above is the current status quo. This "active interposer" that AMD is developing in the article would go above and beyond in terms of integration.

Note that HBM2 (next-generation high-bandwidth RAM) requires the interposer. PCB is not good enough for HBM's protocol. Ditto with Hybrid-Memory Cube (a competing standard). So it seems like the future of computer parts will be the interposer.

The interposer isn't necessary for AMD's CPU strategy however. So the roadmap for this network won't come till 2020+ or later (CPUs). Unless AMD might be building this network for their GPU line?? (But that roadmap is also past 2020). I bet this is all research-and-development, and may never come out as a commercial product.

paulmd · on June 24, 2018

> The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.

While a popular meme, this is not actually true. Epyc is actually a totally different die, stepping B2 vs the B1 die used in Ryzen+TR.

https://en.wikichip.org/wiki/amd/ryzen_7/1800x

https://en.wikichip.org/wiki/amd/epyc/7601

The 2700X is actually on a different die as well, and Raven Ridge on another. There will probably be another die for Banded Kestrel, if AMD ever gets around to releasing that. Presumably, their embedded SOC products are their own die as well.

So, about 5 dies per generation, across AMD's lineup (Epyc, Ryzen, APU, Atom, and embedded SOC). They're using about half as many dies as Intel is - still a significant difference, but far from the "one die for the whole lineup!" meme.

The big difference is that they're serving the whole server market with one small die, vs the three that Intel uses.

Of course, a small die isn't all roses either - both mfrs limit you to 8 dies per system, so right now with an 8-core die AMD systems are limited to 64-core systems (dual-socket Epyc) versus the 224-core systems that you can do with octo-socket 28-core Xeons. But, not everybody needs a million-dollar octo-socket system either.

Once AMD takes a node advantage that advantage will be diminished somewhat, but Intel's 10nm woes are a whole different story ;)

st26 · on June 25, 2018

Steppings are minor revisions, perhaps analogous to a kernel patch revision. So they do have to manufacture B2 die and B1 die and they are not interchangeable- but they are 99.999% identical.

paulmd · on June 25, 2018

It still means these dies have to be fabbed and binned separately. Epyc and Ryzen are different dies and not interchangeable, they are not "the same die used across the whole lineup".

pocak · on June 25, 2018

Are newer Ryzen 1000 series CPUs still made with the B1 die?

I thought steppings were for fixing errata, and once a new revision is qualified, the old one is no longer manufactured.

paulmd · on June 25, 2018

All the Ryzen 1000 processors (including TR) were B1, at least as far as AMD told the USB-IF (and AFAIK nobody ever observed anything else in the wild). Epyc is on B2.

At the time, there was some speculation that B2 might be the "mirror image" die that Epyc uses 2 of.

http://www.usb.org/kcompliance/view/catalog_search/results_b...

http://www.usb.org/kcompliance/view/view_item?item_key=88142...

However, they now report Pinnacle Ridge (Ryzen 2000) as being on the B2 stepping.

Ryzen 2000 is on a slightly updated 12nm process but does not incorporate any library changes. I'm not sure if you could just drop the existing die onto the new process (given that 12nm is really a 14+), or if you could un-flip the die to the proper pinout using the substrate, but it seems like that might be where they moved to the B2 stepping.

But yeah, AMD produced the B1 stepping for an uncommonly long time. At least through the end of 2017, and the first B2 steppings showed up like June 2017.

rb808 · on June 24, 2018

Wow I didn't realize AMD were so close with 7nm. looks like I need a new PC next year. https://www.overclock3d.net/news/cpu_mainboard/amd_s_7nm_epy...

ksec · on June 24, 2018

I am reading this chiplet as basically the same thing as Intel's EMIB.

>AMD's "mobile" or "APU" line is Raven Ridge (a 2nd design at 193mm^2) that doesn't use this.

My guess is that future APU will be chiplet based design as well.

dragontamer · on June 24, 2018

Not quite.

Intel's EMIB is a specific design that is cheaper than a full silicon interposer.

As such, EMIB is a methodology which may enable chiplet designs in the future. Well, I guess Intel did marry that Xeon+FPGA chip earlier this year, so EMIB is deployed today. That's probably as "chiplet" as you can get, since the FPGA is actually on the CPU's cache-coherence network.

Alas: Intel's UPI (ultrapath interconnect) is designed for PCB boards. Its how 2socket or 8-socket chips communicate together. So its not really doing anything "special" with EMIB, its just shrinking down a PCB in some respects.

> My guess is that future APU will be chiplet based design as well.

It really depends on what the "sweetspot" is. Below a certain size, you don't get any yield benefits. While above a certain size, it becomes expensive... and then impractical... to build chips.

AMD's APUs are generally seen as low-cost cheap chips. At under 200mm^2, its likely that APUs will simply be more efficiently manufactured as single pieces of silicon.

Now, if AMD decided to make a "high end" APU, kinda like their Intel+AMD collaboration project Hades Canyon, then maybe that would use chiplets of some kind. From my understanding, the Hades Canyon collaboration project is just running PCIe however, so its not really a chiplet yet.

jostmey · on June 24, 2018

Writing code to run in parallel across a large number of small "chips" is probably the future of programming. In this sense, new programming languages like Julia might prove useful?

wolf550e · on June 24, 2018

This "future of programming" was very widely announced in 2005: http://www.gotw.ca/publications/concurrency-ddj.htm

insulanus · on June 24, 2018

Yes, and in the 90s https://en.wikipedia.org/wiki/Thinking_Machines_Corporation

And the 80s https://en.wikipedia.org/wiki/Transputer

And, aw heck. We knew it was coming :) https://en.wikipedia.org/wiki/History_of_supercomputing

pjmlp · on June 25, 2018

You forgot this nice one, :)

https://www.parallax.com/catalog/microcontrollers/propeller

https://en.wikipedia.org/wiki/Parallax_Propeller

insulanus · on June 25, 2018

Totally. I wanted one of those so bad - thanks for adding it! :)

petra · on June 24, 2018

What about business implications of chiplets ?

Does it mean it would be easier for companies to cooperate, so in one chip you'll get best parts from the leaders ? or does IP already solves that ?

dragontamer · on June 24, 2018

Well, consider that MCMs have existed for decades: https://en.wikipedia.org/wiki/Multi-chip_module

So in many regards, these "chiplets" aren't new, at least conceptually. What needs to happen is for the protocols to be redefined and respec'd for today's technology.

In particular: the silicon interposer allows for many more connections than previous technologies, as well as much lower power consumption. So protocols need to be designed with those power-requirements and huge "pin counts" so to speak.

Consider HBM2: its a 1024-bit bus. But there's four of them per Vega64, so that's 4096-wires on the interposer connecting the GPU to the RAM.

This is a level of integration never before seen, even in the MCM world. The future is with larger pin counts at far lower energy costs than before.

So its not so much that these problems are purely research. They're closer to engineering. There already exist protocols for cache coherency or fast communications at these levels, they just need to be tweaked for the new scale of things.

fyi1183 · on June 24, 2018

There's already been an example of this with the AMD + Intel combination sold by Intel in the Hades Canyon NUC. It's an Intel CPU and an AMD GPU in a single package on some sort of interposer.

The more common approach is via SOCs though.

BooneJS · on June 24, 2018

In the beginning, a sea of discrete components made up a system. Investment in fab technology caused process nodes to shrink very 18 months so these discrete components gave way to the System-On-Chip where the board of chips was replaced by a single chip.

Now physics is harder to overcome, the cost of development at the bleeding edge of technology is higher than ever, and the continued desire for larger and larger systems caused the SoC to break apart again. It’ll be interesting to see if this is what the future looks like for silicon-based chips, or if this is a temporary shortcut.

dragontamer · on June 24, 2018

I'm not sure if SoC is breaking apart actually. I think chipbuilders are figuring out that its more efficient to combine some dies together at the package level.

Consider that EPYC is basically a miniturized multi-socket design. Infinity Fabric is really AMD's new protocol built on top of HyperTransport (multi-socket protocol from the past). Before, AMD used to support 8-sockets. But today, AMD stitches 4-chips together and only supports 2-sockets.

From a software perspective (ie: NUMA), EPYC x2 sockets looks like an 8-socket chip of old. In effect, AMD has miniturized the 4x-socket setup in the form of EPYC. And it has also miniturized the 2x socket setup in the form of Threadripper.

----------------------

These Threadripper / EPYC chips have the same downsides as all old 2x, 4x, and 8x NUMA designs of the past. High latency and poor communications between cores.

The thing is: the modern environment is a highly virtualized, highly independent set of systems. Running 8x NUMA efficiently today is as simple as spinning up 8x VMs, one for each NUMA node.

IIRC, people are finding that Intel's 28-core design is far more effective in say... unified Database performance. Intel's design has a true L3 cache which can be used by all 28-cores, while AMD's L3 cache is split between each die. 4x 8MB caches cannot function as a singular cache in a large-scale database application.

But there's enough situations (ie: VMs, multitasking, render farms) where AMD's NUMA + Infinity Fabric is good enough. And with prices anywhere from 1/4th to 1/2 the cost of Intel, AMD's chips these days are certainly worth considering.

st26 · on June 25, 2018

What is old is new again. We'll probably go in this direction for a decade or three, breaking things up. Then we'll go the other way again, integrating everything into one chip again.

bokchoi · on June 24, 2018

It reminds me of the GreenArrays chip created by Chuck Moore. There was some hubbub here on HN a few years ago about it -- what ever happened to it?

http://www.greenarraychips.com/

chris_st · on June 24, 2018

After having programmed for some unusual architectures (CM2, others) I have to say that the GreenArrays chip looked to be... impressively difficult to program for.

Then I played the game "TIS-100" and found out that my intuition was very likely correct.

corysama · on June 24, 2018

It certainly does seem like chip design is marching grudgingly from Core i7 to Cell BE and eventually to Connection Machine. Physics doesn’t really care about ease of programming.

https://en.m.wikipedia.org/wiki/Cell_(microprocessor)

https://en.m.wikipedia.org/wiki/Connection_Machine

pjmlp · on June 24, 2018

The big difference is that Connection Machine enjoyed being developed in programming languages that were better suited to distributed computing, whereas current chip design we still need to drag system developers away from C.

chris_st · on June 24, 2018

For the record, C* on the Connection Machine was a great language to program with.

Quequau · on June 24, 2018

I think would be interesting to see someone like Mellanox make a chiplet with their tech which could be fully integrated into an AMD SoC or APU or whatever they're calling them now.

wumpus · on June 24, 2018

Check out Intel Omni-Path, it's a separate chip on-package for Intel's high end cpus. 100 gigabit network.

Quequau · on June 24, 2018

I can't afford Intel's high-end stuff. That's why I'm looking at AMD.

frozenport · on June 24, 2018

Really the network is the least expensive part. To get IO that can saturate a 10G costs around 20k, while the infiniband card sets you back less than 4k. Now you're talking about 100G, which will go faster, you could easily be looking at a 500k box of ssds.

davrosthedalek · on June 25, 2018

Wait, what? 10G is 1.2 GByte/s, you can get that from a single SSD easily. 100G is 12 GByte/s, so 5 consumer-level ssds. Squeezing 12 GByte/s over the buses twice might be tricky, but certainly not a 500k problem.

dragontamer · on June 25, 2018

>To get IO that can saturate a 10G costs around 20k

Assuming 10G Ethernet is 8b/10b like a lot of other protocols, that's 1GB/s over 10G Ethernet.

Here's a $130 SSD with 500GB of storage: https://www.amazon.com/Mushkin-PILOT-500GB-Internal-MKNSSDPL...

That's 2600 MB/s read speeds. Or more than double your 10G Ethernet.

------------

RAID0 8 of them together with ASRock's M2. Quad Ultra, and you've got $1040 of SSDs + $200 for the 2x Quad Ultra cards, or just $1200 for 20GB/s read/write speeds. More than enough to saturate any network I'm aware of.

In fact, someone has already done this: http://www.guru3d.com/news-story/eight-nvme-m2-ssds-in-raid-...

They used a higher-end NVM.e SSD and measured 28GB/s (that's capital B, gigaBYTES) on the Threadripper + x399 motherboard.

davrosthedalek · on June 26, 2018

10G is 64b/66b, so somewhat more efficient -> 1.2 GB/s Too fast for single SATA, but any nvme/pcie ssd should give that sustained.

wumpus · on June 26, 2018

10 gigabit Ethernet is 10 gigabits of data. It's mainly Infinband that used a horrible marketing tactic of saying the signal rate instead of the data rate.

dragontamer · on June 26, 2018

PCIe 2.0, USB 2.0, SATA / AHCI, and more protocols are 8b/10b. So all of these protocols were 10-bits per byte.

Modern protocols tend to be 64b/66b or better. So that's why I listed "assuming 8b/10b", its hard to memorize which protocols are which.

Apparently I'm wrong. 10G Ethernet seems to be a more modern 64b/66b in any case.

wumpus · on June 26, 2018

10 gigabit ethernet is not 1 GB/s. It's 10 gigabits of data per second, or 1.25 gigabytes per second. The encoding is not an issue with these data rate numbers because Ethernet quotes their data rate as a data rate.

taneq · on June 24, 2018

Sounds like it's just the next step in the chain from discrete components -> ICs on a circuit board -> this. The active interposer is filling the role that the circuit board currently fills, with devices etched into the interposer filling the role of discrete components, making everything more compact.

gigatexal · on June 24, 2018

Any hardware gurus out there care to tak about how this helps? I guess having a flat pool of heterogenous resources is nice. As long as there’s a decent SDK that abstracts the hard stuff away I’m all for it.

asgionionio · on June 24, 2018

It won't be user-facing, so there won't be an SDK. It's a way of building chips better,like AMD's Infinity Fabric. You could integrate GPUs, multiple CPU dies (like Epyc), and DRAM on a single package and tie them all together with interposers, which would look to the user like a CPU with an integrated GPU and a big L4 cache.

Using multiple small dies and tying them together has several advantages. Small dies yield better, so sometimes several small dies are cheaper than one large one. There's also versatility because you can mix and match components.

petermonsson · on June 24, 2018

As I see it, this is just a network on interposer instead of a network on chip (NoC). NoCs have simple rules for routing that also prevent deadlocks, so I am not sure that the idea here is that significant. The active interposer is a quite new idea. I haven’t followed it. Maybe the journalist found the rules more exiting than the active interposer idea.

Either way, the research is one of the many small steps forward to better chips.

tzahola · on June 24, 2018

>As long as there’s a decent SDK that abstracts the hard stuff away

I'd be very skeptical of that. See: Cell Broadband Engine.

rbanffy · on June 24, 2018

I don't think that's even needed. I just assume what you'll see from inside the program is some resources connected to some sort of very fast interconnect. It's nowhere near what Cell was - a really bad barrel-like PPC core that had to manage DMA to feed a bunch of marginally-smarter-than-DSP cores (with an incompatible ISA, obviously) that had no access to the main memory.

garmaine · on June 24, 2018

Pointing to one example of someone failing to do something in the past is not strong evidence it won’t happen this time. If anything there was lessons learned from that failure.

gigatexal · on June 24, 2018

That’s very true, too.

gigatexal · on June 24, 2018

Oh yeah it took what 8 years to get decent ps3 games?

st26 · on June 25, 2018

A more modular package brings a variety of benefits. Small chips are cheaper per unit area, because yields are higher. Reuse in different products can also lower development costs. Different IP can live on the most appropriate technology node- mature & cheap, new & fast, etc. IP can be developed on different schedules. IP from completely different companies can be integrated more easily. (see the intel-AMD partnership, it's "chiplet")

It's really not that complicated an idea- modular packages are more flexible! What's new is making it work within a compelling power, price, and performance envelope.

berbec · on June 24, 2018

Is the Big Deal about this having what used to be different cards/chips sharing a cpu-style insanely fast bus, instead of trickling stuff over pcie or dram channels? If that is the case, the advantage of this will depend on the amount of bus saturation to be eliminated. Should be interesting. Everything works on Infiniti fabric!

nrp · on June 24, 2018

That’s the advantage over having multiple packaged chip on a traditional PCB. The advantage over an SoC is that you can have different subsystems on decoupled development schedules and different process nodes all come together on the same “chip.”

dragontamer · on June 24, 2018

Seems to be the opposite.

With HBM2 being used in AMD's (and NVidia's) high-end products, it seems like the DRAM Channels are going to require an expensive interposer.

But "what else" can benefit from an interposer? If your RAM requires it, are there cheaper or more efficient designs that are built out of a network of chiplets on an interposer, as opposed to building out huge chips all the time?

AMD is already forced to build an interposer for Vega64. Might as well research other uses of it.

sixdimensional · on June 24, 2018

I wonder if FPGAs could be a node too. That would allow us to mix programmable, highly parallel analog modeled acceleration onto the same high speed / direct connect bus as all the other fixed, traditional computing components. I really like the approach AMD is suggesting here, treating it like nodes on a network.

wumpus · on June 24, 2018

You've been able to buy FPGAs tightly coupled with AMD CPUs since 2006 or so. The tech back then was to either plug them into an HTX Hypertransport slot, or in a cpu socket. Very few customers actually wanted to buy these things and I think all of the makers lost money.

daveguy · on June 24, 2018

They definitely are missing an application that probably won't be coming until specialized chip designs are a commodity (eg agi+). Right now you can get specialized chips for much less than the cost of the programmable chip + custom design. Once a company identifies a kick-ass fpga design (and it has any decent market) they move to asic to drive down costs. I guess if fpga costs were as low as an asic it would be viable as you could change the application depending on current needs. But currently fpga chips are 10-100x the cost of asics.

st26 · on June 25, 2018

Hypothetically, yes. But FPGA's are very area-hungry; a big powerful one isn't going to fit nicely into a modestly sized package along with lots of other chips.

tormeh · on June 24, 2018

What's the difference between chiplets and SoCs?

taneq · on June 24, 2018

The chiplets are separate pieces of silicon linked together by a larger chip, whereas SOCs are etched onto one large piece of silicon?

digi_owl · on June 24, 2018

Sounds somewhat similar to what Intel started offering recently, in collaboration with AMD. Basically a Intel CPU and an AMD GPU in a single package, for use in laptops that needed something beefier than an Intel GPU but didn't have the volume allowance for a full GPU card.

zitterbewegung · on June 24, 2018

They also have a budget Ryzen that does the same thing but with their own Vega graphics. I am using one now for web development and distributed ledger development.

kdmytro · on June 24, 2018

> ... and distributed ledger development.

You are careful not to say "blockchain" :)

jacoblambda · on June 24, 2018

Just because something is a distributed ledger does not necessarily mean it is a blockchain.

Yes, they are a dev in the cryptocurrency/blockchain space but still, who knows?

zitterbewegung · on June 25, 2018

Yea, because the term keeps on being conflated. Blockchain means everything from cryptocurrencies to private ledger systems like hyperledger sawtooth.

digi_owl · on June 24, 2018

I figured those where made much like their APUs of previous years, meaning that it was all on a single die.

namibj · on June 24, 2018

You are correct. Look at wikichip for details.

rbanffy · on June 24, 2018

This looks like a step above what Octavo does with the Sitara cores.

https://octavosystems.com/app_notes/osd335x-design-tutorial/

etaioinshrdlu · on June 25, 2018

I remember Intel being slightly mocked when they put 2 dual core dies on a package and called it a Core 2 Quad.

shmerl · on June 24, 2018

Will this still allow modular builds or it will require buying a single board with everything soldered to it?

speleo_engr · on June 24, 2018

My guess is modular will still be an option, if not the only option. x86 CPUs have really been PCBs for over 20 years. You are now starting to see a move toward SoMs take off in the embedded world, which combine the CPU and RAM on a module to simplify board layout. In the past, these would have been separate components that customers laid down themselves on their custom PCB.

faragon · on June 24, 2018

"The AMD team found that deadlocks on active interposers basically disappear if you follow a few simple rules when designing on-chip networks"

I would like to know how they solved that problem. Is there any public paper or patent explaining that?

m3kw9 · on June 24, 2018

Similar to project ara?

19h · on June 24, 2018

Reminds me of this video from the Cisco TechWise Fundamental series: https://youtube.com/watch?v=l75B6D9xyMQL (“Fundamentals of Software-Defined Networking”)

That’s obviously not a hardware development, but I feel like the motivation may be similar: make components more modular; stabilize, standardize and align their interfaces.

By making these components “plug and play” the distance between a logical flow chart and the actual implementation is somewhat reduced, making the development of custom components more efficient and agile.