Apple M3 Ultra

cxie · 2025-03-05T15:28:19 1741188499

512GB of unified memory is truly breaking new ground. I was wondering when Apple would overcome memory constraints, and now we're seeing a half-terabyte level of unified memory. This is incredibly practical for running large AI models locally ("600 billion parameters"), and Apple's approach of integrating this much efficient memory on a single chip is fascinating compared to NVIDIA's solutions. I'm curious about how this design of "fusing" two M3 Max chips performs in terms of heat dissipation and power consumption though

FloatArtifact · 2025-03-05T15:51:17 1741189877

They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.

So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.

lhl · 2025-03-05T18:54:55 1741200895

Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.

From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.

lynguist · 2025-03-05T23:07:48 1741216068

I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.

tgma · 2025-03-06T06:02:47 1741240967

Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.

Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.

[1]: https://support.apple.com/en-us/101639

saagarjha · 2025-03-06T07:15:29 1741245329

The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.

tgma · 2025-03-06T16:24:24 1741278264

I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.

water9 · 2025-03-06T07:38:59 1741246739

Intel integrated graphics, technically also used unified memory with the standard dram

kergonath · 2025-03-06T09:05:50 1741251950

Those also have terrible performance and worse bandwidth. I am not sure they are really relevant, to be honest.

McDaveNZ · 2025-03-06T08:18:37 1741249117

Did the Xeons in the Mac Pro even have integrated graphics?

icedchai · 2025-03-06T23:38:47 1741304327

So did the Amiga, almost 40 years ago...

vaxman · 2025-03-07T01:51:42 1741312302

You mean this? ;) http://de.wikipedia.org/wiki/Datei:Amiga_1000_PAL.jpg

RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]

icedchai · 2025-03-07T03:26:39 1741317999

Yep! Though one could argue the Amiga wasn't true unified memory due to the chip RAM limitations. Depending on the Agnus revision, you'd be limited to 512, 1 meg, or 2 meg max of RAM addressable by the custom chips ("chip RAM".)

kmacdough · 2025-03-06T02:17:07 1741227427

That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.

That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.

icedchai · 2025-03-06T23:43:46 1741304626

The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.

teknologist · 2025-03-08T05:27:00 1741411620

Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.

brookst · 2025-03-06T12:26:17 1741263977

Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?

happyopossum · 2025-03-06T16:30:10 1741278610

> they specifically built this M3 Ultra for DeepSeek R1 4-bit

Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.

vaxman · 2025-03-07T01:18:48 1741310328

"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).

It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)

SV_BubbleTime · 2025-03-06T17:27:32 1741282052

No one is saying they built a new chip.

But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.

cyanydeez · 2025-03-06T23:33:27 1741304007

Dies are designed in years.

This was just a coincidence.

SV_BubbleTime · 2025-03-07T16:01:19 1741363279

What part of “no one is saying they designed a new chip” is lost here?

cyanydeez · 2025-03-07T19:03:26 1741374206

Sorry, non of us a fan boys trying to shape apple is great narratives

forrestthewoods · 2025-03-05T23:52:54 1741218774

I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.

reitzensteinm · 2025-03-06T05:23:55 1741238635

Chip? Yes. Product? Not necessarily...

It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.

I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.

forrestthewoods · 2025-03-06T06:07:12 1741241232

DeepSeek R1 came out Jan 20.

Literally impossible.

reitzensteinm · 2025-03-06T06:32:13 1741242733

The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.

I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.

forrestthewoods · 2025-03-06T07:54:21 1741247661

From “internal only” to “delivered to customers” in 6 weeks is literally impossible.

ryao · 2025-03-06T14:33:19 1741271599

This change is mostly just using higher density ICs on the assembly line and printing different box art with a SKU change. It does not take much time, especially if they had planned it as a possible product just in case management changed its mind.

jahewson · 2025-03-06T09:15:49 1741252549

That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.

brookst · 2025-03-06T12:31:50 1741264310

Apple is positively building custom servers, and quantities are closer to the 100k range than 1000 [0]

But I agree they are not using m3 ultra for that. It wouldn’t make any sense.

0. https://www.theregister.com/AMP/2024/06/11/apple_built_ai_cl...

teknologist · 2025-03-08T03:02:18 1741402938

That could be why they're also selling it as the Mac Studio M3 Ultra

bustling-noose · 2025-03-06T02:14:13 1741227253

My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.

tgma · 2025-03-06T06:13:33 1741241613

Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.

See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...

bustling-noose · 2025-03-08T05:06:47 1741410407

I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.

nightski · 2025-03-06T04:00:12 1741233612

$10k to run a 4 bit quantized model. Ouch.

OriginalMrPink · 2025-03-06T13:03:42 1741266222

That's today. What about tomorrow?

water9 · 2025-03-06T07:40:27 1741246827

The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine

vaxman · 2025-03-06T06:40:37 1741243237

[flagged]

titanomachy · 2025-03-06T18:42:14 1741286534

I'm downvoting you because your use of language is so annoying, not because I work for Apple.

vaxman · 2025-03-07T01:22:11 1741310531

So, Microsoft?

fredoliveira · 2025-03-06T15:19:20 1741274360

what?

vaxman · 2025-03-06T19:39:18 1741289958

Sorry, an apostrophe got lost in "PO's"

vaxman · 2025-03-06T08:56:57 1741251417

[flagged]

1R053 · 2025-03-06T09:39:21 1741253961

are you comparing the same models? How did you calculate the TOPS for M3 Ultra?

vaxman · 2025-03-06T19:10:19 1741288219

An M3 Ultra is two M3 Max chips connected via fabric, so physics.

Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.

jrflowers · 2025-03-06T21:48:14 1741297694

> they specifically built this M3 Ultra for DeepSeek R1 4-bit.

This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit

a1o · 2025-03-05T23:53:07 1741218787

Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.

j45 · 2025-03-06T02:18:37 1741227517

Looks like up to 480W listed here

https://www.apple.com/mac-studio/specs/

a1o · 2025-03-06T16:27:24 1741278444

Thanks!!

ryao · 2025-03-06T01:56:51 1741226211

The M2 Ultra Mac Pro could reach a maximum of 330W according to Apple:

https://support.apple.com/en-us/102839

I assume it is similar.

drited · 2025-03-05T20:32:12 1741206732

I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?

valine · 2025-03-05T16:26:21 1741191981

Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

FloatArtifact · 2025-03-05T16:32:43 1741192363

> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?

valine · 2025-03-05T16:41:09 1741192869

With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.

ein0p · 2025-03-05T17:38:01 1741196281

What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.

valine · 2025-03-05T18:20:36 1741198836

No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.

doctorpangloss · 2025-03-05T17:49:42 1741196982

Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.

jonfromsf · 2025-03-05T23:53:11 1741218791

Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.

hot_gril · 2025-03-06T01:21:32 1741224092

Computers don't usually depreciate slowly

km3r · 2025-03-06T03:00:48 1741230048

Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.

seec · 2025-03-06T15:11:57 1741273917

It used to be very true, but with Apple's popularity the second-hand market is quite saturated (especially since there are many people buying them impulsively).

Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.

hot_gril · 2025-03-06T22:32:16 1741300336

I don't know. Can't imagine it's easy to sell a used Windows laptop directly to begin with, and those big resellers probably offer very little. Even refurbished Dell Latitudes seem to go for cheap on eBay. I've had an easy time selling old Macs, or high-end desktop market might be simple too.

ein0p · 2025-03-06T04:12:24 1741234344

Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.

DevKoala · 2025-03-05T23:07:44 1741216064

This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.

ein0p · 2025-03-06T04:02:45 1741233765

I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.

DevKoala · 2025-03-06T04:41:28 1741236088

I hear you. I make these decisions for a public company.

When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.

There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.

Der_Einzige · 2025-03-05T18:33:47 1741199627

No one who is using this for home use cares about anything except batch size 1 sequence size 1.

ein0p · 2025-03-05T20:30:41 1741206641

What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.

saagarjha · 2025-03-06T07:18:11 1741245491

People want to talk to their computer, not service requests for a thousand users.

rfoo · 2025-03-05T18:08:47 1741198127

For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.

bick_nyers · 2025-03-05T17:30:28 1741195828

Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.

valine · 2025-03-05T18:25:07 1741199107

Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.

diggan · 2025-03-05T16:33:15 1741192395

> The the question is if a llm will run with usable performance at that scale?

This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.

Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.

johnmaguire · 2025-03-05T16:50:53 1741193453

It is much slower than nVidia, but for a lot of personal-use LLM scenarios, it's very workable. And it doesn't need to be anywhere near as fast considering it's really the only viable (affordable) option for private, local inference, besides building a server like this, which is no faster: https://news.ycombinator.com/item?id=42897205

bastardoperator · 2025-03-05T17:43:02 1741196582

It's fast enough for me to cancel monthly AI services on a mac mini m4 max.

diggan · 2025-03-05T17:49:24 1741196964

Could you maybe share a lightweight benchmark where you share the exact model (+ quantization if you're using that) + runtime + used settings and how much tokens/second you're getting? Or just like a log of the entire run with the stats, if you're using something like llama.cpp, LMDesktop or ollama?

Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.

lostmsu · 2025-03-05T18:45:24 1741200324

Hm, the AI services over 5 years cost half of m4 max minimal configuration which can barely run severely lobotomized LLaMA 70B. And they provide significantly better models.

Matl · 2025-03-05T19:40:38 1741203638

Sure, with something like Kagi you even get many models to choose from for a relatively low price, but not everybody likes to send over their codebase and documents to OpenAI.

nomel · 2025-03-05T19:08:46 1741201726

It's probably much worse than that, with the falling prices of compute.

staticman2 · 2025-03-05T18:03:03 1741197783

Smaller, dumber models are faster than bigger, slower ones.

What model do you find fast enough and smart enough?

Matl · 2025-03-05T19:42:46 1741203766

Not OP but I am finding the Qwen 2.5 32b distilled with DeepSeek R1 model to be a good speed/smartness ratio on the M4 Pro Mac Mini.

bastardoperator · 2025-03-06T16:44:49 1741279489

I'm running the same exact models.

a1o · 2025-03-05T23:55:03 1741218903

How much RAM?

Matl · 2025-03-06T21:19:06 1741295946

It takes between 22GB-37GB depending on the context size etc. from what I've observed.

a1o · 2025-03-07T13:33:23 1741354403

Thanks!

jamesy0ung · 2025-03-06T07:44:15 1741247055

I presume you're using the Pro, not the Max.

Anyways, what ram config, and what model are you using?

fetus8 · 2025-03-05T17:49:42 1741196982

How much RAM are you running on?

hangonhn · 2025-03-05T17:30:02 1741195802

Do we know if is it slower because of hardware is not as well suited for the task or is it mostly a software issue -- the code hasn't been optimized to run on Apple Silicon?

titzer · 2025-03-05T17:37:19 1741196239

AFAICT the neural engine has accelerators for CNNs and integer math, but not the exact tensor operations in popular LLM transformer architectures that are well-supported in GPUs.

woadwarrior01 · 2025-03-05T20:16:01 1741205761

The neural engine is perfectly capable of accelerating matmults. It's just that autoregressive decoding in single batch LLM inference is memory bandwidth constrained, so there are no performance benefits to using the ANE for LLM inference (although, there's a huge power efficiency benefit). And the only way to use the neural engine is via CoreML. Using the GPU with MLX or MPS is often easier.

kridsdale1 · 2025-03-05T19:33:50 1741203230

I have to assume they’re doing something like that in the lab for 4 years from now.

azinman2 · 2025-03-05T19:31:55 1741203115

Memory bandwidth is the issue

bob1029 · 2025-03-05T19:06:39 1741201599

> The question is if a llm will run with usable performance at that scale?

For the self-attention mechanism, memory bandwidth requirements scale ~quadratically with the sequence length.

kridsdale1 · 2025-03-05T19:36:01 1741203361

Someone has got to be working on a better method than that. Hundreds of billions are at stake.

cxie · 2025-03-05T15:59:49 1741190389

Guess what? I'm on a mission to completely max out all 512GB of mem...maybe by running DeepSeek on it. Pure greed!

swivelmaster · 2025-03-05T16:55:16 1741193716

You could always just open a few Chrome tabs…

ksec · 2025-03-06T00:00:13 1741219213

It may not be Firefox in terms of hundreds or thousands of tabs but Chrome has gotten a lot more memory efficient since around 2022.

DidYaWipe · 2025-03-05T22:56:02 1741215362

[flagged]

umanwizard · 2025-03-06T02:29:23 1741228163

I downvote all Reddit-style memes, jokes, reference humor, catchphrases, and so on. It’s low-effort content that doesn’t fit the vibe of HN and actively makes the site worse for its intended purpose.

ksec · 2025-03-05T23:58:53 1741219133

>Edit: WTF, someone downvoted "Enjoy the upvotes?" Pathetic.

You should read HN posting Guidelines if you want to understand why. Although I guess mostly in this case it is someone fat thumbed downvote.

petepete · 2025-03-06T08:30:22 1741249822

Give Cities Skylines 2 a try.

zactato · 2025-03-07T23:29:03 1741390143

It doesn't support Macs yet

deepGem · 2025-03-05T20:02:42 1741204962

Any idea what the sRAM to uRAM ratio is on these new GPUs ? If they have meaningfully higher sRAM than the Hopper GPUs, it could lead to meaningful speedups in large model training.

If they didn't increase the memory bandwidth, then 512GB will enable longer context lengths and that's about it right? No speedups

For any speedups You may need some new variant of FlashAttention3 or something along similar lines to be purpose built for Apple GPUs.

astrange · 2025-03-06T00:53:02 1741222382

I don't know what you mean by s and u, but there is only one kind of memory in the machine, that's what unified memory means.

saagarjha · 2025-03-06T07:21:33 1741245693

I assume they mean SRAM versus unified (D)RAM?

TheRealPomax · 2025-03-05T17:42:28 1741196548

Yeah they did? The M4 has a max memory bandwidth of 546GBps, the M3 Ultra bumps that up to a max of 819GBps.

(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)

okanesen · 2025-03-05T17:50:17 1741197017

Not that dramatic of an increase actually - the M2 Max already had 400GB/s and M2 Ultra 800GB/s memory bandwidth, so the M3 Ultra's 819GB/s is just a modest bump. Though the M4's additional 146GB/s is indeed a more noticeable improvement.

choilive · 2025-03-05T18:03:23 1741197803

Also should note that 800/819GB/s of memory bandwidth is actually VERY usable for LLMs. Consider that a 4090 is just a hair above 1000GB/s

hereonout2 · 2025-03-05T18:59:30 1741201170

Does it work like that though at this larger scale? 512GB of VRAM would be across multiple NVIDIA cards, so the bandwidth and access is parallelized.

But here it looks more of a bottleneck from my (admittedly naive) understanding.

choilive · 2025-03-05T19:12:27 1741201947

For inference the bandwidth is generally not parallelized because the weights need to go through the model layer by layer. The most common model splitting method is done by assigning each GPU a subset of the LLM layers and it doesn't take much bandwidth to send model weights via PCIE to the next GPU.

manmal · 2025-03-05T21:19:14 1741209554

My understanding is that the GPU must still load its assigned layer from VRAM into registers and L2 cache for every token, because those aren’t large enough to hold a significant portion. So naively, for a 24GB layer, you‘d need to move up to 24GB for every token.

angoragoats · 2025-03-07T13:50:17 1741355417

But the memory bandwidth is only part of the equation; the 4090 is at least several times faster at compute compared to the fastest Apple CPU/GPU.

sudoshred · 2025-03-05T23:00:43 1741215643

Agree. Finally I can have several hundred browser tabs open simultaneously with no performance degradation.

protocolture · 2025-03-05T23:03:21 1741215801

Well at least 20

Dban1 · 2025-03-06T08:06:02 1741248362

New update just came in, make that 15

nikisweeting · 2025-03-06T19:09:52 1741288192

My M1 Max regularly pushes 1000+ tabs without breaking a sweat, I feel like this particular metric is no longer useful now that background tab memory is almost always unloaded by the browser.

nullc · 2025-03-06T18:44:52 1741286692

I'm not sure that unified memory is particularly relevant for that-- so e.g. on zen4/zen5 epyc there is more than enough arithmetic power that LLM inference is purely memory bandwidth limited.

On dual (SP5) Epyc I believe the memory bandwidth is somewhat greater than this apple product too... and at apple's price points you can have about twice the ram too.

Presumably the apple solution is more power efficient.

rlt · 2025-03-06T16:56:54 1741280214

Is putting RAM on the same chip as processing economical?

I would have assumed you’d want to save the best process/node for processing, and could use a less expensive processes for RAM.

samstave · 2025-03-05T19:53:31 1741204411

"unified memory"

funny that people think this is so new, when CRAY had Global Heap eons ago...

webworker · 2025-03-05T21:58:39 1741211919

The real hardware needed for artificial intelligence wasn't NVIDIA, it was a CRAY XMP from 1982 all along

samstave · 2025-03-05T23:28:39 1741217319

WHen I was with Mirantis, I flew to Austin TX to meet a client in a non-descript multi-tenant office building...

we walked in and getting our bearings, we come upon CRAY office. WTF?!

I tried the doors, locked - and it was clearly empty... but damn did I want to steal their office door signage.

hot_gril · 2025-03-05T22:05:44 1741212344

It's new for mainstream PCs to have it.

pjmlp · 2025-03-06T07:29:14 1741246154

Nope, it was common in 8 and 16 bit home computers, and in respect to PCs themselves graphics memory was mapped into the main memory until the arrival of 3D dedicated cards.

And even with 3D, integrated GPUs have existed for years.

hot_gril · 2025-03-06T18:01:17 1741284077

The CPUs with iGPUs didn't also have the memory on-chip. The Nintendo 64 did. Not sure about the old home computers, but I thought those had separate memory usually.

pjmlp · 2025-03-06T18:09:41 1741284581

Of course not, because they are not designed as SOCs, the only memory on chip is cache, it doesn't change the fact the memory is one whole block shared between CPU and iGPU.

angoragoats · 2025-03-07T13:52:47 1741355567

Apple does not have the memory on-chip (on the same die as the CPU) either.

djmips · 2025-03-06T10:15:07 1741256107

Like pretty much every game console.

TylerE · 2025-03-06T07:29:05 1741246145

New for performance machines maybe. I remember "integrated graphics" when that meant some shitty co-processor and 16 or 32MB of semi-reserved system RAM.

Vilian · 2025-03-06T11:03:39 1741259019

It's not new for PC to block user ram upgrade

grandempire · 2025-03-06T18:51:04 1741287064

You mean the room sized super computer than sold tens of units?

samstave · 2025-03-06T19:10:50 1741288250

Yes, but now its in my pocket.

ddtaylor · 2025-03-05T20:02:59 1741204979

Why did it take so long for us to get here?

RachelF · 2025-03-05T21:41:10 1741210870

Some possible groups of reasons: 1. Until recently RAM amount was something the end user liked to configure, so little market demand. 2. Technically, building such a large system on a chip or collection of chiplets was not possible. 3. RAM speed wasn't a bottleneck for most tasks, it was IO or CPU. LLMs changed this.

hot_gril · 2025-03-05T22:08:10 1741212490

M1 came out before the LLM rush, though

wtallis · 2025-03-06T02:45:48 1741229148

The M1 is in a product segment where discrete GPUs have been gone for decades, in favor of integrated graphics that shares one pool of RAM with the CPU. The better question to ask is why Apple kept using that unified memory design even when moving up to larger chips like the M1 Max and M1 Ultra.

MBCook · 2025-03-06T05:01:28 1741237288

The GPU is built into the same physical die as the CPU.

So if you wanted to give it a second ram pool you would have to add an entire second memory interface just for the on-die GPU.

Now all you’ve done is make it more complicated, slower because now you have to move things between the two pools, and gained what exactly?

I think it was a very clear and obvious decision to make. It’s an outgrowth out of how the base chips were designed, and it turned out to be extremely handy for some things. Plus since all their modern devices now work this way that probably simplify the software.

I’m not saying it’s genius foresight, but it certainly worked out rather well. There’s nothing stopping them from supporting discreet GPUs too if they wanted to. They just clearly don’t.

RachelF · 2025-03-06T06:10:08 1741241408

I'd guess that they inherited it from the iPhone chips. It was nice and fast and also makes Apple a lot of profit as no third party RAM is possible.

hot_gril · 2025-03-06T07:22:44 1741245764

They put the M1 into the desktops too

MatthiasPortzel · 2025-03-06T12:52:15 1741265535

Apple debuted dedicated machine learning hardware in 2017 with the Neural Engine on iPhones. While I don’t think they predicted the LLM explosion in particular, they knew machine learning was important and they have been allowing that to influence hardware design.

philistine · 2025-03-05T23:14:53 1741216493

Apple has always liked to integrate as much as possible on the same chip. It was only natural that they would come to this conclusion, with the improved perf the cherry on top.

hot_gril · 2025-03-05T23:32:01 1741217521

Well also these chips originated in phones, where they kinda had to integrate it. And the quicker RAM and disk access are pretty nice.

wmf · 2025-03-05T23:42:54 1741218174

Laptops have had unified memory for ten years or more. For desktops very few apps benefit from unified memory.

djmips · 2025-03-06T10:15:52 1741256152

And game consoles that use similar parts as laptops.

baby_souffle · 2025-03-05T20:12:59 1741205579

Just a guess, but fabricating this can't be easy. Yield is probably higher if you have less memory per chip.

astrange · 2025-03-06T00:55:31 1741222531

It's regular memory on separate chips.

PeterStuer · 2025-03-05T18:08:30 1741198110

Is this on chip memory? From the 800GB/s I would guess more likely a 512bit bus (8 channel) to DDR5 modules. Doing it on a quad channel would just about be possible, but really be pushing the envelope. Still a nice thing.

As for practicality, which mainstream applications would benefit from this much memory paired with a nice but relative mid compute? At this price-point (14K for a full specced system), would you prefer it over e.g. a couple of NVIDIA project DIGITS (assuming that arrives on time and for around the announced the 3K price-point)?

zitterbewegung · 2025-03-05T18:40:25 1741200025

NVIDIA project DIGITS has 128 GB LPDDR5x coherent unified system memory at a 273 Gb/s memory bus speed.

bangaladore · 2025-03-05T18:54:51 1741200891

It would be 273 GB/s (gigabytes, not gigabits). But in reality we don't know the bandwidth. Some ex employee said 500 GB/s.

You're source is a reddit post in which they try to match the size to existing chips, without realizing that its very likely that NVIDIA is using custom memory here produced by Micron. Like Apple uses custom memory chips.

PeterStuer · 2025-03-07T08:19:21 1741335561

Yes, but for the price of that single M3 ultra I could have 4 of those GB10's running in a 2x2 cluster with the full NVIDIA stack supported (which is still a big thing)

So M3 preference will depend on whether a niche can significantly benefit from a monolitic lower compute high memory vs higher compute but distributed setup.

MBCook · 2025-03-06T05:03:36 1741237416

Unless something had changed its on package, but not the same die.

RataNova · 2025-03-06T09:06:09 1741251969

It's a game changer for sure.... 512GB of unified memory really pushes the envelope, especially for running complex AI models locally. That said, the real test will be in how well the dual-chip design handles heat and power efficiency

resters · 2025-03-05T23:08:17 1741216097

The same thing could be designed with greater memory bandwidth, and so it's just a matter of time (for NVIDIA) until Apple decides to compete.

TheRealPomax · 2025-03-05T17:39:27 1741196367

I think the other big thing is that the base model finally starts at a normal amount of memory for a production machine. You can't get less than 96GB. Although an extra $4000 for the 512GB model seems Tim Apple levels of ridiculous. There is absolutely no way that the different costs anywhere near that much at the fab.

And the storage solution still makes no sense of course, a machine like this should start at 4TB for $0 extra, 8TB for $500 more, and 16TB for $1000 more. Not start at a useless 1TB, with the 8TB version costing an extra $2400 and 16TB a truly idiotic $4600. If Sabrent can make and sell 8TB m.2 NVMe drives for $1000, SoC storage should set you back half that, not over double that.

jjtheblunt · 2025-03-05T17:59:30 1741197570

> There is absolutely no way that the different costs anywhere near that much at the fab.

price premium probably, but chip lithography errors (thus, yields) at the huge memory density might be partially driving up the cost for huge memory.

wtallis · 2025-03-06T05:57:00 1741240620

> but chip lithography errors (thus, yields) at the huge memory density might be partially driving up the cost for huge memory.

Apple's not having TSMC fab a massive die full of memory. They're buying a bunch of small dies of commodity memory and putting them in a package with a pair of large compute dies. How many of those small commodity memory dies they use has nothing to do with yield.

jjtheblunt · 2025-03-06T18:43:46 1741286626

Is there a teardown link available for what you wrote? If so, that’s interesting.

cayleyh · 2025-03-07T19:16:58 1741375018

This has been pretty clear about all Apple chip designs, going back to some of the first A series afaik. They are "unified memory" but not "memory on die", they've always been "memory on package"-- ie. the ram is packaged together with the CPU, often under a single heat spreader, but they are separate components.

Apple's own product shots have shown this. Here's a bunch of links that clearly show the memory as separate. Lots of these modules you can make out the serial or model numbers and look up the manufacturer of them from directly :)

- Side-by-side teardown of M1 Pro vs M2 Pro laptop motherboards showing separate ram chips with discussion on how apple is moving to different type of ram configurations: https://www.ifixit.com/News/71442/tearing-down-the-14-macboo...

- M2 teardown with the chip + ram highlighted: https://www.macrumors.com/2022/07/18/macbook-air-m2-chip-tea...

- Photo of the A12 with separate ram chips on a single "package": https://en.wikipedia.org/wiki/Apple_A12X

- M1 Ultra with heat spreader removed, clearly showing 3rd party ram chips onpackage: https://iphone-mania.jp/news-487859/

jjtheblunt · 2025-03-08T16:10:40 1741450240

neat! thanks

MBCook · 2025-03-06T05:07:13 1741237633

This is also a niche product. The number they sell is going to be very tiny compared to the base model MacBook, let alone the iPhone.

Apple absolutely loves to gouge for upgrades, but the chips in this have got to be expensive. I almost wonder if the absolute base model of this machine has much noticeably lower margins than a normal Apple product because that. But they expect/know that most everyone who buys one is going to spec it up.

TheRealPomax · 2025-03-05T18:17:26 1741198646

It's Apple, price premium is a given.

asdffdasy · 2025-03-06T14:46:12 1741272372

still not ECC

tempest_ · 2025-03-05T17:02:03 1741194123

Nvidia has had the Grace Hoppers for a while now. Is this not like that?

ykl · 2025-03-05T17:53:23 1741197203

This is cheap compared to GB200, which has a street price of >$70k for just the chip alone if you can even get one. Also GB200 technically has only 192GB per GPU and access to more than that happens over NVLink/RDMA, whereas here it’s just one big flat pool of unified memory without any tiered access topology.

rbanffy · 2025-03-05T18:11:42 1741198302

We finally encountered the situation where an Apple computer is cheaper than its competition ;-)

All joking aside, I don't think Apples are that expensive compared to similar high-end gear. I don't think there is any other compact desktop computer with half a terabyte of RAM accessible to the GPU.

nightski · 2025-03-06T05:51:40 1741240300

I mean expensive relative to who, Nvidia? Both are enjoying little to no competition in their respective niche and are using that monopoly power to extract massive margins. I have no doubt it could be much cheaper if there was actual competition in the market.

Fortunately it seems like AMD is finally catching on and working towards producing a viable competitor to the M series chips.

kridsdale1 · 2025-03-05T19:43:51 1741203831

And yet all that cash still just goes to TSMC

rbanffy · 2025-03-06T11:03:15 1741258995

They are selling the shovels for this gold rush. Also, ASML, who sells machines to make shovels.

dheera · 2025-03-05T16:23:29 1741191809

It will cost 4X what it costs to get 512GB on an x86 server motherboard.

valine · 2025-03-05T16:27:59 1741192079

What would it cost to get 512GB of VRAM on an Nvidia card? That’s the real comparison.

dheera · 2025-03-05T16:35:14 1741192514

Apples to oranges. NVIDIA cards have an order of magnitude more horsepower for compute than this thing. A B100 has 8 TB/s of memory bandwidth, 10 times more than this. If NVIDIA made a card with 512GB of HBM I'd expect it to cost $150K.

The compute and memory bandwidth of the M3 Ultra is more in-line with what you'd get from a Xeon or Epyc/Threadripper CPU on a server motherboard; it's just that the x86 "way" of doing things is usually to attach a GPU for way more horsepower rather than squeezing it out of the CPU.

This will be good for local LLM inference, but not so much for training.

pklausler · 2025-03-05T17:18:08 1741195088

This prompts an "old guy anecdote"; forgive me.

When I was much younger, I got to work on compilers at Cray Computer Corp., which was trying to bring the Cray-3 to market. (This was basically a 16-CPU Cray-2 implemented with GaAs parts; it never worked reliably.)

Back then, HPC performance was measured in mere megaflops. And although the Cray-2 had peak performance of nearly 500MF/s/CPU, it was really hard to attain, since its memory bandwidth was just 250M words/s/CPU (2GB/s/CPU); so you had to have lots of operand re-use to not be memory-bound. The Cray-3 would have had more bandwidth, but it was split between loads and stores, so it was still quite a ways away from the competing Cray X-MP/Y-MP/C-90 architecture, which could load two words per clock, store one, and complete an add and a multiply.

So I asked why the Cray-3 didn't have more read bandwidth to/from memory, and got a lesson from the answer that has stuck. You could actually see how much physical hardware in that machine was devoted to the CPU/memory interconnect, since the case was transparent -- there was a thick nest of tiny blue & white twisted wire pairs between the modules, and the stacks of chips on each CPU devoted to the memory system were a large proportion of the total. So the memory and the interconnect constituted a surprising (to me) majority of the machine. Having more floating-point performance in the CPUs than the memory could sustain meant that the memory system was oversubscribed, and that meant that more of the machine was kept fully utilized. (Or would have been, had it worked...)

In short, don't measure HPC systems with just flops. Measure the effective bandwidth over large data, and make sure that the flops are high enough to keep it utilized.

worthless-trash · 2025-03-06T02:09:48 1741226988

That is a great story. Please never hesitate to drop these in.

Do you have a blog?

musicale · 2025-03-06T05:49:47 1741240187

> so you had to have lots of operand re-use to not be memory-bound

Looking at Nvidia's spec sheet, an H100 SXM can do 989 tf32 teraflops (or 67 non-tensor core fp32 teraflops?) and 3.35 TB/s memory (HBM) bandwidth, so ... similar problem?

pklausler · 2025-03-06T06:29:40 1741242580

There is caching today.

ryao · 2025-03-06T14:52:09 1741272729

The cache hitrate is effectively 0 for LLMs since the datasets are so huge.

LeifCarrotson · 2025-03-05T17:15:49 1741194949

Yep, it's apples to oranges. But sometimes you want apples, and sometimes you want oranges, so it's all good!

There's a wide spectrum of potential requirements between memory capacity, memory bandwidth, compute speed, compute complexity, and compute parallelism. In the past, a few GB was adequate for tasks that we assigned to the GPU, you had enough storage bandwidth to load the relevant scene into memory and generate framebuffers, but now we're running different workloads. Conversely, a big database server might want its entire contents to be resident in many sticks of ECC DIMMs for the CPU, but only needed a couple dozen x86-64 threads. And if your workload has many terabytes or petabytes of content to work with, there are network file systems with entirely different bandwidth targets for entire racks of individual machines to access that data at far slower rates.

There's a lot of latency between the needs of programmers and the development and shipping of hardware to satisfy those needs, I'm just happy we have a new option on that spectrum somewhere in the middle of traditional CPUs and traditional GPUs.

As you say, if Nvidia made a 512 GB card it would cost $150k, but this costs an order of magnitude less than that. Even high-end consumer cards like a 5090 have 16x less memory than this does (average enthusiasts on desktops have maybe 8 GB) and just over double the bandwidth (1.7 TB/s).

Also, nit pick FTA:

> Starting at 96GB, it can be configured up to 512GB, or over half a terabyte.

512 GB is exactly half of a terabyte, which is 1024 GB. It's too late for hard drives - the marketing departments have redefined storage to use multipliers of 1000 and invented "tebibytes" - but in memory we still work with powers of two. Please.

dheera · 2025-03-05T17:18:12 1741195092

Sure, if you want to do training get an NVIDIA card. My point is that it's not worth comparing either Mac or CPU x86 setup to anything with NVIDIA in it.

For inference setups, my point is that instead of paying $10000-$15000 for this Mac you could build an x86 system for <$5K (Epyc processor, 512GB-768GB RAM in 8-12 channels, server mobo) that does the same thing.

The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.

KingOfCoders · 2025-03-05T19:03:14 1741201394

But this is how it wonderfully works. +$4000 does two things: 1. Make Apple very very rich 2. Make people think this is better than a $10k EPYC. Win-Win for Apple. At the point when you have convinced that you are the best, higher price just means people think you are even better.

MBCook · 2025-03-06T05:14:55 1741238095

> The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.

That requires an otherwise equivalent PC to exist. I haven’t seen anyone name a PC with a half-TB of unified memory in this thread.

Yeah it’s $4k. Yeah that’s nuts. But it’s the only game in town like that. If the replacement is a $40k setup from Nvidia or whatever that’s a bargain.

kombine · 2025-03-05T23:21:00 1741216860

An X86 server comparable in performance to M3 Ultra will likely be a few times more energy hungry, no?

egorfine · 2025-03-05T17:39:56 1741196396

> we still work with powers of two. Please.

We do. Common people don't. It's easier to write "over half a terabyte" than explain (again) to millions of people what the power of two is.

johnklos · 2025-03-05T19:31:08 1741203068

Anyone who calls 512 gigs "over half a terabyte" is bullshitting. No, thank you.

egorfine · 2025-03-05T20:02:56 1741204976

Wasn't me.

zitterbewegung · 2025-03-05T16:49:19 1741193359

Since the GH200 has over a terabyte of VRAM at $343,000 and the H100 has 80GB that makes that $195,993 with a bit over 512GB of VRAM . You could beat the price of the Apple M3 Ultra with an AMD EPYC build.

treesciencebot · 2025-03-05T23:38:04 1741217884

GH200 is nowhere near $343,000 number. You can get a single server order around 45k (with inception discount). If you are buying bulk, it goes down to sub-30k ish. This comes with a H100's performance and insane amount of high bandwith memory.

wmf · 2025-03-05T23:45:39 1741218339

They probably meant 8xH200 for $343,000 which is in the ballpark.

zitterbewegung · 2025-03-06T00:43:23 1741221803

Yes this is what I meant since 8 would cover 512GB of Ram

bick_nyers · 2025-03-05T17:32:19 1741195939

About $12k when Project Digits comes out.

MBCook · 2025-03-06T05:08:30 1741237710

Apple is shipping today. No future promises.

valine · 2025-03-05T18:26:49 1741199209

That will only have 128GB of unified memory

dragonwriter · 2025-03-05T18:33:50 1741199630

128GB for 3K; per the announcement their ConnectX networking allows two Project Digits devices to be plugged into eachother and work together as one device giving you 256GB for $6k, and, AFAIK, existing frameworks can split models across devices, as well, hence, presumably, the upthread suggestion that Project Digits would provide 512GB for $12k, though arguably the last step is cheating.

justincormack · 2025-03-05T19:58:33 1741204713

the reason Nvidia only talk about two machines over the network is I think they only have one network port, so you need to add costs for a switch.

Cane_P · 2025-03-06T14:49:15 1741272555

It clearly have two ports. Just watch on the right side of the picture:

https://www.storagereview.com/wp-content/uploads/2025/01/Sto...

You will however get half of the bandwidth and a lot more latency if you have to go through multiple systems.

bick_nyers · 2025-03-05T20:13:48 1741205628

If you want to split tensorwise yes. Layerwise splits could go over Ethernet.

I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.

smith7018 · 2025-03-05T16:26:13 1741191973

You can build an x86 machine that can fully run DeepSeek R1 with 512GB VRAM for ~$2,500?

ta988 · 2025-03-05T16:37:36 1741192656

You will have to explain to me how.

bmelton · 2025-03-05T16:58:21 1741193901

https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...

muricula · 2025-03-05T20:17:30 1741205850

Is that a CPU based inference build? Shouldn't you be able to get more performance out of the M3's GPU?

wmf · 2025-03-05T23:46:51 1741218411

Inference is about memory bandwidth and some CPUs have just as much bandwidth as a GPU.