Huh! I had incorrectly assumed that was for output, not input. Thanks! YES that ...

jmorgan · 2025-01-26T20:38:28 1737923908

Sorry this isn't more obvious. Ideally VRAM usage for the context window (the KV cache) becomes dynamic, starting small and growing with token usage, whereas right now Ollama defaults to a size of 2K which can be overridden at runtime. A great example of this is vLLM's PagedAttention implementation [1] or Microsoft's vAttention [2] which is CUDA-specific (and there are quite a few others).

1M tokens will definitely require a lot of KV cache memory. One way to reduce the memory footprint is to use KV cache quantization, which has recently been added behind a flag [3] and will 1/4 the memory footprint if 4-bit KV cache quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)

[1] https://arxiv.org/pdf/2309.06180

[2] https://github.com/microsoft/vattention

[3] https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

gcanyon · 2025-01-27T03:30:05 1737948605

I think Apple stumbled into a problem here, and I hope they solve it: reasonably priced Macs are -- by the new standards set by modern LLMs -- severely memory-constrained. MacBook Airs max out at 24GB. MacBook Pros go to 32GB for $2200, 48GB for something like $2800, and to get to 128GB requires shelling out over $4000. A Mini can get you to 64GB for $2000. A Mac Studio can get you to 96GB for $3000, or 192GB for $5600.

In this LLM era, those are rookie numbers. It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000. I realize part of the issue is the lead time for chip design -- since Mac memory is an integral part of the chip, and the current crop were designed before the idea of running something like an LLM locally was a real probability.

But I hope the next year or two show significant increases in the default (and possible) memory for Macs.

senko · 2025-01-27T09:37:40 1737970660

> It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000.

Apple is not known for leaving money on the table like that.

Also, projects like NVidia DIGITS ($2k for 128G) might make Apple unwilling to enter the market. As you said, Studio with 192G is $5600k. For purely AI purposes, two DIGITS' are a better choice, and non-AI usages don't need such ludicros amount of RAM (maybe for video, but those customers are willing to pay more).

gcanyon · 2025-01-27T12:18:56 1737980336

> Apple is not known for leaving money on the table like that.

True -- although I will say the M series chips were a step change in performance and efficiency from the Intel processors they replaced, and Apple didn't charge a premium for them.

I'm not suggesting that they'll stop charging more for RAM than the industry at large -- I'm hoping they'll unbundle RAM from CPU-type. A base Mac Mini goes for $600, and adding RAM costs $200 per 8GB. That's a ridiculous premium, clearly, and at that rate my proposed Mac Mini with 256GB of RAM would go for $6600 -- which would roll my eyes until they fell out of my head.

But Apple is also leaving money on the table if they're not offering a more expensive model people would buy. A 128GB Mini, let's say, for $2000, might be that machine.

All that said, it's also a heck of a future-proof machine, so maybe the designed-obsolescence crowd have an argument to make here.

amrrs · 2025-01-26T20:28:56 1737923336

This has been the problem with a lot of long context use cases. It's not just the model's support but also sufficient compute and inference time. This is exactly why I was excited for Mamba and now possibly Lightning attention.

Even though the new DCA based on which these models provide long context could be an interesting area to watch;