Qwen2.5-1M: Deploy your own Qwen with context length up to 1M tokens

dang · 2025-01-26T21:04:17 1737925457

Related: https://simonwillison.net/2025/Jan/26/qwen25-1m/

(via https://news.ycombinator.com/item?id=42832838, but we merged that thread hither)

anotherpaulg · 2025-01-26T21:51:41 1737928301

In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, etc.

Developing aider, I've seen this problem with gpt-4o, Sonnet, DeepSeek, etc. Many aider users report this too. It's perhaps the #1 problem users have, so I created a dedicated help page [0].

Very large context may be useful for certain tasks with lots of "low value" context. But for coding, it seems to lure users into a problematic regime.

[0] https://aider.chat/docs/troubleshooting/edit-errors.html#don...

adamgordonbell · 2025-01-26T22:44:58 1737931498

Aider is great, but you need specific formats from the llm. That might be where the challenge is.

I've used the giant context in Gemini to dump a code base and say: describe the major data structures and data flows.

Things like that, overview documents, work great. It's amazing for orienting in an unfamiliar codebase.

anotherpaulg · 2025-01-26T23:22:32 1737933752

Yes, that is true. Aider expects to work with the LLM to automatically apply edits to the source files. This requires precision from the LLM, which is what breaks down when you overload them with context.

noname120 · 2025-01-27T16:25:42 1737995142

Not true. In Aider the patch produced by the LLM is sent to a second model that is just tasked with fixing the patch — it works wonders.

trentnelson · 2025-01-27T17:00:32 1737997232

Based on an earlier comment, I think the person you're replying to is the author of aider.

anotherpaulg · 2025-01-27T17:18:22 1737998302

Yes, aider can also work in architect/editor mode [0] which tends to produce the best results [1]. An architect model solves the coding problem and describes the needed changes however comes naturally to it. The editor model then takes that solution and turns it into correctly formatted instructions to edit the files.

Too much context can still confuse the LLMs in this situation, but they may be somewhat more resilient.

[0] https://aider.chat/2024/09/26/architect.html

[1] https://aider.chat/2025/01/24/r1-sonnet.html

arkh · 2025-01-27T07:44:04 1737963844

My hypothesis is code completion is not a text completion problem. More of a graph completion one.

So we may have got to a local maximum regarding code helpers with LLMs and we'll have to wait for some breakthrough in the AI field before we get something better.

raincole · 2025-01-27T08:05:56 1737965156

But these models don't work that well even for text when you gave them a huge context. They're reasonably good at summarization, but if you ask them to "continue the story" they will write very inconsistent things (eerily similar to what a sloppy human writer does, though.)

meiraleal · 2025-01-27T12:55:34 1737982534

We should be able to provide 2 fields, context and prompt so the prompt gets higher priority and don't get mixed with the whole context.

meiraleal · 2025-01-27T12:53:57 1737982437

For this breakthrough to happen, big tech will need to hire software engineers again :)

But the good thing is that DeepSeek proved those breakthroughs are going to happen one way or another, fast.

badlogic · 2025-01-27T14:53:11 1737989591

I concur. In my work (analysing news show transcripts and descriptions), I work with about 250k input tokens max. Tasks include:

- Summarize topics (with references to shows) - Find quotes specific to a topic (again with references)

Anything above 32k tokens fails to have acceptable recall, across GPT-4o, Sonnet, and Google's Gemini Flash 1.5 and 2.0.

I suppose it kind of makes sense, given how large context windows are implemented via things like sparse attention etc.

kgeist · 2025-01-27T19:36:31 1738006591

What could be the reason? Do they selectively skip tokens to make it appear they support the full context?

lifty · 2025-01-26T22:09:49 1737929389

Thanks for aider! It has become an integral part of my workflow. Looking forward to try DeepSeek in architect mode with Sonnet as the driver. Curious if it will be a noticeable improvement as compared to using Sonnet by itself.

anotherpaulg · 2025-01-26T23:23:55 1737933835

I'm guessing you're interesting in R1+Sonnet because of the recent SOTA benchmark result? It does seem to be a powerful architect/editor combo.

https://aider.chat/2025/01/24/r1-sonnet.html

cma · 2025-01-26T22:18:58 1737929938

Claude works incredibly well for me with asking for code changes to projects filling up 80% of context (160K tokens). It's way expensive with the API though but reasonable through the web interface with pro.

Yusefmosiah · 2025-01-27T06:07:52 1737958072

It’s not just the quantity of tokens in context that matters, but the coherence of the concepts in the context.

Many conflicting ideas are harder for models to follow than one large unified idea.

seunosewa · 2025-01-26T22:41:44 1737931304

The behaviour you described is what happens when you have small context windows. Perhaps you're feeding the models with more tokens than you think you are. I have enjoyed loading large codebases into AI Studio and getting very satisfying and accurate answers because the models have 1M to 2M token context windows.

dr_kiszonka · 2025-01-27T00:42:49 1737938569

How do you get those large codebases into AI Studio? Concat everything into one big file?

msoad · 2025-01-27T02:25:29 1737944729

I use yek

https://github.com/bodo-run/yek

social_quotient · 2025-01-27T02:23:43 1737944623

Concat to a file but it helps to make an ascii tree at the top and then for each merged file out its path and orientation details. I’ve also started playing with adding line ranges to the ascii tree hoping that the LLMs (more specifically the agentic ones) start getting smart enough to jump to the relevant section.

adamgordonbell · 2025-01-27T01:35:44 1737941744

Basically yes, I have a helper program, but that's mainly what it does.

NiloCK · 2025-01-27T14:46:31 1737989191

I learned this very explicitly recently. I've had some success with project and branch prompts - feeding a bunch of context into the beginning of each dialog.

In one dialog, some 30k tokens later, Claude requested the contents of package.json... which was in the context window already - the whole file!

The strange thing was that after I said so, without re-inserting, Claude successfully read it from context to fill the gap in what it was trying to do.

It's as if a synopsis of what exists in-context delivered with each message would help. But that feels weird!

cyanydeez · 2025-01-27T21:37:25 1738013845

That's what these char models are already doing.

Most chat is just a long running prompt. LLMs have zero actual memory. You just keep feeding it history.

Maybe I misunderstood what you're saying but what you're describing is some kind of 2nd model that condenses that history and that gets fed; this has been done.

Really, what you probably need is another model managing the heap and the stack of the history and brining forward the current context.

But that's easy to say because we are humans.

NiloCK · 2025-01-28T12:49:12 1738068552

To clarify:

I didn't 'fix' the problem by re-inserting the package.json. I just gave a reminder that it that it was already in context.

"I could confirm this root cause if I could see the contents of package.json".

"You can see it."

"Whoops, yes. Package x needs to bump from n to n+1."

The point being, even info inside the current context can actively be overlooked unless it is specifically referenced.

DiogenesKynikos · 2025-01-27T14:51:33 1737989493

Maybe the problem is that the "UI" we're providing to the LLMs is not very useful.

Imagine dumping the entire text of a large code repository in front of a human programmer, and asking them to fix a bug. Human programmers use IDEs, search through the code, flip back and forth between different functions, etc. Maybe with a better interface that the LLM could interact with, it would perform better.

cyanydeez · 2025-01-27T21:42:23 1738014143

I wonder if you could figure out a pseudo code like python. I'd think yaml might work also

Something: Filename: index.js Content: | Class Example...

Another item would be some kind of hyperlinking. Maybe you could load in a hrefs but there might be a more semantically popular way, but the data feeding these AIs just aren't constructed like that.

orbital-decay · 2025-01-27T13:04:57 1737983097

Overall accuracy degradation on longer contexts is just one major issue. Another is that lost-in-the-middle problem starts being much worse on longer contexts, so when it significantly exceeds the length of model's training examples, the tokens in the middle might as well not exist.

ksynwa · 2025-01-27T16:13:59 1737994439

Any idea why this happens?

torginus · 2025-01-27T07:34:25 1737963265

Yeah, and thanks to the features of the programming language, it's very easy to automatically assemble a highly relevant but short context, just by following symbol references recursively.

tmcdonald · 2025-01-26T19:56:56 1737921416

Ollama has a num_ctx parameter that controls the context window length - it defaults to 2048. At a guess you will need to set that.

anotherpaulg · 2025-01-27T02:24:44 1737944684

This is a harsh foot-gun that seems to harm many ollama users.

That 2k default is extremely low, and ollama *silently* discards the leading context. So users have no idea that most of their data hasn’t been provided to the model.

I’ve had to add docs [0] to aider about this, and aider overrides the default to at least 8k tokens. I’d like to do more, but unilaterally raising the context window size has performance implications for users.

Edit: Ok, aider now gives ollama users a clear warning when their chat context exceeds their ollama context window [1].

[0] https://aider.chat/docs/llms/ollama.html#setting-the-context...

[1] https://github.com/Aider-AI/aider/blob/main/aider/coders/bas...

magicalhippo · 2025-01-27T07:10:40 1737961840

There are several issues in the Ollama GitHub issue tracker related to this, like this[1] or this[2].

Fortunately it's easy to create a variant of the model with increased context size using the CLI[3] and then use that variant instead.

Just be mindful that longer context means more memory required[4].

[1]: https://github.com/ollama/ollama/issues/4967

[2]: https://github.com/ollama/ollama/issues/7043

[3]: https://github.com/ollama/ollama/issues/8099#issuecomment-25...

[4]: https://www.reddit.com/r/LocalLLaMA/comments/1848puo/comment...

neuralkoi · 2025-01-27T07:55:16 1737964516

Thank you! I was looking for how to do this. The example in the issue above shows how to increase the context size in ollama:

    $ ollama run llama3.2
    >>> /set parameter num_ctx 32768
    Set parameter 'num_ctx' to '32768'
    >>> /save llama3.2-32k
    Created new model 'llama3.2-32k'
    >>> /bye
    $ ollama run llama3.2-32k "Summarize this file: $(cat README.md)"
    ...

The table in the reddit post above also shows context size vs memory requirements for Model: 01-ai/Yi-34B-200K Params: 34.395B Mode: infer

    Sequence Length vs Bit Precision Memory Requirements
       SL / BP |     4      |     6      |     8      |     16
    --------------------------------------------------------------
           256 |     16.0GB |     24.0GB |     32.1GB |     64.1GB
           512 |     16.0GB |     24.1GB |     32.1GB |     64.2GB
          1024 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
          2048 |     16.1GB |     24.2GB |     32.3GB |     64.5GB
          4096 |     16.3GB |     24.4GB |     32.5GB |     65.0GB
          8192 |     16.5GB |     24.7GB |     33.0GB |     65.9GB
         16384 |     17.0GB |     25.4GB |     33.9GB |     67.8GB
         32768 |     17.9GB |     26.8GB |     35.8GB |     71.6GB
         65536 |     19.8GB |     29.6GB |     39.5GB |     79.1GB
        131072 |     23.5GB |     35.3GB |     47.0GB |     94.1GB
    *   200000 |     27.5GB |     41.2GB |     54.9GB |    109.8GB

    * Model Max Context Size

Code: https://gist.github.com/lapp0/d28931ebc9f59838800faa7c73e3a0...

eurekin · 2025-01-27T14:42:10 1737988930

Can context be split on multiple GPUs?

magicalhippo · 2025-01-27T22:13:45 1738016025

Not my field, but from this[1] blog post which references this[2] paper, it would seem so. Note the optimal approach are a bit different between training and inference. Also note that several of the approaches rely on batching multiple requests (prompts) in order to exploit the parallelism, so won't see the same gains if fed only a single prompt at a time.

[1]: https://medium.com/@plienhar/llm-inference-series-4-kv-cachi...

[2]: https://arxiv.org/abs/2104.04473

simonw · 2025-01-26T20:11:11 1737922271

Huh! I had incorrectly assumed that was for output, not input. Thanks!

YES that was it:

  files-to-prompt \
    ~/Dropbox/Development/llm \
    -e py -c | \
  llm -m q1m 'describe this codebase in detail' \
   -o num_ctx 80000

I was watching my memory usage and it quickly maxed out my 64GB so I hit Ctrl+C before my Mac crashed.

jmorgan · 2025-01-26T20:38:28 1737923908

Sorry this isn't more obvious. Ideally VRAM usage for the context window (the KV cache) becomes dynamic, starting small and growing with token usage, whereas right now Ollama defaults to a size of 2K which can be overridden at runtime. A great example of this is vLLM's PagedAttention implementation [1] or Microsoft's vAttention [2] which is CUDA-specific (and there are quite a few others).

1M tokens will definitely require a lot of KV cache memory. One way to reduce the memory footprint is to use KV cache quantization, which has recently been added behind a flag [3] and will 1/4 the memory footprint if 4-bit KV cache quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)

[1] https://arxiv.org/pdf/2309.06180

[2] https://github.com/microsoft/vattention

[3] https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

gcanyon · 2025-01-27T03:30:05 1737948605

I think Apple stumbled into a problem here, and I hope they solve it: reasonably priced Macs are -- by the new standards set by modern LLMs -- severely memory-constrained. MacBook Airs max out at 24GB. MacBook Pros go to 32GB for $2200, 48GB for something like $2800, and to get to 128GB requires shelling out over $4000. A Mini can get you to 64GB for $2000. A Mac Studio can get you to 96GB for $3000, or 192GB for $5600.

In this LLM era, those are rookie numbers. It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000. I realize part of the issue is the lead time for chip design -- since Mac memory is an integral part of the chip, and the current crop were designed before the idea of running something like an LLM locally was a real probability.

But I hope the next year or two show significant increases in the default (and possible) memory for Macs.

senko · 2025-01-27T09:37:40 1737970660

> It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000.

Apple is not known for leaving money on the table like that.

Also, projects like NVidia DIGITS ($2k for 128G) might make Apple unwilling to enter the market. As you said, Studio with 192G is $5600k. For purely AI purposes, two DIGITS' are a better choice, and non-AI usages don't need such ludicros amount of RAM (maybe for video, but those customers are willing to pay more).

gcanyon · 2025-01-27T12:18:56 1737980336

> Apple is not known for leaving money on the table like that.

True -- although I will say the M series chips were a step change in performance and efficiency from the Intel processors they replaced, and Apple didn't charge a premium for them.

I'm not suggesting that they'll stop charging more for RAM than the industry at large -- I'm hoping they'll unbundle RAM from CPU-type. A base Mac Mini goes for $600, and adding RAM costs $200 per 8GB. That's a ridiculous premium, clearly, and at that rate my proposed Mac Mini with 256GB of RAM would go for $6600 -- which would roll my eyes until they fell out of my head.

But Apple is also leaving money on the table if they're not offering a more expensive model people would buy. A 128GB Mini, let's say, for $2000, might be that machine.

All that said, it's also a heck of a future-proof machine, so maybe the designed-obsolescence crowd have an argument to make here.

amrrs · 2025-01-26T20:28:56 1737923336

This has been the problem with a lot of long context use cases. It's not just the model's support but also sufficient compute and inference time. This is exactly why I was excited for Mamba and now possibly Lightning attention.

Even though the new DCA based on which these models provide long context could be an interesting area to watch;

thot_experiment · 2025-01-26T21:07:01 1737925621

Ollama is a "easymode" LLM runtime and as such has all the problems that every easymode thing has. It will assume things and the moment you want to do anything interesting those assumptions will shoot you in the foot, though I've found ollama plays so fast and loose even first party things that "should just work" do not. For example if you run R1 (at least as of 2 days ago when i tried this) using the default `ollama run deepseek-r1:7b` you will get different context size, top_p and temperature vs what Deepseek recommends in their release post.

xigency · 2025-01-27T01:13:13 1737940393

Ollama definitely is a strange beast. The sparseness of the documentation seems to imply that things will 'just work' and yet, they often don't.

rahimnathwani · 2025-01-26T20:14:23 1737922463

Yup, and this parameter is supported by the plugin he's using:

https://github.com/taketwo/llm-ollama/blob/4ccd5181c099af963...

simonw · 2025-01-26T20:51:12 1737924672

Here are tips for running it on macOS using MLX: https://twitter.com/awnihannun/status/1883611098081099914 - using https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-...

woadwarrior01 · 2025-01-26T22:03:43 1737929023

MLX does not support dual chunk attention[1] that these models use for long contexts, yet.

[1]: https://arxiv.org/abs/2402.17463

ilaksh · 2025-01-26T20:51:43 1737924703

What's the SOTA for memory-centric computing? I feel like maybe we need a new paradigm or something to bring the price of AI memory down.

Maybe they can take some of those hundreds of billions and invest in new approaches.

Because racks of H100s are not sustainable. But it's clear that increasing the amount of memory available is key to getting more intelligence or capabilities.

Maybe there is a way to connect DRAM with photonic interconnects that doesn't require much data ordering for AI if the neural network software model changes somewhat.

Is there something that has the same capabilities of a transformer but doesn't operate on sequences?

If I was a little smarter and had any math ability I feel like I could contribute.

But I am smart enough to know that just building bigger and bigger data centers is not the ideal path forward.

lovelearning · 2025-01-27T06:00:54 1737957654

I'm not sure how SOTA it is but the sentence about connecting DRAM differently reminded me of Cerebras' scalable MemoryX and its "weight streaming" architecture to their custom ASIC. You may find it interesting.

[1]: https://cerebras.ai/press-release/cerebras-systems-announces...

[2]: https://cerebras.ai/chip/announcing-the-cerebras-architectur...

ilaksh · 2025-01-27T06:09:11 1737958151

Yeah, Cerebras seems to be the SOTA. I suspect we need something more radically different for truly memory-centric computing that will be significantly more efficient.

mkroman · 2025-01-26T21:15:28 1737926128

The AI hardware race is still going strong, but with so many rapid changes to the fundamental architectures, it doesn't make sense to bet everything on specialized hardware just yet.. It's happening, but it's expensive and slow.

There's just not enough capacity to build memory fast enough right now. Everyone needs the biggest and fastest modules they can get, since it directly impacts the performance of the models.

There's still a lot of happening to improve memory, like the latest Titans paper: https://arxiv.org/abs/2501.00663

So I think until a breakthrough happens or the fabs catch up, it'll be this painful race to build more datacenters.

rfoo · 2025-01-26T21:34:07 1737927247

> Because racks of H100s are not sustainable.

Huh? Racks of H100s are the most sustainable thing we can have for LLMs for now.

ilaksh · 2025-01-28T20:05:24 1738094724

Right, they are. But they still use massive amounts of energy compared to brains.

So it seems that we need a new paradigm of some sort.

So much investment is being announced for data centers. I assumed there would be more investments in fundamental or applied research. Such as for scaling memristors or something.

mmaunder · 2025-01-26T20:25:51 1737923151

Just want to confirm: so this is the first locally runnable model with a context length of greater than 128K and it’s gone straight to 1M, correct?

segmondy · 2025-01-26T20:43:15 1737924195

No, this is not the first local model with a context length of greater than 128k, there have been such models, for example the following

https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini 256k https://huggingface.co/THUDM/glm-4-9b-chat-1m 1M

and many other's that supposedly extended traditional models via finetune/rope scaling

mmaunder · 2025-01-26T20:48:28 1737924508

Thanks.

terhechte · 2025-01-26T20:35:18 1737923718

Yes. It requires a lot of ram, and even on a M4 with a lot of ram, if you give it 1mio tokens the prompt processing alone (that is, before you get the first response token) will probably take ~30min or more. However I'm looking forward to check if indeed I can give it a whole codebase and ask questions about it.

marci · 2025-01-27T00:56:33 1737939393

You might want to try caching to a file with mlx.

https://github.com/ml-explore/mlx-examples/pull/956

edit: here's a quick example for qwen2.5-1M from a mlx dev

https://x.com/awnihannun/status/1883611098081099914

terhechte · 2025-01-27T16:46:24 1737996384

That's cool, than you, but does MLX support the Qwen 1M context yet?

marci · 2025-01-28T06:16:56 1738045016

According to the tweet, not the full context yet.

whenlambo · 2025-01-29T16:47:37 1738169257

Qwen2.5-Plus

> have you been trained using openai?

No, I have not been trained using technology or data from OpenAI. I am an artificial intelligence model developed by Yandex, called YandexGPT. My “training” is based on proprietary algorithms and data that was collected and processed by the Yandex team. While I have similarities with other models such as those developed by OpenAI (e.g. GPT-3), my training was conducted independently and uses unique approaches specific to Yandex. This includes using Russian-language data as well as other sources to provide a better understanding of context and provide useful information to users. If you have questions on topics related to AI technologies or anything else, I'd be happy to help!

simonw · 2025-01-26T19:42:11 1737920531

I'm really interested in hearing from anyone who does manage to successfully run a long prompt through one these on a Mac (using one of the GGUF versions, or through other means).

terhechte · 2025-01-27T10:48:05 1737974885

I gave it a 446433 token input, then it calculated for ~4 hours, and gave me a reasonable response. The content was a Rust / Typescript codebase where Typescript is the frontend and Rust is the backend. I asked it which backend apis are currently not used by the frontend. I haven't checked yet, but the answer looked correct.

Running this on a M4 max

laborcontract · 2025-01-27T16:46:28 1737996388

how much ram?

terhechte · 2025-01-26T20:36:49 1737923809

I bought an M4 Max with 128g of ram just for these use cases. Currently downloading the 7b

rcarmo · 2025-01-26T19:48:21 1737920901

I'm missing what `files-to-prompt` does. I have an M3 Max and can take a stab at it, although I'm currently fussing with a few quantized -r1 models...

simonw · 2025-01-26T19:49:46 1737920986

It's this tool: https://pypi.org/project/files-to-prompt/

rcarmo · 2025-01-26T19:59:51 1737921591

You might want to use the file markers that the model outputs while being loaded by ollama:

    lm_load_print_meta: general.name     = Qwen2.5 7B Instruct 1M
    llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
    llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
    llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
    llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
    llm_load_print_meta: LF token         = 148848 'ÄĬ'
    llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
    llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
    llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
    llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
    llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
    llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
    llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
    llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
    llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
    llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
    llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
    llm_load_print_meta: max token length = 256

mmaunder · 2025-01-26T20:41:42 1737924102

This API only model with a 1M context window was released back in Nov. Just for some historical context.

https://qwenlm.github.io/blog/qwen2.5-turbo/

simonw · 2025-01-26T20:51:54 1737924714

That's a different model - the 2.5 Turbo one. Today's release is something different.

bloomingkales · 2025-01-26T20:13:25 1737922405

I’ve heard rumbling about native context length. I don’t know too much about it, but is this natively 1M context length?

So even models like llama3 8b say they have a larger context, but they really don’t in practice. I have a hard time getting past 8k on 16gb vram (you can definitely set the context length higher, but the quality and speed degradation is obvious).

I’m curious how people are doing this on modest hardware.

segmondy · 2025-01-26T20:36:40 1737923800

You can't on modest hardware, VRAM size is a function of model size, KV cache that depends on context length and the quant size of the model and K/V. 16gb isn't much really. You need more vram, the best way for most folks is to buy a macbook with unified memory. You can get a 128gb mac, but it's not cheap. If you are handy and resourceful you can build a GPU cluster.

A4ET8a8uTh0_v2 · 2025-01-27T05:49:19 1737956959

I never thought I would say it, but the 128gb mbp is probably the most cost efficient way ( and probably easiest ) of doing it. New nvidia cards ( 5090 ) are 32gb and supposedly just shy of 2k and used a100 40gb is about 8k..

All in all, not a cheap hobby ( if you are not doing it for work ).

elorant · 2025-01-26T21:43:33 1737927813

You need a model that has specifically been extended for larger context windows. For Llama-3 there's Llama3-gradient with up to 1M tokens. You can find it at ollama.com

jkbbwr · 2025-01-26T20:28:54 1737923334

Everyone keeps making the context windows bigger, which is nice.

But what about output? I want to generate a few thousand lines of code, anyone got any tips?

bugglebeetle · 2025-01-26T20:37:12 1737923832

So context size actually helps with this, relative to how LLMs are actually deployed as applications. For example, if you look at how the “continue” option in the DeepSeek web app works for code gen, what they’re likely doing is reinserting the prior messages (in some form) to a new one to prompt further completion. The more context size a model has and can manage successfully, the better it will likely be able at generating longer code blocks.

nejsjsjsbsb · 2025-01-27T07:44:22 1737963862

Isn't input/output lengths an arbitrary distinction. Under the hood, output becomes the input for the next token at each step. OpenAI may charge you more $$ by forcing you to add output to the input and call the API again. But running local you don't have that issue.

mmaunder · 2025-01-26T20:32:28 1737923548

Repeatedly ask it for more providing the previous output as context. (Back to context length as a limitation)

anotheryou · 2025-01-26T20:32:12 1737923532

Isn't that the same? limit-wise.

Now you just need to convince it to output that much :)

AyyEye · 2025-01-26T20:34:22 1737923662

These things already produce embarrassing output. If you make it longer it's just going to get worse.

buyucu · 2025-01-26T20:21:34 1737922894

first, this is amazing!

second, how does one increase the context window without requiring obscene amounts of RAM? we're really hitting the limitations of the transformer architecture's quadratic scaling...

35mm · 2025-01-26T20:26:12 1737923172

Chain of agents seems to be a promising approach for splitting up tasks into smaller parts and then synthesising the results[1]

[1] https://research.google/blog/chain-of-agents-large-language-...

ein0p · 2025-01-27T08:13:12 1737965592

The main problem isn't actually context length most of the time. 128K is plenty for a lot of practical tasks. It's the generation length, both within turns and especially across turns. And nobody knows how to increase that significantly yet.

refulgentis · 2025-01-26T21:21:34 1737926494

People are getting pretty...clever?...with long context retrieval benchmarking in papers.

Here, the prose says "nearly perfect", the graph is all green except for a little yellow section, and you have to parse a 96 cell table, having familiarity with several models and technical techniques to get the real # (84.4%, and that tops out at 128K, not anywhere near the claimed 1M)

I don't bring this up to denigrate, but rather to highlight that "nearly perfect" is quite far off still. Don't rely on long context for anything you build

oefrha · 2025-01-27T06:29:11 1737959351

“Nearly perfect” is cherry-picked from the sentence

> Even models trained on just 32K tokens, such as the Qwen2.5-7B-Instruct, achieve nearly perfect accuracy in passkey retrieval tasks with 1M-token contexts.

Which is pages after the graph and table you mentioned, which are clearly introduced as

(Graph)

> First off, we evaluate the Qwen2.5-1M models on the Passkey Retrieval task with a context length of 1 million tokens. The results show that these models can accurately retrieve hidden information from documents containing up to 1M tokens, with only minor errors observed in the 7B model.

(Table)

> For more complex long-context understanding tasks, we select RULER, LV-Eval, LongbenchChat used in this blog.

That you went so deep into the post to find your “clever” phrase to complain about tells me you’re probably being intentionally misleading. Most readers won’t read that far and ones that do certainly won’t leave with an impression that this is “nearly perfect” for complex tasks.

refulgentis · 2025-01-27T18:06:08 1738001168

> “Nearly perfect” is cherry-picked from the sentence

You're attempting to imply the rest of the sentence adds context that makes pulling out "nearly perfect" incorrect. Can you explain?

> ...

I'm not sure what the rest of the quotes are implying, as you just copy and paste and don't provide any indication of what you're communicating by sharing them. Can you explain more?

> That you went so deep into the post

It's the 587th word, less than 2 minutes reading at average reading speed.

> you’re probably being intentionally misleading.

!?!?!

#1) I'm certainly not intentionally misleading.

#2) What is misleading about "they say nearly perfect and then the highest # I can steelman from the table is 84%?"

#3) This is the first time in 15 years on HN that I've had someone accuse me of being intentionally misleading. Part of that is because there's numerous rules against that sort of dialogue. The remaining part is people, at least here, are usually self-interested enough to not make up motivations for other people feeling differently from them.

iamnotagenius · 2025-01-26T20:03:49 1737921829

requires obscene amount of memory for context.

woadwarrior01 · 2025-01-26T22:28:22 1737930502

It's on the model's huggingface README[1].

> For processing 1 million-token sequences:

> Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).

> Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).

[1]: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M

hmottestad · 2025-01-26T20:10:15 1737922215

More than other models? I thought that context used a lot of memory on all models.

And I’d hardly call it obscene. You can buy a Mac Studio with 192GB of memory, that should allow you to max out the context window of the 7B model. Probably not going to be very fast though.

varispeed · 2025-01-26T20:18:16 1737922696

Not attainable to working class though. can is doing a lot of heavy lifting here. Seems like after a brief period where technology was essentially class agnostic, now only the wealthy can enjoy being part of development and everyone else can just be a consumer.

hmottestad · 2025-01-26T20:43:01 1737924181

Not sure what you mean. Cutting edge computing has never been cheap. And a Mac Studio is definitely within the budget of a software developer in Norway. Not going to feel like a cheap investment, but definitely something that would be doable. Unlike a cluster of H100 GPUs, which would cost as much as a small apartment in Oslo.

And you can easily get a dev job in Norway without having to run an LLM locally on your computer.

manmal · 2025-01-26T22:03:14 1737928994

The money would be better invested in a 2-4 3090 x86 build, than in a Mac Studio. While the Macs have a fantastic performance-per-watt ratio, and have decent memory support (both bus width and memory size), they are not great at compute power. A multi RTX 3090 build totally smokes a Mac at the same price point, at inference speed.

hmottestad · 2025-01-27T11:19:08 1737976748

Memory requirement for the 7B model with full context is 120GB, so you would need 5 3090 GPUs, not 2-4. Do you know if you can get a motherboard with space for 5 GPUs and a power supply to match?

I bet that 5 3090s will smoke a Mac Studio. Can't find anyone in Norway with any in stock though. Or any 4090s with 24GB of memory.

You can get a nVidia RTX 5000 with 32GB of memory, there are two webshops that have those in stock. You'll need to wait though, because it looks like there might be one or maybe two in stock in total. And they are 63 000 NOK, and you need 4 of them. At that price you can buy two Mac Studios though.

I see people selling 3090s with 24GB secondhand for around 10 000 NOK each, but those have been running day in and day our for 3 years and don't come with a warranty.

manmal · 2025-01-28T06:17:06 1738045026

If you search on r/localllama, there are people who have improvised builds with eg 8x GPUs. Takes multiple power supplies and server mainboards. And some let the GPUs sit openly on wooden racks - not sure that’s good for longevity?

BTW a Mac wouldn’t be able to run a model with 120GB requirements, 8GB for the rest is likely too tight a fit.

hmottestad · 2025-01-28T20:48:35 1738097315

Mac Studio has up to 196 GB of memory.

sgt · 2025-01-26T22:01:55 1737928915

Agreed - it's probably not unreasonable. So are the M4 Macs becoming the de-facto solution to running an LLM locally? Due to the insane 800 GB/sec internal bandwidth of Apple Silicon at its best?

simonw · 2025-01-26T22:10:26 1737929426

The advantage the Macs have is that they can share RAM between GPU and CPU, and GPU-accessible RAM is everything when you want to run a decent sized LLM.

The problem is that most ML models are released for NVIDIA CUDA. Getting them to work on macOS requires translating them, usually to either GGUF (the llama.cpp format) or MLX (using Apple's own MLX array framework).

As such, as a Mac user I remain envious of people with NVIDIA/CUDA rigs with decent amounts of VRAM.

The NVIDIA "Digits" product may change things when it ships: https://www.theverge.com/2025/1/6/24337530/nvidia-ces-digits... - it may become the new cheapest convenient way to get 128GB of GPU-accessible RAM for running models.

manmal · 2025-01-26T22:03:45 1737929025

No they are lacking compute power to be great at inference.

simonw · 2025-01-26T22:07:12 1737929232

Can you back that up?

manmal · 2025-01-27T07:53:10 1737964390

One 3090 seems to be equivalent to one M3 Max at inference: https://www.reddit.com/r/LocalLLaMA/s/BaoKxHj8ww

There are many such threads on Reddit. M4 Max is incrementally faster, maybe 20%. Even if you factor in electricity costs, a 2x 3090 setup is IMO the sweet spot, cost/benefit wise.

And it’s maybe a zany line of argumentation, but 2x 3090 use 10x the power of an M4 Max. While the M4 is maybe the most efficient setup out there, it’s not nearly 10x as efficient. That’s IMO where the lack of compute power comes from.

sgt · 2025-01-27T12:37:33 1737981453

What is the GPU memory on that 3090?

manmal · 2025-01-27T13:43:34 1737985414

24GB VRAM. Using multiple ones scales well because models can be split by layers, and run in a pipelined fashion.

varispeed · 2025-01-27T21:49:39 1738014579

I am talking about the times where you were only limited by your imagination and skills. All you needed was a laptop and few hundred bucks for servers. Now, to compete, you would need magnitudes more cash. You can still do some things, but you are at a mercy of AI providers that they can cut you off on a whim.

cma · 2025-01-26T22:23:00 1737930180

Not much more than something like a used jetski, but possibly depreciates even faster.

sbarre · 2025-01-26T20:28:35 1737923315

I mean... when has this not been the case?

Technology has never been class-agnostic or universally accessible.

Even saying that, I would argue that there is more, not less, technology that is accessible to more people today than there ever has been.

postepowanieadm · 2025-01-27T07:23:35 1737962615

Only me getting part of my answer in Chinese?

gpualerts · 2025-01-26T20:37:50 1737923870

You tried to run it on CPU? I can't imagine how long that would take you. im tempted to try it out on half tb ram server